rababa 0.1.0 → 0.1.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.github/workflows/python.yml +81 -0
- data/.github/workflows/release.yml +36 -0
- data/.github/workflows/ruby.yml +27 -0
- data/.gitignore +3 -0
- data/.rubocop.yml +1 -1
- data/CODE_OF_CONDUCT.md +13 -13
- data/README.adoc +80 -0
- data/Rakefile +1 -1
- data/docs/{research-arabic-diacritization-06-2021.md → research-arabic-diacritization-06-2021.adoc} +52 -37
- data/exe/rababa +1 -1
- data/lib/README.adoc +95 -0
- data/lib/rababa/diacritizer.rb +16 -8
- data/lib/rababa/encoders.rb +2 -2
- data/lib/rababa/harakats.rb +1 -1
- data/lib/rababa/reconcile.rb +1 -33
- data/lib/rababa/version.rb +1 -1
- data/models-data/README.adoc +6 -0
- data/python/README.adoc +211 -0
- data/python/config/cbhg.yml +1 -1
- data/python/config/test_cbhg.yml +51 -0
- data/python/dataset.py +23 -31
- data/python/diacritization_model_to_onnx.py +216 -15
- data/python/diacritizer.py +35 -31
- data/python/log_dir/CA_MSA.base.cbhg/models/README.adoc +2 -0
- data/python/log_dir/README.adoc +1 -0
- data/python/{requirement.txt → requirements.txt} +1 -1
- data/python/setup.py +32 -0
- data/python/trainer.py +10 -4
- data/python/util/reconcile_original_plus_diacritized.py +2 -0
- data/python/util/text_cleaners.py +59 -4
- data/rababa.gemspec +1 -1
- data/test-datasets/data-arabic-pointing/{Readme.md → README.adoc} +2 -1
- metadata +22 -18
- data/.github/workflows/main.yml +0 -18
- data/README.md +0 -73
- data/lib/README.md +0 -82
- data/models-data/README.md +0 -6
- data/python/README.md +0 -163
- data/python/log_dir/CA_MSA.base.cbhg/models/Readme.md +0 -2
- data/python/log_dir/README.md +0 -1
data/exe/rababa
CHANGED
@@ -10,7 +10,7 @@ parser = Rababa.parser
|
|
10
10
|
|
11
11
|
config_path = parser.has_key?(:config) ? parser[:config] : "config/model.yml"
|
12
12
|
|
13
|
-
diacritizer = Rababa::Diacritizer.new(parser[:model_path], config_path)
|
13
|
+
diacritizer = Rababa::Diacritizer.new(parser[:model_path], YAML.load_file(config_path))
|
14
14
|
|
15
15
|
if parser.has_key?(:text)
|
16
16
|
# run diacritization text if has :text
|
data/lib/README.adoc
ADDED
@@ -0,0 +1,95 @@
|
|
1
|
+
= Arabic Diacritization in Ruby with Rababa
|
2
|
+
|
3
|
+
== Try out Rababa
|
4
|
+
|
5
|
+
* Install the Gems listed below
|
6
|
+
* Download a Ruby model on https://github.com/secryst/rababa-models[releases]
|
7
|
+
|
8
|
+
== Usage
|
9
|
+
|
10
|
+
=== Install
|
11
|
+
|
12
|
+
[source,sh]
|
13
|
+
----
|
14
|
+
gem install rababa
|
15
|
+
----
|
16
|
+
|
17
|
+
=== Download the ONNX model
|
18
|
+
|
19
|
+
Please download the `diacritization_model_max_len_200.onnx` model file
|
20
|
+
from https://github.com/secryst/rababa-models/releases/tag/0.1.
|
21
|
+
|
22
|
+
|
23
|
+
=== Running examples
|
24
|
+
|
25
|
+
One can diacritize either single strings:
|
26
|
+
|
27
|
+
[source,sh]
|
28
|
+
----
|
29
|
+
rababa -t 'قطر' -m diacritization_model_max_len_200.onnx
|
30
|
+
# or when inside the gem directory during development
|
31
|
+
bundle exec exe/rababa -t 'قطر' -m diacritization_model_max_len_200.onnx
|
32
|
+
----
|
33
|
+
|
34
|
+
Or files as `data/examples.txt` or your own Arabic file (the max string length
|
35
|
+
is specified in the model and has to match the `max_len` parameter in
|
36
|
+
`config/models.yaml`):
|
37
|
+
|
38
|
+
[source,sh]
|
39
|
+
----
|
40
|
+
rababa -f data/example.txt -m diacritization_model_max_len_200.onnx
|
41
|
+
# or when inside the gem directory during development
|
42
|
+
bundle exec exe/rababa -f data/example.txt -m diacritization_model_max_len_200.onnx
|
43
|
+
----
|
44
|
+
|
45
|
+
One would have to preprocess generic arabic texts for running Rababa in general.
|
46
|
+
This can be done on sentences beginnings running for instance
|
47
|
+
https://github.com/Hamza5/Pipeline-diacritizer[Hamza5]:
|
48
|
+
|
49
|
+
----
|
50
|
+
python __main__.py preprocess source destination
|
51
|
+
----
|
52
|
+
|
53
|
+
== Training
|
54
|
+
|
55
|
+
=== ONNX Models
|
56
|
+
|
57
|
+
They can either be built in the `/python` repository or downloaded from the
|
58
|
+
https://github.com/secryst/rababa-models[releases].
|
59
|
+
|
60
|
+
Or ONNX model can be generated running the Python
|
61
|
+
https://github.com/interscript/rababa/blob/main/python/diacritization_model_to_onnx.py[code]
|
62
|
+
in this library.
|
63
|
+
|
64
|
+
It requires to go through some of the steps described in the link above.
|
65
|
+
|
66
|
+
=== Parameters
|
67
|
+
|
68
|
+
* text to diacritize: "**-t**TEXT", "--text=TEXT",
|
69
|
+
* path to file to diacritize: "**-f**FILE", "--text_filename=FILE",
|
70
|
+
* path to ONNX model **Mandatory**: "-mMODEL", "--model_file=MODEL",
|
71
|
+
* path to config file **Default:config/model.yml**: "-cCONFIG", "--config=CONFIG"
|
72
|
+
|
73
|
+
=== Config
|
74
|
+
|
75
|
+
==== Players
|
76
|
+
|
77
|
+
* `max_len`: `200` -- `600`
|
78
|
+
|
79
|
+
** Parameter that has to match the ONNX model built using the
|
80
|
+
https://github.com/interscript/rababa/blob/main/python/diacritization_model_to_onnx.py[code]
|
81
|
+
and following the python/README.adoc.
|
82
|
+
|
83
|
+
** Longer sentences will need to be preprocessed, which can be done for
|
84
|
+
instance using https://github.com/Hamza5[Hamza5]
|
85
|
+
https://github.com/Hamza5/Pipeline-diacritizer/blob/master/pipeline_diacritizer/pipeline_diacritizer.py[code].
|
86
|
+
|
87
|
+
** the smaller the faster the NNets code.
|
88
|
+
|
89
|
+
* text_encoder corresponding to the https://github.com/interscript/rababa/blob/main/python/util/text_encoders.py[rules]:
|
90
|
+
** `BasicArabicEncoder`
|
91
|
+
** `ArabicEncoderWithStartSymbol`
|
92
|
+
|
93
|
+
* text_cleaner corresponding to https://github.com/interscript/rababa/blob/main/python/util/text_cleaners.py[logics]:
|
94
|
+
** `basic_cleaners`: remove redundancy in whitespaces and strip string
|
95
|
+
** `valid_arabic_cleaners`: basic+filter of only Arabic words
|
data/lib/rababa/diacritizer.rb
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
|
2
2
|
# this refers to:
|
3
|
-
# https://github.com/interscript/rababa/blob/
|
3
|
+
# https://github.com/interscript/rababa/blob/main/python/diacritizer.py
|
4
4
|
# as well a drastic simplification of
|
5
5
|
# https://github.com/almodhfer/Arabic_Diacritization/blob/master/config_manager.py
|
6
6
|
|
@@ -16,27 +16,33 @@ module Rababa
|
|
16
16
|
include Rababa::Harakats
|
17
17
|
include Rababa::Reconcile
|
18
18
|
|
19
|
-
def initialize(onnx_model_path,
|
19
|
+
def initialize(onnx_model_path, config)
|
20
20
|
|
21
21
|
# load inference model from model_path
|
22
22
|
@onnx_session = OnnxRuntime::InferenceSession.new(onnx_model_path)
|
23
23
|
|
24
24
|
# load config
|
25
|
-
@config =
|
25
|
+
@config = config
|
26
26
|
@max_length = @config['max_len']
|
27
27
|
@batch_size = @config['batch_size']
|
28
28
|
|
29
29
|
# instantiate encoder's class
|
30
|
-
@encoder = get_text_encoder
|
30
|
+
@encoder = get_text_encoder
|
31
31
|
@start_symbol_id = @encoder.start_symbol_id
|
32
32
|
|
33
33
|
end
|
34
34
|
|
35
35
|
# preprocess text into indices
|
36
36
|
def preprocess_text(text)
|
37
|
-
#if (text.length > @max_length)
|
38
|
-
#
|
39
|
-
#end
|
37
|
+
# if (text.length > @max_length)
|
38
|
+
# raise ValueError.new('text length larger than max_length')
|
39
|
+
# end
|
40
|
+
# hack in absence of preprocessing!
|
41
|
+
if text.length > @max_length
|
42
|
+
text = text[0..@max_length]
|
43
|
+
warn('WARNING:: string cut length > #{@max_length},\n')
|
44
|
+
warn('text:: '+text)
|
45
|
+
end
|
40
46
|
|
41
47
|
text = @encoder.clean(text)
|
42
48
|
text = remove_diacritics(text)
|
@@ -47,6 +53,8 @@ module Rababa
|
|
47
53
|
|
48
54
|
# Diacritize single arabic strings
|
49
55
|
def diacritize_text(text)
|
56
|
+
"""Diacritize single arabic strings"""
|
57
|
+
text = text.strip()
|
50
58
|
seq = preprocess_text(text)
|
51
59
|
|
52
60
|
# initialize onnx computation
|
@@ -66,7 +74,7 @@ module Rababa
|
|
66
74
|
def diacritize_file(path)
|
67
75
|
texts = []
|
68
76
|
File.open(path).each do |line|
|
69
|
-
texts.push(line.chomp)
|
77
|
+
texts.push(line.chomp.strip())
|
70
78
|
end
|
71
79
|
|
72
80
|
# process batches
|
data/lib/rababa/encoders.rb
CHANGED
@@ -1,8 +1,8 @@
|
|
1
1
|
"""
|
2
2
|
corresponds to:
|
3
|
-
https://github.com/interscript/rababa/blob/
|
3
|
+
https://github.com/interscript/rababa/blob/main/python/util/text_encoders.py
|
4
4
|
and
|
5
|
-
https://github.com/interscript/rababa/blob/
|
5
|
+
https://github.com/interscript/rababa/blob/main/python/util/text_cleaners.py
|
6
6
|
"""
|
7
7
|
|
8
8
|
require_relative "arabic_constants"
|
data/lib/rababa/harakats.rb
CHANGED
data/lib/rababa/reconcile.rb
CHANGED
@@ -31,7 +31,7 @@ module Rababa::Reconcile
|
|
31
31
|
(idx_ori..d_original.length).each {|i|
|
32
32
|
if (c_dia == d_original[i])
|
33
33
|
idx_ori = i
|
34
|
-
l_map
|
34
|
+
l_map << [idx_dia, idx_ori]
|
35
35
|
break
|
36
36
|
end
|
37
37
|
}
|
@@ -99,35 +99,3 @@ module Rababa::Reconcile
|
|
99
99
|
end
|
100
100
|
|
101
101
|
end
|
102
|
-
|
103
|
-
|
104
|
-
"""TESTS
|
105
|
-
TODO: MOVE TO RSPEC
|
106
|
-
d_tests = [{'original' => '# گيله پسمير الجديد 34',
|
107
|
-
'diacritized' => 'يَلِهُ سُمِيْرٌ الجَدِيدُ',
|
108
|
-
'reconciled' => '# گيَلِهُ پسُمِيْرٌ الجَدِيدُ 34' },
|
109
|
-
|
110
|
-
{'original' => 'abc',
|
111
|
-
'diacritized' => '',
|
112
|
-
'reconciled' => 'abc'},
|
113
|
-
|
114
|
-
{'original' => '‘Iz. Ibrāhīm as-Sa‘danī',
|
115
|
-
'diacritized' => '',
|
116
|
-
'reconciled' => '‘Iz. Ibrāhīm as-Sa‘danī'},
|
117
|
-
|
118
|
-
{'original' => '26 سبتمبر العقبة',
|
119
|
-
'diacritized' => 'سَبْتَمْبَرِ العَقَبَة',
|
120
|
-
'reconciled' => '26 سَبْتَمْبَرِ العَقَبَة'}]
|
121
|
-
|
122
|
-
d_tests.each {|d| \
|
123
|
-
if not d['reconciled']==reconcile_strings(d['original'], d['diacritized'])
|
124
|
-
raise Exception.new('reconcile string not matched')
|
125
|
-
end
|
126
|
-
}
|
127
|
-
|
128
|
-
or:
|
129
|
-
for s in '# گيله پسمير الجديد 34' 'abc' '‘Iz. Ibrāhīm as-Sa‘danī' '26 سبتمبر العقبة'
|
130
|
-
do;
|
131
|
-
ruby rababa.rb -t $s -m '../models-data/diacritization_model.onnx'
|
132
|
-
done
|
133
|
-
"""
|
data/lib/rababa/version.rb
CHANGED
data/python/README.adoc
ADDED
@@ -0,0 +1,211 @@
|
|
1
|
+
= Rababa Python for diacritization
|
2
|
+
|
3
|
+
== Purpose
|
4
|
+
|
5
|
+
Rababa Python is used for both:
|
6
|
+
|
7
|
+
* Training of the Rababa diacriticization models
|
8
|
+
* Conversion of non-diacriticized Arabic into diacriticized Arabic
|
9
|
+
(i.e. running of the Rababa diacriticization models)
|
10
|
+
|
11
|
+
== Introduction
|
12
|
+
|
13
|
+
Rababa uses deep Learning models for recovering Arabic language diacritics.
|
14
|
+
|
15
|
+
Rababa implements the models described in the paper
|
16
|
+
https://ieeexplore.ieee.org/document/9274427[Effective Deep Learning Models for Automatic Diacritization of Arabic Text] and refers to the implementation models from
|
17
|
+
https://github.com/almodhfer/Arabic_Diacritization[almodhfer],
|
18
|
+
which we have selected for this project from a list of alternatives listed in
|
19
|
+
the README.
|
20
|
+
|
21
|
+
Out of the four models that https://github.com/almodhfer[almodhfer] has
|
22
|
+
implemented, we selected the simplest and most performant ones:
|
23
|
+
|
24
|
+
* The baseline model (`baseline`): consists of 3 bidirectional LSTM layers with
|
25
|
+
optional batch norm layers.
|
26
|
+
|
27
|
+
* The CBHG model (`cbhg`): uses only the encoder of the Tacotron-based model
|
28
|
+
with optional post LSTM, and batch norm layers.
|
29
|
+
|
30
|
+
|
31
|
+
== Usage
|
32
|
+
|
33
|
+
=== Prerequisites
|
34
|
+
|
35
|
+
Python version: 3.6+.
|
36
|
+
|
37
|
+
Setup dependencies with:
|
38
|
+
|
39
|
+
[source,bash]
|
40
|
+
----
|
41
|
+
pip install -r requirement.txt
|
42
|
+
----
|
43
|
+
|
44
|
+
|
45
|
+
=== Quickstart
|
46
|
+
|
47
|
+
. Setup prerequisites
|
48
|
+
|
49
|
+
. Download the released model
|
50
|
+
https://github.com/secryst/rababa-models/releases/download/0.1/2000000-snapshot.pt[here]
|
51
|
+
and place under `python/log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt`
|
52
|
+
|
53
|
+
. Single sentences and text can now be diacritized as below:
|
54
|
+
|
55
|
+
[source,bash]
|
56
|
+
----
|
57
|
+
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
|
58
|
+
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
|
59
|
+
----
|
60
|
+
|
61
|
+
The maximal string length is set in configs at `600`.
|
62
|
+
|
63
|
+
Longer lines need to be broken down, for instance using the library
|
64
|
+
introduced in the link:../lib/README.adoc[Ruby quickstart] section.
|
65
|
+
|
66
|
+
|
67
|
+
== Training
|
68
|
+
|
69
|
+
=== Datasets
|
70
|
+
|
71
|
+
* We have chosen the "Tashkeela processed" corpus ~2,800,000 sentences:
|
72
|
+
** https://github.com/interscript/rababa-tashkeela
|
73
|
+
|
74
|
+
Other datasets are discussed in the reviewed literature or in the article
|
75
|
+
referenced above.
|
76
|
+
|
77
|
+
For training, data needs to be stored in the `data/CA_MSA` directory in such a
|
78
|
+
format:
|
79
|
+
|
80
|
+
[source,bash]
|
81
|
+
----
|
82
|
+
> ls data/CA_MSA/*
|
83
|
+
--> data/CA_MSA/eval.csv data/CA_MSA/train.csv data/CA_MSA/test.csv
|
84
|
+
----
|
85
|
+
|
86
|
+
For instance:
|
87
|
+
|
88
|
+
[source,bash]
|
89
|
+
----
|
90
|
+
mkdir -p data/CA_MSA
|
91
|
+
cd data
|
92
|
+
curl -sSL https://github.com/interscript/rababa-tashkeela/archive/refs/tags/v1.0.zip -o tashkeela.zip
|
93
|
+
unzip tashkeela.zip
|
94
|
+
for d in `ls rababa-tashkeela-1.0/tashkeela_val/*`; do cat $d >> CA_MSA/eval.csv; done
|
95
|
+
for d in `ls rababa-tashkeela-1.0/tashkeela_train/*`; do cat $d >> CA_MSA/train.csv; done
|
96
|
+
for d in `ls rababa-tashkeela-1.0/tashkeela_test/*`; do cat $d >> CA_MSA/test.csv; done
|
97
|
+
----
|
98
|
+
|
99
|
+
Alternatively, the dataset can be downloaded at
|
100
|
+
[rababa-tashkeela](https://github.com/interscript/rababa-tashkeela).
|
101
|
+
|
102
|
+
=== Load Model
|
103
|
+
|
104
|
+
Alternatively, trained CBHG models are available under
|
105
|
+
https://github.com/secryst/rababa-models[releases].
|
106
|
+
|
107
|
+
Models are to be copied as specified in the link just above under:
|
108
|
+
|
109
|
+
[source,bash]
|
110
|
+
----
|
111
|
+
> log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
|
112
|
+
----
|
113
|
+
|
114
|
+
|
115
|
+
=== Config Files
|
116
|
+
|
117
|
+
One can adjust the model configurations in the `/config` repository.
|
118
|
+
|
119
|
+
The model configurations are about the layers but also the dataset to be used
|
120
|
+
and various other options.
|
121
|
+
|
122
|
+
The configuration files are called explicitly in the below applications.
|
123
|
+
|
124
|
+
=== Data Preprocessing
|
125
|
+
|
126
|
+
The original work cited above allow for both raw and preprocessed.
|
127
|
+
|
128
|
+
We go for the simplest raw version here:
|
129
|
+
- As mentioned above, corpus must have `test.csv`,
|
130
|
+
`train.csv`, and `valid.csv`.
|
131
|
+
|
132
|
+
- Specify that the data is not preprocessed in the config.
|
133
|
+
In that case, each batch will be processed and the text and diacritics
|
134
|
+
will be extracted from the original text.
|
135
|
+
|
136
|
+
- You also have to specify the text encoder and the cleaner functions.
|
137
|
+
Two text encoders were included: `BasicArabicEncoder`,
|
138
|
+
`ArabicEncoderWithStartSymbol`.
|
139
|
+
|
140
|
+
Moreover, we have one cleaning function: `valid_arabic_cleaners`, which clean
|
141
|
+
all characters except valid Arabic characters, which include Arabic letters,
|
142
|
+
punctuations, and diacritics.
|
143
|
+
|
144
|
+
=== Training
|
145
|
+
|
146
|
+
All models config are placed in the config directory.
|
147
|
+
|
148
|
+
[source,bash]
|
149
|
+
----
|
150
|
+
python train.py --model "cbhg" --config config/cbhg.yml
|
151
|
+
----
|
152
|
+
|
153
|
+
The model will report the WER and DER while training using the
|
154
|
+
`diacritization_evaluation` package. The frequency of calculating WER and
|
155
|
+
DER can be specified in the config file.
|
156
|
+
|
157
|
+
=== Testing
|
158
|
+
|
159
|
+
The testing is done in the same way as the training,
|
160
|
+
For instance, with the CBHG model on the data in `/data/CA_MSA/test.csv`:
|
161
|
+
|
162
|
+
[source,bash]
|
163
|
+
----
|
164
|
+
python test.py --model 'cbhg' --config config/cbhg.yml
|
165
|
+
----
|
166
|
+
|
167
|
+
The model will load the last saved model unless you specified it in the config:
|
168
|
+
`test_data_path`. The test file is expected to have the correct diacritization!
|
169
|
+
|
170
|
+
If the test file name is different than `test.csv`, you
|
171
|
+
can add it to the `config: test_file_name`.
|
172
|
+
|
173
|
+
=== Diacritize text or files
|
174
|
+
|
175
|
+
Single sentences or files can be processed. The code outputs is the diacritized
|
176
|
+
text or lines.
|
177
|
+
|
178
|
+
[source,bash]
|
179
|
+
----
|
180
|
+
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
|
181
|
+
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
|
182
|
+
----
|
183
|
+
|
184
|
+
=== Convert CBHG, Python model to ONNX
|
185
|
+
|
186
|
+
The last model stored during training is automatically chosen and the ONNX model
|
187
|
+
is saved into a hardcoded location:
|
188
|
+
|
189
|
+
* `../models-data/diacritization_model.onnx`
|
190
|
+
|
191
|
+
==== Run
|
192
|
+
|
193
|
+
[source,bash]
|
194
|
+
----
|
195
|
+
python diacritization_model_to_onnx.py
|
196
|
+
----
|
197
|
+
|
198
|
+
==== Important parameters
|
199
|
+
|
200
|
+
They are hardcoded in the beginning of the script:
|
201
|
+
|
202
|
+
* `max_len`:
|
203
|
+
** match string length, initial model value is given in config.
|
204
|
+
** this param allows tuning the model speed and size!
|
205
|
+
** the Ruby ../lib/README.md points to resources for preprocessing
|
206
|
+
|
207
|
+
* batch_size:
|
208
|
+
** the value is given by the original model and its training.
|
209
|
+
** this constrain how the ONNX model can be put in production:
|
210
|
+
... if > 1, single lines involve redundant computations
|
211
|
+
... if > 1, files are processed in batches.
|