rababa 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.github/workflows/python.yml +81 -0
- data/.github/workflows/release.yml +36 -0
- data/.github/workflows/ruby.yml +27 -0
- data/.gitignore +3 -0
- data/.rubocop.yml +1 -1
- data/CODE_OF_CONDUCT.md +13 -13
- data/README.adoc +80 -0
- data/Rakefile +1 -1
- data/docs/{research-arabic-diacritization-06-2021.md → research-arabic-diacritization-06-2021.adoc} +52 -37
- data/exe/rababa +1 -1
- data/lib/README.adoc +95 -0
- data/lib/rababa/diacritizer.rb +16 -8
- data/lib/rababa/encoders.rb +2 -2
- data/lib/rababa/harakats.rb +1 -1
- data/lib/rababa/reconcile.rb +1 -33
- data/lib/rababa/version.rb +1 -1
- data/models-data/README.adoc +6 -0
- data/python/README.adoc +211 -0
- data/python/config/cbhg.yml +1 -1
- data/python/config/test_cbhg.yml +51 -0
- data/python/dataset.py +23 -31
- data/python/diacritization_model_to_onnx.py +216 -15
- data/python/diacritizer.py +35 -31
- data/python/log_dir/CA_MSA.base.cbhg/models/README.adoc +2 -0
- data/python/log_dir/README.adoc +1 -0
- data/python/{requirement.txt → requirements.txt} +1 -1
- data/python/setup.py +32 -0
- data/python/trainer.py +10 -4
- data/python/util/reconcile_original_plus_diacritized.py +2 -0
- data/python/util/text_cleaners.py +59 -4
- data/rababa.gemspec +1 -1
- data/test-datasets/data-arabic-pointing/{Readme.md → README.adoc} +2 -1
- metadata +22 -18
- data/.github/workflows/main.yml +0 -18
- data/README.md +0 -73
- data/lib/README.md +0 -82
- data/models-data/README.md +0 -6
- data/python/README.md +0 -163
- data/python/log_dir/CA_MSA.base.cbhg/models/Readme.md +0 -2
- data/python/log_dir/README.md +0 -1
data/exe/rababa
CHANGED
@@ -10,7 +10,7 @@ parser = Rababa.parser
|
|
10
10
|
|
11
11
|
config_path = parser.has_key?(:config) ? parser[:config] : "config/model.yml"
|
12
12
|
|
13
|
-
diacritizer = Rababa::Diacritizer.new(parser[:model_path], config_path)
|
13
|
+
diacritizer = Rababa::Diacritizer.new(parser[:model_path], YAML.load_file(config_path))
|
14
14
|
|
15
15
|
if parser.has_key?(:text)
|
16
16
|
# run diacritization text if has :text
|
data/lib/README.adoc
ADDED
@@ -0,0 +1,95 @@
|
|
1
|
+
= Arabic Diacritization in Ruby with Rababa
|
2
|
+
|
3
|
+
== Try out Rababa
|
4
|
+
|
5
|
+
* Install the Gems listed below
|
6
|
+
* Download a Ruby model on https://github.com/secryst/rababa-models[releases]
|
7
|
+
|
8
|
+
== Usage
|
9
|
+
|
10
|
+
=== Install
|
11
|
+
|
12
|
+
[source,sh]
|
13
|
+
----
|
14
|
+
gem install rababa
|
15
|
+
----
|
16
|
+
|
17
|
+
=== Download the ONNX model
|
18
|
+
|
19
|
+
Please download the `diacritization_model_max_len_200.onnx` model file
|
20
|
+
from https://github.com/secryst/rababa-models/releases/tag/0.1.
|
21
|
+
|
22
|
+
|
23
|
+
=== Running examples
|
24
|
+
|
25
|
+
One can diacritize either single strings:
|
26
|
+
|
27
|
+
[source,sh]
|
28
|
+
----
|
29
|
+
rababa -t 'قطر' -m diacritization_model_max_len_200.onnx
|
30
|
+
# or when inside the gem directory during development
|
31
|
+
bundle exec exe/rababa -t 'قطر' -m diacritization_model_max_len_200.onnx
|
32
|
+
----
|
33
|
+
|
34
|
+
Or files as `data/examples.txt` or your own Arabic file (the max string length
|
35
|
+
is specified in the model and has to match the `max_len` parameter in
|
36
|
+
`config/models.yaml`):
|
37
|
+
|
38
|
+
[source,sh]
|
39
|
+
----
|
40
|
+
rababa -f data/example.txt -m diacritization_model_max_len_200.onnx
|
41
|
+
# or when inside the gem directory during development
|
42
|
+
bundle exec exe/rababa -f data/example.txt -m diacritization_model_max_len_200.onnx
|
43
|
+
----
|
44
|
+
|
45
|
+
One would have to preprocess generic arabic texts for running Rababa in general.
|
46
|
+
This can be done on sentences beginnings running for instance
|
47
|
+
https://github.com/Hamza5/Pipeline-diacritizer[Hamza5]:
|
48
|
+
|
49
|
+
----
|
50
|
+
python __main__.py preprocess source destination
|
51
|
+
----
|
52
|
+
|
53
|
+
== Training
|
54
|
+
|
55
|
+
=== ONNX Models
|
56
|
+
|
57
|
+
They can either be built in the `/python` repository or downloaded from the
|
58
|
+
https://github.com/secryst/rababa-models[releases].
|
59
|
+
|
60
|
+
Or ONNX model can be generated running the Python
|
61
|
+
https://github.com/interscript/rababa/blob/main/python/diacritization_model_to_onnx.py[code]
|
62
|
+
in this library.
|
63
|
+
|
64
|
+
It requires to go through some of the steps described in the link above.
|
65
|
+
|
66
|
+
=== Parameters
|
67
|
+
|
68
|
+
* text to diacritize: "**-t**TEXT", "--text=TEXT",
|
69
|
+
* path to file to diacritize: "**-f**FILE", "--text_filename=FILE",
|
70
|
+
* path to ONNX model **Mandatory**: "-mMODEL", "--model_file=MODEL",
|
71
|
+
* path to config file **Default:config/model.yml**: "-cCONFIG", "--config=CONFIG"
|
72
|
+
|
73
|
+
=== Config
|
74
|
+
|
75
|
+
==== Players
|
76
|
+
|
77
|
+
* `max_len`: `200` -- `600`
|
78
|
+
|
79
|
+
** Parameter that has to match the ONNX model built using the
|
80
|
+
https://github.com/interscript/rababa/blob/main/python/diacritization_model_to_onnx.py[code]
|
81
|
+
and following the python/README.adoc.
|
82
|
+
|
83
|
+
** Longer sentences will need to be preprocessed, which can be done for
|
84
|
+
instance using https://github.com/Hamza5[Hamza5]
|
85
|
+
https://github.com/Hamza5/Pipeline-diacritizer/blob/master/pipeline_diacritizer/pipeline_diacritizer.py[code].
|
86
|
+
|
87
|
+
** the smaller the faster the NNets code.
|
88
|
+
|
89
|
+
* text_encoder corresponding to the https://github.com/interscript/rababa/blob/main/python/util/text_encoders.py[rules]:
|
90
|
+
** `BasicArabicEncoder`
|
91
|
+
** `ArabicEncoderWithStartSymbol`
|
92
|
+
|
93
|
+
* text_cleaner corresponding to https://github.com/interscript/rababa/blob/main/python/util/text_cleaners.py[logics]:
|
94
|
+
** `basic_cleaners`: remove redundancy in whitespaces and strip string
|
95
|
+
** `valid_arabic_cleaners`: basic+filter of only Arabic words
|
data/lib/rababa/diacritizer.rb
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
|
2
2
|
# this refers to:
|
3
|
-
# https://github.com/interscript/rababa/blob/
|
3
|
+
# https://github.com/interscript/rababa/blob/main/python/diacritizer.py
|
4
4
|
# as well a drastic simplification of
|
5
5
|
# https://github.com/almodhfer/Arabic_Diacritization/blob/master/config_manager.py
|
6
6
|
|
@@ -16,27 +16,33 @@ module Rababa
|
|
16
16
|
include Rababa::Harakats
|
17
17
|
include Rababa::Reconcile
|
18
18
|
|
19
|
-
def initialize(onnx_model_path,
|
19
|
+
def initialize(onnx_model_path, config)
|
20
20
|
|
21
21
|
# load inference model from model_path
|
22
22
|
@onnx_session = OnnxRuntime::InferenceSession.new(onnx_model_path)
|
23
23
|
|
24
24
|
# load config
|
25
|
-
@config =
|
25
|
+
@config = config
|
26
26
|
@max_length = @config['max_len']
|
27
27
|
@batch_size = @config['batch_size']
|
28
28
|
|
29
29
|
# instantiate encoder's class
|
30
|
-
@encoder = get_text_encoder
|
30
|
+
@encoder = get_text_encoder
|
31
31
|
@start_symbol_id = @encoder.start_symbol_id
|
32
32
|
|
33
33
|
end
|
34
34
|
|
35
35
|
# preprocess text into indices
|
36
36
|
def preprocess_text(text)
|
37
|
-
#if (text.length > @max_length)
|
38
|
-
#
|
39
|
-
#end
|
37
|
+
# if (text.length > @max_length)
|
38
|
+
# raise ValueError.new('text length larger than max_length')
|
39
|
+
# end
|
40
|
+
# hack in absence of preprocessing!
|
41
|
+
if text.length > @max_length
|
42
|
+
text = text[0..@max_length]
|
43
|
+
warn('WARNING:: string cut length > #{@max_length},\n')
|
44
|
+
warn('text:: '+text)
|
45
|
+
end
|
40
46
|
|
41
47
|
text = @encoder.clean(text)
|
42
48
|
text = remove_diacritics(text)
|
@@ -47,6 +53,8 @@ module Rababa
|
|
47
53
|
|
48
54
|
# Diacritize single arabic strings
|
49
55
|
def diacritize_text(text)
|
56
|
+
"""Diacritize single arabic strings"""
|
57
|
+
text = text.strip()
|
50
58
|
seq = preprocess_text(text)
|
51
59
|
|
52
60
|
# initialize onnx computation
|
@@ -66,7 +74,7 @@ module Rababa
|
|
66
74
|
def diacritize_file(path)
|
67
75
|
texts = []
|
68
76
|
File.open(path).each do |line|
|
69
|
-
texts.push(line.chomp)
|
77
|
+
texts.push(line.chomp.strip())
|
70
78
|
end
|
71
79
|
|
72
80
|
# process batches
|
data/lib/rababa/encoders.rb
CHANGED
@@ -1,8 +1,8 @@
|
|
1
1
|
"""
|
2
2
|
corresponds to:
|
3
|
-
https://github.com/interscript/rababa/blob/
|
3
|
+
https://github.com/interscript/rababa/blob/main/python/util/text_encoders.py
|
4
4
|
and
|
5
|
-
https://github.com/interscript/rababa/blob/
|
5
|
+
https://github.com/interscript/rababa/blob/main/python/util/text_cleaners.py
|
6
6
|
"""
|
7
7
|
|
8
8
|
require_relative "arabic_constants"
|
data/lib/rababa/harakats.rb
CHANGED
data/lib/rababa/reconcile.rb
CHANGED
@@ -31,7 +31,7 @@ module Rababa::Reconcile
|
|
31
31
|
(idx_ori..d_original.length).each {|i|
|
32
32
|
if (c_dia == d_original[i])
|
33
33
|
idx_ori = i
|
34
|
-
l_map
|
34
|
+
l_map << [idx_dia, idx_ori]
|
35
35
|
break
|
36
36
|
end
|
37
37
|
}
|
@@ -99,35 +99,3 @@ module Rababa::Reconcile
|
|
99
99
|
end
|
100
100
|
|
101
101
|
end
|
102
|
-
|
103
|
-
|
104
|
-
"""TESTS
|
105
|
-
TODO: MOVE TO RSPEC
|
106
|
-
d_tests = [{'original' => '# گيله پسمير الجديد 34',
|
107
|
-
'diacritized' => 'يَلِهُ سُمِيْرٌ الجَدِيدُ',
|
108
|
-
'reconciled' => '# گيَلِهُ پسُمِيْرٌ الجَدِيدُ 34' },
|
109
|
-
|
110
|
-
{'original' => 'abc',
|
111
|
-
'diacritized' => '',
|
112
|
-
'reconciled' => 'abc'},
|
113
|
-
|
114
|
-
{'original' => '‘Iz. Ibrāhīm as-Sa‘danī',
|
115
|
-
'diacritized' => '',
|
116
|
-
'reconciled' => '‘Iz. Ibrāhīm as-Sa‘danī'},
|
117
|
-
|
118
|
-
{'original' => '26 سبتمبر العقبة',
|
119
|
-
'diacritized' => 'سَبْتَمْبَرِ العَقَبَة',
|
120
|
-
'reconciled' => '26 سَبْتَمْبَرِ العَقَبَة'}]
|
121
|
-
|
122
|
-
d_tests.each {|d| \
|
123
|
-
if not d['reconciled']==reconcile_strings(d['original'], d['diacritized'])
|
124
|
-
raise Exception.new('reconcile string not matched')
|
125
|
-
end
|
126
|
-
}
|
127
|
-
|
128
|
-
or:
|
129
|
-
for s in '# گيله پسمير الجديد 34' 'abc' '‘Iz. Ibrāhīm as-Sa‘danī' '26 سبتمبر العقبة'
|
130
|
-
do;
|
131
|
-
ruby rababa.rb -t $s -m '../models-data/diacritization_model.onnx'
|
132
|
-
done
|
133
|
-
"""
|
data/lib/rababa/version.rb
CHANGED
data/python/README.adoc
ADDED
@@ -0,0 +1,211 @@
|
|
1
|
+
= Rababa Python for diacritization
|
2
|
+
|
3
|
+
== Purpose
|
4
|
+
|
5
|
+
Rababa Python is used for both:
|
6
|
+
|
7
|
+
* Training of the Rababa diacriticization models
|
8
|
+
* Conversion of non-diacriticized Arabic into diacriticized Arabic
|
9
|
+
(i.e. running of the Rababa diacriticization models)
|
10
|
+
|
11
|
+
== Introduction
|
12
|
+
|
13
|
+
Rababa uses deep Learning models for recovering Arabic language diacritics.
|
14
|
+
|
15
|
+
Rababa implements the models described in the paper
|
16
|
+
https://ieeexplore.ieee.org/document/9274427[Effective Deep Learning Models for Automatic Diacritization of Arabic Text] and refers to the implementation models from
|
17
|
+
https://github.com/almodhfer/Arabic_Diacritization[almodhfer],
|
18
|
+
which we have selected for this project from a list of alternatives listed in
|
19
|
+
the README.
|
20
|
+
|
21
|
+
Out of the four models that https://github.com/almodhfer[almodhfer] has
|
22
|
+
implemented, we selected the simplest and most performant ones:
|
23
|
+
|
24
|
+
* The baseline model (`baseline`): consists of 3 bidirectional LSTM layers with
|
25
|
+
optional batch norm layers.
|
26
|
+
|
27
|
+
* The CBHG model (`cbhg`): uses only the encoder of the Tacotron-based model
|
28
|
+
with optional post LSTM, and batch norm layers.
|
29
|
+
|
30
|
+
|
31
|
+
== Usage
|
32
|
+
|
33
|
+
=== Prerequisites
|
34
|
+
|
35
|
+
Python version: 3.6+.
|
36
|
+
|
37
|
+
Setup dependencies with:
|
38
|
+
|
39
|
+
[source,bash]
|
40
|
+
----
|
41
|
+
pip install -r requirement.txt
|
42
|
+
----
|
43
|
+
|
44
|
+
|
45
|
+
=== Quickstart
|
46
|
+
|
47
|
+
. Setup prerequisites
|
48
|
+
|
49
|
+
. Download the released model
|
50
|
+
https://github.com/secryst/rababa-models/releases/download/0.1/2000000-snapshot.pt[here]
|
51
|
+
and place under `python/log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt`
|
52
|
+
|
53
|
+
. Single sentences and text can now be diacritized as below:
|
54
|
+
|
55
|
+
[source,bash]
|
56
|
+
----
|
57
|
+
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
|
58
|
+
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
|
59
|
+
----
|
60
|
+
|
61
|
+
The maximal string length is set in configs at `600`.
|
62
|
+
|
63
|
+
Longer lines need to be broken down, for instance using the library
|
64
|
+
introduced in the link:../lib/README.adoc[Ruby quickstart] section.
|
65
|
+
|
66
|
+
|
67
|
+
== Training
|
68
|
+
|
69
|
+
=== Datasets
|
70
|
+
|
71
|
+
* We have chosen the "Tashkeela processed" corpus ~2,800,000 sentences:
|
72
|
+
** https://github.com/interscript/rababa-tashkeela
|
73
|
+
|
74
|
+
Other datasets are discussed in the reviewed literature or in the article
|
75
|
+
referenced above.
|
76
|
+
|
77
|
+
For training, data needs to be stored in the `data/CA_MSA` directory in such a
|
78
|
+
format:
|
79
|
+
|
80
|
+
[source,bash]
|
81
|
+
----
|
82
|
+
> ls data/CA_MSA/*
|
83
|
+
--> data/CA_MSA/eval.csv data/CA_MSA/train.csv data/CA_MSA/test.csv
|
84
|
+
----
|
85
|
+
|
86
|
+
For instance:
|
87
|
+
|
88
|
+
[source,bash]
|
89
|
+
----
|
90
|
+
mkdir -p data/CA_MSA
|
91
|
+
cd data
|
92
|
+
curl -sSL https://github.com/interscript/rababa-tashkeela/archive/refs/tags/v1.0.zip -o tashkeela.zip
|
93
|
+
unzip tashkeela.zip
|
94
|
+
for d in `ls rababa-tashkeela-1.0/tashkeela_val/*`; do cat $d >> CA_MSA/eval.csv; done
|
95
|
+
for d in `ls rababa-tashkeela-1.0/tashkeela_train/*`; do cat $d >> CA_MSA/train.csv; done
|
96
|
+
for d in `ls rababa-tashkeela-1.0/tashkeela_test/*`; do cat $d >> CA_MSA/test.csv; done
|
97
|
+
----
|
98
|
+
|
99
|
+
Alternatively, the dataset can be downloaded at
|
100
|
+
[rababa-tashkeela](https://github.com/interscript/rababa-tashkeela).
|
101
|
+
|
102
|
+
=== Load Model
|
103
|
+
|
104
|
+
Alternatively, trained CBHG models are available under
|
105
|
+
https://github.com/secryst/rababa-models[releases].
|
106
|
+
|
107
|
+
Models are to be copied as specified in the link just above under:
|
108
|
+
|
109
|
+
[source,bash]
|
110
|
+
----
|
111
|
+
> log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
|
112
|
+
----
|
113
|
+
|
114
|
+
|
115
|
+
=== Config Files
|
116
|
+
|
117
|
+
One can adjust the model configurations in the `/config` repository.
|
118
|
+
|
119
|
+
The model configurations are about the layers but also the dataset to be used
|
120
|
+
and various other options.
|
121
|
+
|
122
|
+
The configuration files are called explicitly in the below applications.
|
123
|
+
|
124
|
+
=== Data Preprocessing
|
125
|
+
|
126
|
+
The original work cited above allow for both raw and preprocessed.
|
127
|
+
|
128
|
+
We go for the simplest raw version here:
|
129
|
+
- As mentioned above, corpus must have `test.csv`,
|
130
|
+
`train.csv`, and `valid.csv`.
|
131
|
+
|
132
|
+
- Specify that the data is not preprocessed in the config.
|
133
|
+
In that case, each batch will be processed and the text and diacritics
|
134
|
+
will be extracted from the original text.
|
135
|
+
|
136
|
+
- You also have to specify the text encoder and the cleaner functions.
|
137
|
+
Two text encoders were included: `BasicArabicEncoder`,
|
138
|
+
`ArabicEncoderWithStartSymbol`.
|
139
|
+
|
140
|
+
Moreover, we have one cleaning function: `valid_arabic_cleaners`, which clean
|
141
|
+
all characters except valid Arabic characters, which include Arabic letters,
|
142
|
+
punctuations, and diacritics.
|
143
|
+
|
144
|
+
=== Training
|
145
|
+
|
146
|
+
All models config are placed in the config directory.
|
147
|
+
|
148
|
+
[source,bash]
|
149
|
+
----
|
150
|
+
python train.py --model "cbhg" --config config/cbhg.yml
|
151
|
+
----
|
152
|
+
|
153
|
+
The model will report the WER and DER while training using the
|
154
|
+
`diacritization_evaluation` package. The frequency of calculating WER and
|
155
|
+
DER can be specified in the config file.
|
156
|
+
|
157
|
+
=== Testing
|
158
|
+
|
159
|
+
The testing is done in the same way as the training,
|
160
|
+
For instance, with the CBHG model on the data in `/data/CA_MSA/test.csv`:
|
161
|
+
|
162
|
+
[source,bash]
|
163
|
+
----
|
164
|
+
python test.py --model 'cbhg' --config config/cbhg.yml
|
165
|
+
----
|
166
|
+
|
167
|
+
The model will load the last saved model unless you specified it in the config:
|
168
|
+
`test_data_path`. The test file is expected to have the correct diacritization!
|
169
|
+
|
170
|
+
If the test file name is different than `test.csv`, you
|
171
|
+
can add it to the `config: test_file_name`.
|
172
|
+
|
173
|
+
=== Diacritize text or files
|
174
|
+
|
175
|
+
Single sentences or files can be processed. The code outputs is the diacritized
|
176
|
+
text or lines.
|
177
|
+
|
178
|
+
[source,bash]
|
179
|
+
----
|
180
|
+
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
|
181
|
+
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
|
182
|
+
----
|
183
|
+
|
184
|
+
=== Convert CBHG, Python model to ONNX
|
185
|
+
|
186
|
+
The last model stored during training is automatically chosen and the ONNX model
|
187
|
+
is saved into a hardcoded location:
|
188
|
+
|
189
|
+
* `../models-data/diacritization_model.onnx`
|
190
|
+
|
191
|
+
==== Run
|
192
|
+
|
193
|
+
[source,bash]
|
194
|
+
----
|
195
|
+
python diacritization_model_to_onnx.py
|
196
|
+
----
|
197
|
+
|
198
|
+
==== Important parameters
|
199
|
+
|
200
|
+
They are hardcoded in the beginning of the script:
|
201
|
+
|
202
|
+
* `max_len`:
|
203
|
+
** match string length, initial model value is given in config.
|
204
|
+
** this param allows tuning the model speed and size!
|
205
|
+
** the Ruby ../lib/README.md points to resources for preprocessing
|
206
|
+
|
207
|
+
* batch_size:
|
208
|
+
** the value is given by the original model and its training.
|
209
|
+
** this constrain how the ONNX model can be put in production:
|
210
|
+
... if > 1, single lines involve redundant computations
|
211
|
+
... if > 1, files are processed in batches.
|