rababa 0.1.0 → 0.1.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.github/workflows/python.yml +81 -0
- data/.github/workflows/release.yml +36 -0
- data/.github/workflows/ruby.yml +27 -0
- data/.gitignore +3 -0
- data/.rubocop.yml +1 -1
- data/CODE_OF_CONDUCT.md +13 -13
- data/README.adoc +80 -0
- data/Rakefile +1 -1
- data/docs/{research-arabic-diacritization-06-2021.md → research-arabic-diacritization-06-2021.adoc} +52 -37
- data/exe/rababa +1 -1
- data/lib/README.adoc +95 -0
- data/lib/rababa/diacritizer.rb +16 -8
- data/lib/rababa/encoders.rb +2 -2
- data/lib/rababa/harakats.rb +1 -1
- data/lib/rababa/reconcile.rb +1 -33
- data/lib/rababa/version.rb +1 -1
- data/models-data/README.adoc +6 -0
- data/python/README.adoc +211 -0
- data/python/config/cbhg.yml +1 -1
- data/python/config/test_cbhg.yml +51 -0
- data/python/dataset.py +23 -31
- data/python/diacritization_model_to_onnx.py +216 -15
- data/python/diacritizer.py +35 -31
- data/python/log_dir/CA_MSA.base.cbhg/models/README.adoc +2 -0
- data/python/log_dir/README.adoc +1 -0
- data/python/{requirement.txt → requirements.txt} +1 -1
- data/python/setup.py +32 -0
- data/python/trainer.py +10 -4
- data/python/util/reconcile_original_plus_diacritized.py +2 -0
- data/python/util/text_cleaners.py +59 -4
- data/rababa.gemspec +1 -1
- data/test-datasets/data-arabic-pointing/{Readme.md → README.adoc} +2 -1
- metadata +22 -18
- data/.github/workflows/main.yml +0 -18
- data/README.md +0 -73
- data/lib/README.md +0 -82
- data/models-data/README.md +0 -6
- data/python/README.md +0 -163
- data/python/log_dir/CA_MSA.base.cbhg/models/Readme.md +0 -2
- data/python/log_dir/README.md +0 -1
data/models-data/README.md
DELETED
data/python/README.md
DELETED
@@ -1,163 +0,0 @@
|
|
1
|
-
# Diacritization Model
|
2
|
-
|
3
|
-
## Try out Rababa
|
4
|
-
* download the torch model under /Assets at [releases](https://github.com/secryst/rababa-models/releases)
|
5
|
-
* Put the model under python/log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
|
6
|
-
* single sentences and text can now be diacritized as below:
|
7
|
-
```bash
|
8
|
-
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
|
9
|
-
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
|
10
|
-
```
|
11
|
-
The maximal string length is set in configs at 600.
|
12
|
-
Larger line will need to be breaken down, for instance using the library introduced in the Ruby Try out section: ../lib/README.md
|
13
|
-
|
14
|
-
## Core: Python Deep Learning models for recovering Arabic language diacritics
|
15
|
-
|
16
|
-
We are referring here to the [code](https://github.com/almodhfer/Arabic_Diacritization) and
|
17
|
-
[Effective Deep Learning Models for Automatic Diacritization of Arabic Text](https://ieeexplore.ieee.org/document/9274427)
|
18
|
-
that we have selected for this project from a list of alternatives listed in the
|
19
|
-
docs readme.
|
20
|
-
|
21
|
-
Out of the four models that [almodhfer](https://github.com/almodhfer) has
|
22
|
-
implemented, we selected the simplest and most performant ones:
|
23
|
-
|
24
|
-
- The baseline model (`baseline`): consists of 3 bidirectional LSTM layers with
|
25
|
-
optional batch norm layers.
|
26
|
-
|
27
|
-
- The CBHG model (`cbhg`): uses only the encoder of the Tacotron-based model with
|
28
|
-
optional post LSTM, and batch norm layers.
|
29
|
-
|
30
|
-
### Python Version & Dependencies
|
31
|
-
|
32
|
-
- version: 3.6
|
33
|
-
- dependencies:
|
34
|
-
```bash
|
35
|
-
pip install -r requirement.txt
|
36
|
-
```
|
37
|
-
|
38
|
-
### Datasets
|
39
|
-
|
40
|
-
- We have chosen the Tashkeela corpus ~2800000 sentences:
|
41
|
-
* [sourceforge](https://sourceforge.net/projects/tashkeela-processed/)
|
42
|
-
|
43
|
-
Other datasets are discussed in the reviewed literature or in the article referenced above.
|
44
|
-
|
45
|
-
```bash
|
46
|
-
mkdir data
|
47
|
-
mkdir data/CA_MSA
|
48
|
-
```
|
49
|
-
|
50
|
-
For training, data need to be in format:
|
51
|
-
|
52
|
-
```bash
|
53
|
-
> ls data/CA_MSA/*
|
54
|
-
--> data/CA_MSA/eval.csv data/CA_MSA/train.csv data/CA_MSA/test.csv
|
55
|
-
```
|
56
|
-
|
57
|
-
For instance:
|
58
|
-
|
59
|
-
```bash
|
60
|
-
unzip data.zip
|
61
|
-
for d in `ls tashkeela_val/*`; do; cat $d >> data/CA_MSA/eval.csv; done
|
62
|
-
for d in `ls tashkeela_train/*`; do; cat $d >> data/CA_MSA/train.csv; done
|
63
|
-
for d in `ls tashkeela_test/*`; do; cat $d >> data/CA_MSA/test.csv; done
|
64
|
-
```
|
65
|
-
|
66
|
-
### Load Model
|
67
|
-
|
68
|
-
Alternatively, trained CBHG models are available under
|
69
|
-
[releases](https://github.com/secryst/rababa-models).
|
70
|
-
Models are to be copied as specified in the link just above under:
|
71
|
-
> log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
|
72
|
-
|
73
|
-
|
74
|
-
### Config Files
|
75
|
-
|
76
|
-
One can adjust the model configurations in the `/config` repository.
|
77
|
-
|
78
|
-
The model configurations are about the layers but also the dataset to be used
|
79
|
-
and various other options.
|
80
|
-
|
81
|
-
The configuration files are called explicitly in the below applications.
|
82
|
-
|
83
|
-
### Data Preprocessing
|
84
|
-
|
85
|
-
The original work cited above allow for both raw and preprocessed.
|
86
|
-
|
87
|
-
We go for the simplest raw version here:
|
88
|
-
- As mentioned above, corpus must have test.csv, train.csv, and valid.csv.
|
89
|
-
|
90
|
-
- Specify that the data is not preprocessed in the config.
|
91
|
-
In that case, each batch will be processed and the text and diacritics
|
92
|
-
will be extracted from the original text.
|
93
|
-
|
94
|
-
- You also have to specify the text encoder and the cleaner functions.
|
95
|
-
Two text encoders were included: BasicArabicEncoder, ArabicEncoderWithStartSymbol.
|
96
|
-
|
97
|
-
Moreover, we have one cleaning function: valid_arabic_cleaners, which clean
|
98
|
-
all characters except valid Arabic characters, which include Arabic letters,
|
99
|
-
punctuations, and diacritics.
|
100
|
-
|
101
|
-
### Training
|
102
|
-
|
103
|
-
All models config are placed in the config directory.
|
104
|
-
|
105
|
-
```bash
|
106
|
-
python train.py --model model_name --config config/config_name.yml
|
107
|
-
```
|
108
|
-
|
109
|
-
The model will report the WER and DER while training using the
|
110
|
-
diacritization_evaluation package. The frequency of calculating WER and
|
111
|
-
DER can be specified in the config file.
|
112
|
-
|
113
|
-
### Testing
|
114
|
-
|
115
|
-
The testing is done in the same way as the training,
|
116
|
-
For instance, with the CBHG model on the data in `/data/CA_MSA/test.csv`:
|
117
|
-
|
118
|
-
```bash
|
119
|
-
python test.py --model 'cbhg' --config config/cbhg.yml
|
120
|
-
```
|
121
|
-
|
122
|
-
The model will load the last saved model unless you specified it in the config:
|
123
|
-
`test_data_path`. The test file is expected to have the correct diacritization!
|
124
|
-
|
125
|
-
If the test file name is different than `test.csv`, you
|
126
|
-
can add it to the `config: test_file_name`.
|
127
|
-
|
128
|
-
### "Diacritize" Text or Files
|
129
|
-
|
130
|
-
Single sentences or files can be processed. The code outputs is the diacritized
|
131
|
-
text or lines.
|
132
|
-
|
133
|
-
```bash
|
134
|
-
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
|
135
|
-
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
|
136
|
-
```
|
137
|
-
|
138
|
-
|
139
|
-
### Convert CBHG, python model to ONNX
|
140
|
-
|
141
|
-
The last model stored during training is automatically chosen and the ONNX model
|
142
|
-
is saved into a hardcoded location: `../models-data/diacritization_model.onnx`
|
143
|
-
|
144
|
-
#### Run
|
145
|
-
|
146
|
-
```bash
|
147
|
-
python diacritization_model_to_onnx.py
|
148
|
-
```
|
149
|
-
|
150
|
-
#### Important parameters
|
151
|
-
|
152
|
-
They are hardcoded in the beginning of the script:
|
153
|
-
|
154
|
-
* `max_len`:
|
155
|
-
* match string length, initial model value is given in config.
|
156
|
-
* this param allows tuning the model speed and size!
|
157
|
-
* the Ruby ../lib/README.md points to resources for preprocessing
|
158
|
-
|
159
|
-
* batch_size:
|
160
|
-
* the value is given by the original model and its training.
|
161
|
-
* this constrain how the ONNX model can be put in production:
|
162
|
-
1. if > 1, single lines involve redundant computations
|
163
|
-
2. if > 1, files are processed in batches.
|
data/python/log_dir/README.md
DELETED
@@ -1 +0,0 @@
|
|
1
|
-
### Model storage directory for training and inference
|