rababa 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.github/workflows/python.yml +81 -0
- data/.github/workflows/release.yml +36 -0
- data/.github/workflows/ruby.yml +27 -0
- data/.gitignore +3 -0
- data/.rubocop.yml +1 -1
- data/CODE_OF_CONDUCT.md +13 -13
- data/README.adoc +80 -0
- data/Rakefile +1 -1
- data/docs/{research-arabic-diacritization-06-2021.md → research-arabic-diacritization-06-2021.adoc} +52 -37
- data/exe/rababa +1 -1
- data/lib/README.adoc +95 -0
- data/lib/rababa/diacritizer.rb +16 -8
- data/lib/rababa/encoders.rb +2 -2
- data/lib/rababa/harakats.rb +1 -1
- data/lib/rababa/reconcile.rb +1 -33
- data/lib/rababa/version.rb +1 -1
- data/models-data/README.adoc +6 -0
- data/python/README.adoc +211 -0
- data/python/config/cbhg.yml +1 -1
- data/python/config/test_cbhg.yml +51 -0
- data/python/dataset.py +23 -31
- data/python/diacritization_model_to_onnx.py +216 -15
- data/python/diacritizer.py +35 -31
- data/python/log_dir/CA_MSA.base.cbhg/models/README.adoc +2 -0
- data/python/log_dir/README.adoc +1 -0
- data/python/{requirement.txt → requirements.txt} +1 -1
- data/python/setup.py +32 -0
- data/python/trainer.py +10 -4
- data/python/util/reconcile_original_plus_diacritized.py +2 -0
- data/python/util/text_cleaners.py +59 -4
- data/rababa.gemspec +1 -1
- data/test-datasets/data-arabic-pointing/{Readme.md → README.adoc} +2 -1
- metadata +22 -18
- data/.github/workflows/main.yml +0 -18
- data/README.md +0 -73
- data/lib/README.md +0 -82
- data/models-data/README.md +0 -6
- data/python/README.md +0 -163
- data/python/log_dir/CA_MSA.base.cbhg/models/Readme.md +0 -2
- data/python/log_dir/README.md +0 -1
data/models-data/README.md
DELETED
data/python/README.md
DELETED
@@ -1,163 +0,0 @@
|
|
1
|
-
# Diacritization Model
|
2
|
-
|
3
|
-
## Try out Rababa
|
4
|
-
* download the torch model under /Assets at [releases](https://github.com/secryst/rababa-models/releases)
|
5
|
-
* Put the model under python/log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
|
6
|
-
* single sentences and text can now be diacritized as below:
|
7
|
-
```bash
|
8
|
-
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
|
9
|
-
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
|
10
|
-
```
|
11
|
-
The maximal string length is set in configs at 600.
|
12
|
-
Larger line will need to be breaken down, for instance using the library introduced in the Ruby Try out section: ../lib/README.md
|
13
|
-
|
14
|
-
## Core: Python Deep Learning models for recovering Arabic language diacritics
|
15
|
-
|
16
|
-
We are referring here to the [code](https://github.com/almodhfer/Arabic_Diacritization) and
|
17
|
-
[Effective Deep Learning Models for Automatic Diacritization of Arabic Text](https://ieeexplore.ieee.org/document/9274427)
|
18
|
-
that we have selected for this project from a list of alternatives listed in the
|
19
|
-
docs readme.
|
20
|
-
|
21
|
-
Out of the four models that [almodhfer](https://github.com/almodhfer) has
|
22
|
-
implemented, we selected the simplest and most performant ones:
|
23
|
-
|
24
|
-
- The baseline model (`baseline`): consists of 3 bidirectional LSTM layers with
|
25
|
-
optional batch norm layers.
|
26
|
-
|
27
|
-
- The CBHG model (`cbhg`): uses only the encoder of the Tacotron-based model with
|
28
|
-
optional post LSTM, and batch norm layers.
|
29
|
-
|
30
|
-
### Python Version & Dependencies
|
31
|
-
|
32
|
-
- version: 3.6
|
33
|
-
- dependencies:
|
34
|
-
```bash
|
35
|
-
pip install -r requirement.txt
|
36
|
-
```
|
37
|
-
|
38
|
-
### Datasets
|
39
|
-
|
40
|
-
- We have chosen the Tashkeela corpus ~2800000 sentences:
|
41
|
-
* [sourceforge](https://sourceforge.net/projects/tashkeela-processed/)
|
42
|
-
|
43
|
-
Other datasets are discussed in the reviewed literature or in the article referenced above.
|
44
|
-
|
45
|
-
```bash
|
46
|
-
mkdir data
|
47
|
-
mkdir data/CA_MSA
|
48
|
-
```
|
49
|
-
|
50
|
-
For training, data need to be in format:
|
51
|
-
|
52
|
-
```bash
|
53
|
-
> ls data/CA_MSA/*
|
54
|
-
--> data/CA_MSA/eval.csv data/CA_MSA/train.csv data/CA_MSA/test.csv
|
55
|
-
```
|
56
|
-
|
57
|
-
For instance:
|
58
|
-
|
59
|
-
```bash
|
60
|
-
unzip data.zip
|
61
|
-
for d in `ls tashkeela_val/*`; do; cat $d >> data/CA_MSA/eval.csv; done
|
62
|
-
for d in `ls tashkeela_train/*`; do; cat $d >> data/CA_MSA/train.csv; done
|
63
|
-
for d in `ls tashkeela_test/*`; do; cat $d >> data/CA_MSA/test.csv; done
|
64
|
-
```
|
65
|
-
|
66
|
-
### Load Model
|
67
|
-
|
68
|
-
Alternatively, trained CBHG models are available under
|
69
|
-
[releases](https://github.com/secryst/rababa-models).
|
70
|
-
Models are to be copied as specified in the link just above under:
|
71
|
-
> log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
|
72
|
-
|
73
|
-
|
74
|
-
### Config Files
|
75
|
-
|
76
|
-
One can adjust the model configurations in the `/config` repository.
|
77
|
-
|
78
|
-
The model configurations are about the layers but also the dataset to be used
|
79
|
-
and various other options.
|
80
|
-
|
81
|
-
The configuration files are called explicitly in the below applications.
|
82
|
-
|
83
|
-
### Data Preprocessing
|
84
|
-
|
85
|
-
The original work cited above allow for both raw and preprocessed.
|
86
|
-
|
87
|
-
We go for the simplest raw version here:
|
88
|
-
- As mentioned above, corpus must have test.csv, train.csv, and valid.csv.
|
89
|
-
|
90
|
-
- Specify that the data is not preprocessed in the config.
|
91
|
-
In that case, each batch will be processed and the text and diacritics
|
92
|
-
will be extracted from the original text.
|
93
|
-
|
94
|
-
- You also have to specify the text encoder and the cleaner functions.
|
95
|
-
Two text encoders were included: BasicArabicEncoder, ArabicEncoderWithStartSymbol.
|
96
|
-
|
97
|
-
Moreover, we have one cleaning function: valid_arabic_cleaners, which clean
|
98
|
-
all characters except valid Arabic characters, which include Arabic letters,
|
99
|
-
punctuations, and diacritics.
|
100
|
-
|
101
|
-
### Training
|
102
|
-
|
103
|
-
All models config are placed in the config directory.
|
104
|
-
|
105
|
-
```bash
|
106
|
-
python train.py --model model_name --config config/config_name.yml
|
107
|
-
```
|
108
|
-
|
109
|
-
The model will report the WER and DER while training using the
|
110
|
-
diacritization_evaluation package. The frequency of calculating WER and
|
111
|
-
DER can be specified in the config file.
|
112
|
-
|
113
|
-
### Testing
|
114
|
-
|
115
|
-
The testing is done in the same way as the training,
|
116
|
-
For instance, with the CBHG model on the data in `/data/CA_MSA/test.csv`:
|
117
|
-
|
118
|
-
```bash
|
119
|
-
python test.py --model 'cbhg' --config config/cbhg.yml
|
120
|
-
```
|
121
|
-
|
122
|
-
The model will load the last saved model unless you specified it in the config:
|
123
|
-
`test_data_path`. The test file is expected to have the correct diacritization!
|
124
|
-
|
125
|
-
If the test file name is different than `test.csv`, you
|
126
|
-
can add it to the `config: test_file_name`.
|
127
|
-
|
128
|
-
### "Diacritize" Text or Files
|
129
|
-
|
130
|
-
Single sentences or files can be processed. The code outputs is the diacritized
|
131
|
-
text or lines.
|
132
|
-
|
133
|
-
```bash
|
134
|
-
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
|
135
|
-
python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
|
136
|
-
```
|
137
|
-
|
138
|
-
|
139
|
-
### Convert CBHG, python model to ONNX
|
140
|
-
|
141
|
-
The last model stored during training is automatically chosen and the ONNX model
|
142
|
-
is saved into a hardcoded location: `../models-data/diacritization_model.onnx`
|
143
|
-
|
144
|
-
#### Run
|
145
|
-
|
146
|
-
```bash
|
147
|
-
python diacritization_model_to_onnx.py
|
148
|
-
```
|
149
|
-
|
150
|
-
#### Important parameters
|
151
|
-
|
152
|
-
They are hardcoded in the beginning of the script:
|
153
|
-
|
154
|
-
* `max_len`:
|
155
|
-
* match string length, initial model value is given in config.
|
156
|
-
* this param allows tuning the model speed and size!
|
157
|
-
* the Ruby ../lib/README.md points to resources for preprocessing
|
158
|
-
|
159
|
-
* batch_size:
|
160
|
-
* the value is given by the original model and its training.
|
161
|
-
* this constrain how the ONNX model can be put in production:
|
162
|
-
1. if > 1, single lines involve redundant computations
|
163
|
-
2. if > 1, files are processed in batches.
|
data/python/log_dir/README.md
DELETED
@@ -1 +0,0 @@
|
|
1
|
-
### Model storage directory for training and inference
|