rababa 0.1.0 → 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/.github/workflows/python.yml +81 -0
  3. data/.github/workflows/release.yml +36 -0
  4. data/.github/workflows/ruby.yml +27 -0
  5. data/.gitignore +3 -0
  6. data/.rubocop.yml +1 -1
  7. data/CODE_OF_CONDUCT.md +13 -13
  8. data/README.adoc +80 -0
  9. data/Rakefile +1 -1
  10. data/docs/{research-arabic-diacritization-06-2021.md → research-arabic-diacritization-06-2021.adoc} +52 -37
  11. data/exe/rababa +1 -1
  12. data/lib/README.adoc +95 -0
  13. data/lib/rababa/diacritizer.rb +16 -8
  14. data/lib/rababa/encoders.rb +2 -2
  15. data/lib/rababa/harakats.rb +1 -1
  16. data/lib/rababa/reconcile.rb +1 -33
  17. data/lib/rababa/version.rb +1 -1
  18. data/models-data/README.adoc +6 -0
  19. data/python/README.adoc +211 -0
  20. data/python/config/cbhg.yml +1 -1
  21. data/python/config/test_cbhg.yml +51 -0
  22. data/python/dataset.py +23 -31
  23. data/python/diacritization_model_to_onnx.py +216 -15
  24. data/python/diacritizer.py +35 -31
  25. data/python/log_dir/CA_MSA.base.cbhg/models/README.adoc +2 -0
  26. data/python/log_dir/README.adoc +1 -0
  27. data/python/{requirement.txt → requirements.txt} +1 -1
  28. data/python/setup.py +32 -0
  29. data/python/trainer.py +10 -4
  30. data/python/util/reconcile_original_plus_diacritized.py +2 -0
  31. data/python/util/text_cleaners.py +59 -4
  32. data/rababa.gemspec +1 -1
  33. data/test-datasets/data-arabic-pointing/{Readme.md → README.adoc} +2 -1
  34. metadata +22 -18
  35. data/.github/workflows/main.yml +0 -18
  36. data/README.md +0 -73
  37. data/lib/README.md +0 -82
  38. data/models-data/README.md +0 -6
  39. data/python/README.md +0 -163
  40. data/python/log_dir/CA_MSA.base.cbhg/models/Readme.md +0 -2
  41. data/python/log_dir/README.md +0 -1
@@ -1,6 +0,0 @@
1
- ### model data dir
2
-
3
- contains:
4
- ONNX data
5
- Pickle sample data
6
-
data/python/README.md DELETED
@@ -1,163 +0,0 @@
1
- # Diacritization Model
2
-
3
- ## Try out Rababa
4
- * download the torch model under /Assets at [releases](https://github.com/secryst/rababa-models/releases)
5
- * Put the model under python/log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
6
- * single sentences and text can now be diacritized as below:
7
- ```bash
8
- python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
9
- python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
10
- ```
11
- The maximal string length is set in configs at 600.
12
- Larger line will need to be breaken down, for instance using the library introduced in the Ruby Try out section: ../lib/README.md
13
-
14
- ## Core: Python Deep Learning models for recovering Arabic language diacritics
15
-
16
- We are referring here to the [code](https://github.com/almodhfer/Arabic_Diacritization) and
17
- [Effective Deep Learning Models for Automatic Diacritization of Arabic Text](https://ieeexplore.ieee.org/document/9274427)
18
- that we have selected for this project from a list of alternatives listed in the
19
- docs readme.
20
-
21
- Out of the four models that [almodhfer](https://github.com/almodhfer) has
22
- implemented, we selected the simplest and most performant ones:
23
-
24
- - The baseline model (`baseline`): consists of 3 bidirectional LSTM layers with
25
- optional batch norm layers.
26
-
27
- - The CBHG model (`cbhg`): uses only the encoder of the Tacotron-based model with
28
- optional post LSTM, and batch norm layers.
29
-
30
- ### Python Version & Dependencies
31
-
32
- - version: 3.6
33
- - dependencies:
34
- ```bash
35
- pip install -r requirement.txt
36
- ```
37
-
38
- ### Datasets
39
-
40
- - We have chosen the Tashkeela corpus ~2800000 sentences:
41
- * [sourceforge](https://sourceforge.net/projects/tashkeela-processed/)
42
-
43
- Other datasets are discussed in the reviewed literature or in the article referenced above.
44
-
45
- ```bash
46
- mkdir data
47
- mkdir data/CA_MSA
48
- ```
49
-
50
- For training, data need to be in format:
51
-
52
- ```bash
53
- > ls data/CA_MSA/*
54
- --> data/CA_MSA/eval.csv data/CA_MSA/train.csv data/CA_MSA/test.csv
55
- ```
56
-
57
- For instance:
58
-
59
- ```bash
60
- unzip data.zip
61
- for d in `ls tashkeela_val/*`; do; cat $d >> data/CA_MSA/eval.csv; done
62
- for d in `ls tashkeela_train/*`; do; cat $d >> data/CA_MSA/train.csv; done
63
- for d in `ls tashkeela_test/*`; do; cat $d >> data/CA_MSA/test.csv; done
64
- ```
65
-
66
- ### Load Model
67
-
68
- Alternatively, trained CBHG models are available under
69
- [releases](https://github.com/secryst/rababa-models).
70
- Models are to be copied as specified in the link just above under:
71
- > log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
72
-
73
-
74
- ### Config Files
75
-
76
- One can adjust the model configurations in the `/config` repository.
77
-
78
- The model configurations are about the layers but also the dataset to be used
79
- and various other options.
80
-
81
- The configuration files are called explicitly in the below applications.
82
-
83
- ### Data Preprocessing
84
-
85
- The original work cited above allow for both raw and preprocessed.
86
-
87
- We go for the simplest raw version here:
88
- - As mentioned above, corpus must have test.csv, train.csv, and valid.csv.
89
-
90
- - Specify that the data is not preprocessed in the config.
91
- In that case, each batch will be processed and the text and diacritics
92
- will be extracted from the original text.
93
-
94
- - You also have to specify the text encoder and the cleaner functions.
95
- Two text encoders were included: BasicArabicEncoder, ArabicEncoderWithStartSymbol.
96
-
97
- Moreover, we have one cleaning function: valid_arabic_cleaners, which clean
98
- all characters except valid Arabic characters, which include Arabic letters,
99
- punctuations, and diacritics.
100
-
101
- ### Training
102
-
103
- All models config are placed in the config directory.
104
-
105
- ```bash
106
- python train.py --model model_name --config config/config_name.yml
107
- ```
108
-
109
- The model will report the WER and DER while training using the
110
- diacritization_evaluation package. The frequency of calculating WER and
111
- DER can be specified in the config file.
112
-
113
- ### Testing
114
-
115
- The testing is done in the same way as the training,
116
- For instance, with the CBHG model on the data in `/data/CA_MSA/test.csv`:
117
-
118
- ```bash
119
- python test.py --model 'cbhg' --config config/cbhg.yml
120
- ```
121
-
122
- The model will load the last saved model unless you specified it in the config:
123
- `test_data_path`. The test file is expected to have the correct diacritization!
124
-
125
- If the test file name is different than `test.csv`, you
126
- can add it to the `config: test_file_name`.
127
-
128
- ### "Diacritize" Text or Files
129
-
130
- Single sentences or files can be processed. The code outputs is the diacritized
131
- text or lines.
132
-
133
- ```bash
134
- python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
135
- python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
136
- ```
137
-
138
-
139
- ### Convert CBHG, python model to ONNX
140
-
141
- The last model stored during training is automatically chosen and the ONNX model
142
- is saved into a hardcoded location: `../models-data/diacritization_model.onnx`
143
-
144
- #### Run
145
-
146
- ```bash
147
- python diacritization_model_to_onnx.py
148
- ```
149
-
150
- #### Important parameters
151
-
152
- They are hardcoded in the beginning of the script:
153
-
154
- * `max_len`:
155
- * match string length, initial model value is given in config.
156
- * this param allows tuning the model speed and size!
157
- * the Ruby ../lib/README.md points to resources for preprocessing
158
-
159
- * batch_size:
160
- * the value is given by the original model and its training.
161
- * this constrain how the ONNX model can be put in production:
162
- 1. if > 1, single lines involve redundant computations
163
- 2. if > 1, files are processed in batches.
@@ -1,2 +0,0 @@
1
- #### Put model trained with CA_MSA here:
2
- 2000000-snapshot.pt
@@ -1 +0,0 @@
1
- ### Model storage directory for training and inference