rababa 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/.github/workflows/python.yml +81 -0
  3. data/.github/workflows/release.yml +36 -0
  4. data/.github/workflows/ruby.yml +27 -0
  5. data/.gitignore +3 -0
  6. data/.rubocop.yml +1 -1
  7. data/CODE_OF_CONDUCT.md +13 -13
  8. data/README.adoc +80 -0
  9. data/Rakefile +1 -1
  10. data/docs/{research-arabic-diacritization-06-2021.md → research-arabic-diacritization-06-2021.adoc} +52 -37
  11. data/exe/rababa +1 -1
  12. data/lib/README.adoc +95 -0
  13. data/lib/rababa/diacritizer.rb +16 -8
  14. data/lib/rababa/encoders.rb +2 -2
  15. data/lib/rababa/harakats.rb +1 -1
  16. data/lib/rababa/reconcile.rb +1 -33
  17. data/lib/rababa/version.rb +1 -1
  18. data/models-data/README.adoc +6 -0
  19. data/python/README.adoc +211 -0
  20. data/python/config/cbhg.yml +1 -1
  21. data/python/config/test_cbhg.yml +51 -0
  22. data/python/dataset.py +23 -31
  23. data/python/diacritization_model_to_onnx.py +216 -15
  24. data/python/diacritizer.py +35 -31
  25. data/python/log_dir/CA_MSA.base.cbhg/models/README.adoc +2 -0
  26. data/python/log_dir/README.adoc +1 -0
  27. data/python/{requirement.txt → requirements.txt} +1 -1
  28. data/python/setup.py +32 -0
  29. data/python/trainer.py +10 -4
  30. data/python/util/reconcile_original_plus_diacritized.py +2 -0
  31. data/python/util/text_cleaners.py +59 -4
  32. data/rababa.gemspec +1 -1
  33. data/test-datasets/data-arabic-pointing/{Readme.md → README.adoc} +2 -1
  34. metadata +22 -18
  35. data/.github/workflows/main.yml +0 -18
  36. data/README.md +0 -73
  37. data/lib/README.md +0 -82
  38. data/models-data/README.md +0 -6
  39. data/python/README.md +0 -163
  40. data/python/log_dir/CA_MSA.base.cbhg/models/Readme.md +0 -2
  41. data/python/log_dir/README.md +0 -1
@@ -1,6 +0,0 @@
1
- ### model data dir
2
-
3
- contains:
4
- ONNX data
5
- Pickle sample data
6
-
data/python/README.md DELETED
@@ -1,163 +0,0 @@
1
- # Diacritization Model
2
-
3
- ## Try out Rababa
4
- * download the torch model under /Assets at [releases](https://github.com/secryst/rababa-models/releases)
5
- * Put the model under python/log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
6
- * single sentences and text can now be diacritized as below:
7
- ```bash
8
- python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
9
- python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
10
- ```
11
- The maximal string length is set in configs at 600.
12
- Larger line will need to be breaken down, for instance using the library introduced in the Ruby Try out section: ../lib/README.md
13
-
14
- ## Core: Python Deep Learning models for recovering Arabic language diacritics
15
-
16
- We are referring here to the [code](https://github.com/almodhfer/Arabic_Diacritization) and
17
- [Effective Deep Learning Models for Automatic Diacritization of Arabic Text](https://ieeexplore.ieee.org/document/9274427)
18
- that we have selected for this project from a list of alternatives listed in the
19
- docs readme.
20
-
21
- Out of the four models that [almodhfer](https://github.com/almodhfer) has
22
- implemented, we selected the simplest and most performant ones:
23
-
24
- - The baseline model (`baseline`): consists of 3 bidirectional LSTM layers with
25
- optional batch norm layers.
26
-
27
- - The CBHG model (`cbhg`): uses only the encoder of the Tacotron-based model with
28
- optional post LSTM, and batch norm layers.
29
-
30
- ### Python Version & Dependencies
31
-
32
- - version: 3.6
33
- - dependencies:
34
- ```bash
35
- pip install -r requirement.txt
36
- ```
37
-
38
- ### Datasets
39
-
40
- - We have chosen the Tashkeela corpus ~2800000 sentences:
41
- * [sourceforge](https://sourceforge.net/projects/tashkeela-processed/)
42
-
43
- Other datasets are discussed in the reviewed literature or in the article referenced above.
44
-
45
- ```bash
46
- mkdir data
47
- mkdir data/CA_MSA
48
- ```
49
-
50
- For training, data need to be in format:
51
-
52
- ```bash
53
- > ls data/CA_MSA/*
54
- --> data/CA_MSA/eval.csv data/CA_MSA/train.csv data/CA_MSA/test.csv
55
- ```
56
-
57
- For instance:
58
-
59
- ```bash
60
- unzip data.zip
61
- for d in `ls tashkeela_val/*`; do; cat $d >> data/CA_MSA/eval.csv; done
62
- for d in `ls tashkeela_train/*`; do; cat $d >> data/CA_MSA/train.csv; done
63
- for d in `ls tashkeela_test/*`; do; cat $d >> data/CA_MSA/test.csv; done
64
- ```
65
-
66
- ### Load Model
67
-
68
- Alternatively, trained CBHG models are available under
69
- [releases](https://github.com/secryst/rababa-models).
70
- Models are to be copied as specified in the link just above under:
71
- > log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
72
-
73
-
74
- ### Config Files
75
-
76
- One can adjust the model configurations in the `/config` repository.
77
-
78
- The model configurations are about the layers but also the dataset to be used
79
- and various other options.
80
-
81
- The configuration files are called explicitly in the below applications.
82
-
83
- ### Data Preprocessing
84
-
85
- The original work cited above allow for both raw and preprocessed.
86
-
87
- We go for the simplest raw version here:
88
- - As mentioned above, corpus must have test.csv, train.csv, and valid.csv.
89
-
90
- - Specify that the data is not preprocessed in the config.
91
- In that case, each batch will be processed and the text and diacritics
92
- will be extracted from the original text.
93
-
94
- - You also have to specify the text encoder and the cleaner functions.
95
- Two text encoders were included: BasicArabicEncoder, ArabicEncoderWithStartSymbol.
96
-
97
- Moreover, we have one cleaning function: valid_arabic_cleaners, which clean
98
- all characters except valid Arabic characters, which include Arabic letters,
99
- punctuations, and diacritics.
100
-
101
- ### Training
102
-
103
- All models config are placed in the config directory.
104
-
105
- ```bash
106
- python train.py --model model_name --config config/config_name.yml
107
- ```
108
-
109
- The model will report the WER and DER while training using the
110
- diacritization_evaluation package. The frequency of calculating WER and
111
- DER can be specified in the config file.
112
-
113
- ### Testing
114
-
115
- The testing is done in the same way as the training,
116
- For instance, with the CBHG model on the data in `/data/CA_MSA/test.csv`:
117
-
118
- ```bash
119
- python test.py --model 'cbhg' --config config/cbhg.yml
120
- ```
121
-
122
- The model will load the last saved model unless you specified it in the config:
123
- `test_data_path`. The test file is expected to have the correct diacritization!
124
-
125
- If the test file name is different than `test.csv`, you
126
- can add it to the `config: test_file_name`.
127
-
128
- ### "Diacritize" Text or Files
129
-
130
- Single sentences or files can be processed. The code outputs is the diacritized
131
- text or lines.
132
-
133
- ```bash
134
- python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
135
- python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
136
- ```
137
-
138
-
139
- ### Convert CBHG, python model to ONNX
140
-
141
- The last model stored during training is automatically chosen and the ONNX model
142
- is saved into a hardcoded location: `../models-data/diacritization_model.onnx`
143
-
144
- #### Run
145
-
146
- ```bash
147
- python diacritization_model_to_onnx.py
148
- ```
149
-
150
- #### Important parameters
151
-
152
- They are hardcoded in the beginning of the script:
153
-
154
- * `max_len`:
155
- * match string length, initial model value is given in config.
156
- * this param allows tuning the model speed and size!
157
- * the Ruby ../lib/README.md points to resources for preprocessing
158
-
159
- * batch_size:
160
- * the value is given by the original model and its training.
161
- * this constrain how the ONNX model can be put in production:
162
- 1. if > 1, single lines involve redundant computations
163
- 2. if > 1, files are processed in batches.
@@ -1,2 +0,0 @@
1
- #### Put model trained with CA_MSA here:
2
- 2000000-snapshot.pt
@@ -1 +0,0 @@
1
- ### Model storage directory for training and inference