RubyGems - rababa - Versions diffs - 0.1.0 → 0.1.1 - Mend

rababa 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

checksums.yaml +4 -4
data/.github/workflows/python.yml +81 -0
data/.github/workflows/release.yml +36 -0
data/.github/workflows/ruby.yml +27 -0
data/.gitignore +3 -0
data/.rubocop.yml +1 -1
data/CODE_OF_CONDUCT.md +13 -13
data/README.adoc +80 -0
data/Rakefile +1 -1
data/docs/{research-arabic-diacritization-06-2021.md → research-arabic-diacritization-06-2021.adoc} +52 -37
data/exe/rababa +1 -1
data/lib/README.adoc +95 -0
data/lib/rababa/diacritizer.rb +16 -8
data/lib/rababa/encoders.rb +2 -2
data/lib/rababa/harakats.rb +1 -1
data/lib/rababa/reconcile.rb +1 -33
data/lib/rababa/version.rb +1 -1
data/models-data/README.adoc +6 -0
data/python/README.adoc +211 -0
data/python/config/cbhg.yml +1 -1
data/python/config/test_cbhg.yml +51 -0
data/python/dataset.py +23 -31
data/python/diacritization_model_to_onnx.py +216 -15
data/python/diacritizer.py +35 -31
data/python/log_dir/CA_MSA.base.cbhg/models/README.adoc +2 -0
data/python/log_dir/README.adoc +1 -0
data/python/{requirement.txt → requirements.txt} +1 -1
data/python/setup.py +32 -0
data/python/trainer.py +10 -4
data/python/util/reconcile_original_plus_diacritized.py +2 -0
data/python/util/text_cleaners.py +59 -4
data/rababa.gemspec +1 -1
data/test-datasets/data-arabic-pointing/{Readme.md → README.adoc} +2 -1
metadata +22 -18
data/.github/workflows/main.yml +0 -18
data/README.md +0 -73
data/lib/README.md +0 -82
data/models-data/README.md +0 -6
data/python/README.md +0 -163
data/python/log_dir/CA_MSA.base.cbhg/models/Readme.md +0 -2
data/python/log_dir/README.md +0 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: d171ba914bf49b5ff592722ac42382737a431883afc4220dcc0a1b785c3b5273
-  data.tar.gz: fc0b1db20509b60d5bac3819705f2c8591ab1b596996190a51a56f8b1094e3a5
+  metadata.gz: 0fe110940a4f0173f919bcf2f8e9d33e1dcd21ac775e52619e11cc37860b17cb
+  data.tar.gz: 750defe96bdc852a066585c7f713daf7b17dc5f6509dfaf567bafb5797b9929b
 SHA512:
-  metadata.gz: 46ea6eb725f2460116ef229175ac3d52287a42b0b20b9e47c6802e2c7bde06a98acae23118bd338d7ad68d600a4e5d8116e4dedc3c4657e5263c5ff854a8f182
-  data.tar.gz: '019241258b1e1d346458aebd1a21c309220fe3ced90246e90852adbc012a5e21b5bd6f96a646a509d6685646aacbd7b71a26619371b2e4ff53141c07ab88db53'
+  metadata.gz: 380fa14e57e3fba948d609987e7e076c53d5e9c8492f7219b77d7b51a538122091b0c5176585b1bbf9be5bf63f0e17ff1fa0d2510063ea6b2c656fd50be02476
+  data.tar.gz: 441cc2614664238a6ba2230f4e2a13e239ee7cf8dfcc0ce8b21c8995f704ea40d985fea22a8dcea1e58d8f5ecf980101e64c45234a4a945046dfb42bbfc71b65

data/.github/workflows/python.yml ADDED Viewed

@@ -0,0 +1,81 @@
+name: python
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+jobs:
+  infer:
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ['3.6', '3.7', '3.8', '3.9']
+    steps:
+    - uses: actions/checkout@v2
+    - uses: actions/setup-python@v2
+      with:
+        python-version: ${{ matrix.python-version }}
+    - uses: actions/cache@v2
+      with:
+        path: ${{ env.pythonLocation }}
+        key: ${{ env.pythonLocation }}-${{ hashFiles('python/setup.py') }}-${{ hashFiles('python/requirements.txt') }}
+    - name: Install requirements
+      working-directory: ./python
+      run: |
+        pip install --upgrade --upgrade-strategy eager -r requirements.txt -e .
+    - name: Download PyTorch model
+      working-directory: ./python
+      run: |
+        curl -sSL https://github.com/secryst/rababa-models/releases/download/0.1/2000000-snapshot.pt \
+          -o log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
+    - name: Run diacriticization
+      working-directory: ./python
+      run: |
+        python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
+  train:
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ['3.6', '3.7', '3.8', '3.9']
+    steps:
+    - uses: actions/checkout@v2
+    - uses: actions/setup-python@v2
+      with:
+        python-version: ${{ matrix.python-version }}
+    - uses: actions/cache@v2
+      with:
+        path: ${{ env.pythonLocation }}
+        key: ${{ env.pythonLocation }}-${{ hashFiles('python/setup.py') }}-${{ hashFiles('python/requirements.txt') }}
+    - name: Install requirements
+      working-directory: ./python
+      run: |
+        pip install --upgrade --upgrade-strategy eager -r requirements.txt -e .
+    - name: Prepare dataset
+      working-directory: ./python
+      run: |
+        mkdir -p data/CA_MSA
+        touch data/CA_MSA/{eval,train,test}.csv
+        cd data
+        curl -sSL https://github.com/interscript/rababa-tashkeela/archive/refs/tags/v1.0.zip -o tashkeela.zip
+        unzip tashkeela.zip
+        for d in `ls rababa-tashkeela-1.0/tashkeela_val/*`; do cat $d >> CA_MSA/eval.csv; done
+        for d in `ls rababa-tashkeela-1.0/tashkeela_train/*`; do cat $d >> CA_MSA/train.csv; done
+        for d in `ls rababa-tashkeela-1.0/tashkeela_test/*`; do cat $d >> CA_MSA/test.csv; done
+    - name: Try training (WIP)
+      working-directory: ./python
+      run: |
+        python train.py --model "cbhg" --config config/test_cbhg.yml

data/.github/workflows/release.yml ADDED Viewed

@@ -0,0 +1,36 @@
+name: release
+on:
+  push:
+    tags:
+      - 'v*'
+jobs:
+  release:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - uses: actions/setup-ruby@v1
+        with:
+          ruby-version: '2.7'
+          architecture: 'x64'
+      - run: bundle install --jobs 4 --retry 3
+      - name: Test the Ruby package
+        run: bundle exec rake
+      - name: Publish to rubygems.org
+        env:
+          RUBYGEMS_API_KEY: ${{secrets.INTERSCRIPT_RUBYGEMS_API_KEY}}
+        run: |
+          gem install gem-release
+          touch ~/.gem/credentials
+          cat > ~/.gem/credentials << EOF
+          ---
+          :rubygems_api_key: ${RUBYGEMS_API_KEY}
+          EOF
+          chmod 0600 ~/.gem/credentials
+          git status
+          gem release

data/.github/workflows/ruby.yml ADDED Viewed

@@ -0,0 +1,27 @@
+name: ruby
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        ruby-version: ['2.6', '2.7', '3.0']
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Ruby
+      uses: ruby/setup-ruby@v1
+      with:
+        ruby-version: ${{ matrix.ruby-version }}
+        bundler-cache: true
+    - name: Run rake
+      run: |
+        bundle exec rake

data/.gitignore CHANGED Viewed

@@ -9,3 +9,6 @@
 # rspec failure tracking
 .rspec_status
+*.onnx
+Gemfile.lock

data/.rubocop.yml CHANGED Viewed

@@ -1,5 +1,5 @@
 AllCops:
-  TargetRubyVersion: 2.4
+  TargetRubyVersion: 2.5
 Style/StringLiterals:
   Enabled: true

data/CODE_OF_CONDUCT.md CHANGED Viewed

@@ -1,12 +1,12 @@
-# Contributor Covenant Code of Conduct
+= Contributor Covenant Code of Conduct
-## Our Pledge
+== Our Pledge
 We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
 We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
-## Our Standards
+== Our Standards
 Examples of behavior that contributes to a positive environment for our community include:
@@ -27,56 +27,56 @@ Examples of unacceptable behavior include:
 * Other conduct which could reasonably be considered inappropriate in a
   professional setting
-## Enforcement Responsibilities
+== Enforcement Responsibilities
 Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
 Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
-## Scope
+== Scope
 This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
-## Enforcement
+== Enforcement
 Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at ronald.tse@ribose.com. All complaints will be reviewed and investigated promptly and fairly.
 All community leaders are obligated to respect the privacy and security of the reporter of any incident.
-## Enforcement Guidelines
+== Enforcement Guidelines
 Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
-### 1. Correction
+=== 1. Correction
 **Community Impact**: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
 **Consequence**: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
-### 2. Warning
+=== 2. Warning
 **Community Impact**: A violation through a single incident or series of actions.
 **Consequence**: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
-### 3. Temporary Ban
+=== 3. Temporary Ban
 **Community Impact**: A serious violation of community standards, including sustained inappropriate behavior.
 **Consequence**: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
-### 4. Permanent Ban
+=== 4. Permanent Ban
 **Community Impact**: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior,  harassment of an individual, or aggression toward or disparagement of classes of individuals.
 **Consequence**: A permanent ban from any sort of public interaction within the community.
-## Attribution
+== Attribution
 This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 2.0,
 available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
-Community Impact Guidelines were inspired by [Mozilla's code of conduct enforcement ladder](https://github.com/mozilla/diversity).
+Community Impact Guidelines were inspired by https://github.com/mozilla/diversity[Mozilla's code of conduct enforcement ladder].
 [homepage]: https://www.contributor-covenant.org

data/README.adoc ADDED Viewed

@@ -0,0 +1,80 @@
+= رُبابَة RABABA the Arabic Diacritization Library
+Arabic diacritization is useful for several practical business cases like text
+to speech or Romanization of Arabic texts or scripts.
+== Purpose
+This repository contains everything to train a diacritization model in Python
+and run it in Python and Ruby.
+== Try out Rababa
+Rababa can be run both in Python and Ruby. Go the directory corresponding to the
+language you prefer to use.
+Please see the following README's, under the "`Try out Rababa`" section:
+* https://github.com/interscript/rababa/tree/main/python[Python]
+* https://github.com/interscript/rababa/tree/main/lib[Ruby]
+== Library
+This library was built for the
+https://www.interscript.org[Interscript project]
+(https://github.com/interscript/)[at GitHub].
+Diacritization strategy is following several steps with at heart a deep learning
+model:
+. text preprocessing
+. neural networks model prediction
+. text postprocessing
+This repository contains:
+* https://github.com/interscript/rababa/tree/main/lib[lib] is
+  the Ruby library using NNet model in ONNX format.
+* https://github.com/interscript/rababa/tree/main/docs[docs]
+  contains an application focused summary of latest (2021-06) relevant papers
+  and solutions.
+* https://github.com/interscript/rababa/tree/main/python[python]
+** A *neural network solution* for automatised diacritization based on the
+work of https://github.com/almodhfer/Arabic_Diacritization[almodhfer],
+from which we overtook the baseline and more advanced and efficient CBHG
+models only. This very recent solution allows for efficient predictions on
+CPU's with a reasonable sized model.
+** **PyTorch to ONNX** conversion of PyTorch to ONNX format
+** **Strings Pre-/Post-processing**, also from
+   https://github.com/almodhfer/Arabic_Diacritization[almodhfer]
+* https://github.com/interscript/rababa/tree/main/tests-benchmarks[tests and benchmarking utilities],
+  allowing to compare with other implementations.
+** tests are taken from
+  https://github.com/AliOsm/arabic-text-diacritization[diacritization benchmarking]
+** we have added own, realistic datasets for the problem of diacritization
+* **models-data** directory to store models and embeddings in various formats
+== About the name
+A https://en.wikipedia.org/wiki/Rebab[Rababa] is an antique string instrument.
+In a similar fashion that a Rababa produces melody from a simple strings and
+pieces of wood, our library and diacritization gives a whole palette of colour
+and meanings to arabic scripts.
+== Under development
+We are working on the following improvements:
+* Preprocessing for breaking down large sentences
+* PoS tagging and search to improve the diacritization

data/Rakefile CHANGED Viewed

@@ -9,4 +9,4 @@ require "rubocop/rake_task"
 RuboCop::RakeTask.new
-task default: %i[spec rubocop]
+task default: %i[spec]# rubocop]

data/docs/{research-arabic-diacritization-06-2021.md → research-arabic-diacritization-06-2021.adoc} RENAMED Viewed

@@ -1,4 +1,4 @@
-# Literature and Codes
+= Literature and Codes
 Last updated: 2021-06.
@@ -9,74 +9,89 @@ Older solutions used rules based approaches.
 Deep Learning was applied relatively to the problem of diacritization, gradually
 getting better results than rules based approaches.
+== References
 **Mishkal, Arabic text vocalization software**
-Zerrouki, T.
- rules based library, 2014
- * [code](https://github.com/linuxscout/mishkal)
+* Zerrouki, T.
+* rules based library, 2014
+* https://github.com/linuxscout/mishkal[code]
 **Automatic minimal diacritization of Arabic texts**
-Rehab Alnefaiea, Aqil M.Azmib
-11.2017
+* Rehab Alnefaiea, Aqil M.Azmib
+* 11.2017
 * MADAMIRA software
-* [paper](https://www.sciencedirect.com/science/article/pii/S1877050917321634)
+* https://www.sciencedirect.com/science/article/pii/S1877050917321634[paper]
 **An Approach for Arabic Diacritization**
- Ismail Hadjir, Mohamed Abbache, Fatma Zohra Belkredim
-06.2019
+* Ismail Hadjir, Mohamed Abbache, Fatma Zohra Belkredim
+* 06.2019
 * keywords: Hidden Markov Models, Viterbi algorithm
-* [article](https://link.springer.com/chapter/10.1007/978-3-030-23281-8_29)
+* https://link.springer.com/chapter/10.1007/978-3-030-23281-8_29[article]
 **Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach**
-Kareem Darwish∗, Ahmed Abdelali∗, Hamdy Mubarak∗, Younes Samih†, Mohammed Attia⋆
-2018
+* Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Younes Samih, Mohammed Attia
+* 2018
 * keywords: Conditional Random Fields, arabic dialects...
-* [paper](http://lrec-conf.org/workshops/lrec2018/W30/pdf/20_W30.pdf)
+* http://lrec-conf.org/workshops/lrec2018/W30/pdf/20_W30.pdf[paper]
 **Arabic Text Diacritization Using Deep Neural Networks**
-Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, Mahmoud Al-Ayyoub
-**Shakkala** library, tensorflow,  04.2019
+* Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, Mahmoud Al-Ayyoub
+* **Shakkala** library, tensorflow
+* 04.2019
 * keywords: Embedding, LSTM
-*  [paper](https://arxiv.org/abs/1905.01965)
-*  [code](https://github.com/Barqawiz/Shakkala), tensorflow
-* [benchmarks&scripts](https://github.com/AliOsm/arabic-text-diacritization)
+*  https://arxiv.org/abs/1905.01965[paper]
+*  https://github.com/Barqawiz/Shakkala[code], tensorflow
+* https://github.com/AliOsm/arabic-text-diacritization[benchmarks&scripts]
 **Highly Effective Arabic Diacritization using Sequence to Sequence Modeling**
 * Hamdy Mubarak, Ahmed Abdelali, Hassan Sajjad, Younes Samih, Kareem Darwish
-06.2019
+* 06.2019
 * keywords: seq2seq(LSTM), NMT, interesting representation units, context window, voting
-* [paper](https://www.aclweb.org/anthology/N19-1248.pdf)
+* https://www.aclweb.org/anthology/N19-1248.pdf[paper]
 **Multi-components System for Automatic Arabic Diacritization**
-Hamza Abbad, Shengwu Xiong
-04.2020
+* Hamza Abbad, Shengwu Xiong
+* 04.2020
 * keywords: LSTM's, parallel layers for Shadda and Harakat (⇒ pipeline)
-* [paper](https://paperswithcode.com/paper/multi-components-system-for-automatic-arabic)
-* [code](https://github.com/Hamza5/Pipeline-diacritizer), tensorflow
+* https://paperswithcode.com/paper/multi-components-system-for-automatic-arabic[paper]
+* https://github.com/Hamza5/Pipeline-diacritizer[code], tensorflow
 **Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization**
-Badr AlKhamissi, Muhammad N. ElNokrashy, and Mohamed Gabr
-12.2020
+* Badr AlKhamissi, Muhammad N. ElNokrashy, and Mohamed Gabr
+* 12.2020
 * keywords: Cross-level attention, Encoder-Decoder (LSTM), Teacher forcing,
-* [paper](https://www.aclweb.org/anthology/2020.wanlp-1.4.pdf)
-* [slides](https://drive.google.com/file/d/1GzXRIddVeJRCge74QaRC67M1I-pAoGV3/view)
-* [code](https://github.com/BKHMSI/deep-diacritization), pytorch
+* https://www.aclweb.org/anthology/2020.wanlp-1.4.pdf[paper]
+* https://drive.google.com/file/d/1GzXRIddVeJRCge74QaRC67M1I-pAoGV3/view[slides]
+* https://github.com/BKHMSI/deep-diacritization[code], pytorch
 **Effective Deep Learning Models for Automatic Diacritization of Arabic Text**
-Mokthar Ali Hasan Madhfar; Ali Mustafa Qamar
-12.2020
+* Mokthar Ali Hasan Madhfar; Ali Mustafa Qamar
+* 12.2020
 * keywords: embedding, encoder-decoder (LSTM), Highway Nets, Attention, CBHG Module
-* [paper](https://paperswithcode.com/paper/effective-deep-learning-models-for-automatic)
-* [code](https://github.com/almodhfer/Arabic_Diacritization), pytorch
+* https://paperswithcode.com/paper/effective-deep-learning-models-for-automatic[paper]
+* https://github.com/almodhfer/Arabic_Diacritization[code], pytorch
 **A Deep Belief Network Classification Approach for Automatic Diacritization of Arabic Text**
-Mohammad Aref Alshraideh, Mohammad Alshraideh and Omar Alkadi
-4.2021
+* Mohammad Aref Alshraideh, Mohammad Alshraideh and Omar Alkadi
+* 4.2021
 * keywords: DBN built with Boltzmann restricted machines (restricted RBM's) superior to LSTMs, unicode encoding, Borderline-SMOTE
-* [paper](https://www.researchgate.net/publication/352226815_A_Deep_Belief_Network_Classification_Approach_for_Automatic_Diacritization_of_Arabic_Text)
+* https://www.researchgate.net/publication/352226815_A_Deep_Belief_Network_Classification_Approach_for_Automatic_Diacritization_of_Arabic_Text[paper]
+== Research ideas
-# Research ideas
 Here we just mention some 2021-ish ideas mentioned in the recent papers above:
 * Transformer-based Encoders
 * Byte-pair-encodings
 * Improve Injected Hints Method (train with semi diacritised data)