RubyGems - rababa - Versions diffs - 0.1.0 → 0.1.1 - Mend

rababa 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

checksums.yaml +4 -4
data/.github/workflows/python.yml +81 -0
data/.github/workflows/release.yml +36 -0
data/.github/workflows/ruby.yml +27 -0
data/.gitignore +3 -0
data/.rubocop.yml +1 -1
data/CODE_OF_CONDUCT.md +13 -13
data/README.adoc +80 -0
data/Rakefile +1 -1
data/docs/{research-arabic-diacritization-06-2021.md → research-arabic-diacritization-06-2021.adoc} +52 -37
data/exe/rababa +1 -1
data/lib/README.adoc +95 -0
data/lib/rababa/diacritizer.rb +16 -8
data/lib/rababa/encoders.rb +2 -2
data/lib/rababa/harakats.rb +1 -1
data/lib/rababa/reconcile.rb +1 -33
data/lib/rababa/version.rb +1 -1
data/models-data/README.adoc +6 -0
data/python/README.adoc +211 -0
data/python/config/cbhg.yml +1 -1
data/python/config/test_cbhg.yml +51 -0
data/python/dataset.py +23 -31
data/python/diacritization_model_to_onnx.py +216 -15
data/python/diacritizer.py +35 -31
data/python/log_dir/CA_MSA.base.cbhg/models/README.adoc +2 -0
data/python/log_dir/README.adoc +1 -0
data/python/{requirement.txt → requirements.txt} +1 -1
data/python/setup.py +32 -0
data/python/trainer.py +10 -4
data/python/util/reconcile_original_plus_diacritized.py +2 -0
data/python/util/text_cleaners.py +59 -4
data/rababa.gemspec +1 -1
data/test-datasets/data-arabic-pointing/{Readme.md → README.adoc} +2 -1
metadata +22 -18
data/.github/workflows/main.yml +0 -18
data/README.md +0 -73
data/lib/README.md +0 -82
data/models-data/README.md +0 -6
data/python/README.md +0 -163
data/python/log_dir/CA_MSA.base.cbhg/models/Readme.md +0 -2
data/python/log_dir/README.md +0 -1

data/python/log_dir/CA_MSA.base.cbhg/models/README.adoc ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ ==== Put model trained with CA_MSA here:
2	+ 2000000-snapshot.pt

data/python/log_dir/README.adoc ADDED Viewed

	@@ -0,0 +1 @@
1	+ === Model storage directory for training and inference

data/python/{requirement.txt → requirements.txt} RENAMED Viewed

@@ -1,4 +1,4 @@
-torch==1.7.0
+torch==1.9.0
 numpy==1.19.5
 matplotlib==3.3.3
 pandas==1.1.5

data/python/setup.py ADDED Viewed

@@ -0,0 +1,32 @@
+from setuptools import setup, find_packages
+setup(
+    name='rababa',
+    version='0.1.0',
+    description='Rababa for Arabic diacriticization',
+    author='Ribose',
+    author_email='open.source@ribose.com',
+    url='https://www.interscript.org',
+    # packages=find_packages(include=['exampleproject', 'exampleproject.*']),
+    python_requires='>=3.6, <4',
+    install_requires=[
+      'torch==1.9.0',
+      'numpy==1.19.5',
+      'matplotlib==3.3.3',
+      'pandas==1.1.5',
+      'ruamel.yaml==0.16.12',
+      'tensorboard==2.4.0',
+      'diacritization-evaluation==0.5',
+      'tqdm==4.56.0',
+      'onnx==1.9.0',
+      'onnxruntime==1.8.1',
+      'pyyaml==5.4.1',
+    ],
+    # extras_require={'plotting': ['matplotlib>=2.2.0', 'jupyter']},
+    setup_requires=['pytest-runner'],
+    tests_require=['pytest'],
+    # entry_points={
+    #     'console_scripts': ['my-command=exampleproject.example:main']
+    # },
+    # package_data={'exampleproject': ['data/schema.json']}
+)

data/python/trainer.py CHANGED Viewed

@@ -12,7 +12,7 @@ from tqdm import trange
 from config_manager import ConfigManager
 from dataset import load_iterators
-from diacritizer import CBHGDiacritizer
+from diacritizer import Diacritizer
 from util.learning_rates import LearningRateDecay
 from options import OptimizerType
 from util.utils import (
@@ -51,6 +51,7 @@ class GeneralTrainer(Trainer):
         self.model = self.config_manager.get_model()
         self.optimizer = self.get_optimizer()
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
         self.model = self.model.to(self.device)
         self.load_model(model_path=self.config.get("train_resume_model_path"))
@@ -78,7 +79,7 @@ class GeneralTrainer(Trainer):
     def load_diacritizer(self):
         if self.model_kind in ["cbhg", "baseline"]:
-            self.diacritizer = CBHGDiacritizer(self.config_path, self.model_kind)
+            self.diacritizer = Diacritizer(self.config_path, self.model_kind)
         else:
             print('model not found')
             exit()
@@ -195,6 +196,7 @@ class GeneralTrainer(Trainer):
         return results, summary_texts
     def run(self):
         scaler = torch.cuda.amp.GradScaler()
         train_iterator, _, validation_iterator = load_iterators(self.config_manager)
         print("data loaded")
@@ -337,9 +339,12 @@ class GeneralTrainer(Trainer):
         predictions = outputs["diacritics"].contiguous()
         targets = batch_inputs["target"].contiguous()
         predictions = predictions.view(-1, predictions.shape[-1])
         targets = targets.view(-1)
-        loss = self.criterion(predictions.to(self.device), targets.to(self.device))
+        loss = self.criterion(predictions.to(self.device),
+                              targets.to(self.device))
         outputs.update({"loss": loss})
         return outputs
@@ -361,7 +366,8 @@ class GeneralTrainer(Trainer):
             last_model_path = model_path
         print(f"loading from {last_model_path}")
-        saved_model = torch.load(last_model_path)
+        saved_model = torch.load(last_model_path) if torch.cuda.is_available() \
+            else torch.load(last_model_path, map_location=torch.device('cpu'))
         self.model.load_state_dict(saved_model["model_state_dict"])
         if load_optimizer:
             self.optimizer.load_state_dict(saved_model["optimizer_state_dict"])

data/python/util/reconcile_original_plus_diacritized.py CHANGED Viewed

@@ -26,6 +26,7 @@ def build_pivot_map(d_original, d_diacritized):
             d_diacritized: dictionary modelling diacritized as above
         return: list of ids tuple where strings match
     """
     l_map = []
     idx_dia, idx_ori = 0, 0
     while idx_dia < len(d_diacritized):
@@ -59,6 +60,7 @@ def reconcile_strings(str_original, str_diacritized):
             str_diacritized: diacritized string
         return: reconciled string
     """
     # we model the strings as dict
     d_original = dict((i,c) for i,c in
                       enumerate(list([c for c in str_original if not c in HARAQAT])))

data/python/util/text_cleaners.py CHANGED Viewed

@@ -1,6 +1,6 @@
 import re
-from util.constants import VALID_ARABIC
+from util.constants import VALID_ARABIC, BASIC_HARAQAT, ALL_POSSIBLE_HARAQAT
+from diacritization_evaluation import util
 _whitespace_re = re.compile(r"\s+")
@@ -9,13 +9,68 @@ def collapse_whitespace(text):
     text = re.sub(_whitespace_re, " ", text)
     return text
 def basic_cleaners(text):
     text = collapse_whitespace(text)
     return text.strip()
 def valid_arabic_cleaners(text):
     text = filter(lambda char: char in VALID_ARABIC, text)
     text = collapse_whitespace(''.join(list(text)))
     return text.strip()
+def extract_stack(stack, correct_reversed: bool = True):
+    """
+    Given stack, we extract its content to string, and check whether this string is
+    available at all_possible_haraqat list: if not we raise an error. When correct_reversed
+    is set, we also check the reversed order of the string, if it was not already correct.
+    """
+    char_haraqat = []
+    while len(stack) != 0:
+        char_haraqat.append(stack.pop())
+    full_haraqah = "".join(char_haraqat)
+    reversed_full_haraqah = "".join(reversed(char_haraqat))
+    if full_haraqah in ALL_POSSIBLE_HARAQAT:
+        out = full_haraqah
+    elif reversed_full_haraqah in ALL_POSSIBLE_HARAQAT and correct_reversed:
+        out = reversed_full_haraqah
+    else:
+        #raise ValueError(stack)
+        #raise ValueError(
+        #    f"""The chart has the following haraqat which are not found in
+        #all possible haraqat: {'|'.join([ALL_POSSIBLE_HARAQAT[diacritic]
+        #                                 for diacritic in full_haraqah ])}"""
+        #)
+        out = ''
+    return out
+def extract_haraqat(text: str, correct_reversed: bool = True):
+    """
+    Args:
+    text (str): text to be diacritized
+    Returns:
+    text: the original text as it comes
+    text_list: all text that are not haraqat
+    haraqat_list: all haraqat_list
+    """
+    if len(text.strip()) == 0:
+        return text, [" "] * len(text), [""] * len(text)
+    stack = []
+    haraqat_list = []
+    txt_list = []
+    for char in text:
+        # if chart is a diacritic, then extract the stack and empty it
+        if char not in BASIC_HARAQAT.keys():
+            stack_content = extract_stack(stack,
+                                          correct_reversed=correct_reversed)
+            #if stack_content != '':
+            haraqat_list.append(stack_content)
+            txt_list.append(char)
+            stack = []
+        else:
+            stack.append(char)
+    if len(haraqat_list) > 0:
+        del haraqat_list[0]
+    haraqat_list.append(extract_stack(stack))
+    return text, txt_list, haraqat_list

data/rababa.gemspec CHANGED Viewed

@@ -11,7 +11,7 @@ Gem::Specification.new do |spec|
   spec.summary       = "Arabic diacriticizer from Interscript."
   # spec.description   = "TODO: Write a longer description or delete this line."
   spec.homepage      = "https://www.interscript.org"
-  spec.required_ruby_version = Gem::Requirement.new(">= 2.4.0")
+  spec.required_ruby_version = Gem::Requirement.new(">= 2.5.0")
   spec.metadata["homepage_uri"] = spec.homepage
   spec.metadata["source_code_uri"] = "https://github.com/interscript/rababa"

data/test-datasets/data-arabic-pointing/{Readme.md → README.adoc} RENAMED Viewed

@@ -1,2 +1,3 @@
-# Data arabic pointing:
+= Data arabic pointing
 https://github.com/secryst/data-arabic-pointing

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: rababa
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.1.1
 platform: ruby
 authors:
 - Ribose
-autorequire:
+autorequire:
 bindir: exe
 cert_chain: []
-date: 2021-07-26 00:00:00.000000000 Z
+date: 2021-08-15 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: onnxruntime
@@ -66,7 +66,7 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
-description:
+description:
 email:
 - open.source@ribose.com
 executables:
@@ -74,21 +74,23 @@ executables:
 extensions: []
 extra_rdoc_files: []
 files:
-- ".github/workflows/main.yml"
+- ".github/workflows/python.yml"
+- ".github/workflows/release.yml"
+- ".github/workflows/ruby.yml"
 - ".gitignore"
 - ".rspec"
 - ".rubocop.yml"
 - CODE_OF_CONDUCT.md
 - Gemfile
-- README.md
+- README.adoc
 - Rakefile
 - bin/console
 - bin/setup
 - config/model.yml
 - data/example.txt
-- docs/research-arabic-diacritization-06-2021.md
+- docs/research-arabic-diacritization-06-2021.adoc
 - exe/rababa
-- lib/README.md
+- lib/README.adoc
 - lib/rababa.rb
 - lib/rababa/arabic_constants.rb
 - lib/rababa/diacritizer.rb
@@ -96,18 +98,19 @@ files:
 - lib/rababa/harakats.rb
 - lib/rababa/reconcile.rb
 - lib/rababa/version.rb
-- models-data/README.md
+- models-data/README.adoc
 - models-data/batch_example_data.pkl
-- python/README.md
+- python/README.adoc
 - python/config/baseline.yml
 - python/config/cbhg.yml
+- python/config/test_cbhg.yml
 - python/config_manager.py
 - python/dataset.py
 - python/diacritization_model_to_onnx.py
 - python/diacritize.py
 - python/diacritizer.py
-- python/log_dir/CA_MSA.base.cbhg/models/Readme.md
-- python/log_dir/README.md
+- python/log_dir/CA_MSA.base.cbhg/models/README.adoc
+- python/log_dir/README.adoc
 - python/models/baseline.py
 - python/models/cbhg.py
 - python/models/seq2seq.py
@@ -116,7 +119,8 @@ files:
 - python/modules/layers.py
 - python/modules/tacotron_modules.py
 - python/options.py
-- python/requirement.txt
+- python/requirements.txt
+- python/setup.py
 - python/test.py
 - python/tester.py
 - python/train.py
@@ -130,7 +134,7 @@ files:
 - python/util/utils.py
 - rababa.gemspec
 - test-datasets/business-cases/examples_with_coutrynames.txt
-- test-datasets/data-arabic-pointing/Readme.md
+- test-datasets/data-arabic-pointing/README.adoc
 - test-datasets/tashkeela/test.txt
 - test-datasets/tashkeela/train.txt
 - test-datasets/tashkeela/val.txt
@@ -140,7 +144,7 @@ metadata:
   homepage_uri: https://www.interscript.org
   source_code_uri: https://github.com/interscript/rababa
   changelog_uri: https://github.com/interscript/rababa
-post_install_message:
+post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -148,15 +152,15 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: 2.4.0
+      version: 2.5.0
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.0.3
-signing_key:
+rubygems_version: 3.1.6
+signing_key:
 specification_version: 4
 summary: Arabic diacriticizer from Interscript.
 test_files: []

data/.github/workflows/main.yml DELETED Viewed

@@ -1,18 +0,0 @@
-name: Ruby
-on: [push,pull_request]
-jobs:
-  build:
-    runs-on: ubuntu-latest
-    steps:
-    - uses: actions/checkout@v2
-    - name: Set up Ruby
-      uses: ruby/setup-ruby@v1
-      with:
-        ruby-version: 2.6.6
-    - name: Run the default task
-      run: |
-        gem install bundler -v 2.2.15
-        bundle install
-        bundle exec rake

data/README.md DELETED Viewed

@@ -1,73 +0,0 @@
-# رُبابَة RABABA the Arabic Diacritization Library
-Arabic diacritization is useful for several practical business cases like text
-to speech or Romanization of Arabic texts or scripts.
-## Purpose
-This repository contains everything to train a diacritization model in Python
-and run it in Python and Ruby.
-## Try out Rababa
-Rababa can be run both in python and ruby. Go the directory corresponding to the language you prefer to use.  Indications are in the README's, under the "Try out Rababa" section:
-* [Python](https://github.com/interscript/rababa/tree/master/python)
-* [Ruby](https://github.com/interscript/rababa/tree/master/lib)
-## Library
-This library was built for the
-[Interscript project](https://www.interscript.org)
-([at GitHub](https://github.com/interscript/)).
-Diacritization strategy is following several steps with at heart a deep learning
-model:
-1. text preprocessing
-2. neural networks model prediction
-3. text postprocessing
-This repository contains:
-- [lib](https://github.com/interscript/rababa/tree/master/lib) is
-  the Ruby library using NNet model in ONNX format.
-- [docs](https://github.com/interscript/rababa/tree/master/docs)
-  contains an application focused summary of latest (2021-06) relevant papers
-  and solutions.
-- [python](https://github.com/interscript/rababa/tree/master/python)
-    - A **neural network solution** for automatised diacritization based on the
-      work of [almodhfer](https://github.com/almodhfer/Arabic_Diacritization),
-      from which we overtook the baseline and more advanced and efficient CBHG
-      models only. This very recent solution allows for efficient predictions on
-      CPU's with a reasonable sized model.
-    * **PyTorch to ONNX** conversion of PyTorch to ONNX format
-    * **Strings Pre-/Post-processing**, also from
-      [almodhfer](https://github.com/almodhfer/Arabic_Diacritization)
-- [tests and benchmarking utilities](https://github.com/interscript/rababa/tree/master/tests-benchmarks),
-  allowing to compare with other implementations.
-	* tests are are taken from
-	  [diacritization benchmarking](https://github.com/AliOsm/arabic-text-diacritization)
-	* we have added own, realistic datasets for the problem of diacritization
-- **models-data** directory to store models and embeddings in various formats
-## About the Name
-A https://en.wikipedia.org/wiki/Rebab[Rababa] is an antique string instrument.
-In a similar fashion that a Rababa produces melody from a simple strings and
-pieces of wood, our library and diacritization gives a whole palette of colour
-and meanings to arabic scripts.
-## Under development
-We are working on the following improvements:
-* Preprocessing for breaking down large sentences
-* PoS tagging and search to improve the diacritization

data/lib/README.md DELETED Viewed

@@ -1,82 +0,0 @@
-# Arabic Diacritization in Ruby with Rababa
-## Try out Rababa
-* Install the Gems listed below
-* Download a ruby model on [releases](https://github.com/secryst/rababa-models)
-### Run examples
-Prerequisite:
-* Please download the `diacritization_model_max_len_200.onnx` model file
-from https://github.com/secryst/rababa-models/releases/tag/0.1
-One can diacritize either single strings:
-```sh
-rababa -t 'قطر' -m diacritization_model_max_len_200.onnx
-# or when inside the gem directory during development
-bundle exec exe/rababa -t 'قطر' -m diacritization_model_max_len_200.onnx
-```
-Or files as `data/examples.txt` or your own Arabic file (the max string length
-is specified in the model and has to match the `max_len` parameter in
-`config/models.yaml`):
-```sh
-rababa -f data/example.txt -m diacritization_model_max_len_200.onnx
-# or when inside the gem directory during development
-bundle exec exe/rababa -f data/example.txt -m diacritization_model_max_len_200.onnx
-```
-One would have to preprocess generic arabic texts for running Rababa in general.
-This can be done on sentences beginnings running for instance
-[Hamza5](https://github.com/Hamza5/Pipeline-diacritizer):
-```
-python __main__.py preprocess source destination
-```
-### ONNX Models
-They can either be built in the `/python` repository or downloaded from the
-[releases](https://github.com/secryst/rababa-models).
-Or ONNX model can be generated running the python
-[code](https://github.com/interscript/rababa/blob/master/python/diacritization_model_to_onnx.py)
-in this library.
-It requires to go through some of the steps described in the link above.
-### Parameters
-* text to diacritize: "**-t**TEXT", "--text=TEXT",
-* path to file to diacritize: "**-f**FILE", "--text_filename=FILE",
-* path to ONNX model **Mandatory**: "-mMODEL", "--model_file=MODEL",
-* path to config file **Default:config/model.yml**: "-cCONFIG", "--config=CONFIG"
-### Config
-#### Players:
-* max_len: 200 -- 600
-	* Parameter that has to match the ONNX model built using the
-	  [code]{https://github.com/interscript/rababa/blob/master/python/diacritization_model_to_onnx.py}
-	  and following the python/Readme.md.
-	* Longer sentences will need to be preprocessed, which can be done for
-	  instance using [Hamza5](https://github.com/Hamza5)
-	  [code](https://github.com/Hamza5/Pipeline-diacritizer/blob/master/pipeline_diacritizer/pipeline_diacritizer.py).
-	* the smaller the faster the nnets code.
-* text_encoder corresponding to the [rules](https://github.com/interscript/rababa/blob/master/python/util/text_encoders.py):
-     * BasicArabicEncoder
-     * ArabicEncoderWithStartSymbol
-* text_cleaner corresponding to [logics](https://github.com/interscript/rababa/blob/master/python/util/text_cleaners.py):
-     * basic_cleaners: remove redundancy in whitespaces and strip string
-     * valid_arabic_cleaners: basic+filter of only arabic words
-### Gems
-```sh
-gem install rababa
-```