PyPI - SinaTools - Versions diffs - 0.1.36__py2.py3-none-any.whl → 0.1.38__py2.py3-none-any.whl - Mend

SinaTools 0.1.36py2.py3-none-any.whl → 0.1.38py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

{SinaTools-0.1.36.dist-info → SinaTools-0.1.38.dist-info}/METADATA RENAMED Viewed

@@ -1,64 +1,62 @@
-Metadata-Version: 2.1
-Name: SinaTools
-Version: 0.1.36
-Summary: Open-source Python toolkit for Arabic Natural Understanding, allowing people to integrate it in their system workflow.
-Home-page: https://github.com/SinaLab/sinatools
-License: MIT license
-Keywords: sinatools
-Platform: UNKNOWN
-Description-Content-Type: text/markdown
-Requires-Dist: six
-Requires-Dist: farasapy
-Requires-Dist: tqdm
-Requires-Dist: requests
-Requires-Dist: regex
-Requires-Dist: pathlib
-Requires-Dist: torch (==1.13.0)
-Requires-Dist: transformers (==4.24.0)
-Requires-Dist: torchtext (==0.14.0)
-Requires-Dist: torchvision (==0.14.0)
-Requires-Dist: seqeval (==1.2.2)
-Requires-Dist: natsort (==7.1.1)
-SinaTools
-======================
-Open Source Toolkit for Arabic NLP and NLU developed by [SinaLab](http://sina.birzeit.edu/) at Birzeit University. SinaTools is available through Python APIs, command lines, colabs, and online demos.
-See the full list of [Available Packages](https://sina.birzeit.edu/sinatools/), which include: (1) [Morphology Tagging](https://sina.birzeit.edu/sinatools/index.html#morph), (2) [Named Entity Recognition (NER)](https://sina.birzeit.edu/sinatools/index.html#ner), (3) [Word Sense Disambiguation (WSD)](https://sina.birzeit.edu/sinatools/index.html#wsd), (4) [Semantic Relatedness](https://sina.birzeit.edu/sinatools/index.html#sr), (5) [Synonymy Extraction and Evaluation](https://sina.birzeit.edu/sinatools/index.html#se), (6) [Relation Extraction](https://sina.birzeit.edu/sinatools/index.html#re), (7) [Utilities](https://sina.birzeit.edu/sinatools/index.html#u) (diacritic-based word matching, Jaccard similarly, parser, tokenizers, corpora processing, transliteration, etc).
-See [Demo Pages](https://sina.birzeit.edu/sinatools/).
-See the [benchmarking](https://www.jarrar.info/publications/HJK24.pdf), which shows that SinaTools outperformed all related toolkits.
-Installation
---------
-To install SinaTools, ensure you are using Python version 3.10.8, then clone the [GitHub](git://github.com/SinaLab/SinaTools) repository.
-Alternatively, you can execute the following command:
-```bash
-pip install sinatools
-```
-Installing Models and Data Files
---------
-Some modules in SinaTools require some data files and fine-tuned models to be downloaded. To download these models, please consult the [DataDownload](https://sina.birzeit.edu/sinatools/documentation/cli_tools/DataDownload/DataDownload.html).
-Documentation
---------
-For information, please refer to the [main page](https://sina.birzeit.edu/sinatools) or the [online domuementation](https://sina.birzeit.edu/sinatools/documentation).
-Citation
--------
-Tymaa Hammouda, Mustafa Jarrar, Mohammed Khalilia: [SinaTools: Open Source Toolkit for Arabic Natural Language Understanding](http://www.jarrar.info/publications/HJK24.pdf). In Proceedings of the 2024 AI in Computational Linguistics (ACLing 2024), Procedia Computer Science, Dubai. ELSEVIER.
-License
---------
-SinaTools is available under the MIT License. See the [LICENSE](https://github.com/SinaLab/sinatools/blob/main/LICENSE) file for more information.
-Reporting Issues
---------
-To report any issues or bugs, please contact us at "sina.institute.bzu@gmail.com" or visit [SinaTools Issues](https://github.com/SinaLab/sinatools/issues).
+Metadata-Version: 2.1
+Name: SinaTools
+Version: 0.1.38
+Summary: Open-source Python toolkit for Arabic Natural Understanding, allowing people to integrate it in their system workflow.
+Home-page: https://github.com/SinaLab/sinatools
+License: MIT license
+Keywords: sinatools
+Description-Content-Type: text/markdown
+License-File: LICENSE
+License-File: AUTHORS.rst
+Requires-Dist: six
+Requires-Dist: farasapy
+Requires-Dist: tqdm
+Requires-Dist: requests
+Requires-Dist: pathlib
+Requires-Dist: torch ==1.13.0
+Requires-Dist: transformers ==4.24.0
+Requires-Dist: torchtext ==0.14.0
+Requires-Dist: torchvision ==0.14.0
+Requires-Dist: seqeval ==1.2.2
+Requires-Dist: natsort ==7.1.1
+SinaTools
+======================
+Open Source Toolkit for Arabic NLP and NLU developed by [SinaLab](http://sina.birzeit.edu/) at Birzeit University. SinaTools is available through Python APIs, command lines, colabs, and online demos.
+See the full list of [Available Packages](https://sina.birzeit.edu/sinatools/), which include: (1) [Morphology Tagging](https://sina.birzeit.edu/sinatools/index.html#morph), (2) [Named Entity Recognition (NER)](https://sina.birzeit.edu/sinatools/index.html#ner), (3) [Word Sense Disambiguation (WSD)](https://sina.birzeit.edu/sinatools/index.html#wsd), (4) [Semantic Relatedness](https://sina.birzeit.edu/sinatools/index.html#sr), (5) [Synonymy Extraction and Evaluation](https://sina.birzeit.edu/sinatools/index.html#se), (6) [Relation Extraction](https://sina.birzeit.edu/sinatools/index.html#re), (7) [Utilities](https://sina.birzeit.edu/sinatools/index.html#u) (diacritic-based word matching, Jaccard similarly, parser, tokenizers, corpora processing, transliteration, etc).
+See [Demo Pages](https://sina.birzeit.edu/sinatools/).
+See the [benchmarking](https://www.jarrar.info/publications/HJK24.pdf), which shows that SinaTools outperformed all related toolkits.
+Installation
+--------
+To install SinaTools, ensure you are using Python version 3.10.8, then clone the [GitHub](git://github.com/SinaLab/SinaTools) repository.
+Alternatively, you can execute the following command:
+```bash
+pip install sinatools
+```
+Installing Models and Data Files
+--------
+Some modules in SinaTools require some data files and fine-tuned models to be downloaded. To download these models, please consult the [DataDownload](https://sina.birzeit.edu/sinatools/documentation/cli_tools/DataDownload/DataDownload.html).
+Documentation
+--------
+For information, please refer to the [main page](https://sina.birzeit.edu/sinatools) or the [online domuementation](https://sina.birzeit.edu/sinatools/documentation).
+Citation
+-------
+Tymaa Hammouda, Mustafa Jarrar, Mohammed Khalilia: [SinaTools: Open Source Toolkit for Arabic Natural Language Understanding](http://www.jarrar.info/publications/HJK24.pdf). In Proceedings of the 2024 AI in Computational Linguistics (ACLing 2024), Procedia Computer Science, Dubai. ELSEVIER.
+License
+--------
+SinaTools is available under the MIT License. See the [LICENSE](https://github.com/SinaLab/sinatools/blob/main/LICENSE) file for more information.
+Reporting Issues
+--------
+To report any issues or bugs, please contact us at "sina.institute.bzu@gmail.com" or visit [SinaTools Issues](https://github.com/SinaLab/sinatools/issues).

{SinaTools-0.1.36.dist-info → SinaTools-0.1.38.dist-info}/RECORD RENAMED Viewed

@@ -1,5 +1,5 @@
-SinaTools-0.1.36.data/data/sinatools/environment.yml,sha256=OzilhLjZbo_3nU93EQNUFX-6G5O3newiSWrwxvMH2Os,7231
-sinatools/VERSION,sha256=4WO9ZLWQOVGEf7BUbcCdCnR4_2Fp3iJiMmtiLd4Vzo8,6
+SinaTools-0.1.38.data/data/sinatools/environment.yml,sha256=OzilhLjZbo_3nU93EQNUFX-6G5O3newiSWrwxvMH2Os,7231
+sinatools/VERSION,sha256=IG8zXDtajZ6W0rgxySeHulP0aoaEpnkET2yOuT5wRks,6
 sinatools/__init__.py,sha256=bEosTU1o-FSpyytS6iVP_82BXHF2yHnzpJxPLYRbeII,135
 sinatools/environment.yml,sha256=OzilhLjZbo_3nU93EQNUFX-6G5O3newiSWrwxvMH2Os,7231
 sinatools/install_env.py,sha256=EODeeE0ZzfM_rz33_JSIruX03Nc4ghyVOM5BHVhsZaQ,404
@@ -91,9 +91,9 @@ sinatools/ner/nn/BertNestedTagger.py,sha256=_fwAn1kiKmXe6m5y16Ipty3kvXIEFEmiUq74
 sinatools/ner/nn/BertSeqTagger.py,sha256=dFcBBiMw2QCWsyy7aQDe_PS3aRuNn4DOxKIHgTblFvc,504
 sinatools/ner/nn/__init__.py,sha256=UgQD_XLNzQGBNSYc_Bw1aRJZjq4PJsnMT1iZwnJemqE,170
 sinatools/ner/trainers/BaseTrainer.py,sha256=Ifz4SeTxJwVn1_uWZ3I9KbcSo2hLPN3ojsIYuoKE9wE,4050
-sinatools/ner/trainers/BertNestedTrainer.py,sha256=Pb4O2WeBmTvV3hHMT6DXjxrTzgtuh3OrKQZnogYy8RQ,8429
-sinatools/ner/trainers/BertTrainer.py,sha256=B_uVtUwfv_eFwMMPsKQvZgW_ZNLy6XEsX5ePR0s8d-k,6433
-sinatools/ner/trainers/__init__.py,sha256=UDok8pDDpYOpwRBBKVLKaOgSUlmqqb-zHZI1p0xPxzI,188
+sinatools/ner/trainers/BertNestedTrainer.py,sha256=iJOah69tXZsAXBimqP0odEsk8SPX4A355riePzW2BFs,8632
+sinatools/ner/trainers/BertTrainer.py,sha256=BtttsrHPolmK3eRDqrgVUuv6lVMuImIeskxhi02Q-44,6596
+sinatools/ner/trainers/__init__.py,sha256=Xnbi_M4KKJRqV7FJe1vklyT0nEW2Q2obxgcWkbR0ZbA,190
 sinatools/relations/__init__.py,sha256=cYjsP2mlTYvAwVIEFtgA6i9gLUSkGVOuDggMs7TvG5k,272
 sinatools/relations/relation_extractor.py,sha256=UuDlaaR0ch9BFv4sBF1tr7P-P9xq8oRZF41tAze6_ok,9751
 sinatools/semantic_relatedness/__init__.py,sha256=S0xrmqtl72L02N56nbNMudPoebnYQgsaIyyX-587DsU,830
@@ -104,7 +104,7 @@ sinatools/utils/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
 sinatools/utils/charsets.py,sha256=rs82oZJqRqosZdTKXfFAJfJ5t4PxjMM_oAPsiWSWuwU,2817
 sinatools/utils/parser.py,sha256=qvHdln5R5CAv_0UOJWe0mcp8JCsGqgazoeIIkoALH88,6259
 sinatools/utils/readfile.py,sha256=xE4LEaCqXJIk9v37QUSSmWb-aY3UnCFUNb7uVdx3cpM,133
-sinatools/utils/similarity.py,sha256=CgKOJpRAU5UaSjOg-sdZcACCNl9tuKDRwdFAKATCL_w,10762
+sinatools/utils/similarity.py,sha256=HAK6OmyVnfjPm0GWL3z9s4ZoUwpZHVKxt3CeSMfqLIQ,11990
 sinatools/utils/text_dublication_detector.py,sha256=FeSkbfWGMQluz23H4CBHXION-walZPgjueX6AL8u_Q0,5660
 sinatools/utils/text_transliteration.py,sha256=F3smhr2AEJtySE6wGQsiXXOslTvSDzLivTYu0btgc10,8769
 sinatools/utils/tokenizer.py,sha256=nyk6lh5-p38wrU62hvh4wg7ni9ammkdqqIgcjbbBxxo,6965
@@ -114,10 +114,10 @@ sinatools/wsd/__init__.py,sha256=mwmCUurOV42rsNRpIUP3luG0oEzeTfEx3oeDl93Oif8,306
 sinatools/wsd/disambiguator.py,sha256=h-3idc5rPPbMDSE_QVJAsEVkDHwzYY3L2SEPNXIdOcc,20104
 sinatools/wsd/settings.py,sha256=6XflVTFKD8SVySX9Wj7zYQtV26WDTcQ2-uW8-gDNHKE,747
 sinatools/wsd/wsd.py,sha256=gHIBUFXegoY1z3rRnIlK6TduhYq2BTa_dHakOjOlT4k,4434
-SinaTools-0.1.36.dist-info/AUTHORS.rst,sha256=aTWeWlIdfLi56iLJfIUAwIrmqDcgxXKLji75_Fjzjyg,174
-SinaTools-0.1.36.dist-info/LICENSE,sha256=uwsKYG4TayHXNANWdpfMN2lVW4dimxQjA_7vuCVhD70,1088
-SinaTools-0.1.36.dist-info/METADATA,sha256=vukmjuNbUETy8EMIkA64uOOwAS5WO5WuWOOMeBoR6ps,3267
-SinaTools-0.1.36.dist-info/WHEEL,sha256=6T3TYZE4YFi2HTS1BeZHNXAi8N52OZT4O-dJ6-ome_4,116
-SinaTools-0.1.36.dist-info/entry_points.txt,sha256=-YGM-r0_UtNPnI0C4UcK1ptrpwFZpUhxdy2qHkehNCo,1303
-SinaTools-0.1.36.dist-info/top_level.txt,sha256=8tNdPTeJKw3TQCaua8IJIx6N6WpgZZmVekf1OdBNJpE,10
-SinaTools-0.1.36.dist-info/RECORD,,
+SinaTools-0.1.38.dist-info/AUTHORS.rst,sha256=aTWeWlIdfLi56iLJfIUAwIrmqDcgxXKLji75_Fjzjyg,174
+SinaTools-0.1.38.dist-info/LICENSE,sha256=uwsKYG4TayHXNANWdpfMN2lVW4dimxQjA_7vuCVhD70,1088
+SinaTools-0.1.38.dist-info/METADATA,sha256=sMasvTcuV4-3WpBTyGKHkm9nTFfXuZkf4uXTHDh5_I8,3324
+SinaTools-0.1.38.dist-info/WHEEL,sha256=DZajD4pwLWue70CAfc7YaxT1wLUciNBvN_TTcvXpltE,110
+SinaTools-0.1.38.dist-info/entry_points.txt,sha256=_CsRKM_tSCWV5hefBNUsWf9_6DrJnzFlxeAo1wm5XqY,1302
+SinaTools-0.1.38.dist-info/top_level.txt,sha256=8tNdPTeJKw3TQCaua8IJIx6N6WpgZZmVekf1OdBNJpE,10
+SinaTools-0.1.38.dist-info/RECORD,,

{SinaTools-0.1.36.dist-info → SinaTools-0.1.38.dist-info}/WHEEL RENAMED Viewed

@@ -1,6 +1,6 @@
-Wheel-Version: 1.0
-Generator: bdist_wheel (0.34.2)
-Root-Is-Purelib: true
-Tag: py2-none-any
-Tag: py3-none-any
+Wheel-Version: 1.0
+Generator: bdist_wheel (0.43.0)
+Root-Is-Purelib: true
+Tag: py2-none-any
+Tag: py3-none-any

{SinaTools-0.1.36.dist-info → SinaTools-0.1.38.dist-info}/entry_points.txt RENAMED Viewed

@@ -20,4 +20,3 @@ sentence_tokenizer = sinatools.CLI.utils.sentence_tokenizer:main
 text_dublication_detector = sinatools.CLI.utils.text_dublication_detector:main
 transliterate = sinatools.CLI.utils.text_transliteration:main
 wsd = sinatools.CLI.wsd.disambiguator:main

sinatools/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.1.36
1	+ 0.1.38

sinatools/ner/trainers/BertNestedTrainer.py CHANGED Viewed

@@ -1,203 +1,203 @@
-import os
-import logging
-import torch
-import numpy as np
-from sinatools.ner.trainers import BaseTrainer
-from sinatools.ner.metrics import compute_nested_metrics
-logger = logging.getLogger(__name__)
-class BertNestedTrainer(BaseTrainer):
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-    def train(self):
-        best_val_loss, test_loss = np.inf, np.inf
-        num_train_batch = len(self.train_dataloader)
-        num_labels = [len(v) for v in self.train_dataloader.dataset.vocab.tags[1:]]
-        patience = self.patience
-        for epoch_index in range(self.max_epochs):
-            self.current_epoch = epoch_index
-            train_loss = 0
-            for batch_index, (subwords, gold_tags, tokens, valid_len, logits) in enumerate(self.tag(
-                self.train_dataloader, is_train=True
-            ), 1):
-                self.current_timestep += 1
-                # Compute loses for each output
-                # logits = B x T x L x C
-                losses = [self.loss(logits[:, :, i, 0:l].view(-1, logits[:, :, i, 0:l].shape[-1]),
-                                    torch.reshape(gold_tags[:, i, :], (-1,)).long())
-                          for i, l in enumerate(num_labels)]
-                torch.autograd.backward(losses)
-                # Avoid exploding gradient by doing gradient clipping
-                torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip)
-                self.optimizer.step()
-                self.scheduler.step()
-                batch_loss = sum(l.item() for l in losses)
-                train_loss += batch_loss
-                if self.current_timestep % self.log_interval == 0:
-                    logger.info(
-                        "Epoch %d | Batch %d/%d | Timestep %d | LR %.10f | Loss %f",
-                        epoch_index,
-                        batch_index,
-                        num_train_batch,
-                        self.current_timestep,
-                        self.optimizer.param_groups[0]['lr'],
-                        batch_loss
-                    )
-            train_loss /= num_train_batch
-            logger.info("** Evaluating on validation dataset **")
-            val_preds, segments, valid_len, val_loss = self.eval(self.val_dataloader)
-            val_metrics = compute_nested_metrics(segments, self.val_dataloader.dataset.transform.vocab.tags[1:])
-            epoch_summary_loss = {
-                "train_loss": train_loss,
-                "val_loss": val_loss
-            }
-            epoch_summary_metrics = {
-                "val_micro_f1": val_metrics.micro_f1,
-                "val_precision": val_metrics.precision,
-                "val_recall": val_metrics.recall
-            }
-            logger.info(
-                "Epoch %d | Timestep %d | Train Loss %f | Val Loss %f | F1 %f",
-                epoch_index,
-                self.current_timestep,
-                train_loss,
-                val_loss,
-                val_metrics.micro_f1
-            )
-            if val_loss < best_val_loss:
-                patience = self.patience
-                best_val_loss = val_loss
-                logger.info("** Validation improved, evaluating test data **")
-                test_preds, segments, valid_len, test_loss = self.eval(self.test_dataloader)
-                self.segments_to_file(segments, os.path.join(self.output_path, "predictions.txt"))
-                test_metrics = compute_nested_metrics(segments, self.test_dataloader.dataset.transform.vocab.tags[1:])
-                epoch_summary_loss["test_loss"] = test_loss
-                epoch_summary_metrics["test_micro_f1"] = test_metrics.micro_f1
-                epoch_summary_metrics["test_precision"] = test_metrics.precision
-                epoch_summary_metrics["test_recall"] = test_metrics.recall
-                logger.info(
-                    f"Epoch %d | Timestep %d | Test Loss %f | F1 %f",
-                    epoch_index,
-                    self.current_timestep,
-                    test_loss,
-                    test_metrics.micro_f1
-                )
-                self.save()
-            else:
-                patience -= 1
-            # No improvements, terminating early
-            if patience == 0:
-                logger.info("Early termination triggered")
-                break
-            self.summary_writer.add_scalars("Loss", epoch_summary_loss, global_step=self.current_timestep)
-            self.summary_writer.add_scalars("Metrics", epoch_summary_metrics, global_step=self.current_timestep)
-    def tag(self, dataloader, is_train=True):
-        """
-        Given a dataloader containing segments, predict the tags
-        :param dataloader: torch.utils.data.DataLoader
-        :param is_train: boolean - True for training model, False for evaluation
-        :return: Iterator
-                    subwords (B x T x NUM_LABELS)- torch.Tensor - BERT subword ID
-                    gold_tags (B x T x NUM_LABELS) - torch.Tensor - ground truth tags IDs
-                    tokens - List[arabiner.data.dataset.Token] - list of tokens
-                    valid_len (B x 1) - int - valiud length of each sequence
-                    logits (B x T x NUM_LABELS) - logits for each token and each tag
-        """
-        for subwords, gold_tags, tokens, mask, valid_len in dataloader:
-            self.model.train(is_train)
-            if torch.cuda.is_available():
-                subwords = subwords.cuda()
-                gold_tags = gold_tags.cuda()
-            if is_train:
-                self.optimizer.zero_grad()
-                logits = self.model(subwords)
-            else:
-                with torch.no_grad():
-                    logits = self.model(subwords)
-            yield subwords, gold_tags, tokens, valid_len, logits
-    def eval(self, dataloader):
-        golds, preds, segments, valid_lens = list(), list(), list(), list()
-        num_labels = [len(v) for v in dataloader.dataset.vocab.tags[1:]]
-        loss = 0
-        for _, gold_tags, tokens, valid_len, logits in self.tag(
-            dataloader, is_train=False
-        ):
-            losses = [self.loss(logits[:, :, i, 0:l].view(-1, logits[:, :, i, 0:l].shape[-1]),
-                                torch.reshape(gold_tags[:, i, :], (-1,)).long())
-                      for i, l in enumerate(num_labels)]
-            loss += sum(losses)
-            preds += torch.argmax(logits, dim=3)
-            segments += tokens
-            valid_lens += list(valid_len)
-        loss /= len(dataloader)
-        # Update segments, attach predicted tags to each token
-        segments = self.to_segments(segments, preds, valid_lens, dataloader.dataset.vocab)
-        return preds, segments, valid_lens, loss
-    def infer(self, dataloader):
-        golds, preds, segments, valid_lens = list(), list(), list(), list()
-        for _, gold_tags, tokens, valid_len, logits in self.tag(
-            dataloader, is_train=False
-        ):
-            preds += torch.argmax(logits, dim=3)
-            segments += tokens
-            valid_lens += list(valid_len)
-        segments = self.to_segments(segments, preds, valid_lens, dataloader.dataset.vocab)
-        return segments
-    def to_segments(self, segments, preds, valid_lens, vocab):
-        if vocab is None:
-            vocab = self.vocab
-        tagged_segments = list()
-        tokens_stoi = vocab.tokens.get_stoi()
-        unk_id = tokens_stoi["UNK"]
-        for segment, pred, valid_len in zip(segments, preds, valid_lens):
-            # First, the token at 0th index [CLS] and token at nth index [SEP]
-            # Combine the tokens with their corresponding predictions
-            segment_pred = zip(segment[1:valid_len-1], pred[1:valid_len-1])
-            # Ignore the sub-tokens/subwords, which are identified with text being UNK
-            segment_pred = list(filter(lambda t: tokens_stoi[t[0].text] != unk_id, segment_pred))
-            # Attach the predicted tags to each token
-            list(map(lambda t: setattr(t[0], 'pred_tag', [{"tag": vocab.get_itos()[tag_id]}
-                                                     for tag_id, vocab in zip(t[1].int().tolist(), vocab.tags[1:])]), segment_pred))
-            # We are only interested in the tagged tokens, we do no longer need raw model predictions
-            tagged_segment = [t for t, _ in segment_pred]
-            tagged_segments.append(tagged_segment)
-        return tagged_segments
+import os
+import logging
+import torch
+import numpy as np
+from sinatools.ner.trainers import BaseTrainer
+from sinatools.ner.metrics import compute_nested_metrics
+logger = logging.getLogger(__name__)
+class BertNestedTrainer(BaseTrainer):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+    def train(self):
+        best_val_loss, test_loss = np.inf, np.inf
+        num_train_batch = len(self.train_dataloader)
+        num_labels = [len(v) for v in self.train_dataloader.dataset.vocab.tags[1:]]
+        patience = self.patience
+        for epoch_index in range(self.max_epochs):
+            self.current_epoch = epoch_index
+            train_loss = 0
+            for batch_index, (subwords, gold_tags, tokens, valid_len, logits) in enumerate(self.tag(
+                self.train_dataloader, is_train=True
+            ), 1):
+                self.current_timestep += 1
+                # Compute loses for each output
+                # logits = B x T x L x C
+                losses = [self.loss(logits[:, :, i, 0:l].view(-1, logits[:, :, i, 0:l].shape[-1]),
+                                    torch.reshape(gold_tags[:, i, :], (-1,)).long())
+                          for i, l in enumerate(num_labels)]
+                torch.autograd.backward(losses)
+                # Avoid exploding gradient by doing gradient clipping
+                torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip)
+                self.optimizer.step()
+                self.scheduler.step()
+                batch_loss = sum(l.item() for l in losses)
+                train_loss += batch_loss
+                if self.current_timestep % self.log_interval == 0:
+                    logger.info(
+                        "Epoch %d | Batch %d/%d | Timestep %d | LR %.10f | Loss %f",
+                        epoch_index,
+                        batch_index,
+                        num_train_batch,
+                        self.current_timestep,
+                        self.optimizer.param_groups[0]['lr'],
+                        batch_loss
+                    )
+            train_loss /= num_train_batch
+            logger.info("** Evaluating on validation dataset **")
+            val_preds, segments, valid_len, val_loss = self.eval(self.val_dataloader)
+            val_metrics = compute_nested_metrics(segments, self.val_dataloader.dataset.transform.vocab.tags[1:])
+            epoch_summary_loss = {
+                "train_loss": train_loss,
+                "val_loss": val_loss
+            }
+            epoch_summary_metrics = {
+                "val_micro_f1": val_metrics.micro_f1,
+                "val_precision": val_metrics.precision,
+                "val_recall": val_metrics.recall
+            }
+            logger.info(
+                "Epoch %d | Timestep %d | Train Loss %f | Val Loss %f | F1 %f",
+                epoch_index,
+                self.current_timestep,
+                train_loss,
+                val_loss,
+                val_metrics.micro_f1
+            )
+            if val_loss < best_val_loss:
+                patience = self.patience
+                best_val_loss = val_loss
+                logger.info("** Validation improved, evaluating test data **")
+                test_preds, segments, valid_len, test_loss = self.eval(self.test_dataloader)
+                self.segments_to_file(segments, os.path.join(self.output_path, "predictions.txt"))
+                test_metrics = compute_nested_metrics(segments, self.test_dataloader.dataset.transform.vocab.tags[1:])
+                epoch_summary_loss["test_loss"] = test_loss
+                epoch_summary_metrics["test_micro_f1"] = test_metrics.micro_f1
+                epoch_summary_metrics["test_precision"] = test_metrics.precision
+                epoch_summary_metrics["test_recall"] = test_metrics.recall
+                logger.info(
+                    f"Epoch %d | Timestep %d | Test Loss %f | F1 %f",
+                    epoch_index,
+                    self.current_timestep,
+                    test_loss,
+                    test_metrics.micro_f1
+                )
+                self.save()
+            else:
+                patience -= 1
+            # No improvements, terminating early
+            if patience == 0:
+                logger.info("Early termination triggered")
+                break
+            self.summary_writer.add_scalars("Loss", epoch_summary_loss, global_step=self.current_timestep)
+            self.summary_writer.add_scalars("Metrics", epoch_summary_metrics, global_step=self.current_timestep)
+    def tag(self, dataloader, is_train=True):
+        """
+        Given a dataloader containing segments, predict the tags
+        :param dataloader: torch.utils.data.DataLoader
+        :param is_train: boolean - True for training model, False for evaluation
+        :return: Iterator
+                    subwords (B x T x NUM_LABELS)- torch.Tensor - BERT subword ID
+                    gold_tags (B x T x NUM_LABELS) - torch.Tensor - ground truth tags IDs
+                    tokens - List[arabiner.data.dataset.Token] - list of tokens
+                    valid_len (B x 1) - int - valiud length of each sequence
+                    logits (B x T x NUM_LABELS) - logits for each token and each tag
+        """
+        for subwords, gold_tags, tokens, mask, valid_len in dataloader:
+            self.model.train(is_train)
+            if torch.cuda.is_available():
+                subwords = subwords.cuda()
+                gold_tags = gold_tags.cuda()
+            if is_train:
+                self.optimizer.zero_grad()
+                logits = self.model(subwords)
+            else:
+                with torch.no_grad():
+                    logits = self.model(subwords)
+            yield subwords, gold_tags, tokens, valid_len, logits
+    def eval(self, dataloader):
+        golds, preds, segments, valid_lens = list(), list(), list(), list()
+        num_labels = [len(v) for v in dataloader.dataset.vocab.tags[1:]]
+        loss = 0
+        for _, gold_tags, tokens, valid_len, logits in self.tag(
+            dataloader, is_train=False
+        ):
+            losses = [self.loss(logits[:, :, i, 0:l].view(-1, logits[:, :, i, 0:l].shape[-1]),
+                                torch.reshape(gold_tags[:, i, :], (-1,)).long())
+                      for i, l in enumerate(num_labels)]
+            loss += sum(losses)
+            preds += torch.argmax(logits, dim=3)
+            segments += tokens
+            valid_lens += list(valid_len)
+        loss /= len(dataloader)
+        # Update segments, attach predicted tags to each token
+        segments = self.to_segments(segments, preds, valid_lens, dataloader.dataset.vocab)
+        return preds, segments, valid_lens, loss
+    def infer(self, dataloader):
+        golds, preds, segments, valid_lens = list(), list(), list(), list()
+        for _, gold_tags, tokens, valid_len, logits in self.tag(
+            dataloader, is_train=False
+        ):
+            preds += torch.argmax(logits, dim=3)
+            segments += tokens
+            valid_lens += list(valid_len)
+        segments = self.to_segments(segments, preds, valid_lens, dataloader.dataset.vocab)
+        return segments
+    def to_segments(self, segments, preds, valid_lens, vocab):
+        if vocab is None:
+            vocab = self.vocab
+        tagged_segments = list()
+        tokens_stoi = vocab.tokens.get_stoi()
+        unk_id = tokens_stoi["UNK"]
+        for segment, pred, valid_len in zip(segments, preds, valid_lens):
+            # First, the token at 0th index [CLS] and token at nth index [SEP]
+            # Combine the tokens with their corresponding predictions
+            segment_pred = zip(segment[1:valid_len-1], pred[1:valid_len-1])
+            # Ignore the sub-tokens/subwords, which are identified with text being UNK
+            segment_pred = list(filter(lambda t: tokens_stoi[t[0].text] != unk_id, segment_pred))
+            # Attach the predicted tags to each token
+            list(map(lambda t: setattr(t[0], 'pred_tag', [{"tag": vocab.get_itos()[tag_id]}
+                                                     for tag_id, vocab in zip(t[1].int().tolist(), vocab.tags[1:])]), segment_pred))
+            # We are only interested in the tagged tokens, we do no longer need raw model predictions
+            tagged_segment = [t for t, _ in segment_pred]
+            tagged_segments.append(tagged_segment)
+        return tagged_segments

sinatools/ner/trainers/BertTrainer.py CHANGED Viewed

@@ -1,163 +1,163 @@
-import os
-import logging
-import torch
-import numpy as np
-from sinatools.ner.trainers import BaseTrainer
-from sinatools.ner.metrics import compute_single_label_metrics
-logger = logging.getLogger(__name__)
-class BertTrainer(BaseTrainer):
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-    def train(self):
-        best_val_loss, test_loss = np.inf, np.inf
-        num_train_batch = len(self.train_dataloader)
-        patience = self.patience
-        for epoch_index in range(self.max_epochs):
-            self.current_epoch = epoch_index
-            train_loss = 0
-            for batch_index, (_, gold_tags, _, _, logits) in enumerate(self.tag(
-                self.train_dataloader, is_train=True
-            ), 1):
-                self.current_timestep += 1
-                batch_loss = self.loss(logits.view(-1, logits.shape[-1]), gold_tags.view(-1))
-                batch_loss.backward()
-                # Avoid exploding gradient by doing gradient clipping
-                torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip)
-                self.optimizer.step()
-                self.scheduler.step()
-                train_loss += batch_loss.item()
-                if self.current_timestep % self.log_interval == 0:
-                    logger.info(
-                        "Epoch %d | Batch %d/%d | Timestep %d | LR %.10f | Loss %f",
-                        epoch_index,
-                        batch_index,
-                        num_train_batch,
-                        self.current_timestep,
-                        self.optimizer.param_groups[0]['lr'],
-                        batch_loss.item()
-                    )
-            train_loss /= num_train_batch
-            logger.info("** Evaluating on validation dataset **")
-            val_preds, segments, valid_len, val_loss = self.eval(self.val_dataloader)
-            val_metrics = compute_single_label_metrics(segments)
-            epoch_summary_loss = {
-                "train_loss": train_loss,
-                "val_loss": val_loss
-            }
-            epoch_summary_metrics = {
-                "val_micro_f1": val_metrics.micro_f1,
-                "val_precision": val_metrics.precision,
-                "val_recall": val_metrics.recall
-            }
-            logger.info(
-                "Epoch %d | Timestep %d | Train Loss %f | Val Loss %f | F1 %f",
-                epoch_index,
-                self.current_timestep,
-                train_loss,
-                val_loss,
-                val_metrics.micro_f1
-            )
-            if val_loss < best_val_loss:
-                patience = self.patience
-                best_val_loss = val_loss
-                logger.info("** Validation improved, evaluating test data **")
-                test_preds, segments, valid_len, test_loss = self.eval(self.test_dataloader)
-                self.segments_to_file(segments, os.path.join(self.output_path, "predictions.txt"))
-                test_metrics = compute_single_label_metrics(segments)
-                epoch_summary_loss["test_loss"] = test_loss
-                epoch_summary_metrics["test_micro_f1"] = test_metrics.micro_f1
-                epoch_summary_metrics["test_precision"] = test_metrics.precision
-                epoch_summary_metrics["test_recall"] = test_metrics.recall
-                logger.info(
-                    f"Epoch %d | Timestep %d | Test Loss %f | F1 %f",
-                    epoch_index,
-                    self.current_timestep,
-                    test_loss,
-                    test_metrics.micro_f1
-                )
-                self.save()
-            else:
-                patience -= 1
-            # No improvements, terminating early
-            if patience == 0:
-                logger.info("Early termination triggered")
-                break
-            self.summary_writer.add_scalars("Loss", epoch_summary_loss, global_step=self.current_timestep)
-            self.summary_writer.add_scalars("Metrics", epoch_summary_metrics, global_step=self.current_timestep)
-    def eval(self, dataloader):
-        golds, preds, segments, valid_lens = list(), list(), list(), list()
-        loss = 0
-        for _, gold_tags, tokens, valid_len, logits in self.tag(
-            dataloader, is_train=False
-        ):
-            loss += self.loss(logits.view(-1, logits.shape[-1]), gold_tags.view(-1))
-            preds += torch.argmax(logits, dim=2).detach().cpu().numpy().tolist()
-            segments += tokens
-            valid_lens += list(valid_len)
-        loss /= len(dataloader)
-        # Update segments, attach predicted tags to each token
-        segments = self.to_segments(segments, preds, valid_lens, dataloader.dataset.vocab)
-        return preds, segments, valid_lens, loss.item()
-    def infer(self, dataloader):
-        golds, preds, segments, valid_lens = list(), list(), list(), list()
-        for _, gold_tags, tokens, valid_len, logits in self.tag(
-            dataloader, is_train=False
-        ):
-            preds += torch.argmax(logits, dim=2).detach().cpu().numpy().tolist()
-            segments += tokens
-            valid_lens += list(valid_len)
-        segments = self.to_segments(segments, preds, valid_lens, dataloader.dataset.vocab)
-        return segments
-    def to_segments(self, segments, preds, valid_lens, vocab):
-        if vocab is None:
-            vocab = self.vocab
-        tagged_segments = list()
-        tokens_stoi = vocab.tokens.get_stoi()
-        tags_itos = vocab.tags[0].get_itos()
-        unk_id = tokens_stoi["UNK"]
-        for segment, pred, valid_len in zip(segments, preds, valid_lens):
-            # First, the token at 0th index [CLS] and token at nth index [SEP]
-            # Combine the tokens with their corresponding predictions
-            segment_pred = zip(segment[1:valid_len-1], pred[1:valid_len-1])
-            # Ignore the sub-tokens/subwords, which are identified with text being UNK
-            segment_pred = list(filter(lambda t: tokens_stoi[t[0].text] != unk_id, segment_pred))
-            # Attach the predicted tags to each token
-            list(map(lambda t: setattr(t[0], 'pred_tag', [{"tag": tags_itos[t[1]]}]), segment_pred))
-            # We are only interested in the tagged tokens, we do no longer need raw model predictions
-            tagged_segment = [t for t, _ in segment_pred]
-            tagged_segments.append(tagged_segment)
-        return tagged_segments
+import os
+import logging
+import torch
+import numpy as np
+from sinatools.ner.trainers import BaseTrainer
+from sinatools.ner.metrics import compute_single_label_metrics
+logger = logging.getLogger(__name__)
+class BertTrainer(BaseTrainer):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+    def train(self):
+        best_val_loss, test_loss = np.inf, np.inf
+        num_train_batch = len(self.train_dataloader)
+        patience = self.patience
+        for epoch_index in range(self.max_epochs):
+            self.current_epoch = epoch_index
+            train_loss = 0
+            for batch_index, (_, gold_tags, _, _, logits) in enumerate(self.tag(
+                self.train_dataloader, is_train=True
+            ), 1):
+                self.current_timestep += 1
+                batch_loss = self.loss(logits.view(-1, logits.shape[-1]), gold_tags.view(-1))
+                batch_loss.backward()
+                # Avoid exploding gradient by doing gradient clipping
+                torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip)
+                self.optimizer.step()
+                self.scheduler.step()
+                train_loss += batch_loss.item()
+                if self.current_timestep % self.log_interval == 0:
+                    logger.info(
+                        "Epoch %d | Batch %d/%d | Timestep %d | LR %.10f | Loss %f",
+                        epoch_index,
+                        batch_index,
+                        num_train_batch,
+                        self.current_timestep,
+                        self.optimizer.param_groups[0]['lr'],
+                        batch_loss.item()
+                    )
+            train_loss /= num_train_batch
+            logger.info("** Evaluating on validation dataset **")
+            val_preds, segments, valid_len, val_loss = self.eval(self.val_dataloader)
+            val_metrics = compute_single_label_metrics(segments)
+            epoch_summary_loss = {
+                "train_loss": train_loss,
+                "val_loss": val_loss
+            }
+            epoch_summary_metrics = {
+                "val_micro_f1": val_metrics.micro_f1,
+                "val_precision": val_metrics.precision,
+                "val_recall": val_metrics.recall
+            }
+            logger.info(
+                "Epoch %d | Timestep %d | Train Loss %f | Val Loss %f | F1 %f",
+                epoch_index,
+                self.current_timestep,
+                train_loss,
+                val_loss,
+                val_metrics.micro_f1
+            )
+            if val_loss < best_val_loss:
+                patience = self.patience
+                best_val_loss = val_loss
+                logger.info("** Validation improved, evaluating test data **")
+                test_preds, segments, valid_len, test_loss = self.eval(self.test_dataloader)
+                self.segments_to_file(segments, os.path.join(self.output_path, "predictions.txt"))
+                test_metrics = compute_single_label_metrics(segments)
+                epoch_summary_loss["test_loss"] = test_loss
+                epoch_summary_metrics["test_micro_f1"] = test_metrics.micro_f1
+                epoch_summary_metrics["test_precision"] = test_metrics.precision
+                epoch_summary_metrics["test_recall"] = test_metrics.recall
+                logger.info(
+                    f"Epoch %d | Timestep %d | Test Loss %f | F1 %f",
+                    epoch_index,
+                    self.current_timestep,
+                    test_loss,
+                    test_metrics.micro_f1
+                )
+                self.save()
+            else:
+                patience -= 1
+            # No improvements, terminating early
+            if patience == 0:
+                logger.info("Early termination triggered")
+                break
+            self.summary_writer.add_scalars("Loss", epoch_summary_loss, global_step=self.current_timestep)
+            self.summary_writer.add_scalars("Metrics", epoch_summary_metrics, global_step=self.current_timestep)
+    def eval(self, dataloader):
+        golds, preds, segments, valid_lens = list(), list(), list(), list()
+        loss = 0
+        for _, gold_tags, tokens, valid_len, logits in self.tag(
+            dataloader, is_train=False
+        ):
+            loss += self.loss(logits.view(-1, logits.shape[-1]), gold_tags.view(-1))
+            preds += torch.argmax(logits, dim=2).detach().cpu().numpy().tolist()
+            segments += tokens
+            valid_lens += list(valid_len)
+        loss /= len(dataloader)
+        # Update segments, attach predicted tags to each token
+        segments = self.to_segments(segments, preds, valid_lens, dataloader.dataset.vocab)
+        return preds, segments, valid_lens, loss.item()
+    def infer(self, dataloader):
+        golds, preds, segments, valid_lens = list(), list(), list(), list()
+        for _, gold_tags, tokens, valid_len, logits in self.tag(
+            dataloader, is_train=False
+        ):
+            preds += torch.argmax(logits, dim=2).detach().cpu().numpy().tolist()
+            segments += tokens
+            valid_lens += list(valid_len)
+        segments = self.to_segments(segments, preds, valid_lens, dataloader.dataset.vocab)
+        return segments
+    def to_segments(self, segments, preds, valid_lens, vocab):
+        if vocab is None:
+            vocab = self.vocab
+        tagged_segments = list()
+        tokens_stoi = vocab.tokens.get_stoi()
+        tags_itos = vocab.tags[0].get_itos()
+        unk_id = tokens_stoi["UNK"]
+        for segment, pred, valid_len in zip(segments, preds, valid_lens):
+            # First, the token at 0th index [CLS] and token at nth index [SEP]
+            # Combine the tokens with their corresponding predictions
+            segment_pred = zip(segment[1:valid_len-1], pred[1:valid_len-1])
+            # Ignore the sub-tokens/subwords, which are identified with text being UNK
+            segment_pred = list(filter(lambda t: tokens_stoi[t[0].text] != unk_id, segment_pred))
+            # Attach the predicted tags to each token
+            list(map(lambda t: setattr(t[0], 'pred_tag', [{"tag": tags_itos[t[1]]}]), segment_pred))
+            # We are only interested in the tagged tokens, we do no longer need raw model predictions
+            tagged_segment = [t for t, _ in segment_pred]
+            tagged_segments.append(tagged_segment)
+        return tagged_segments

sinatools/ner/trainers/__init__.py CHANGED Viewed

@@ -1,3 +1,3 @@
-from sinatools.ner.trainers.BaseTrainer import BaseTrainer
-from sinatools.ner.trainers.BertTrainer import BertTrainer
+from sinatools.ner.trainers.BaseTrainer import BaseTrainer
+from sinatools.ner.trainers.BertTrainer import BertTrainer
 from sinatools.ner.trainers.BertNestedTrainer import BertNestedTrainer

sinatools/utils/similarity.py CHANGED Viewed

@@ -101,56 +101,91 @@ def get_intersection(list1, list2, ignore_all_diacritics_but_not_shadda=False, i
-def get_union(list1, list2, ignore_all_diacritics_but_not_shadda, ignore_shadda_diacritic):
-    """
-    Computes the union of two sets of Arabic words, considering the differences in their diacritization. The method provides two options for handling diacritics: (i) ignore all diacritics except for shadda, and (ii) ignore the shadda diacritic as well. You can try the demo online.
+# def get_union(list1, list2, ignore_all_diacritics_but_not_shadda, ignore_shadda_diacritic):
+#     """
+#     Computes the union of two sets of Arabic words, considering the differences in their diacritization. The method provides two options for handling diacritics: (i) ignore all diacritics except for shadda, and (ii) ignore the shadda diacritic as well. You can try the demo online.
-    Args:
-        list1 (:obj:`list`): The first list.
-        list2 (:obj:`bool`): The second list.
-        ignore_all_diacratics_but_not_shadda (:obj:`bool`, optional) – A flag to ignore all diacratics except for the shadda. Defaults to False.
-        ignore_shadda_diacritic (:obj:`bool`, optional) – A flag to ignore the shadda diacritic. Defaults to False.
+#     Args:
+#         list1 (:obj:`list`): The first list.
+#         list2 (:obj:`bool`): The second list.
+#         ignore_all_diacratics_but_not_shadda (:obj:`bool`, optional) – A flag to ignore all diacratics except for the shadda. Defaults to False.
+#         ignore_shadda_diacritic (:obj:`bool`, optional) – A flag to ignore the shadda diacritic. Defaults to False.
-    Returns:
-        :obj:`list`: The union of the two lists, ignoring diacritics if flags are true.
+#     Returns:
+#         :obj:`list`: The union of the two lists, ignoring diacritics if flags are true.
-    **Example:**
+#     **Example:**
-    .. highlight:: python
-    .. code-block:: python
+#     .. highlight:: python
+#     .. code-block:: python
-        from sinatools.utils.similarity import get_union
-        list1 = ["كتب","فَعل","فَعَلَ"]
-        list2 = ["كتب","فَعّل"]
-        print(get_union(list1, list2, False, True))
-        #output: ["كتب" ,"فَعل" ,"فَعَلَ"]
-    """
-    list1 = [str(i) for i in list1 if i not in (None, ' ', '')]
+#         from sinatools.utils.similarity import get_union
+#         list1 = ["كتب","فَعل","فَعَلَ"]
+#         list2 = ["كتب","فَعّل"]
+#         print(get_union(list1, list2, False, True))
+#         #output: ["كتب" ,"فَعل" ,"فَعَلَ"]
+#     """
+#     list1 = [str(i) for i in list1 if i not in (None, ' ', '')]
+#     list2 = [str(i) for i in list2 if i not in (None, ' ', '')]
+#     union_list = []
+#     for list1_word in list1:
+#         word1 = normalize_word(list1_word, ignore_all_diacritics_but_not_shadda, ignore_shadda_diacritic)
+#         union_list.append(word1)
+#     for list2_word in list2:
+#         word2 = normalize_word(list2_word, ignore_all_diacritics_but_not_shadda, ignore_shadda_diacritic)
+#         union_list.append(word2)
+#     i = 0
+#     while i < len(union_list):
+#         j = i + 1
+#         while j < len(union_list):
+#             non_preferred_word = get_non_preferred_word(union_list[i], union_list[j])
+#             if (non_preferred_word != "#"):
+#                 union_list.remove(non_preferred_word)
+#             j = j + 1
+#         i = i + 1
+#     return union_list
+def get_union(list1, list2, ignore_all_diacritics_but_not_shadda, ignore_shadda_diacritic):
+    list1 = [str(i) for i in list1 if i not in (None, ' ', '')]
     list2 = [str(i) for i in list2 if i not in (None, ' ', '')]
     union_list = []
+    # Normalize and add words from list1
     for list1_word in list1:
         word1 = normalize_word(list1_word, ignore_all_diacritics_but_not_shadda, ignore_shadda_diacritic)
-        union_list.append(word1)
+        if word1 not in union_list:
+            union_list.append(word1)
+    # Normalize and add words from list2
     for list2_word in list2:
         word2 = normalize_word(list2_word, ignore_all_diacritics_but_not_shadda, ignore_shadda_diacritic)
-        union_list.append(word2)
+        if word2 not in union_list:
+            union_list.append(word2)
     i = 0
     while i < len(union_list):
         j = i + 1
         while j < len(union_list):
             non_preferred_word = get_non_preferred_word(union_list[i], union_list[j])
-            if (non_preferred_word != "#"):
+            if non_preferred_word != "#":
                 union_list.remove(non_preferred_word)
-            j = j + 1
-        i = i + 1
+                j -= 1
+            j += 1
+        i += 1
     return union_list
 def get_jaccard_similarity(list1: list, list2: list, ignore_all_diacritics_but_not_shadda: bool, ignore_shadda_diacritic: bool) -> float:
@@ -184,7 +219,7 @@ def get_jaccard_similarity(list1: list, list2: list, ignore_all_diacritics_but_n
     return float(len(intersection_list)) / float(len(union_list))
-def get_jaccard(delimiter, str1, str2, selection, ignoreAllDiacriticsButNotShadda=True, ignoreShaddaDiacritic=True):
+def get_jaccard(delimiter, selection, str1, str2, ignoreAllDiacriticsButNotShadda=True, ignoreShaddaDiacritic=True):
     """
     Calculates and returns the Jaccard similarity values (union, intersection, or Jaccard similarity) between two lists of Arabic words, considering the differences in their diacritization. The method provides two options for handling diacritics: (i) ignore all diacritics except for shadda, and (ii) ignore the shadda diacritic as well. You can try the demo online.

{SinaTools-0.1.36.data → SinaTools-0.1.38.data}/data/sinatools/environment.yml RENAMED Viewed

File without changes

{SinaTools-0.1.36.dist-info → SinaTools-0.1.38.dist-info}/AUTHORS.rst RENAMED Viewed

File without changes

{SinaTools-0.1.36.dist-info → SinaTools-0.1.38.dist-info}/LICENSE RENAMED Viewed

File without changes

{SinaTools-0.1.36.dist-info → SinaTools-0.1.38.dist-info}/top_level.txt RENAMED Viewed

File without changes

SinaTools 0.1.36__py2.py3-none-any.whl → 0.1.38__py2.py3-none-any.whl

SinaTools 0.1.36py2.py3-none-any.whl → 0.1.38py2.py3-none-any.whl