PyPI - torchtextclassifiers - Versions diffs - 0.1.0__tar.gz → 1.0.1__tar.gz - Mend

torchtextclassifiers 0.1.0tar.gz → 1.0.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

{torchtextclassifiers-0.1.0 → torchtextclassifiers-1.0.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: torchtextclassifiers
-Version: 0.1.0
+Version: 1.0.1
 Summary: A text classification toolkit to easily build, train and evaluate deep learning text classifiers using PyTorch.
 Keywords: fastText,text classification,NLP,automatic coding,deep learning
 Author: Cédric Couralet, Meilame Tayebjee
@@ -26,15 +26,18 @@ Description-Content-Type: text/markdown
 # torchTextClassifiers
+[![Documentation](https://img.shields.io/badge/docs-latest-blue.svg)](https://inseefrlab.github.io/torchTextClassifiers/)
 A unified, extensible framework for text classification with categorical variables built on [PyTorch](https://pytorch.org/) and [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/).
 ## 🚀 Features
-- **Mixed input support**: Handle text data alongside categorical variables seamlessly.
+- **Complex input support**: Handle text data alongside categorical variables seamlessly.
 - **Unified yet highly customizable**:
     - Use any tokenizer from HuggingFace or the original fastText's ngram tokenizer.
     - Manipulate the components (`TextEmbedder`, `CategoricalVariableNet`, `ClassificationHead`) to easily create custom architectures - including **self-attention**. All of them are `torch.nn.Module` !
     - The `TextClassificationModel` class combines these components and can be extended for custom behavior.
+- **Multiclass / multilabel classification support**: Support for both multiclass (only one label is true) and multi-label (several labels can be true) classification tasks.
 - **PyTorch Lightning**: Automated training with callbacks, early stopping, and logging
 - **Easy experimentation**: Simple API for training, evaluating, and predicting with minimal code:
     - The `torchTextClassifiers` wrapper class orchestrates the tokenizer and the model for you
@@ -55,6 +58,15 @@ uv sync
 pip install -e .
 ```
+## 📖 Documentation
+Full documentation is available at: **https://inseefrlab.github.io/torchTextClassifiers/**
+The documentation includes:
+- **Getting Started**: Installation and quick start guide
+- **Architecture**: Understanding the 3-layer design
+- **Tutorials**: Step-by-step guides for different use cases
+- **API Reference**: Complete API documentation
 ## 📝 Usage
 Checkout the [notebook](notebooks/example.ipynb) for a quick start.
@@ -71,3 +83,5 @@ See the [examples/](examples/) directory for:
 ## 📄 License
 This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

{torchtextclassifiers-0.1.0 → torchtextclassifiers-1.0.1}/README.md RENAMED Viewed

@@ -1,14 +1,17 @@
 # torchTextClassifiers
+[![Documentation](https://img.shields.io/badge/docs-latest-blue.svg)](https://inseefrlab.github.io/torchTextClassifiers/)
 A unified, extensible framework for text classification with categorical variables built on [PyTorch](https://pytorch.org/) and [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/).
 ## 🚀 Features
-- **Mixed input support**: Handle text data alongside categorical variables seamlessly.
+- **Complex input support**: Handle text data alongside categorical variables seamlessly.
 - **Unified yet highly customizable**:
     - Use any tokenizer from HuggingFace or the original fastText's ngram tokenizer.
     - Manipulate the components (`TextEmbedder`, `CategoricalVariableNet`, `ClassificationHead`) to easily create custom architectures - including **self-attention**. All of them are `torch.nn.Module` !
     - The `TextClassificationModel` class combines these components and can be extended for custom behavior.
+- **Multiclass / multilabel classification support**: Support for both multiclass (only one label is true) and multi-label (several labels can be true) classification tasks.
 - **PyTorch Lightning**: Automated training with callbacks, early stopping, and logging
 - **Easy experimentation**: Simple API for training, evaluating, and predicting with minimal code:
     - The `torchTextClassifiers` wrapper class orchestrates the tokenizer and the model for you
@@ -29,6 +32,15 @@ uv sync
 pip install -e .
 ```
+## 📖 Documentation
+Full documentation is available at: **https://inseefrlab.github.io/torchTextClassifiers/**
+The documentation includes:
+- **Getting Started**: Installation and quick start guide
+- **Architecture**: Understanding the 3-layer design
+- **Tutorials**: Step-by-step guides for different use cases
+- **API Reference**: Complete API documentation
 ## 📝 Usage
 Checkout the [notebook](notebooks/example.ipynb) for a quick start.
@@ -45,3 +57,5 @@ See the [examples/](examples/) directory for:
 ## 📄 License
 This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

{torchtextclassifiers-0.1.0 → torchtextclassifiers-1.0.1}/pyproject.toml RENAMED Viewed

@@ -18,12 +18,12 @@ dependencies = [
     "pytorch-lightning>=2.4.0",
 ]
 requires-python = ">=3.11"
-version="0.1.0"
+version="1.0.1"
 [dependency-groups]
 dev = [
-  "pytest >=8.1.1,<9",
+  "pytest >=9.0.1,<10",
   "pandas",
   "scikit-learn",
   "nltk",
@@ -35,13 +35,17 @@ dev = [
   "ipywidgets>=8.1.8",
 ]
 docs = [
-  "sphinx>=5.0.0",
-  "sphinx-rtd-theme>=1.2.0",
-  "sphinx-autodoc-typehints>=1.19.0",
-  "sphinxcontrib-napoleon>=0.7",
+  "sphinx>=8.1.0",
+  "pydata-sphinx-theme>=0.16.0",
+  "sphinx-autodoc-typehints>=2.0.0",
   "sphinx-copybutton>=0.5.0",
-  "myst-parser>=0.18.0",
-  "sphinx-design>=0.3.0"
+  "myst-parser>=4.0.0",
+  "sphinx-design>=0.6.0",
+  "nbsphinx>=0.9.0",
+  "ipython>=8.0.0",
+  "pandoc>=2.0.0",
+  "linkify-it-py>=2.0.0",
+  "sphinxcontrib-images>=1.0.1"
 ]
 [project.optional-dependencies]

{torchtextclassifiers-0.1.0 → torchtextclassifiers-1.0.1}/torchTextClassifiers/dataset/dataset.py RENAMED Viewed

@@ -1,3 +1,4 @@
+import logging
 import os
 from typing import List, Union
@@ -8,6 +9,7 @@ from torch.utils.data import DataLoader, Dataset
 from torchTextClassifiers.tokenizers import BaseTokenizer
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
+logger = logging.getLogger(__name__)
 class TextClassificationDataset(Dataset):
@@ -16,7 +18,8 @@ class TextClassificationDataset(Dataset):
         texts: List[str],
         categorical_variables: Union[List[List[int]], np.array, None],
         tokenizer: BaseTokenizer,
-        labels: Union[List[int], None] = None,
+        labels: Union[List[int], List[List[int]], np.array, None] = None,
+        ragged_multilabel: bool = False,
     ):
         self.categorical_variables = categorical_variables
@@ -32,6 +35,23 @@ class TextClassificationDataset(Dataset):
         self.texts = texts
         self.tokenizer = tokenizer
         self.labels = labels
+        self.ragged_multilabel = ragged_multilabel
+        if self.ragged_multilabel and self.labels is not None:
+            max_value = int(max(max(row) for row in labels if row))
+            self.num_classes = max_value + 1
+            if max_value == 1:
+                try:
+                    labels = np.array(labels)
+                    logger.critical(
+                        """ragged_multilabel set to True but max label value is 1 and all samples have the same number of labels.
+                        If your labels are already one-hot encoded, set ragged_multilabel to False. Otherwise computations are likely to be wrong."""
+                    )
+                except ValueError:
+                    logger.warning(
+                        "ragged_multilabel set to True but max label value is 1. If your labels are already one-hot encoded, set ragged_multilabel to False. Otherwise computations are likely to be wrong."
+                    )
     def __len__(self):
         return len(self.texts)
@@ -59,10 +79,28 @@ class TextClassificationDataset(Dataset):
             )
     def collate_fn(self, batch):
-        text, *categorical_vars, y = zip(*batch)
+        text, *categorical_vars, labels = zip(*batch)
         if self.labels is not None:
-            labels_tensor = torch.tensor(y, dtype=torch.long)
+            if self.ragged_multilabel:
+                # Pad labels to the max length in the batch
+                labels_padded = torch.nn.utils.rnn.pad_sequence(
+                    [torch.tensor(label) for label in labels],
+                    batch_first=True,
+                    padding_value=-1,  # use impossible class
+                ).int()
+                labels_tensor = torch.zeros(labels_padded.size(0), 6).float()
+                mask = labels_padded != -1
+                batch_size = labels_padded.size(0)
+                rows = torch.arange(batch_size).unsqueeze(1).expand_as(labels_padded)[mask]
+                cols = labels_padded[mask]
+                labels_tensor[rows, cols] = 1
+            else:
+                labels_tensor = torch.tensor(labels)
         else:
             labels_tensor = None

torchtextclassifiers-1.0.1/torchTextClassifiers/model/components/classification_head.py ADDED Viewed

@@ -0,0 +1,61 @@
+from typing import Optional
+import torch
+from torch import nn
+class ClassificationHead(nn.Module):
+    def __init__(
+        self,
+        input_dim: Optional[int] = None,
+        num_classes: Optional[int] = None,
+        net: Optional[nn.Module] = None,
+    ):
+        """
+        Classification head for text classification tasks.
+        It is a nn.Module that can either be a simple Linear layer or a custom neural network module.
+        Args:
+            input_dim (int, optional): Dimension of the input features. Required if net is not provided.
+            num_classes (int, optional): Number of output classes. Required if net is not provided.
+            net (nn.Module, optional): Custom neural network module to be used as the classification head.
+                If provided, input_dim and num_classes are inferred from this module.
+                Should be either an nn.Sequential with first and last layers being Linears or nn.Linear.
+        """
+        super().__init__()
+        if net is not None:
+            self.net = net
+            # --- Custom net should either be a Sequential or a Linear ---
+            if not (isinstance(net, nn.Sequential) or isinstance(net, nn.Linear)):
+                raise ValueError("net must be an nn.Sequential when provided.")
+            # --- If Sequential, Check first and last layers are Linear ---
+            if isinstance(net, nn.Sequential):
+                first = net[0]
+                last = net[-1]
+                if not isinstance(first, nn.Linear):
+                    raise TypeError(f"First layer must be nn.Linear, got {type(first).__name__}.")
+                if not isinstance(last, nn.Linear):
+                    raise TypeError(f"Last layer must be nn.Linear, got {type(last).__name__}.")
+                # --- Extract features ---
+                self.input_dim = first.in_features
+                self.num_classes = last.out_features
+            else:  # if not Sequential, it is a Linear
+                self.input_dim = net.in_features
+                self.num_classes = net.out_features
+        else:
+            assert (
+                input_dim is not None and num_classes is not None
+            ), "Either net or both input_dim and num_classes must be provided."
+            self.net = nn.Linear(input_dim, num_classes)
+            self.input_dim = input_dim
+            self.num_classes = num_classes
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.net(x)

{torchtextclassifiers-0.1.0 → torchtextclassifiers-1.0.1}/torchTextClassifiers/model/lightning.py RENAMED Viewed

@@ -76,6 +76,10 @@ class TextClassificationModule(pl.LightningModule):
         targets = batch["labels"]
         outputs = self.forward(batch)
+        if isinstance(self.loss, torch.nn.BCEWithLogitsLoss):
+            targets = targets.float()
         loss = self.loss(outputs, targets)
         self.log("train_loss", loss, on_epoch=True, on_step=True, prog_bar=True)
         accuracy = self.accuracy_fn(outputs, targets)

{torchtextclassifiers-0.1.0 → torchtextclassifiers-1.0.1}/torchTextClassifiers/torchTextClassifiers.py RENAMED Viewed

@@ -100,6 +100,7 @@ class torchTextClassifiers:
         self,
         tokenizer: BaseTokenizer,
         model_config: ModelConfig,
+        ragged_multilabel: bool = False,
     ):
         """Initialize the torchTextClassifiers instance.
@@ -124,6 +125,7 @@ class torchTextClassifiers:
         self.model_config = model_config
         self.tokenizer = tokenizer
+        self.ragged_multilabel = ragged_multilabel
         if hasattr(self.tokenizer, "trained"):
             if not self.tokenizer.trained:
@@ -182,9 +184,9 @@ class torchTextClassifiers:
         self,
         X_train: np.ndarray,
         y_train: np.ndarray,
-        X_val: np.ndarray,
-        y_val: np.ndarray,
         training_config: TrainingConfig,
+        X_val: Optional[np.ndarray] = None,
+        y_val: Optional[np.ndarray] = None,
         verbose: bool = False,
     ) -> None:
         """Train the classifier using PyTorch Lightning.
@@ -196,6 +198,12 @@ class torchTextClassifiers:
         - Model training with early stopping
         - Best model loading after training
+        Note on Checkpoints:
+            After training, the best model checkpoint is automatically loaded.
+            This checkpoint contains the full training state (model weights,
+            optimizer, and scheduler state). Loading uses weights_only=False
+            as the checkpoint is self-generated and trusted.
         Args:
             X_train: Training input data
             y_train: Training labels
@@ -222,7 +230,14 @@ class torchTextClassifiers:
         """
         # Input validation
         X_train, y_train = self._check_XY(X_train, y_train)
-        X_val, y_val = self._check_XY(X_val, y_val)
+        if X_val is not None:
+            assert y_val is not None, "y_val must be provided if X_val is provided."
+        if y_val is not None:
+            assert X_val is not None, "X_val must be provided if y_val is provided."
+        if X_val is not None and y_val is not None:
+            X_val, y_val = self._check_XY(X_val, y_val)
         if (
             X_train["categorical_variables"] is not None
@@ -249,6 +264,11 @@ class torchTextClassifiers:
         if training_config.optimizer_params is not None:
             optimizer_params.update(training_config.optimizer_params)
+        if training_config.loss is torch.nn.CrossEntropyLoss and self.ragged_multilabel:
+            logger.warning(
+                "⚠️ You have set ragged_multilabel to True but are using CrossEntropyLoss. We would recommend to use torch.nn.BCEWithLogitsLoss for multilabel classification tasks."
+            )
         self.lightning_module = TextClassificationModule(
             model=self.pytorch_model,
             loss=training_config.loss,
@@ -270,38 +290,43 @@ class torchTextClassifiers:
             texts=X_train["text"],
             categorical_variables=X_train["categorical_variables"],  # None if no cat vars
             tokenizer=self.tokenizer,
-            labels=y_train,
-        )
-        val_dataset = TextClassificationDataset(
-            texts=X_val["text"],
-            categorical_variables=X_val["categorical_variables"],  # None if no cat vars
-            tokenizer=self.tokenizer,
-            labels=y_val,
+            labels=y_train.tolist(),
+            ragged_multilabel=self.ragged_multilabel,
         )
         train_dataloader = train_dataset.create_dataloader(
             batch_size=training_config.batch_size,
             num_workers=training_config.num_workers,
             shuffle=True,
             **training_config.dataloader_params if training_config.dataloader_params else {},
         )
-        val_dataloader = val_dataset.create_dataloader(
-            batch_size=training_config.batch_size,
-            num_workers=training_config.num_workers,
-            shuffle=False,
-            **training_config.dataloader_params if training_config.dataloader_params else {},
-        )
+        if X_val is not None and y_val is not None:
+            val_dataset = TextClassificationDataset(
+                texts=X_val["text"],
+                categorical_variables=X_val["categorical_variables"],  # None if no cat vars
+                tokenizer=self.tokenizer,
+                labels=y_val,
+                ragged_multilabel=self.ragged_multilabel,
+            )
+            val_dataloader = val_dataset.create_dataloader(
+                batch_size=training_config.batch_size,
+                num_workers=training_config.num_workers,
+                shuffle=False,
+                **training_config.dataloader_params if training_config.dataloader_params else {},
+            )
+        else:
+            val_dataloader = None
         # Setup trainer
         callbacks = [
             ModelCheckpoint(
-                monitor="val_loss",
+                monitor="val_loss" if val_dataloader is not None else "train_loss",
                 save_top_k=1,
                 save_last=False,
                 mode="min",
             ),
             EarlyStopping(
-                monitor="val_loss",
+                monitor="val_loss" if val_dataloader is not None else "train_loss",
                 patience=training_config.patience_early_stopping,
                 mode="min",
             ),
@@ -342,6 +367,7 @@ class torchTextClassifiers:
             best_model_path,
             model=self.pytorch_model,
             loss=training_config.loss,
+            weights_only=False,  # Required: checkpoint contains optimizer/scheduler state
         )
         self.pytorch_model = self.lightning_module.model.to(self.device)
@@ -352,7 +378,7 @@ class torchTextClassifiers:
         X = self._check_X(X)
         Y = self._check_Y(Y)
-        if X["text"].shape[0] != Y.shape[0]:
+        if X["text"].shape[0] != len(Y):
             raise ValueError("X_train and y_train must have the same number of observations.")
         return X, Y
@@ -422,22 +448,32 @@ class torchTextClassifiers:
         return {"text": text, "categorical_variables": categorical_variables}
     def _check_Y(self, Y):
-        assert isinstance(Y, np.ndarray), "Y must be a numpy array of shape (N,) or (N,1)."
-        assert len(Y.shape) == 1 or (
-            len(Y.shape) == 2 and Y.shape[1] == 1
-        ), "Y must be a numpy array of shape (N,) or (N,1)."
+        if self.ragged_multilabel:
+            assert isinstance(
+                Y, list
+            ), "Y must be a list of lists for ragged multilabel classification."
+            for row in Y:
+                assert isinstance(row, list), "Each element of Y must be a list of labels."
-        try:
-            Y = Y.astype(int)
-        except ValueError:
-            logger.error("Y must be castable in integer format.")
+            return Y
-        if Y.max() >= self.num_classes or Y.min() < 0:
-            raise ValueError(
-                f"Y contains class labels outside the range [0, {self.num_classes - 1}]."
-            )
+        else:
+            assert isinstance(Y, np.ndarray), "Y must be a numpy array of shape (N,) or (N,1)."
+            assert (
+                len(Y.shape) == 1 or len(Y.shape) == 2
+            ), "Y must be a numpy array of shape (N,) or (N, num_labels)."
+            try:
+                Y = Y.astype(int)
+            except ValueError:
+                logger.error("Y must be castable in integer format.")
+            if Y.max() >= self.num_classes or Y.min() < 0:
+                raise ValueError(
+                    f"Y contains class labels outside the range [0, {self.num_classes - 1}]."
+                )
-        return Y
+            return Y
     def predict(
         self,

{torchtextclassifiers-0.1.0 → torchtextclassifiers-1.0.1}/torchTextClassifiers/utilities/plot_explainability.py RENAMED Viewed

@@ -53,8 +53,18 @@ def map_attributions_to_char(attributions, offsets, text):
         np.exp(attributions_per_char), axis=1, keepdims=True
     )  # softmax normalization
+def get_id_to_word(text, word_ids, offsets):
+    words = {}
+    for idx, word_id in enumerate(word_ids):
+        if word_id is None:
+            continue
+        start, end = offsets[idx]
+        words[int(word_id)] = text[start:end]
+    return words
-def map_attributions_to_word(attributions, word_ids):
+def map_attributions_to_word(attributions, text, word_ids, offsets):
     """
     Maps token-level attributions to word-level attributions based on word IDs.
     Args:
@@ -69,8 +79,9 @@ def map_attributions_to_word(attributions, word_ids):
         np.ndarray: Array of shape (top_k, num_words) containing word-level attributions.
             num_words is the number of unique words in the original text.
     """
     word_ids = np.array(word_ids)
+    words = get_id_to_word(text, word_ids, offsets)
     # Convert None to -1 for easier processing (PAD tokens)
     word_ids_int = np.array([x if x is not None else -1 for x in word_ids], dtype=int)
@@ -99,7 +110,7 @@ def map_attributions_to_word(attributions, word_ids):
         )  # zero-out non-matching tokens and sum attributions for all tokens belonging to the same word
     # assert word_attributions.sum(axis=1) == attributions.sum(axis=1), "Sum of word attributions per top_k must equal sum of token attributions per top_k."
-    return np.exp(word_attributions) / np.sum(
+    return words, np.exp(word_attributions) / np.sum(
         np.exp(word_attributions), axis=1, keepdims=True
     )  # softmax normalization
@@ -131,7 +142,7 @@ def plot_attributions_at_char(
         fig, ax = plt.subplots(figsize=figsize)
         ax.bar(range(len(text)), attributions_per_char[i])
         ax.set_xticks(np.arange(len(text)))
-        ax.set_xticklabels(list(text), rotation=90)
+        ax.set_xticklabels(list(text), rotation=45)
         title = titles[i] if titles is not None else f"Attributions for Top {i+1} Prediction"
         ax.set_title(title)
         ax.set_xlabel("Characters in Text")
@@ -142,7 +153,7 @@ def plot_attributions_at_char(
 def plot_attributions_at_word(
-    text, attributions_per_word, figsize=(10, 2), titles: Optional[List[str]] = None
+    text, words, attributions_per_word, figsize=(10, 2), titles: Optional[List[str]] = None
 ):
     """
     Plots word-level attributions as a heatmap.
@@ -159,14 +170,13 @@ def plot_attributions_at_word(
             "matplotlib is required for plotting. Please install it to use this function."
         )
-    words = text.split()
     top_k = attributions_per_word.shape[0]
     all_plots = []
     for i in range(top_k):
         fig, ax = plt.subplots(figsize=figsize)
         ax.bar(range(len(words)), attributions_per_word[i])
         ax.set_xticks(np.arange(len(words)))
-        ax.set_xticklabels(words, rotation=90)
+        ax.set_xticklabels(words, rotation=45)
         title = titles[i] if titles is not None else f"Attributions for Top {i+1} Prediction"
         ax.set_title(title)
         ax.set_xlabel("Words in Text")

torchtextclassifiers-0.1.0/torchTextClassifiers/model/components/classification_head.py DELETED Viewed

@@ -1,43 +0,0 @@
-from typing import Optional
-import torch
-from torch import nn
-class ClassificationHead(nn.Module):
-    def __init__(
-        self,
-        input_dim: Optional[int] = None,
-        num_classes: Optional[int] = None,
-        net: Optional[nn.Module] = None,
-    ):
-        super().__init__()
-        if net is not None:
-            self.net = net
-            self.input_dim = net.in_features
-            self.num_classes = net.out_features
-        else:
-            assert (
-                input_dim is not None and num_classes is not None
-            ), "Either net or both input_dim and num_classes must be provided."
-            self.net = nn.Linear(input_dim, num_classes)
-            self.input_dim, self.num_classes = self._get_linear_input_output_dims(self.net)
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        return self.net(x)
-    @staticmethod
-    def _get_linear_input_output_dims(module: nn.Module):
-        """
-        Returns (input_dim, output_dim) for any module containing Linear layers.
-        Works for Linear, Sequential, or nested models.
-        """
-        # Collect all Linear layers recursively
-        linears = [m for m in module.modules() if isinstance(m, nn.Linear)]
-        if not linears:
-            raise ValueError("No Linear layers found in the given module.")
-        input_dim = linears[0].in_features
-        output_dim = linears[-1].out_features
-        return input_dim, output_dim