PyPI - torchtextclassifiers - Versions diffs - 0.0.1__tar.gz → 0.1.0__tar.gz - Mend

torchtextclassifiers 0.0.1tar.gz → 0.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (37) hide show

torchtextclassifiers-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,73 @@
+Metadata-Version: 2.3
+Name: torchtextclassifiers
+Version: 0.1.0
+Summary: A text classification toolkit to easily build, train and evaluate deep learning text classifiers using PyTorch.
+Keywords: fastText,text classification,NLP,automatic coding,deep learning
+Author: Cédric Couralet, Meilame Tayebjee
+Author-email: Cédric Couralet <cedric.couralet@insee.fr>, Meilame Tayebjee <meilame.tayebjee@insee.fr>
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Requires-Dist: numpy>=1.26.4
+Requires-Dist: pytorch-lightning>=2.4.0
+Requires-Dist: unidecode ; extra == 'explainability'
+Requires-Dist: nltk ; extra == 'explainability'
+Requires-Dist: captum ; extra == 'explainability'
+Requires-Dist: tokenizers>=0.22.1 ; extra == 'huggingface'
+Requires-Dist: transformers>=4.57.1 ; extra == 'huggingface'
+Requires-Dist: datasets>=4.3.0 ; extra == 'huggingface'
+Requires-Dist: unidecode ; extra == 'preprocess'
+Requires-Dist: nltk ; extra == 'preprocess'
+Requires-Python: >=3.11
+Provides-Extra: explainability
+Provides-Extra: huggingface
+Provides-Extra: preprocess
+Description-Content-Type: text/markdown
+# torchTextClassifiers
+A unified, extensible framework for text classification with categorical variables built on [PyTorch](https://pytorch.org/) and [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/).
+## 🚀 Features
+- **Mixed input support**: Handle text data alongside categorical variables seamlessly.
+- **Unified yet highly customizable**:
+    - Use any tokenizer from HuggingFace or the original fastText's ngram tokenizer.
+    - Manipulate the components (`TextEmbedder`, `CategoricalVariableNet`, `ClassificationHead`) to easily create custom architectures - including **self-attention**. All of them are `torch.nn.Module` !
+    - The `TextClassificationModel` class combines these components and can be extended for custom behavior.
+- **PyTorch Lightning**: Automated training with callbacks, early stopping, and logging
+- **Easy experimentation**: Simple API for training, evaluating, and predicting with minimal code:
+    - The `torchTextClassifiers` wrapper class orchestrates the tokenizer and the model for you
+- **Additional features**: explainability using Captum
+## 📦 Installation
+```bash
+# Clone the repository
+git clone https://github.com/InseeFrLab/torchTextClassifiers.git
+cd torchtextClassifiers
+# Install with uv (recommended)
+uv sync
+# Or install with pip
+pip install -e .
+```
+## 📝 Usage
+Checkout the [notebook](notebooks/example.ipynb) for a quick start.
+## 📚 Examples
+See the [examples/](examples/) directory for:
+- Basic text classification
+- Multi-class classification
+- Mixed features (text + categorical)
+- Advanced training configurations
+- Prediction and explainability
+## 📄 License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

torchtextclassifiers-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,47 @@
+# torchTextClassifiers
+A unified, extensible framework for text classification with categorical variables built on [PyTorch](https://pytorch.org/) and [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/).
+## 🚀 Features
+- **Mixed input support**: Handle text data alongside categorical variables seamlessly.
+- **Unified yet highly customizable**:
+    - Use any tokenizer from HuggingFace or the original fastText's ngram tokenizer.
+    - Manipulate the components (`TextEmbedder`, `CategoricalVariableNet`, `ClassificationHead`) to easily create custom architectures - including **self-attention**. All of them are `torch.nn.Module` !
+    - The `TextClassificationModel` class combines these components and can be extended for custom behavior.
+- **PyTorch Lightning**: Automated training with callbacks, early stopping, and logging
+- **Easy experimentation**: Simple API for training, evaluating, and predicting with minimal code:
+    - The `torchTextClassifiers` wrapper class orchestrates the tokenizer and the model for you
+- **Additional features**: explainability using Captum
+## 📦 Installation
+```bash
+# Clone the repository
+git clone https://github.com/InseeFrLab/torchTextClassifiers.git
+cd torchtextClassifiers
+# Install with uv (recommended)
+uv sync
+# Or install with pip
+pip install -e .
+```
+## 📝 Usage
+Checkout the [notebook](notebooks/example.ipynb) for a quick start.
+## 📚 Examples
+See the [examples/](examples/) directory for:
+- Basic text classification
+- Multi-class classification
+- Mixed features (text + categorical)
+- Advanced training configurations
+- Prediction and explainability
+## 📄 License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

{torchtextclassifiers-0.0.1 → torchtextclassifiers-0.1.0}/pyproject.toml RENAMED Viewed

@@ -1,11 +1,9 @@
 [project]
 name = "torchtextclassifiers"
-description = "An implementation of the https://github.com/facebookresearch/fastText supervised learning algorithm for text classification using Pytorch."
+description = "A text classification toolkit to easily build, train and evaluate deep learning text classifiers using PyTorch."
 authors = [
-    { name = "Tom Seimandi", email = "tom.seimandi@gmail.com" },
-    { name = "Julien Pramil", email = "julien.pramil@insee.fr" },
-    { name = "Meilame Tayebjee", email = "meilame.tayebjee@insee.fr" },
     { name = "Cédric Couralet", email = "cedric.couralet@insee.fr" },
+    { name = "Meilame Tayebjee", email = "meilame.tayebjee@insee.fr" },
 ]
 readme = "README.md"
 repository = "https://github.com/InseeFrLab/torchTextClassifiers"
@@ -20,7 +18,7 @@ dependencies = [
     "pytorch-lightning>=2.4.0",
 ]
 requires-python = ">=3.11"
-version="0.0.1"
+version="0.1.0"
 [dependency-groups]
@@ -31,7 +29,10 @@ dev = [
   "nltk",
   "unidecode",
   "captum",
-  "pyarrow"
+  "pyarrow",
+  "pre-commit>=4.3.0",
+  "ruff>=0.14.3",
+  "ipywidgets>=8.1.8",
 ]
 docs = [
   "sphinx>=5.0.0",
@@ -46,9 +47,15 @@ docs = [
 [project.optional-dependencies]
 explainability = ["unidecode", "nltk", "captum"]
 preprocess =     ["unidecode", "nltk"]
+huggingface = [
+    "tokenizers>=0.22.1",
+    "transformers>=4.57.1",
+    "datasets>=4.3.0",
+]
 [build-system]
-requires = ["uv_build>=0.8.3,<0.9.0"]
+requires = ["uv_build>=0.9.3,<0.10.0"]
 build-backend = "uv_build"
 [tool.ruff]
@@ -58,6 +65,3 @@ line-length = 100
 [tool.uv.build-backend]
 module-name="torchTextClassifiers"
 module-root = ""

torchtextclassifiers-0.1.0/torchTextClassifiers/__init__.py ADDED Viewed

@@ -0,0 +1,32 @@
+"""torchTextClassifiers: A unified framework for text classification.
+This package provides a generic, extensible framework for building and training
+different types of text classifiers. It currently supports FastText classifiers
+with a clean API for building, training, and inference.
+Key Features:
+- Unified API for different classifier types
+- Built-in support for FastText classifiers
+- PyTorch Lightning integration for training
+- Extensible architecture for adding new classifier types
+- Support for both text-only and mixed text/categorical features
+"""
+from .torchTextClassifiers import (
+    ModelConfig as ModelConfig,
+)
+from .torchTextClassifiers import (
+    TrainingConfig as TrainingConfig,
+)
+from .torchTextClassifiers import (
+    torchTextClassifiers as torchTextClassifiers,
+)
+__all__ = [
+    "torchTextClassifiers",
+    "ModelConfig",
+    "TrainingConfig",
+]
+__version__ = "1.0.0"

torchtextclassifiers-0.1.0/torchTextClassifiers/dataset/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ from .dataset import TextClassificationDataset as TextClassificationDataset

torchtextclassifiers-0.1.0/torchTextClassifiers/dataset/dataset.py ADDED Viewed

@@ -0,0 +1,114 @@
+import os
+from typing import List, Union
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset
+from torchTextClassifiers.tokenizers import BaseTokenizer
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+class TextClassificationDataset(Dataset):
+    def __init__(
+        self,
+        texts: List[str],
+        categorical_variables: Union[List[List[int]], np.array, None],
+        tokenizer: BaseTokenizer,
+        labels: Union[List[int], None] = None,
+    ):
+        self.categorical_variables = categorical_variables
+        self.texts = texts
+        if hasattr(tokenizer, "trained") and not tokenizer.trained:
+            raise RuntimeError(
+                f"Tokenizer {type(tokenizer)} must be trained before creating dataset."
+            )
+        self.tokenizer = tokenizer
+        self.texts = texts
+        self.tokenizer = tokenizer
+        self.labels = labels
+    def __len__(self):
+        return len(self.texts)
+    def __getitem__(self, idx):
+        if self.labels is not None:
+            return (
+                str(self.texts[idx]),
+                (
+                    self.categorical_variables[idx]
+                    if self.categorical_variables is not None
+                    else None
+                ),
+                self.labels[idx],
+            )
+        else:
+            return (
+                str(self.texts[idx]),
+                (
+                    self.categorical_variables[idx]
+                    if self.categorical_variables is not None
+                    else None
+                ),
+                None,
+            )
+    def collate_fn(self, batch):
+        text, *categorical_vars, y = zip(*batch)
+        if self.labels is not None:
+            labels_tensor = torch.tensor(y, dtype=torch.long)
+        else:
+            labels_tensor = None
+        tokenize_output = self.tokenizer.tokenize(list(text))
+        if self.categorical_variables is not None:
+            categorical_tensors = torch.stack(
+                [
+                    torch.tensor(cat_var, dtype=torch.float32)
+                    for cat_var in categorical_vars[
+                        0
+                    ]  # Access first element since zip returns tuple
+                ]
+            )
+        else:
+            categorical_tensors = None
+        return {
+            "input_ids": tokenize_output.input_ids,
+            "attention_mask": tokenize_output.attention_mask,
+            "categorical_vars": categorical_tensors,
+            "labels": labels_tensor,
+        }
+    def create_dataloader(
+        self,
+        batch_size: int,
+        shuffle: bool = False,
+        drop_last: bool = False,
+        num_workers: int = os.cpu_count() - 1,
+        pin_memory: bool = False,
+        persistent_workers: bool = True,
+        **kwargs,
+    ):
+        # persistent_workers requires num_workers > 0
+        if num_workers == 0:
+            persistent_workers = False
+        return DataLoader(
+            dataset=self,
+            batch_size=batch_size,
+            collate_fn=self.collate_fn,
+            shuffle=shuffle,
+            drop_last=drop_last,
+            pin_memory=pin_memory,
+            num_workers=num_workers,
+            persistent_workers=persistent_workers,
+            **kwargs,
+        )

torchtextclassifiers-0.1.0/torchTextClassifiers/model/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ from .lightning import TextClassificationModule as TextClassificationModule
2	+ from .model import TextClassificationModel as TextClassificationModel

torchtextclassifiers-0.1.0/torchTextClassifiers/model/components/__init__.py ADDED Viewed

@@ -0,0 +1,12 @@
+from .attention import (
+    AttentionConfig as AttentionConfig,
+)
+from .categorical_var_net import (
+    CategoricalForwardType as CategoricalForwardType,
+)
+from .categorical_var_net import (
+    CategoricalVariableNet as CategoricalVariableNet,
+)
+from .classification_head import ClassificationHead as ClassificationHead
+from .text_embedder import TextEmbedder as TextEmbedder
+from .text_embedder import TextEmbedderConfig as TextEmbedderConfig

torchtextclassifiers-0.1.0/torchTextClassifiers/model/components/attention.py ADDED Viewed

@@ -0,0 +1,126 @@
+"""Largely inspired from Andrej Karpathy's nanochat, see here https://github.com/karpathy/nanochat/blob/master/nanochat/gpt.py"""
+from dataclasses import dataclass
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+### Some utils used in text_embedder.py for the attention blocks ###
+def apply_rotary_emb(x, cos, sin):
+    assert x.ndim == 4  # multihead attention
+    d = x.shape[3] // 2
+    x1, x2 = x[..., :d], x[..., d:]  # split up last time into two halves
+    y1 = x1 * cos + x2 * sin  # rotate pairs of dims
+    y2 = x1 * (-sin) + x2 * cos
+    out = torch.cat([y1, y2], 3)  # re-assemble
+    out = out.to(x.dtype)  # ensure input/output dtypes match
+    return out
+def norm(x):
+    # Purely functional rmsnorm with no learnable params
+    return F.rms_norm(x, (x.size(-1),))
+#### Config #####
+@dataclass
+class AttentionConfig:
+    n_layers: int
+    n_head: int
+    n_kv_head: int
+    sequence_len: Optional[int] = None
+    positional_encoding: bool = True
+    aggregation_method: str = "mean"  # or 'last', or 'first'
+#### Attention Block #####
+# Composed of SelfAttentionLayer and MLP with residual connections
+class Block(nn.Module):
+    def __init__(self, config: AttentionConfig, layer_idx: int):
+        super().__init__()
+        self.layer_idx = layer_idx
+        self.attn = SelfAttentionLayer(config, layer_idx)
+        self.mlp = MLP(config)
+    def forward(self, x, cos_sin):
+        x = x + self.attn(norm(x), cos_sin)
+        x = x + self.mlp(norm(x))
+        return x
+##### Components of the Block #####
+class SelfAttentionLayer(nn.Module):
+    def __init__(self, config: AttentionConfig, layer_idx):
+        super().__init__()
+        self.layer_idx = layer_idx
+        self.n_head = config.n_head
+        self.n_kv_head = config.n_kv_head
+        self.enable_gqa = (
+            self.n_head != self.n_kv_head
+        )  # Group Query Attention (GQA): duplicate key/value heads to match query heads if desired
+        self.n_embd = config.n_embd
+        self.head_dim = self.n_embd // self.n_head
+        assert self.n_embd % self.n_head == 0
+        assert self.n_kv_head <= self.n_head and self.n_head % self.n_kv_head == 0
+        self.c_q = nn.Linear(self.n_embd, self.n_head * self.head_dim, bias=False)
+        self.c_k = nn.Linear(self.n_embd, self.n_kv_head * self.head_dim, bias=False)
+        self.c_v = nn.Linear(self.n_embd, self.n_kv_head * self.head_dim, bias=False)
+        self.c_proj = nn.Linear(self.n_embd, self.n_embd, bias=False)
+        self.apply_positional_encoding = config.positional_encoding
+    def forward(self, x, cos_sin=None):
+        B, T, C = x.size()
+        # Project the input to get queries, keys, and values
+        q = self.c_q(x).view(B, T, self.n_head, self.head_dim)
+        k = self.c_k(x).view(B, T, self.n_kv_head, self.head_dim)
+        v = self.c_v(x).view(B, T, self.n_kv_head, self.head_dim)
+        if self.apply_positional_encoding:
+            assert cos_sin is not None, "Rotary embeddings require precomputed cos/sin tensors"
+            cos, sin = cos_sin
+            q, k = (
+                apply_rotary_emb(q, cos, sin),
+                apply_rotary_emb(k, cos, sin),
+            )  # QK rotary embedding
+        q, k = norm(q), norm(k)  # QK norm
+        q, k, v = (
+            q.transpose(1, 2),
+            k.transpose(1, 2),
+            v.transpose(1, 2),
+        )  # make head be batch dim, i.e. (B, T, H, D) -> (B, H, T, D)
+        # is_causal=False for non-autoregressive models (BERT-like)
+        y = F.scaled_dot_product_attention(q, k, v, is_causal=False, enable_gqa=self.enable_gqa)
+        # Re-assemble the heads side by side and project back to residual stream
+        y = y.transpose(1, 2).contiguous().view(B, T, -1)
+        y = self.c_proj(y)
+        return y
+class MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
+        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)
+    def forward(self, x):
+        x = self.c_fc(x)
+        x = F.relu(x).square()
+        x = self.c_proj(x)
+        return x

torchtextclassifiers-0.1.0/torchTextClassifiers/model/components/categorical_var_net.py ADDED Viewed

@@ -0,0 +1,128 @@
+from enum import Enum
+from typing import List, Optional, Union
+import torch
+from torch import nn
+class CategoricalForwardType(Enum):
+    SUM_TO_TEXT = "EMBEDDING_SUM_TO_TEXT"
+    AVERAGE_AND_CONCAT = "EMBEDDING_AVERAGE_AND_CONCAT"
+    CONCATENATE_ALL = "EMBEDDING_CONCATENATE_ALL"
+class CategoricalVariableNet(nn.Module):
+    def __init__(
+        self,
+        categorical_vocabulary_sizes: List[int],
+        categorical_embedding_dims: Optional[Union[List[int], int]] = None,
+        text_embedding_dim: Optional[int] = None,
+    ):
+        super().__init__()
+        self.categorical_vocabulary_sizes = categorical_vocabulary_sizes
+        self.categorical_embedding_dims = categorical_embedding_dims
+        self.text_embedding_dim = text_embedding_dim
+        self._validate_categorical_inputs()
+        assert isinstance(
+            self.forward_type, CategoricalForwardType
+        ), "forward_type must be set after validation"
+        assert isinstance(self.output_dim, int), "output_dim must be set as int after validation"
+        self.categorical_embedding_layers = {}
+        for var_idx, num_rows in enumerate(self.categorical_vocabulary_sizes):
+            emb_layer = nn.Embedding(
+                num_embeddings=num_rows,
+                embedding_dim=self.categorical_embedding_dims[var_idx],
+            )
+            self.categorical_embedding_layers[var_idx] = emb_layer
+            setattr(self, f"categorical_embedding_{var_idx}", emb_layer)
+    def forward(self, categorical_vars_tensor: torch.Tensor) -> torch.Tensor:
+        cat_embeds = self._get_cat_embeds(categorical_vars_tensor)
+        if self.forward_type == CategoricalForwardType.SUM_TO_TEXT:
+            x_combined = torch.stack(cat_embeds, dim=0).sum(dim=0)  # (bs, text_embed_dim)
+        elif self.forward_type == CategoricalForwardType.AVERAGE_AND_CONCAT:
+            x_combined = torch.stack(cat_embeds, dim=0).mean(dim=0)  # (bs, embed_dim)
+        elif self.forward_type == CategoricalForwardType.CONCATENATE_ALL:
+            x_combined = torch.cat(cat_embeds, dim=1)  # (bs, sum of all cat embed dims)
+        else:
+            raise ValueError(f"Unknown forward type: {self.forward_type}")
+        assert (
+            x_combined.dim() == 2
+        ), "Output combined tensor must be 2-dimensional (batch_size, embed_dim)"
+        assert x_combined.size(1) == self.output_dim
+        return x_combined
+    def _get_cat_embeds(self, categorical_vars_tensor: torch.Tensor):
+        if categorical_vars_tensor.dtype != torch.long:
+            categorical_vars_tensor = categorical_vars_tensor.to(torch.long)
+        cat_embeds = []
+        for i, embed_layer in self.categorical_embedding_layers.items():
+            cat_var_tensor = categorical_vars_tensor[:, i]
+            # Check if categorical values are within valid range
+            vocab_size = embed_layer.num_embeddings
+            max_val = cat_var_tensor.max().item()
+            min_val = cat_var_tensor.min().item()
+            if max_val >= vocab_size or min_val < 0:
+                raise ValueError(
+                    f"Categorical feature {i}: values range [{min_val}, {max_val}] exceed vocabulary size {vocab_size}."
+                )
+            cat_embed = embed_layer(cat_var_tensor)
+            if cat_embed.dim() > 2:
+                cat_embed = cat_embed.squeeze(1)
+            cat_embeds.append(cat_embed)
+        return cat_embeds
+    def _validate_categorical_inputs(self):
+        categorical_vocabulary_sizes = self.categorical_vocabulary_sizes
+        categorical_embedding_dims = self.categorical_embedding_dims
+        if not isinstance(categorical_vocabulary_sizes, list):
+            raise TypeError("categorical_vocabulary_sizes must be a list of int")
+        if isinstance(categorical_embedding_dims, list):
+            if len(categorical_vocabulary_sizes) != len(categorical_embedding_dims):
+                raise ValueError(
+                    "Categorical vocabulary sizes and their embedding dimensions must have the same length"
+                )
+        num_categorical_features = len(categorical_vocabulary_sizes)
+        # "Transform" embedding dims into a suitable list, or stay None
+        if categorical_embedding_dims is not None:
+            if isinstance(categorical_embedding_dims, int):
+                self.forward_type = CategoricalForwardType.AVERAGE_AND_CONCAT
+                self.output_dim = categorical_embedding_dims
+                categorical_embedding_dims = [categorical_embedding_dims] * num_categorical_features
+            elif isinstance(categorical_embedding_dims, list):
+                self.forward_type = CategoricalForwardType.CONCATENATE_ALL
+                self.output_dim = sum(categorical_embedding_dims)
+            else:
+                raise TypeError("categorical_embedding_dims must be an int, a list of int or None")
+        else:
+            if self.text_embedding_dim is None:
+                raise ValueError(
+                    "If categorical_embedding_dims is None, text_embedding_dim must be provided"
+                )
+            self.forward_type = CategoricalForwardType.SUM_TO_TEXT
+            self.output_dim = self.text_embedding_dim
+            categorical_embedding_dims = [self.text_embedding_dim] * num_categorical_features
+        assert (
+            isinstance(categorical_embedding_dims, list) or categorical_embedding_dims is None
+        ), "categorical_embedding_dims must be a list of int at this point"
+        self.categorical_vocabulary_sizes = categorical_vocabulary_sizes
+        self.categorical_embedding_dims = categorical_embedding_dims
+        self.num_categorical_features = num_categorical_features

torchtextclassifiers-0.1.0/torchTextClassifiers/model/components/classification_head.py ADDED Viewed

@@ -0,0 +1,43 @@
+from typing import Optional
+import torch
+from torch import nn
+class ClassificationHead(nn.Module):
+    def __init__(
+        self,
+        input_dim: Optional[int] = None,
+        num_classes: Optional[int] = None,
+        net: Optional[nn.Module] = None,
+    ):
+        super().__init__()
+        if net is not None:
+            self.net = net
+            self.input_dim = net.in_features
+            self.num_classes = net.out_features
+        else:
+            assert (
+                input_dim is not None and num_classes is not None
+            ), "Either net or both input_dim and num_classes must be provided."
+            self.net = nn.Linear(input_dim, num_classes)
+            self.input_dim, self.num_classes = self._get_linear_input_output_dims(self.net)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.net(x)
+    @staticmethod
+    def _get_linear_input_output_dims(module: nn.Module):
+        """
+        Returns (input_dim, output_dim) for any module containing Linear layers.
+        Works for Linear, Sequential, or nested models.
+        """
+        # Collect all Linear layers recursively
+        linears = [m for m in module.modules() if isinstance(m, nn.Linear)]
+        if not linears:
+            raise ValueError("No Linear layers found in the given module.")
+        input_dim = linears[0].in_features
+        output_dim = linears[-1].out_features
+        return input_dim, output_dim

torchtextclassifiers 0.0.1__tar.gz → 0.1.0__tar.gz

torchtextclassifiers 0.0.1tar.gz → 0.1.0tar.gz