PyPI - expressivity - Versions diffs - 0.1.0__tar.gz - Mend

expressivity 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

expressivity-0.1.0/PKG-INFO +27 -0
expressivity-0.1.0/README.md +14 -0
expressivity-0.1.0/expressivity/__init__.py +2 -0
expressivity-0.1.0/expressivity/probabilistic_density.py +391 -0
expressivity-0.1.0/expressivity/space.py +142 -0
expressivity-0.1.0/expressivity.egg-info/PKG-INFO +27 -0
expressivity-0.1.0/expressivity.egg-info/SOURCES.txt +21 -0
expressivity-0.1.0/expressivity.egg-info/dependency_links.txt +1 -0
expressivity-0.1.0/expressivity.egg-info/requires.txt +6 -0
expressivity-0.1.0/expressivity.egg-info/top_level.txt +2 -0
expressivity-0.1.0/pyproject.toml +22 -0
expressivity-0.1.0/setup.cfg +4 -0
expressivity-0.1.0/tests/cubic_transformer/attention.py +73 -0
expressivity-0.1.0/tests/cubic_transformer/cubic_transformer_test.py +64 -0
expressivity-0.1.0/tests/cubic_transformer/transformer.py +107 -0
expressivity-0.1.0/tests/n_diagonal/deep_nn.py +35 -0
expressivity-0.1.0/tests/n_diagonal/linear.py +24 -0
expressivity-0.1.0/tests/n_diagonal/lora.py +27 -0
expressivity-0.1.0/tests/n_diagonal/n_diagonal.py +41 -0
expressivity-0.1.0/tests/n_diagonal/n_diagonal_test.py +90 -0
expressivity-0.1.0/tests/split_transformer/attention.py +57 -0
expressivity-0.1.0/tests/split_transformer/split_transformer_test.py +50 -0
expressivity-0.1.0/tests/split_transformer/transformer.py +50 -0

expressivity-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,27 @@
+Metadata-Version: 2.4
+Name: expressivity
+Version: 0.1.0
+Summary: A package made to objectively compare the predicting power of neural network architectures implented with torch.
+Author-email: Clustery <bigarnaque@gmail.com>
+Requires-Python: >=3.11
+Description-Content-Type: text/markdown
+Requires-Dist: matplotlib>=3.10.0
+Requires-Dist: torch>=2.5.1
+Provides-Extra: dev
+Requires-Dist: black>=24.10.0; extra == "dev"
+Requires-Dist: ruff>=0.9.0; extra == "dev"
+Pour comparer 2 architectures, il faut définir une distribution de probabilité sur les paramtres de chaque réseau.
+Cette distribution doit être définie directement lors de la création du réseau dans la méthode init.
+Afin de comparer 2 architectures de manière équitable, il faut s'assurer qu'il y a ait autant de paramètres pour chaque paire prise à indice égale dans les listes space.architecture.
+Il n'y a pas ce problème lorsque l'on compare avec un réseau tier.
+L'argument 'mesure' dans un espace sert à quantifier la 'taille' d'un réseau. Le mode 'parameter' pour l'argument 'automatic_mesurement_mode' permet de comptabiliser automatiquement le nombre paramètres apprenables dans le modèle. Le mode par défaut 'information' permet lui de prendre en compte également la précision des paramètres pour comptabiliser l'intégralité des bits néccessaires pour encoder l'information des poids apprenables. L'utilisateur pourra lui-même définir ses propres métriques en définissant manuellement l'argument 'mesure' pour l'intégralité des modèles de l'espace.
+Tests :
+- Le fichier split_transformers_test.py permet de comparer un transformers classique à un transformer appliquant une fonction d'activation juste avant le produit entre Q et K. Un effectue un test A/B entre les deux modèles.
+- Le fichier n_diagonal_test.py compare l'usage d'une matrice n_diagonale, comparée à une matrice écrite sous la forme d'une LoRA. Le nombre de paramètre n'étant pas régoureusement identique dans les 2 architectures, on se propose de comparer ces dernière à une architecture tier qui englobe les 2 architectures. On compare donc à un réseau constitué de matrices de poids pleines.
+TODO:
+- In the next version it would be great to let the user define its training routine directly in the ArchitecturalSpace class
+- Lors de l'entrainement avec une architecture tier, on recalcule 2 fois des passes avant pour le modèle tier. On pourrait opitmiser le temps de calcul en s'assurant que l'on génère une target, puis les 2 modèles concurrents doivent l'approximer (au prix d'un code plus long, à moins de repenser la structure logique)

expressivity-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,14 @@
+Pour comparer 2 architectures, il faut définir une distribution de probabilité sur les paramtres de chaque réseau.
+Cette distribution doit être définie directement lors de la création du réseau dans la méthode init.
+Afin de comparer 2 architectures de manière équitable, il faut s'assurer qu'il y a ait autant de paramètres pour chaque paire prise à indice égale dans les listes space.architecture.
+Il n'y a pas ce problème lorsque l'on compare avec un réseau tier.
+L'argument 'mesure' dans un espace sert à quantifier la 'taille' d'un réseau. Le mode 'parameter' pour l'argument 'automatic_mesurement_mode' permet de comptabiliser automatiquement le nombre paramètres apprenables dans le modèle. Le mode par défaut 'information' permet lui de prendre en compte également la précision des paramètres pour comptabiliser l'intégralité des bits néccessaires pour encoder l'information des poids apprenables. L'utilisateur pourra lui-même définir ses propres métriques en définissant manuellement l'argument 'mesure' pour l'intégralité des modèles de l'espace.
+Tests :
+- Le fichier split_transformers_test.py permet de comparer un transformers classique à un transformer appliquant une fonction d'activation juste avant le produit entre Q et K. Un effectue un test A/B entre les deux modèles.
+- Le fichier n_diagonal_test.py compare l'usage d'une matrice n_diagonale, comparée à une matrice écrite sous la forme d'une LoRA. Le nombre de paramètre n'étant pas régoureusement identique dans les 2 architectures, on se propose de comparer ces dernière à une architecture tier qui englobe les 2 architectures. On compare donc à un réseau constitué de matrices de poids pleines.
+TODO:
+- In the next version it would be great to let the user define its training routine directly in the ArchitecturalSpace class
+- Lors de l'entrainement avec une architecture tier, on recalcule 2 fois des passes avant pour le modèle tier. On pourrait opitmiser le temps de calcul en s'assurant que l'on génère une target, puis les 2 modèles concurrents doivent l'approximer (au prix d'un code plus long, à moins de repenser la structure logique)

expressivity-0.1.0/expressivity/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ from .probabilistic_density import ArchitectureComparator
2	+ from .space import ArchitecturalSpace

expressivity-0.1.0/expressivity/probabilistic_density.py ADDED Viewed

@@ -0,0 +1,391 @@
+import torch
+from torch import nn, optim
+from expressivity.space import ArchitecturalSpace
+import matplotlib.pyplot as plt
+class ArchitectureComparator:
+    def __init__(
+        self,
+        A_space: ArchitecturalSpace,
+        B_space: ArchitecturalSpace,
+        base_space: ArchitecturalSpace = None,
+        criterion=nn.MSELoss(),
+        law=torch.distributions.Normal(0, 1),
+    ) -> None:
+        """
+        Initialize the ArchitectureComparator.
+        Parameters:
+        - A_space (ArchitecturalSpace): The first architectural space.
+        - B_space (ArchitecturalSpace): The second architectural space.
+        - base_space (ArchitecturalSpace, optional): The base architectural space used for comparison.
+        - criterion (nn.Module): Loss function used for training (default: nn.MSELoss).
+        - law (torch.distributions.Distribution): Data distribution for sampling (default: Normal(0, 1)).
+        """
+        self.A_space = A_space
+        self.B_space = B_space
+        self.base_space = base_space
+        self.criterion = criterion
+        self.law = law
+        assert (
+            A_space.input_size == B_space.input_size
+        ), "The input size of the two models must be the same"
+        self.input_size = A_space.input_size
+        assert len(A_space.parameters) == len(
+            B_space.parameters
+        ), "The number of architectures must be the same in space A and B"
+        self.count = len(A_space.parameters)
+        assert (
+            A_space.automatic_mesurement_mode == B_space.automatic_mesurement_mode
+            or A_space.automatic_mesurement_mode is None
+            or B_space.automatic_mesurement_mode is None
+        ), "The automatic mesurement mode must be the same in space A and B"
+        if A_space.mesurement != B_space.mesurement:
+            print(
+                "Warning: The mesurements of space A and B are different, you may not compare both model on an equal footing"
+            )
+        try:
+            test_tensor = torch.zeros((1, *self.input_size))
+            A_output_size = self._create_model(A_space, 0)(test_tensor).shape
+            B_output_size = self._create_model(B_space, 0)(test_tensor).shape
+            assert (
+                A_output_size == B_output_size
+            ), "The output size of the two models must be the same"
+            self.output_size = A_output_size[1:]
+            if base_space is not None:
+                assert (
+                    self.input_size == base_space.input_size
+                ), "The input size of the two models must be the same"
+                base_output_size = self._create_model(base_space, 0)(test_tensor).shape
+                assert (
+                    self.output_size == base_output_size[1:]
+                ), "The output size of the two models must be the same"
+                assert len(base_space.parameters) == self.count
+        except Exception as e:
+            print("The input size is not correct", e)
+    def compare(
+        self,
+        max_iterations: int = 10,
+        sub_iterations: int = 1,
+        variance_threashold: float | None = None,
+        plot_mode: str | None = None,
+    ) -> tuple[list[float]]:
+        """
+        Compare architectures by fitting one to the other and evaluating performance.
+        Parameters:
+        - max_iterations (int): Maximum number of gradient descent iterations.
+        - sub_iterations (int): Number of attempts of the source architecture to minimize error at each iteration.
+        - variance_threashold (float, optional): Threshold to stop iterations based on variance.
+        - plot_mode (str, optional): Plot comparison results; "min" or "mean".
+        Returns:
+        - tuple[list[float]]: Minimum and mean losses for architectures A and B.
+        """
+        self.max_iterations = max_iterations
+        self.sub_iterations = sub_iterations
+        if variance_threashold is None:
+            self.variance_threashold = 0
+        else:
+            self.variance_threashold = variance_threashold
+        self.min_A_fit = [None for _ in range(self.count)]
+        self.mean_A_fit = [None for _ in range(self.count)]
+        self.min_B_fit = [None for _ in range(self.count)]
+        self.mean_B_fit = [None for _ in range(self.count)]
+        for i in range(self.count):
+            print(f"Fitting model {i+1} out of {self.count}")
+            if self.base_space is None:
+                print(f"{self.A_space.name} fits {self.B_space.name}")
+                self.min_A_fit[i], self.mean_A_fit[i] = self._fit_source_to_target(
+                    self.A_space, self.B_space, i
+                )
+                print(f"{self.B_space.name} fits {self.A_space.name}")
+                self.min_B_fit[i], self.mean_B_fit[i] = self._fit_source_to_target(
+                    self.B_space, self.A_space, i
+                )
+            else:
+                print(f"{self.A_space.name} fits {self.base_space.name}")
+                self.min_A_fit[i], self.mean_A_fit[i] = self._fit_source_to_target(
+                    self.A_space, self.base_space, i
+                )
+                print(f"{self.base_space.name} fits {self.B_space.name}")
+                self.min_B_fit[i], self.mean_B_fit[i] = self._fit_source_to_target(
+                    self.B_space, self.base_space, i
+                )
+            if self.min_B_fit[i] > self.min_A_fit[i]:
+                self.winnner = "A"
+                print(f"Model {self.A_space.name} is better than {self.B_space.name}")
+            else:
+                self.winnner = "B"
+                print(f"Model {self.B_space.name} is better than {self.A_space.name}")
+            if self.mean_B_fit[i] > self.mean_A_fit[i]:
+                if self.winnner == "A":
+                    print(
+                        f"Model {self.A_space.name} is better than {self.B_space.name} by any mean"
+                    )
+                else:
+                    print(
+                        f"However, model {self.A_space.name} shows better convergence in mean than {self.B_space.name}"
+                    )
+            else:
+                if self.winnner == "B":
+                    print(
+                        f"Model {self.B_space.name} is better than {self.A_space.name} by any mean"
+                    )
+                else:
+                    print(
+                        f"However, model {self.B_space.name} shows better convergence in mean than {self.A_space.name}"
+                    )
+        if plot_mode is not None:
+            self.plot(plot_mode)
+        return self.min_A_fit, self.mean_A_fit, self.min_B_fit, self.mean_B_fit
+    def _create_model(self, space: ArchitecturalSpace, index: int) -> nn.Module:
+        """
+        Create a model from a given architecture and a set of parameters.
+        Parameters:
+        - space (ArchitecturalSpace): The architectural space.
+        - index (int): Index of the model within the space.
+        Returns:
+        - nn.Module: The created model.
+        """
+        return space.architecture(**space.parameters[index])
+    def _fit_source_to_target(
+        self,
+        source_space: ArchitecturalSpace,
+        target_space: ArchitecturalSpace,
+        model_index: int,
+    ) -> tuple[float]:
+        """
+        Fit a source model to match the behavior of a target model.
+        Parameters:
+        - source_space (ArchitecturalSpace): The source architectural space.
+        - target_space (ArchitecturalSpace): The target architectural space.
+        - model_index (int): Index of the model being compared.
+        Returns:
+        - tuple[float]: Mean and minimum losses for the source model fitting the target.
+        """
+        minimum = torch.tensor([torch.inf] * self.max_iterations)
+        mean = torch.zeros(self.max_iterations)
+        # Initialize epochs, grad_clamp and criterion
+        epochs = source_space.epoch[model_index]
+        grad_clamp = source_space.grad_clamp[model_index]
+        criterion = self.criterion
+        # We initialize mini_batch_count with both the target_space batch size and the source_space mini batch size
+        # This allows us to take the information of the source space to improve convergence (as it is an important hyperparamter during the learning process of the source)
+        # In the meantime, using the target space batch size allows us to know how much samples are need, if the target network has only one parameter, then the maximum degree of freedom its output is 1 (this is much more relevant thant taking the one of the source, but when compareing without a base space, we recommand to have similar mesurements for both the source and the target)
+        mini_batch_count = (
+            target_space.batch_size[model_index]
+            // source_space.mini_batch_size[model_index]
+        )
+        mini_batch_size = source_space.mini_batch_size[model_index]
+        shape = (
+            mini_batch_count,
+            mini_batch_size,
+            *self.input_size,
+        )
+        for i in range(self.max_iterations):
+            print(f"Iteration {i+1}/{self.max_iterations}")
+            # Generate data
+            X = self.law.sample(shape)
+            X.detach()
+            # Initialize target model
+            target_model = self._create_model(target_space, model_index)
+            target_model.eval()
+            # Foward pass into target model
+            with torch.no_grad():
+                target_output = target_model(
+                    X.view(mini_batch_count * mini_batch_size, *self.input_size)
+                )
+                target_output = target_output.view(
+                    mini_batch_count, mini_batch_size, *self.output_size
+                )
+            for j in range(self.sub_iterations):
+                print(f"Sub-iteration {j+1}/{self.sub_iterations}")
+                # Initialize source model
+                source_model = self._create_model(source_space, model_index)
+                optimizer = source_space.optimizer(
+                    source_model.parameters(), source_space.lr[model_index]
+                )
+                loss = self.test_model(
+                    source_model,
+                    criterion,
+                    X,
+                    target_output,
+                )
+                # Train source model to fit target model
+                self.train_model(
+                    source_model,
+                    epochs,
+                    criterion,
+                    optimizer,
+                    grad_clamp,
+                    X,
+                    target_output,
+                )
+                # Compute loss on the whole batch
+                print("Computing score on the eval set...")
+                loss = self.test_model(
+                    source_model,
+                    criterion,
+                    X,
+                    target_output,
+                )
+                minimum[i] = min(minimum[i], loss)
+                mean[i] += loss
+            mean[i] /= self.sub_iterations
+            # Calculer la variance empirique afin de savoir quand s'arrêter
+            min_var = torch.var(minimum, unbiased=True)
+            mean_var = torch.var(mean, unbiased=True)
+            max_var = max(min_var, mean_var)
+            if max_var < self.variance_threashold:
+                break
+        return minimum.mean().item(), mean.mean().item()
+    def test_model(
+        self,
+        model: nn.Module,
+        criterion: nn.Module,
+        X: list[torch.Tensor],
+        y: list[torch.Tensor],
+    ) -> float:
+        """
+        Test a model on a given dataset.
+        Parameters:
+        - model (nn.Module): The model to train.
+        - criterion (nn.Module): Loss function.
+        - X (list[torch.Tensor]): Input tensors.
+        - y (list[torch.Tensor]): Target tensors.
+        """
+        model.eval()
+        loss = 0
+        for mini_batch, target in zip(X, y):
+            output = model(mini_batch)
+            loss += criterion(output, target)
+        loss /= X.shape[0]
+        print(f"Score on the whole set, loss: {loss}")
+        return loss.item()
+    def train_model(
+        self,
+        model: nn.Module,
+        epochs: int,
+        criterion: nn.Module,
+        optimizer: optim.Optimizer,
+        grad_clamp: float,
+        X: list[torch.Tensor],
+        y: list[torch.Tensor],
+    ) -> None:
+        """
+        Train a model to minimize the loss between predicted and target outputs.
+        Parameters:
+        - model (nn.Module): The model to train.
+        - epochs (int): Number of training epochs.
+        - criterion (nn.Module): Loss function.
+        - optimizer (optim.Optimizer): Optimizer for gradient updates.
+        - grad_clamp (float): Maximum gradient value for clipping.
+        - X (list[torch.Tensor]): Input tensors.
+        - y (list[torch.Tensor]): Target tensors.
+        """
+        model.train()
+        for epoch in range(epochs):
+            for mini_batch, target in zip(X, y):
+                optimizer.zero_grad()
+                output = model(mini_batch)
+                loss = criterion(output, target)
+                loss.backward()
+                torch.nn.utils.clip_grad_value_(model.parameters(), grad_clamp)
+                optimizer.step()
+            print(f"Epoch {epoch + 1}/{epochs}, Loss: {loss.item()}")
+    def plot(self, mode: str) -> None:
+        """
+        Plot comparison results between architectures.
+        Parameters:
+        - mode (str): Plot type, "min" for minimum loss or "mean" for average loss.
+        Raises:
+        - ValueError: If the mode is not "min" or "mean".
+        """
+        if mode not in ["min", "mean"]:
+            raise ValueError("Mode must be 'min' or 'mean'")
+        if mode == "min":
+            values_A = self.min_A_fit
+            values_B = self.min_B_fit
+        elif mode == "mean":
+            values_A = self.mean_A_fit
+            values_B = self.mean_B_fit
+        plt.figure(figsize=(10, 5))
+        plt.plot(
+            self.A_space.mesurement,
+            values_A,
+            label=f"Architecture {self.A_space.name} ({mode})",
+            marker="o",
+        )
+        plt.plot(
+            self.B_space.mesurement,
+            values_B,
+            label=f"Architecture {self.B_space.name} ({mode})",
+            marker="o",
+        )
+        plt.xlabel("Number of Parameters")
+        plt.ylabel(f"{mode.capitalize()} Value")
+        plt.title(f"Comparison of {mode.capitalize()} Values for Architectures A and B")
+        plt.legend()
+        plt.grid(True)
+        plt.show()
+    def get_densities(self):
+        """
+        Compute and return the density of the comparison.
+        Returns:
+        - To be implemented if mathematically cool.
+        """
+        pass

expressivity-0.1.0/expressivity/space.py ADDED Viewed

@@ -0,0 +1,142 @@
+from copy import deepcopy
+import torch
+import torch.optim as optim
+import torch.nn as nn
+from typing import Dict, Any
+class ArchitecturalSpace:
+    def __init__(
+        self,
+        input_size: tuple | torch.Size,
+        name: str = None,
+        architecture: nn.Module = None,
+        parameters: Dict[str, Any] | list[Dict[str, Any]] | None = None,
+        lr: float | list[float] = 0.001,
+        epoch: int | list[int] = 3,
+        batch_size: int | list[int] | None = None,
+        automatic_batch_size_scale: float | None = 1.0,
+        mesurement: float | list[float] | None = None,
+        automatic_mesurement_mode: str | None = "information",
+        mini_batch_size: int | list[int] = 16,
+        optimizer=optim.AdamW,
+        grad_clamp: int | list[int] = 1,
+    ) -> None:
+        """
+        Initializes an instance of the ArchitecturalSpace class.
+        Parameters:
+        - input_size (tuple | torch.Size): The size of the input data.
+        - name (str, optional): The name of the architectural space. Defaults to None.
+        - architecture (nn.Module, optional): The neural network architecture. Defaults to None.
+        - parameters (Dict[str, Any] | list[Dict[str, Any]] | None, optional): The parameters needed when initilizing the architecture. Defaults to None.
+        - lr (float | list[float], optional): Learning rate(s) for the optimizer. Defaults to 0.001.
+        - epoch (int | list[int], optional): Number of epochs for training. Defaults to 10.
+        - batch_size (int | list[int] | None, optional): Batch size(s) for training. Defaults to None.
+        - automatic_batch_size_scale (float | None, optional): Scale factor for automatic batch size calculation. Defaults to 10.0.
+        - mesurement (float | list[float] | None, optional): Measurement(s) for the architecture. Defaults to None.
+        - automatic_mesurement_mode (str | None, optional): Mode for automatic measurement calculation. Defaults to "information".
+        - mini_batch_size (int | list[int], optional): Mini-batch size(s) for training. Defaults to 16.
+        - optimizer (optional): Optimizer for training. Defaults to optim.AdamW.
+        - grad_clamp (int | list[int], optional): Gradient clamp value(s). Defaults to 1.
+        Returns:
+        - None
+        """
+        assert (
+            batch_size is not None or automatic_batch_size_scale is not None
+        ), "Either batch_size or automatic_batch_size_scale must be defined"
+        assert (
+            mesurement is not None or automatic_mesurement_mode is not None
+        ), "Either mesurement or automatic_mesurement_mode must be defined"
+        if mesurement is None:
+            assert automatic_mesurement_mode in [
+                "information",
+                "parameters",
+            ], "automatic_mesurement_mode must be either 'information' or 'parameters'"
+        self.name = name
+        self.architecture = architecture
+        self.parameters = parameters
+        self.lr = lr
+        self.epoch = epoch
+        self.automatic_mesurement_mode = automatic_mesurement_mode
+        self.mini_batch_size = mini_batch_size
+        self.optimizer = optimizer
+        self.input_size = input_size
+        self.grad_clamp = grad_clamp
+        if type(parameters) is not list:
+            self.parameters = [parameters]
+        list_size = len(self.parameters)
+        if automatic_mesurement_mode == "information":
+            self.mesurement_method = self.count_information
+        elif automatic_mesurement_mode == "parameters":
+            self.mesurement_method = self.count_parameters
+        else:
+            self.mesurement_method = None
+        if mesurement is None:
+            self.mesurement = [
+                self.mesurement_method(architecture(**params))
+                for params in self.parameters
+            ]
+        else:
+            self.mesurement = mesurement
+        if automatic_batch_size_scale is None:
+            self.batch_size = batch_size
+        else:
+            self.batch_size = [
+                int(automatic_batch_size_scale * mesure) for mesure in self.mesurement
+            ]
+        for attr_name, attr_value in vars(self).items():
+            if type(attr_value) is list:
+                assert (
+                    len(attr_value) == list_size
+                ), "You should have as much elements in each list in your parameters"
+            elif attr_name in [
+                "lr",
+                "epoch",
+                "mini_batch_size",
+                "grad_clamp",
+                "batch_size",
+                "mesurement",
+            ]:
+                setattr(
+                    self, attr_name, [deepcopy(attr_value) for _ in range(list_size)]
+                )
+    def count_parameters(self, model: nn.Module) -> int:
+        """
+        Counts the number of trainable parameters in a given neural network architecture.
+        Parameters:
+        - model (nn.Module): The neural network architecture for which the parameters are to be counted.
+        Returns:
+        - int: The total number of trainable parameters in the architecture.
+        """
+        return sum(p.numel() for p in model.parameters() if p.requires_grad)
+    def count_information(self, model: nn.Module) -> int:
+        """
+        Calculate and return the count of information in bit.
+        Returns:
+            int: The number of bit needed to code the list of all trainable parameters.
+        """
+        total_bits = 0
+        for p in model.parameters():
+            if p.requires_grad:
+                element_size_in_bytes = p.element_size()
+                element_size_in_bits = element_size_in_bytes * 8
+                total_bits += p.numel() * element_size_in_bits
+        return total_bits

expressivity-0.1.0/expressivity.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,27 @@
+Metadata-Version: 2.4
+Name: expressivity
+Version: 0.1.0
+Summary: A package made to objectively compare the predicting power of neural network architectures implented with torch.
+Author-email: Clustery <bigarnaque@gmail.com>
+Requires-Python: >=3.11
+Description-Content-Type: text/markdown
+Requires-Dist: matplotlib>=3.10.0
+Requires-Dist: torch>=2.5.1
+Provides-Extra: dev
+Requires-Dist: black>=24.10.0; extra == "dev"
+Requires-Dist: ruff>=0.9.0; extra == "dev"
+Pour comparer 2 architectures, il faut définir une distribution de probabilité sur les paramtres de chaque réseau.
+Cette distribution doit être définie directement lors de la création du réseau dans la méthode init.
+Afin de comparer 2 architectures de manière équitable, il faut s'assurer qu'il y a ait autant de paramètres pour chaque paire prise à indice égale dans les listes space.architecture.
+Il n'y a pas ce problème lorsque l'on compare avec un réseau tier.
+L'argument 'mesure' dans un espace sert à quantifier la 'taille' d'un réseau. Le mode 'parameter' pour l'argument 'automatic_mesurement_mode' permet de comptabiliser automatiquement le nombre paramètres apprenables dans le modèle. Le mode par défaut 'information' permet lui de prendre en compte également la précision des paramètres pour comptabiliser l'intégralité des bits néccessaires pour encoder l'information des poids apprenables. L'utilisateur pourra lui-même définir ses propres métriques en définissant manuellement l'argument 'mesure' pour l'intégralité des modèles de l'espace.
+Tests :
+- Le fichier split_transformers_test.py permet de comparer un transformers classique à un transformer appliquant une fonction d'activation juste avant le produit entre Q et K. Un effectue un test A/B entre les deux modèles.
+- Le fichier n_diagonal_test.py compare l'usage d'une matrice n_diagonale, comparée à une matrice écrite sous la forme d'une LoRA. Le nombre de paramètre n'étant pas régoureusement identique dans les 2 architectures, on se propose de comparer ces dernière à une architecture tier qui englobe les 2 architectures. On compare donc à un réseau constitué de matrices de poids pleines.
+TODO:
+- In the next version it would be great to let the user define its training routine directly in the ArchitecturalSpace class
+- Lors de l'entrainement avec une architecture tier, on recalcule 2 fois des passes avant pour le modèle tier. On pourrait opitmiser le temps de calcul en s'assurant que l'on génère une target, puis les 2 modèles concurrents doivent l'approximer (au prix d'un code plus long, à moins de repenser la structure logique)

expressivity-0.1.0/expressivity.egg-info/SOURCES.txt ADDED Viewed

@@ -0,0 +1,21 @@
+README.md
+pyproject.toml
+expressivity/__init__.py
+expressivity/probabilistic_density.py
+expressivity/space.py
+expressivity.egg-info/PKG-INFO
+expressivity.egg-info/SOURCES.txt
+expressivity.egg-info/dependency_links.txt
+expressivity.egg-info/requires.txt
+expressivity.egg-info/top_level.txt
+tests/cubic_transformer/attention.py
+tests/cubic_transformer/cubic_transformer_test.py
+tests/cubic_transformer/transformer.py
+tests/n_diagonal/deep_nn.py
+tests/n_diagonal/linear.py
+tests/n_diagonal/lora.py
+tests/n_diagonal/n_diagonal.py
+tests/n_diagonal/n_diagonal_test.py
+tests/split_transformer/attention.py
+tests/split_transformer/split_transformer_test.py
+tests/split_transformer/transformer.py

expressivity-0.1.0/expressivity.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+

expressivity-0.1.0/expressivity.egg-info/requires.txt ADDED Viewed

@@ -0,0 +1,6 @@
+matplotlib>=3.10.0
+torch>=2.5.1
+[dev]
+black>=24.10.0
+ruff>=0.9.0

expressivity-0.1.0/expressivity.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ dist
2	+ expressivity

expressivity-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,22 @@
+[project]
+name = "expressivity"
+version = "0.1.0"
+authors = [
+    {name = "Clustery", email = "bigarnaque@gmail.com"}
+]
+description = "A package made to objectively compare the predicting power of neural network architectures implented with torch."
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+    "matplotlib>=3.10.0",
+    "torch>=2.5.1",
+]
+[project.optional-dependencies]
+dev = [
+    "black>=24.10.0",
+    "ruff>=0.9.0",
+]
+[tool.setuptools.packages.find]
+exclude = ["notebook", "notebook.*", "tests", "tests.*"]

expressivity-0.1.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

expressivity-0.1.0/tests/cubic_transformer/attention.py ADDED Viewed

@@ -0,0 +1,73 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+"""
+Cette classe implémente l'attention multi-tête d'OpenAI (et non celle du papier originel 'Attention is all you need').
+La différence principale est que les tête d'attention sont sommées au lieu d'être concaténées. Cela permet n'implique donc pas de respecter la contrainte d_model%n_heads = 0.
+L'autre différence, est qu'il n'y a pas de couche linéaire tradictionnellement appelée 'O' (pour 'output') appliquée à la fin du mécanisme d'attention.
+"""
+class MultiHeadAttention(nn.Module):
+    def __init__(
+        self,
+        d_model,
+        n_heads,
+        cubic=False,
+    ):
+        super(MultiHeadAttention, self).__init__()
+        self.d_model = d_model
+        self.n_heads = n_heads
+        self.cubic = cubic
+        self.Q = nn.Linear(d_model, d_model * n_heads, False)
+        self.K = nn.Linear(d_model, d_model * n_heads, False)
+        self.V = nn.Linear(d_model, d_model * n_heads, False)
+    def forward(
+        self,
+        x: torch.Tensor,
+    ):
+        """
+        x : Tensor de taille (batch_size, seq_len, d_model)
+        """
+        batch_size, seq_len, _ = x.size()
+        # Calculer Q, K, V et les diviser en têtes
+        q, k, v = self.Q(x), self.K(x), self.V(x)
+        query = self._reshape_to_batches(q)
+        key = self._reshape_to_batches(k)
+        value = self._reshape_to_batches(v)
+        dk = query.size()[-1]
+        scores = query.matmul(key.transpose(-2, -1)) / math.sqrt(dk)
+        if self.cubic:
+            attention = F.softmax(scores, dim=-1)
+        else:
+            attention = scores
+        y = attention.matmul(value)
+        y = y.reshape(batch_size, self.n_heads, seq_len, self.d_model)
+        y = y.sum(dim=1)
+        return y
+    def _reshape_to_batches(
+        self,
+        x: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        x: input tensor with shape (batch_size, seq_len, d_model*n_heads)
+        Returns:
+        Reshaped tensor with shape (batch_size*n_heads, seq_len, d_model)
+        """
+        batch_size, seq_len, _ = x.size()
+        return (
+            x.reshape(batch_size, seq_len, self.n_heads, self.d_model)
+            .permute(0, 2, 1, 3)
+            .reshape(batch_size * self.n_heads, seq_len, self.d_model)
+        )

expressivity-0.1.0/tests/cubic_transformer/cubic_transformer_test.py ADDED Viewed

@@ -0,0 +1,64 @@
+from transformer import Transformer
+from expressivity.space import ArchitecturalSpace
+from expressivity.probabilistic_density import ArchitectureComparator
+from torch import optim
+"""
+In this exemple we are comparing the OpenAI style Transformer achritecture with the mathematically simplest network allowing attention.
+We have purposely remove one fully connected layer from the original architecture to ensure both architectures hold the same number of parameters.
+"""
+d_model = 6
+seq_length = 5
+n_heads = 1
+d_ff = 6
+max_depth = 4
+# Create competing architectures
+cubic_transformer_params = [
+    {
+        "d_model": d_model,
+        "n_heads": n_heads,
+        "d_ff": d_ff,
+        "depth": i + 4,
+        "cubic": True,
+    }
+    for i in range(max_depth)
+]
+transformer_params = [
+    {
+        "d_model": d_model,
+        "n_heads": n_heads,
+        "d_ff": d_ff,
+        "depth": i + 4,
+    }
+    for i in range(max_depth)
+]
+# Create architectural spaces
+epoch = [i+3 for i in range(max_depth)]
+cubic_transformer_space = ArchitecturalSpace(
+    (seq_length, d_model),
+    "Cubic Transformer",
+    Transformer,
+    cubic_transformer_params,
+    epoch=epoch,
+    optimizer=optim.AdamW
+)
+transformer_space = ArchitecturalSpace(
+    (seq_length, d_model),
+    "Transformer",
+    Transformer,
+    transformer_params,
+    epoch=epoch,
+)
+# Create comparator
+comparator = ArchitectureComparator(cubic_transformer_space, transformer_space)
+res = comparator.compare()
+print(res)
+comparator.plot("min")

expressivity-0.1.0/tests/cubic_transformer/transformer.py ADDED Viewed

@@ -0,0 +1,107 @@
+import sys
+import os
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "..")))
+from tests.cubic_transformer.attention import MultiHeadAttention
+import torch.nn as nn
+import torch
+import torch.nn.functional as F
+class Transformer(nn.Module):
+    def __init__(
+        self,
+        d_model,
+        n_heads,
+        d_ff,
+        depth,
+        dropout=0.1,
+        cubic=False,
+    ):
+        super(Transformer, self).__init__()
+        self.d_model = d_model
+        # Liste des couches de l'encodeur
+        self.encoder_layers = nn.ModuleList(
+            [
+                TransformerEncoderLayer(
+                    d_model,
+                    n_heads,
+                    d_ff,
+                    dropout,
+                    cubic,
+                )
+                for _ in range(depth)
+            ]
+        )
+    def forward(self, x):
+        """
+        x : Tensor de taille (batch_size, seq_len, d_model)
+        """
+        for layer in self.encoder_layers:
+            x = layer(x)
+        return x
+class TransformerEncoderLayer(nn.Module):
+    def __init__(
+        self,
+        d_model,
+        n_heads,
+        d_ff,
+        dropout=0.1,
+        cubic=False,
+    ):
+        super(TransformerEncoderLayer, self).__init__()
+        self.self_attention = MultiHeadAttention(
+            d_model, n_heads, cubic
+        )
+        self.cubic = cubic
+        self.fc = nn.Linear(d_model, d_model)
+        self.fc_1 = nn.Linear(d_model, d_ff)
+        self.fc_2 = nn.Linear(d_ff, d_model)
+        self.activation = nn.ReLU()
+        self.layer_norm1 = nn.LayerNorm(d_model)
+        self.layer_norm2 = nn.LayerNorm(d_model)
+        self.dropout = nn.Dropout(dropout)
+        self.previous_weights = None
+    def forward(self, x):
+        """
+        x : Tensor de taille (batch_size, seq_len, d_model)
+        """
+        # Attention multi-têtes
+        attn_output = self.self_attention(x)
+        if self.cubic:
+            x = attn_output + self.fc(x)
+        else:
+            # x = self.layer_norm1(x + self.dropout(attn_output))
+            x = F.normalize(x + self.dropout(attn_output), p=2, dim=-1)
+            # x = F.normalize(x + attn_output, p=2, dim=-1)
+            x = self.fc(x)
+        # Réseau feed-forward
+        # ff_output = self.fc_2(self.activation(self.fc_1(x)))
+        ff_output = x
+        # x = self.layer_norm2(x + self.dropout(ff_output))
+        x = F.normalize(x + self.dropout(ff_output), p=2, dim=-1)
+        # x = F.normalize(x + ff_output, p=2, dim=-1)
+        # self.check_weights()
+        return x
+    def check_weights(self):
+        current_weights = {name: param.clone() for name, param in self.named_parameters()}
+        if self.previous_weights is not None:
+            for name, param in current_weights.items():
+                param = torch.round(torch.clamp(param, -1, 2))
+                self.previous_weights[name] = torch.round(torch.clamp(self.previous_weights[name], -1, 2))
+                pass
+                # assert torch.equal(param, self.previous_weights[name]), f"Les poids pour {name} ont changé."
+        self.previous_weights = current_weights

expressivity-0.1.0/tests/n_diagonal/deep_nn.py ADDED Viewed

@@ -0,0 +1,35 @@
+import torch.nn as nn
+from n_diagonal import NDiagonalLayer
+from lora import LoRALayer
+from linear import LinearLayer
+class DeepNetwork(nn.Module):
+    def __init__(
+        self, layer_type="fully_connected", dim=10, depth=1, rank=1, bias=True
+    ):
+        super(DeepNetwork, self).__init__()
+        self.layers = nn.ModuleList()
+        self.norms = nn.ModuleList()
+        self.activation = nn.ReLU()
+        for _ in range(depth):
+            if layer_type == "n_diagonal":
+                self.layers.append(NDiagonalLayer(dim, rank, bias))
+            elif layer_type == "LoRA":
+                self.layers.append(LoRALayer(dim, rank, bias))
+            elif layer_type == "fully_connected":
+                self.layers.append(LinearLayer(dim, bias))
+            else:
+                raise ValueError(f"Type de couche non supporté : {layer_type}")
+            self.norms.append(nn.LayerNorm(dim))
+    def forward(self, x):
+        for layer, norm in zip(self.layers, self.norms):
+            skip = x
+            x = layer(x)
+            x = self.activation(x)
+            x = x + skip
+            x = norm(x)
+        return x

expressivity-0.1.0/tests/n_diagonal/linear.py ADDED Viewed

@@ -0,0 +1,24 @@
+import torch
+import torch.nn as nn
+from math import sqrt
+class LinearLayer(nn.Module):
+    def __init__(self, dim, bias=True):
+        super(LinearLayer, self).__init__()
+        self.dim = dim
+        std = sqrt(2 / dim)
+        # std = 0.0001
+        # Matrix
+        self.weight = nn.Parameter(torch.randn(dim, dim) * std)
+        # Optionnel : biais
+        self.bias = nn.Parameter(torch.randn(dim)) if bias else None
+    def forward(self, x):
+        out = x @ self.weight
+        if self.bias is not None:
+            out = out + self.bias
+        return out

expressivity-0.1.0/tests/n_diagonal/lora.py ADDED Viewed

@@ -0,0 +1,27 @@
+import torch
+import torch.nn as nn
+from math import sqrt
+class LoRALayer(nn.Module):
+    def __init__(self, dim, rank, bias=True):
+        super(LoRALayer, self).__init__()
+        self.dim = dim
+        self.rank = rank
+        # std = sqrt(2/rank)
+        std = 0.0001
+        # Matrices LoRA
+        self.down_proj = nn.Parameter(torch.randn(dim, rank) * std)
+        self.up_proj = nn.Parameter(torch.randn(rank, dim) * std)
+        # Optionnel : biais
+        self.bias = nn.Parameter(torch.randn(dim)) if bias else None
+    def forward(self, x):
+        # Application de la réduction puis de la projection
+        lora_out = x @ self.down_proj @ self.up_proj
+        if self.bias is not None:
+            lora_out += self.bias
+        return lora_out

expressivity-0.1.0/tests/n_diagonal/n_diagonal.py ADDED Viewed

@@ -0,0 +1,41 @@
+import torch
+import torch.nn as nn
+from math import sqrt
+class NDiagonalLayer(nn.Module):
+    def __init__(self, dim, rank, bias=True):
+        super(NDiagonalLayer, self).__init__()
+        assert rank > 0, "Le nombre de diagonales doit être positif."
+        self.dim = dim
+        self.rank = rank
+        # std = sqrt(2/rank)
+        std = 0.0001
+        # Matrice de poids limitée à n diagonales
+        self.diagonal_weights = nn.Parameter(torch.randn(dim) * std)
+        self.lower_weights = [
+            nn.Parameter(torch.randn(dim - (i + 1)) * std) for i in range(rank - 1)
+        ]
+        self.upper_weights = [
+            nn.Parameter(torch.randn(dim - (i + 1)) * std) for i in range(rank - 1)
+        ]
+        # Optionnel : biais
+        self.bias = nn.Parameter(torch.randn(dim)) if bias else None
+    def forward(self, x):
+        weigths = torch.zeros((self.dim, self.dim))
+        weigths = weigths + torch.diag(self.diagonal_weights)
+        for i, (lower_weights, upper_weights) in enumerate(
+            zip(self.lower_weights, self.upper_weights)
+        ):
+            weigths = weigths + torch.diag(lower_weights, diagonal=-(i + 1))
+            weigths = weigths + torch.diag(upper_weights, diagonal=i + 1)
+        out = x @ weigths
+        if self.bias is not None:
+            out += self.bias
+        return out

expressivity-0.1.0/tests/n_diagonal/n_diagonal_test.py ADDED Viewed

@@ -0,0 +1,90 @@
+import sys
+import os
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "..")))
+from expressivity.probabilistic_density import ArchitectureComparator
+from expressivity.space import ArchitecturalSpace
+from tests.n_diagonal.deep_nn import DeepNetwork
+"""
+In this exemple we are comparing the reduction of parameter count when using the usuall LoRA and a n-diagonal matrix instead
+As the count of parameter does not scale the same way, we compare both architectures to an architecture with fully connected layers
+"""
+# Create competing architectures
+n_diagonal_params = [
+    {
+        "layer_type": "n_diagonal",
+        "dim": 5,
+        "depth": 4,
+        "rank": i + 1,
+        "bias": True,
+    }
+    for i in range(5)
+]
+lora_params = [
+    {
+        "layer_type": "LoRA",
+        "dim": 5,
+        "depth": 4,
+        "rank": i + 1,
+        "bias": True,
+    }
+    for i in range(5)
+]
+fully_connected_params = [
+    {
+        "layer_type": "fully_connected",
+        "dim": 5,
+        "depth": 4,
+        "bias": True,
+    }
+    for _ in range(5)
+]
+def compute_params(dim, depth, rank, bias):
+    return depth * (dim + 2 * dim * rank - rank * (rank + 1) + dim * bias)
+# Create architectural spaces
+n_diagonal_space = ArchitecturalSpace(
+    (5,),
+    "N-Diagonal",
+    DeepNetwork,
+    n_diagonal_params,
+    epoch=10,
+    lr=0.01,
+    automatic_mesurement_mode=None,
+    mesurement=[compute_params(5, 4, i, True) for i in range(5)],
+)
+lora_space = ArchitecturalSpace(
+    (5,),
+    "LoRA",
+    DeepNetwork,
+    lora_params,
+    epoch=10,
+    lr=0.01,
+    automatic_mesurement_mode="parameters",
+)
+fully_connected_space = ArchitecturalSpace(
+    (5,),
+    "Fully Connected",
+    DeepNetwork,
+    fully_connected_params,
+    epoch=10,
+    lr=0.01,
+    automatic_mesurement_mode="parameters",
+)
+# Create comparator
+comparator = ArchitectureComparator(n_diagonal_space, lora_space, fully_connected_space)
+# comparator = ArchitectureComparator(lora_space, n_diagonal_space, fully_connected_space)
+res = comparator.compare(100, 5)
+print(res)
+comparator.plot("min")

expressivity-0.1.0/tests/split_transformer/attention.py ADDED Viewed

@@ -0,0 +1,57 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+class MultiHeadAttention(nn.Module):
+    def __init__(self, d_model, n_heads, split=False):
+        super(MultiHeadAttention, self).__init__()
+        assert d_model % n_heads == 0, "d_model doit être divisible par n_heads"
+        self.d_model = d_model
+        self.n_heads = n_heads
+        self.head_dim = d_model // n_heads
+        # Projections linéaires pour Q, K, V
+        self.qkv_proj = nn.Linear(d_model, 3 * d_model, False)
+        self.fc_out = nn.Linear(d_model, d_model)
+        self.scale = math.sqrt(self.head_dim)
+        self.split = split
+        if split:
+            # self.activation = nn.ReLU()
+            self.activation = nn.Tanh()
+    def forward(self, query, key, value):
+        """
+        query, key, value : Tensor de taille (batch_size, seq_len, d_model)
+        """
+        batch_size, seq_len, _ = query.size()
+        # Calculer Q, K, V et les diviser en têtes
+        qkv = self.qkv_proj(query).chunk(3, dim=-1)
+        query, key, value = [
+            x.view(batch_size, -1, self.n_heads, self.head_dim).transpose(1, 2)
+            for x in qkv
+        ]
+        if self.split:
+            query = self.activation(query)
+            key = self.activation(key)
+        # Produit scalaire pour l'attention
+        scores = torch.matmul(query, key.transpose(-2, -1)) / self.scale
+        attention = F.softmax(scores, dim=-1)
+        # Appliquer les poids d'attention sur V
+        attn_output = torch.matmul(attention, value)
+        # Réassembler les têtes
+        attn_output = (
+            attn_output.transpose(1, 2)
+            .contiguous()
+            .view(batch_size, seq_len, self.d_model)
+        )
+        # Projection finale
+        return self.fc_out(attn_output)

expressivity-0.1.0/tests/split_transformer/split_transformer_test.py ADDED Viewed

@@ -0,0 +1,50 @@
+import sys
+import os
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
+from expressivity.probabilistic_density import ArchitectureComparator
+from expressivity.space import ArchitecturalSpace
+from tests.split_transformer.transformer import Transformer
+"""
+In this exemple we are comparing 2 transformers with the same general architecture
+The only difference will be that one will apply an activation function before multiplying Q and K together
+"""
+# Create competing architectures
+transformer_params = [
+    {
+        "d_model": 6,
+        "n_heads": 3,
+        "d_ff": 6,
+        "num_layers": i + 1,
+    }
+    for i in range(4)
+]
+split_transformer_params = [
+    {
+        "d_model": 6,
+        "n_heads": 3,
+        "d_ff": 6,
+        "num_layers": i + 1,
+        "split": True,
+    }
+    for i in range(4)
+]
+# Create architectural spaces
+transformer_space = ArchitecturalSpace(
+    (5, 6), "transformers", Transformer, transformer_params
+)
+split_transformer_space = ArchitecturalSpace(
+    (5, 6), "split_transformers", Transformer, split_transformer_params
+)
+# Create comparator
+comparator = ArchitectureComparator(transformer_space, split_transformer_space)
+res = comparator.compare()
+print(res)
+comparator.plot("min")

expressivity-0.1.0/tests/split_transformer/transformer.py ADDED Viewed

@@ -0,0 +1,50 @@
+import torch.nn as nn
+from tests.split_transformer.attention import MultiHeadAttention
+class Transformer(nn.Module):
+    def __init__(self, d_model, n_heads, d_ff, num_layers, dropout=0.1, split=False):
+        super(Transformer, self).__init__()
+        self.d_model = d_model
+        # Liste des couches de l'encodeur
+        self.encoder_layers = nn.ModuleList(
+            [
+                TransformerEncoderLayer(d_model, n_heads, d_ff, dropout, split)
+                for _ in range(num_layers)
+            ]
+        )
+    def forward(self, x):
+        """
+        x : Tensor de taille (batch_size, seq_len, d_model)
+        """
+        for layer in self.encoder_layers:
+            x = layer(x)
+        return x
+class TransformerEncoderLayer(nn.Module):
+    def __init__(self, d_model, n_heads, d_ff, dropout=0.1, split=False):
+        super(TransformerEncoderLayer, self).__init__()
+        self.self_attention = MultiHeadAttention(d_model, n_heads, split)
+        self.feed_forward = nn.Sequential(
+            nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model)
+        )
+        self.layer_norm1 = nn.LayerNorm(d_model)
+        self.layer_norm2 = nn.LayerNorm(d_model)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        """
+        x : Tensor de taille (batch_size, seq_len, d_model)
+        """
+        # Attention multi-têtes
+        attn_output = self.self_attention(x, x, x)
+        x = self.layer_norm1(x + self.dropout(attn_output))
+        # Réseau feed-forward
+        ff_output = self.feed_forward(x)
+        x = self.layer_norm2(x + self.dropout(ff_output))
+        return x