PyPI - interpreto - Versions diffs - 0.5.0.dev0__tar.gz → 0.5.0.dev1__tar.gz - Mend

interpreto 0.5.0.dev0tar.gz → 0.5.0.dev1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (135) hide show

{interpreto-0.5.0.dev0 → interpreto-0.5.0.dev1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: interpreto
-Version: 0.5.0.dev0
+Version: 0.5.0.dev1
 Summary: Interpretability toolbox for LLMs
 Author: FOR Team
 Author-email: fanny.jourdan@irt-saintexupery.com
@@ -57,7 +57,7 @@ License-File: LICENSE
 Requires-Dist: transformers>=4.22.0
 Requires-Dist: nltk
 Requires-Dist: torch>=2.0
-Requires-Dist: nnsight<0.6.0,>=0.5.1
+Requires-Dist: nnsight<0.8.0,>=0.7.0
 Requires-Dist: jaxtyping<=0.2.36
 Requires-Dist: beartype
 Requires-Dist: mknotebooks
@@ -120,6 +120,7 @@ Dynamic: license-file
 <p align="center">
   <a href="https://for-sight-ai.github.io/interpreto/"><strong>📚 Explore Interpreto docs &gt;&gt;</strong></a><br />
   <a href="https://for-sight-ai.github.io/interpreto-demo/"><strong>🖼️ Checkout our explanation gallery &gt;&gt;</strong></a>
+  <a href="https://arxiv.org/abs/2512.09730"><strong>📜 Read our paper &gt;&gt;</strong></a>
 </p>
 ## 🚀 Quick Start
@@ -163,41 +164,51 @@ They all work seamlessly for both classification (`...ForSequenceClassification`
 Concept-based explanations aim to provide high-level interpretations of latent model representations.
-Interpreto generalizes these methods through four core steps:
+We propose both supervised (probes and CAVs) and unsupervised (dictionary learning) approaches.
+Interpreto generalizes these methods through four core steps, the two first are common between both approaches:
 1. Split a model in two and obtain a dataset of activations
-2. Concept Discovery (e.g., from latent embeddings)
-3. Concept Interpretation (mapping discovered concepts to human-understandable elements)
-4. Concept-to-Output Attribution (assessing concept relevance to model outputs)
+2. Learn concepts (e.g., from latent embeddings)
+3. Interpret concepts (mapping discovered concepts to human-understandable elements)
+4. Estimate concepts importance (assessing concept relevance to model outputs)
 **1. Split a model in two and obtain a dataset of activations:** (mainly via [`nnsight`](https://github.com/ndif-team/nnsight)):
 Choose any layer in any HuggingFace language model with our `ModelWithSplitPoints` based on `nnsight`. Then pass a dataset through it to obtain a dataset of activations.
-**2. Dictionary Learning for Concept Discovery** (mainly via [`overcomplete`](https://github.com/KempnerInstitute/overcomplete)):
+**2. (supervised) Train probe** with the [`ProbeExplainer`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/)
+We differentiate two families of probes:
+- Linear probes: [`LinearRegressionProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.LinearRegressionProbe), [`LogisticRegressionProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.LogisticRegressionProbe), [`LinearSVMProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.LinearSVMProbe), [`MeansDiffProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.MeansDiffProbe)
+- Centroid-based probes: [`CosineCentroidProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.CosineCentroidProbe), [`DotProductCentroidProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.DotProductCentroidProbe), [`SqL2CentroidProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.SqL2CentroidProbe), [`SVDDCentroidProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.SVDDCentroidProbe), [`DiagonalMahalanobisCentroidProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.DiagonalMahalanobisCentroidProbe)
+Both can be tuned with `bias_calibrator` and `normalization` parameters.
+**2. (unsupervised) Dictionary Learning for Concept Discovery** (mainly via [`overcomplete`](https://github.com/KempnerInstitute/overcomplete)):
 - Interpret neurons directly via [`NeuronsAsConcepts`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/neurons_as_concepts/)
 - [`NMF`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.NMFConcepts), [`Semi-NMF`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.SemiNMFConcepts), [`ConvexNMF`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.ConvexNMFConcepts)
 - [`ICA`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.ICAConcepts), [`SVD`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.SVDConcepts), [`PCA`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.PCAConcepts), [`KMeans`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.KMeansConcepts)
 - SAE variants: [`Vanilla SAE`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/sae/#interpreto.concepts.VanillaSAEConcepts), [`TopK SAE`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/sae/#interpreto.concepts.TopKSAEConcepts), [`JumpReLU SAE`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/sae/#interpreto.concepts.JumpReLUSAEConcepts), [`BatchTopK SAE`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/sae/#interpreto.concepts.BatchTopKSAEConcepts)
-**3. Available Concept Interpretation Techniques:**
+**3. (unsupervised) Available Concept Interpretation Techniques:**
 - Top-k tokens from tokenizer vocabulary via [`TopKInputs`](https://for-sight-ai.github.io/interpreto/api/concepts/concepts_interpretations/#interpreto.concepts.interpretations.TopKInputs) and `use_vocab=True`
 - Top-k tokens/words/sentences/samples from specific datasets via [`TopKInputs`](https://for-sight-ai.github.io/interpreto/api/concepts/concepts_interpretations/#interpreto.concepts.interpretations.TopKInputs)
 - Label concepts via LLMs with [`LLMLabels`](https://for-sight-ai.github.io/interpreto/api/concepts/concepts_interpretations/#interpreto.concepts.interpretations.LLMLabels) ([Bills et al. 2023](https://openai.com/index/language-models-can-explain-neurons-in-language-models/))
+- Input-to-concept attribution from dataset examples ([Concept Attributions](https://for-sight-ai.github.io/interpreto/api/concepts/interpretations/concept_attributions/)) ([Jourdan et al. 2023](https://aclanthology.org/2023.findings-acl.317/))
 <details><summary>Concept Interpretation Techniques Added in the future:</summary>
-- Input-to-concept attribution from dataset examples ([Jourdan et al. 2023](https://aclanthology.org/2023.findings-acl.317/))
-- Theme prediction via LLMs from top-k tokens/sentences
 - Aligning concepts with human labels ([Sajjad et al. 2022](https://aclanthology.org/2022.naacl-main.225/))
 - Word cloud visualizations of concepts ([Dalvi et al. 2022](https://arxiv.org/abs/2205.07237))
 - VocabProj & TokenChange ([Gur-Arieh et al. 2025](https://arxiv.org/abs/2501.08319))
 </details>
-**4. Concept-to-Output Attribution:**
+**4. (unsupervised) Concept-to-Output Attribution:**
 Estimate the contribution of each concept to the model output.
@@ -207,7 +218,6 @@ Can be obtained with any concept-based explainer via [`MethodConcepts.concept_ou
 Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods:
-- CAV and TCAV: [Kim et al. 2018, Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)](http://proceedings.mlr.press/v80/kim18d.html)
 - ConceptSHAP: [Yeh et al. 2020, On Completeness-aware Concept-Based Explanations in Deep Neural Networks](https://proceedings.neurips.cc/paper/2020/hash/ecb287ff763c169694f682af52c1f309-Abstract.html)
 - COCKATIEL: [Jourdan et al. 2023, COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP](https://aclanthology.org/2023.findings-acl.317/)
 - Yun et al. 2021, [Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors](https://arxiv.org/abs/2103.15949)
@@ -264,7 +274,7 @@ Interpreto 🪄 is a project of the [FOR](https://www.irt-saintexupery.com/fr/fo
 ## 🗞️ Citation
-If you use Interpreto 🪄 as part of your workflow in a scientific publication, please consider citing 🗞️ our paper:
+If you use Interpreto 🪄 as part of your workflow in a scientific publication, please consider citing 🗞️ [our paper](https://arxiv.org/abs/2512.09730):
 ```bibtex
 @article{poche2025interpreto,

{interpreto-0.5.0.dev0 → interpreto-0.5.0.dev1}/README.md RENAMED Viewed

@@ -15,6 +15,7 @@
 <p align="center">
   <a href="https://for-sight-ai.github.io/interpreto/"><strong>📚 Explore Interpreto docs &gt;&gt;</strong></a><br />
   <a href="https://for-sight-ai.github.io/interpreto-demo/"><strong>🖼️ Checkout our explanation gallery &gt;&gt;</strong></a>
+  <a href="https://arxiv.org/abs/2512.09730"><strong>📜 Read our paper &gt;&gt;</strong></a>
 </p>
 ## 🚀 Quick Start
@@ -58,41 +59,51 @@ They all work seamlessly for both classification (`...ForSequenceClassification`
 Concept-based explanations aim to provide high-level interpretations of latent model representations.
-Interpreto generalizes these methods through four core steps:
+We propose both supervised (probes and CAVs) and unsupervised (dictionary learning) approaches.
+Interpreto generalizes these methods through four core steps, the two first are common between both approaches:
 1. Split a model in two and obtain a dataset of activations
-2. Concept Discovery (e.g., from latent embeddings)
-3. Concept Interpretation (mapping discovered concepts to human-understandable elements)
-4. Concept-to-Output Attribution (assessing concept relevance to model outputs)
+2. Learn concepts (e.g., from latent embeddings)
+3. Interpret concepts (mapping discovered concepts to human-understandable elements)
+4. Estimate concepts importance (assessing concept relevance to model outputs)
 **1. Split a model in two and obtain a dataset of activations:** (mainly via [`nnsight`](https://github.com/ndif-team/nnsight)):
 Choose any layer in any HuggingFace language model with our `ModelWithSplitPoints` based on `nnsight`. Then pass a dataset through it to obtain a dataset of activations.
-**2. Dictionary Learning for Concept Discovery** (mainly via [`overcomplete`](https://github.com/KempnerInstitute/overcomplete)):
+**2. (supervised) Train probe** with the [`ProbeExplainer`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/)
+We differentiate two families of probes:
+- Linear probes: [`LinearRegressionProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.LinearRegressionProbe), [`LogisticRegressionProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.LogisticRegressionProbe), [`LinearSVMProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.LinearSVMProbe), [`MeansDiffProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.MeansDiffProbe)
+- Centroid-based probes: [`CosineCentroidProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.CosineCentroidProbe), [`DotProductCentroidProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.DotProductCentroidProbe), [`SqL2CentroidProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.SqL2CentroidProbe), [`SVDDCentroidProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.SVDDCentroidProbe), [`DiagonalMahalanobisCentroidProbe`](https://for-sight-ai.github.io/interpreto/api/concepts/probes/#interpreto.concepts.probes.DiagonalMahalanobisCentroidProbe)
+Both can be tuned with `bias_calibrator` and `normalization` parameters.
+**2. (unsupervised) Dictionary Learning for Concept Discovery** (mainly via [`overcomplete`](https://github.com/KempnerInstitute/overcomplete)):
 - Interpret neurons directly via [`NeuronsAsConcepts`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/neurons_as_concepts/)
 - [`NMF`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.NMFConcepts), [`Semi-NMF`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.SemiNMFConcepts), [`ConvexNMF`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.ConvexNMFConcepts)
 - [`ICA`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.ICAConcepts), [`SVD`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.SVDConcepts), [`PCA`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.PCAConcepts), [`KMeans`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/optim/#interpreto.concepts.KMeansConcepts)
 - SAE variants: [`Vanilla SAE`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/sae/#interpreto.concepts.VanillaSAEConcepts), [`TopK SAE`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/sae/#interpreto.concepts.TopKSAEConcepts), [`JumpReLU SAE`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/sae/#interpreto.concepts.JumpReLUSAEConcepts), [`BatchTopK SAE`](https://for-sight-ai.github.io/interpreto/api/concepts/methods/sae/#interpreto.concepts.BatchTopKSAEConcepts)
-**3. Available Concept Interpretation Techniques:**
+**3. (unsupervised) Available Concept Interpretation Techniques:**
 - Top-k tokens from tokenizer vocabulary via [`TopKInputs`](https://for-sight-ai.github.io/interpreto/api/concepts/concepts_interpretations/#interpreto.concepts.interpretations.TopKInputs) and `use_vocab=True`
 - Top-k tokens/words/sentences/samples from specific datasets via [`TopKInputs`](https://for-sight-ai.github.io/interpreto/api/concepts/concepts_interpretations/#interpreto.concepts.interpretations.TopKInputs)
 - Label concepts via LLMs with [`LLMLabels`](https://for-sight-ai.github.io/interpreto/api/concepts/concepts_interpretations/#interpreto.concepts.interpretations.LLMLabels) ([Bills et al. 2023](https://openai.com/index/language-models-can-explain-neurons-in-language-models/))
+- Input-to-concept attribution from dataset examples ([Concept Attributions](https://for-sight-ai.github.io/interpreto/api/concepts/interpretations/concept_attributions/)) ([Jourdan et al. 2023](https://aclanthology.org/2023.findings-acl.317/))
 <details><summary>Concept Interpretation Techniques Added in the future:</summary>
-- Input-to-concept attribution from dataset examples ([Jourdan et al. 2023](https://aclanthology.org/2023.findings-acl.317/))
-- Theme prediction via LLMs from top-k tokens/sentences
 - Aligning concepts with human labels ([Sajjad et al. 2022](https://aclanthology.org/2022.naacl-main.225/))
 - Word cloud visualizations of concepts ([Dalvi et al. 2022](https://arxiv.org/abs/2205.07237))
 - VocabProj & TokenChange ([Gur-Arieh et al. 2025](https://arxiv.org/abs/2501.08319))
 </details>
-**4. Concept-to-Output Attribution:**
+**4. (unsupervised) Concept-to-Output Attribution:**
 Estimate the contribution of each concept to the model output.
@@ -102,7 +113,6 @@ Can be obtained with any concept-based explainer via [`MethodConcepts.concept_ou
 Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods:
-- CAV and TCAV: [Kim et al. 2018, Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)](http://proceedings.mlr.press/v80/kim18d.html)
 - ConceptSHAP: [Yeh et al. 2020, On Completeness-aware Concept-Based Explanations in Deep Neural Networks](https://proceedings.neurips.cc/paper/2020/hash/ecb287ff763c169694f682af52c1f309-Abstract.html)
 - COCKATIEL: [Jourdan et al. 2023, COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP](https://aclanthology.org/2023.findings-acl.317/)
 - Yun et al. 2021, [Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors](https://arxiv.org/abs/2103.15949)
@@ -159,7 +169,7 @@ Interpreto 🪄 is a project of the [FOR](https://www.irt-saintexupery.com/fr/fo
 ## 🗞️ Citation
-If you use Interpreto 🪄 as part of your workflow in a scientific publication, please consider citing 🗞️ our paper:
+If you use Interpreto 🪄 as part of your workflow in a scientific publication, please consider citing 🗞️ [our paper](https://arxiv.org/abs/2512.09730):
 ```bibtex
 @article{poche2025interpreto,

{interpreto-0.5.0.dev0 → interpreto-0.5.0.dev1}/interpreto/__init__.py RENAMED Viewed

@@ -42,7 +42,7 @@ from .attributions import (
 from .commons import (
     Granularity,
 )
-from .model_wrapping import ModelWithSplitPoints
+from .model_wrapping import ModelWithSplitPoints, SplitSequenceClassification
 from .visualizations import (
     AttributionVisualization,
     plot_attributions,
@@ -71,6 +71,7 @@ __all__ = [
     "SquareGrad",
     "Saliency",
     "SmoothGrad",
+    "SplitSequenceClassification",
     "Sobol",
     "VarGrad",
     "get_version",

interpreto-0.5.0.dev1/interpreto/_vendor/overcomplete/base.pyi ADDED Viewed

@@ -0,0 +1,16 @@
+from abc import ABC
+import torch
+from torch import nn
+class BaseDictionaryLearning(ABC, nn.Module):
+    nb_concepts: int
+    device: str | torch.device
+    fitted: bool
+    def __init__(self, nb_concepts: int, device: str | torch.device = "cpu") -> None: ...
+    def encode(self, x: torch.Tensor) -> torch.Tensor: ...
+    def decode(self, z: torch.Tensor) -> torch.Tensor: ...
+    def fit(self, x: torch.Tensor) -> None: ...
+    def get_dictionary(self) -> torch.Tensor: ...
+    def to(self, device: str | torch.device) -> "BaseDictionaryLearning": ...  # type: ignore[override]

{interpreto-0.5.0.dev0 → interpreto-0.5.0.dev1}/interpreto/attributions/base.py RENAMED Viewed

@@ -45,13 +45,17 @@ from interpreto.attributions.perturbations.base import Perturbator
 from interpreto.commons import Granularity
 from interpreto.commons.generator_tools import split_iterator
 from interpreto.commons.granularity import GranularityAggregationStrategy
+from interpreto.concepts.base import ModelForInputsToConcepts
 from interpreto.model_wrapping.classification_inference_wrapper import ClassificationInferenceWrapper
 from interpreto.model_wrapping.generation_inference_wrapper import GenerationInferenceWrapper
 from interpreto.model_wrapping.inference_wrapper import InferenceModes, InferenceWrapper
+from interpreto.model_wrapping.inputs_to_concepts_inference_wrapper import InputsToConceptsInferenceWrapper
 from interpreto.typing import ClassificationTarget, GeneratedTarget, ModelInputs, SingleAttribution, TensorMapping
-def setup_token_ids(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, require_mask_token: bool = True) -> int:
+def setup_token_ids(
+    model: PreTrainedModel | ModelForInputsToConcepts, tokenizer: PreTrainedTokenizer, require_mask_token: bool = True
+) -> int:
     """
     Setup the tokenizer and the model with the appropriate token IDs, for padding and masking.
@@ -139,6 +143,7 @@ class ModelTask(Enum):
     CLASSIFICATION = "classification"
     GENERATION = "generation"
+    CONCEPTS = "concepts"
 def clone_tensor_mapping(tm: TensorMapping, detach: bool = False) -> TensorMapping:
@@ -232,7 +237,7 @@ class AttributionExplainer:
     def __init__(
         self,
-        model: PreTrainedModel,
+        model: PreTrainedModel | ModelForInputsToConcepts,
         tokenizer: PreTrainedTokenizer,
         batch_size: int = 4,
         perturbator: Perturbator | None = None,
@@ -248,7 +253,7 @@ class AttributionExplainer:
         Initializes the AttributionExplainer.
         Args:
-            model (PreTrainedModel): The model to be explained.
+            model (PreTrainedModel | ModelForInputsToConcepts): The model to be explained.
             tokenizer (PreTrainedTokenizer): The tokenizer associated with the model.
             batch_size (int): The batch size used for model inference.
             perturbator (Perturbator, optional): Instance used to generate input perturbations.
@@ -276,7 +281,7 @@ class AttributionExplainer:
         self.tokenizer = tokenizer
         self.inference_wrapper = self._associated_inference_wrapper(
-            model,
+            model,  # type: ignore
             gradients=use_gradient,
             input_x_gradient=input_x_gradient,
             batch_size=batch_size,
@@ -846,13 +851,99 @@ class GenerationAttributionExplainer(AttributionExplainer):
         return ModelTask.GENERATION, contribution
+class InputsToConceptsAttributionsExplainer(AttributionExplainer):
+    """Attribution explainer for input-to-concept models.
+    This explainer computes how much each input token contributes to each concept
+    activation. It bridges the attribution framework with the concept framework:
+    once a concept explainer is fitted, its ``inputs_to_concepts`` property returns
+    a model that can be passed to any perturbation-based attribution method.
+    The result is a per-token attribution for each concept, revealing which parts
+    of the input are most responsible for activating a given concept.
+    Note:
+        Only perturbation-based methods (Lime, KernelShap, Occlusion, Sobol) are
+        supported. Gradient-based methods are incompatible because the
+        ``ModelForInputsToConcepts`` is based on `nnsight` and which make differentiation complex.
+    Example:
+        ```python
+        from interpreto import Occlusion, SplitSequenceClassification
+        from interpreto.concepts import SemiNMFConcepts
+        split_model = SplitSequenceClassification("model_id", device_map="cuda")
+        concept_explainer = SemiNMFConcepts(split_model, nb_concepts=20)
+        concept_explainer.fit(activations)
+        explainer = Occlusion(concept_explainer.inputs_to_concepts, split_model.tokenizer)
+        results = explainer.explain("Some input text.", targets=torch.arange(5))
+        ```
+    """
+    _associated_inference_wrapper = InputsToConceptsInferenceWrapper
+    inference_wrapper: InputsToConceptsInferenceWrapper
+    def process_inputs_to_explain_and_targets(  # type: ignore
+        self,
+        model_inputs: ModelInputs,
+        targets: Iterable[int] | None = None,
+    ) -> tuple[list[TensorMapping], list[Int[torch.Tensor, "t"]]]:
+        """
+        Processes the inputs and targets for explanation.
+        This method must be implemented by subclasses.
+        Args:
+            model_inputs (ModelInputs):
+                The inputs to the model.
+            targets (Optional[Iterable[int]]):
+                The targets to be explained.
+                If None, all concepts are explained.
+        Returns:
+            processed_inputs (list[TensorMapping]):
+                The processed inputs.
+            processed_targets (list[Int[torch.Tensor, "t"]]):
+                The processed targets.
+        """
+        sanitized_targets: list[Int[torch.Tensor, "t"]]
+        if targets is None:
+            # explain all concepts
+            input_wise_targets = torch.arange(self.inference_wrapper.model.nb_concepts)  # type: ignore
+            sanitized_targets = [input_wise_targets] * len(model_inputs)  # type: ignore
+        else:
+            # targets are concept indices, shared across all inputs
+            if isinstance(targets, torch.Tensor):
+                input_wise_targets = targets.long()
+            else:
+                input_wise_targets = torch.tensor(list(targets), dtype=torch.long)
+            sanitized_targets = [input_wise_targets] * len(model_inputs)  # type: ignore
+        return model_inputs, sanitized_targets  # type: ignore
+    def post_processing(self, contribution: Float[torch.Tensor, "t l"]):
+        """
+        Concepts specific post-processing of the attribution scores.
+        No post-processing is required for concept attributions.
+        Args:
+            contribution (Float[torch.Tensor, "t l"]): The contribution values.
+        Returns:
+            model_task (ModelTask): The model task.
+            contribution (Float[torch.Tensor, "t l"]): The post-processed contribution values.
+        """
+        return ModelTask.CONCEPTS, contribution
 class FactoryGeneratedMeta(type):
     """
     Metaclass to distinguish classes generated by the MultitaskExplainerMixin.
     """
-class MultitaskExplainerMixin(AttributionExplainer):
+class MultitaskExplainerMixin:
     """
     Mixin class to generate the appropriate Explainer based on the model type.
     """
@@ -866,6 +957,11 @@ class MultitaskExplainerMixin(AttributionExplainer):
         if model.__class__.__name__.endswith("ForCausalLM") or model.__class__.__name__.endswith("LMHeadModel"):
             t = FactoryGeneratedMeta("Generation" + cls.__name__, (cls, GenerationAttributionExplainer), {})
             return t.__new__(t, model, *args, **kwargs)  # type: ignore
+        if model.__class__.__name__.endswith("ForInputsToConcepts"):
+            t = FactoryGeneratedMeta(
+                "InputsToConcepts" + cls.__name__, (cls, InputsToConceptsAttributionsExplainer), {}
+            )
+            return t.__new__(t, model, *args, **kwargs)  # type: ignore
         raise NotImplementedError(
             "Model type not supported for Explainer. Use a ModelForSequenceClassification, a ModelForCausalLM model or a LMHeadModel model."
         )

{interpreto-0.5.0.dev0 → interpreto-0.5.0.dev1}/interpreto/attributions/methods/kernel_shap.py RENAMED Viewed

@@ -40,6 +40,7 @@ from interpreto.attributions.aggregations.linear_regression_aggregation import (
 from interpreto.attributions.base import AttributionExplainer, MultitaskExplainerMixin, setup_token_ids
 from interpreto.attributions.perturbations.shap_perturbation import ShapTokenPerturbator
 from interpreto.commons.granularity import Granularity, GranularityAggregationStrategy
+from interpreto.concepts.base import ModelForInputsToConcepts
 from interpreto.model_wrapping.inference_wrapper import InferenceModes
@@ -68,7 +69,7 @@ class KernelShap(MultitaskExplainerMixin, AttributionExplainer):
     def __init__(
         self,
-        model: PreTrainedModel,
+        model: PreTrainedModel | ModelForInputsToConcepts,
         tokenizer: PreTrainedTokenizer,
         batch_size: int = 4,
         granularity: Granularity = Granularity.WORD,
@@ -81,7 +82,7 @@ class KernelShap(MultitaskExplainerMixin, AttributionExplainer):
         Initialize the attribution method.
         Args:
-            model (PreTrainedModel): model to explain
+            model (PreTrainedModel | ModelForInputsToConcepts): model to explain
             tokenizer (PreTrainedTokenizer): Hugging Face tokenizer associated with the model
             batch_size (int): batch size for the attribution method
             granularity (Granularity, optional): The level of granularity for the explanation.

{interpreto-0.5.0.dev0 → interpreto-0.5.0.dev1}/interpreto/attributions/methods/lime.py RENAMED Viewed

@@ -43,6 +43,7 @@ from interpreto.attributions.aggregations.linear_regression_aggregation import (
 from interpreto.attributions.base import AttributionExplainer, InferenceModes, MultitaskExplainerMixin, setup_token_ids
 from interpreto.attributions.perturbations.random_perturbation import RandomMaskedTokenPerturbator
 from interpreto.commons import Granularity, GranularityAggregationStrategy
+from interpreto.concepts.base import ModelForInputsToConcepts
 class Lime(MultitaskExplainerMixin, AttributionExplainer):
@@ -72,7 +73,7 @@ class Lime(MultitaskExplainerMixin, AttributionExplainer):
     def __init__(
         self,
-        model: PreTrainedModel,
+        model: PreTrainedModel | ModelForInputsToConcepts,
         tokenizer: PreTrainedTokenizer,
         batch_size: int = 4,
         granularity: Granularity = Granularity.WORD,
@@ -88,7 +89,7 @@ class Lime(MultitaskExplainerMixin, AttributionExplainer):
         Initialize the attribution method.
         Args:
-            model (PreTrainedModel): model to explain
+            model (PreTrainedModel | ModelForInputsToConcepts): model to explain
             tokenizer (PreTrainedTokenizer): Hugging Face tokenizer associated with the model
             batch_size (int): batch size for the attribution method
             granularity (Granularity, optional): The level of granularity for the explanation.

{interpreto-0.5.0.dev0 → interpreto-0.5.0.dev1}/interpreto/attributions/methods/occlusion.py RENAMED Viewed

@@ -29,10 +29,9 @@ Occlusion attribution method
 from __future__ import annotations
 from collections.abc import Callable
-from typing import Any
 import torch
-from transformers import PreTrainedTokenizer
+from transformers import PreTrainedModel, PreTrainedTokenizer
 from interpreto.attributions.aggregations.base import OcclusionAggregator
 from interpreto.attributions.base import (
@@ -42,6 +41,7 @@ from interpreto.attributions.base import (
 )
 from interpreto.attributions.perturbations import OcclusionPerturbator
 from interpreto.commons.granularity import Granularity, GranularityAggregationStrategy
+from interpreto.concepts.base import ModelForInputsToConcepts
 from interpreto.model_wrapping.inference_wrapper import InferenceModes
@@ -68,7 +68,7 @@ class Occlusion(MultitaskExplainerMixin, AttributionExplainer):
     def __init__(
         self,
-        model: Any,
+        model: PreTrainedModel | ModelForInputsToConcepts,
         tokenizer: PreTrainedTokenizer,
         batch_size: int = 4,
         granularity: Granularity = Granularity.WORD,
@@ -80,7 +80,7 @@ class Occlusion(MultitaskExplainerMixin, AttributionExplainer):
         Initialize the attribution method.
         Args:
-            model (PreTrainedModel): model to explain
+            model (PreTrainedModel | ModelForInputsToConcepts): model to explain
             tokenizer (PreTrainedTokenizer): Hugging Face tokenizer associated with the model
             batch_size (int): batch size for the attribution method
             granularity (Granularity, optional): The level of granularity for the explanation.

{interpreto-0.5.0.dev0 → interpreto-0.5.0.dev1}/interpreto/attributions/methods/sobol_attribution.py RENAMED Viewed

@@ -41,6 +41,7 @@ from interpreto.attributions.perturbations.sobol_perturbation import (
     SobolTokenPerturbator,
 )
 from interpreto.commons.granularity import Granularity, GranularityAggregationStrategy
+from interpreto.concepts.base import ModelForInputsToConcepts
 class Sobol(MultitaskExplainerMixin, AttributionExplainer):
@@ -73,7 +74,7 @@ class Sobol(MultitaskExplainerMixin, AttributionExplainer):
     def __init__(
         self,
-        model: PreTrainedModel,
+        model: PreTrainedModel | ModelForInputsToConcepts,
         tokenizer: PreTrainedTokenizer,
         batch_size: int = 4,
         granularity: Granularity = Granularity.WORD,
@@ -88,7 +89,7 @@ class Sobol(MultitaskExplainerMixin, AttributionExplainer):
         Initialize the attribution method.
         Args:
-            model (PreTrainedModel): model to explain
+            model (PreTrainedModel | ModelForInputsToConcepts): model to explain
             tokenizer (PreTrainedTokenizer): Hugging Face tokenizer associated with the model
             batch_size (int): batch size for the attribution method
             granularity (Granularity, optional): The level of granularity for the explanation.

{interpreto-0.5.0.dev0 → interpreto-0.5.0.dev1}/interpreto/concepts/__init__.py RENAMED Viewed

@@ -43,6 +43,18 @@ from .methods import (
     TopKSAEConcepts,
     VanillaSAEConcepts,
 )
+from .probes import (
+    CosineCentroidProbe,
+    DiagonalMahalanobisCentroidProbe,
+    DotProductCentroidProbe,
+    LinearRegressionProbe,
+    LinearSVMProbe,
+    LogisticRegressionProbe,
+    MeansDiffProbe,
+    ProbeExplainer,
+    SqL2CentroidProbe,
+    SVDDCentroidProbe,
+)
 __all__ = [
     "BatchTopKSAEConcepts",
@@ -50,19 +62,29 @@ __all__ = [
     "ConceptAutoEncoderExplainer",
     "ConceptEncoderExplainer",
     "ConvexNMFConcepts",
+    "CosineCentroidProbe",
+    "DiagonalMahalanobisCentroidProbe",
     "DictionaryLearningConcepts",
+    "DotProductCentroidProbe",
     "ICAConcepts",
     "JumpReLUSAEConcepts",
     "KMeansConcepts",
     "LLMLabels",
+    "LinearRegressionProbe",
+    "LinearSVMProbe",
+    "LogisticRegressionProbe",
+    "MeansDiffProbe",
     "MpSAEConcepts",
     "NeuronsAsConcepts",
     "NMFConcepts",
     "PCAConcepts",
+    "ProbeExplainer",
     "SAELossClasses",
     "SemiNMFConcepts",
     "SparsePCAConcepts",
+    "SqL2CentroidProbe",
     "SVDConcepts",
+    "SVDDCentroidProbe",
     "TopKInputs",
     "TopKSAEConcepts",
     "VanillaSAEConcepts",

interpreto 0.5.0.dev0__tar.gz → 0.5.0.dev1__tar.gz

interpreto 0.5.0.dev0tar.gz → 0.5.0.dev1tar.gz