PyPI - renard-pipeline - Versions diffs - 0.4.1__tar.gz → 0.6.0__tar.gz - Mend

renard-pipeline 0.4.1tar.gz → 0.6.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of renard-pipeline might be problematic. Click here for more details.

Files changed (42) hide show

{renard_pipeline-0.4.1 → renard_pipeline-0.6.0}/PKG-INFO RENAMED Viewed

@@ -1,46 +1,49 @@
 Metadata-Version: 2.1
 Name: renard-pipeline
-Version: 0.4.1
+Version: 0.6.0
 Summary: Relationships Extraction from NARrative Documents
 Home-page: https://github.com/CompNet/Renard
 License: GPL-3.0-only
 Author: Arthur Amalvy
 Author-email: arthur.amalvy@univ-avignon.fr
-Requires-Python: >=3.8,<3.11
+Requires-Python: >=3.8,<3.12
 Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
 Provides-Extra: spacy
 Provides-Extra: stanza
-Requires-Dist: coreferee (>=1.4.0,<2.0.0) ; extra == "spacy"
-Requires-Dist: datasets (>=2.16.1,<3.0.0)
-Requires-Dist: grimbert (>=0.1.0,<0.2.0)
-Requires-Dist: matplotlib (>=3.5.3,<4.0.0)
-Requires-Dist: more-itertools (>=10.1.0,<11.0.0)
-Requires-Dist: nameparser (>=1.1.0,<2.0.0)
-Requires-Dist: networkx (>=2.6.3,<3.0.0)
-Requires-Dist: nltk (>=3.6.5,<4.0.0)
-Requires-Dist: pandas (>=2.0.0,<3.0.0)
-Requires-Dist: pytest (>=7.2.1,<8.0.0)
-Requires-Dist: seqeval (==1.2.2)
-Requires-Dist: spacy (>=3.5.0,<4.0.0) ; extra == "spacy"
-Requires-Dist: spacy-transformers (>=1.2.1,<2.0.0) ; extra == "spacy"
-Requires-Dist: stanza (>=1.3.0,<2.0.0) ; extra == "stanza"
-Requires-Dist: tibert (>=0.3.0,<0.4.0)
+Requires-Dist: coreferee (>=1.4,<2.0) ; extra == "spacy"
+Requires-Dist: datasets (>=3.0,<4.0)
+Requires-Dist: grimbert (>=0.1,<0.2)
+Requires-Dist: matplotlib (>=3.5,<4.0)
+Requires-Dist: more-itertools (>=10.5,<11.0)
+Requires-Dist: nameparser (>=1.1,<2.0)
+Requires-Dist: networkx (>=3.0,<4.0)
+Requires-Dist: nltk (>=3.9,<4.0)
+Requires-Dist: pandas (>=2.0,<3.0)
+Requires-Dist: pytest (>=8.3.0,<9.0.0)
+Requires-Dist: rank-bm25 (>=0.2.2,<0.3.0)
+Requires-Dist: spacy (>=3.5,<4.0) ; extra == "spacy"
+Requires-Dist: spacy-transformers (>=1.3,<2.0) ; extra == "spacy"
+Requires-Dist: stanza (>=1.3,<2.0) ; extra == "stanza"
+Requires-Dist: tibert (>=0.5,<0.6)
 Requires-Dist: torch (>=2.0.0,!=2.0.1)
 Requires-Dist: tqdm (>=4.62.3,<5.0.0)
-Requires-Dist: transformers (>=4.36.0,<5.0.0)
+Requires-Dist: transformers (>=4.36,<5.0)
 Project-URL: Documentation, https://compnet.github.io/Renard/
 Project-URL: Repository, https://github.com/CompNet/Renard
 Description-Content-Type: text/markdown
 # Renard
-Renard (Relationships Extraction from NARrative Documents) is a library for creating and using custom character networks extraction pipelines. Renard can extract dynamic as well as static character networks.
+[![DOI](https://joss.theoj.org/papers/10.21105/joss.06574/status.svg)](https://doi.org/10.21105/joss.06574)
-![Character network extracted from "Pride and Prejudice"](./docs/pp_white_bg.svg)
+Renard (Relationship Extraction from NARrative Documents) is a library for creating and using custom character networks extraction pipelines. Renard can extract dynamic as well as static character networks.
+![The Renard logo](./docs/renard.svg)
 # Installation
@@ -102,3 +105,25 @@ Expensive tests are disabled by default. These can be run by setting the environ
 see [the "Contributing" section of the documentation](https://compnet.github.io/Renard/contributing.html).
+# How to cite
+If you use Renard in your research project, please cite it as follows:
+```bibtex
+@Article{Amalvy2024,
+  doi	       = {10.21105/joss.06574},
+  year	       = {2024},
+  publisher    = {The Open Journal},
+  volume       = {9},
+  number       = {98},
+  pages	       = {6574},
+  author       = {Amalvy, A. and Labatut, V. and Dufour, R.},
+  title	       = {Renard: A Modular Pipeline for Extracting Character
+                  Networks from Narrative Texts},
+  journal      = {Journal of Open Source Software},
+}
+```
+We would be happy to hear about your usage of Renard, so don't hesitate to reach out!

{renard_pipeline-0.4.1 → renard_pipeline-0.6.0}/README.md RENAMED Viewed

@@ -1,8 +1,10 @@
 # Renard
-Renard (Relationships Extraction from NARrative Documents) is a library for creating and using custom character networks extraction pipelines. Renard can extract dynamic as well as static character networks.
+[![DOI](https://joss.theoj.org/papers/10.21105/joss.06574/status.svg)](https://doi.org/10.21105/joss.06574)
-![Character network extracted from "Pride and Prejudice"](./docs/pp_white_bg.svg)
+Renard (Relationship Extraction from NARrative Documents) is a library for creating and using custom character networks extraction pipelines. Renard can extract dynamic as well as static character networks.
+![The Renard logo](./docs/renard.svg)
 # Installation
@@ -63,3 +65,25 @@ Expensive tests are disabled by default. These can be run by setting the environ
 # Contributing
 see [the "Contributing" section of the documentation](https://compnet.github.io/Renard/contributing.html).
+# How to cite
+If you use Renard in your research project, please cite it as follows:
+```bibtex
+@Article{Amalvy2024,
+  doi	       = {10.21105/joss.06574},
+  year	       = {2024},
+  publisher    = {The Open Journal},
+  volume       = {9},
+  number       = {98},
+  pages	       = {6574},
+  author       = {Amalvy, A. and Labatut, V. and Dufour, R.},
+  title	       = {Renard: A Modular Pipeline for Extracting Character
+                  Networks from Narrative Texts},
+  journal      = {Journal of Open Source Software},
+}
+```
+We would be happy to hear about your usage of Renard, so don't hesitate to reach out!

{renard_pipeline-0.4.1 → renard_pipeline-0.6.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "renard-pipeline"
-version = "0.4.1"
+version = "0.6.0"
 description = "Relationships Extraction from NARrative Documents"
 authors = ["Arthur Amalvy <arthur.amalvy@univ-avignon.fr>"]
 license = "GPL-3.0-only"
@@ -14,29 +14,29 @@ documentation = "https://compnet.github.io/Renard/"
 [tool.poetry.dependencies]
 # optional dependencies
-stanza = { version = "^1.3.0", optional = true }
-spacy = { version = "^3.5.0", optional = true }
-coreferee = { version = "^1.4.0", optional = true }
-spacy-transformers = {version = "^1.2.1", optional = true}
+stanza = { version = "^1.3", optional = true }
+spacy = { version = "^3.5", optional = true }
+coreferee = { version = "^1.4", optional = true }
+spacy-transformers = {version = "^1.3", optional = true}
 # required dependencies
-python = "^3.8,<3.11"
+python = "^3.8,<3.12"
 torch = ">=2.0.0, !=2.0.1"
-transformers = "^4.36.0"
-nltk = "^3.6.5"
+transformers = "^4.36"
+nltk = "^3.9"
 tqdm = "^4.62.3"
-networkx = "^2.6.3"
-more-itertools = "^10.1.0"
-nameparser = "^1.1.0"
-matplotlib = "^3.5.3"
-seqeval = "1.2.2"
-pandas = "^2.0.0"
-pytest = "^7.2.1"
-tibert = "^0.3.0"
-grimbert = "^0.1.0"
-datasets = "^2.16.1"
+networkx = "^3.0"
+more-itertools = "^10.5"
+nameparser = "^1.1"
+matplotlib = "^3.5"
+pandas = "^2.0"
+pytest = "^8.3.0"
+tibert = "^0.5"
+grimbert = "^0.1"
+datasets = "^3.0"
+rank-bm25 = "^0.2.2"
 [tool.poetry.dev-dependencies]
-hypothesis = "^6.24.0"
+hypothesis = "^6.82"
 Sphinx = "^4.3.1"
 sphinx-rtd-theme = "^1.0.0"
 sphinx-autodoc-typehints = "^1.12.0"

{renard_pipeline-0.4.1 → renard_pipeline-0.6.0}/renard/graph_utils.py RENAMED Viewed

@@ -70,10 +70,17 @@ def graph_with_names(
     else:
         name_style_fn = name_style
-    return nx.relabel_nodes(
-        G,
-        {character: name_style_fn(character) for character in G.nodes()},  # type: ignore
-    )
+    mapping = {}
+    for character in G.nodes():
+        # NOTE: it is *possible* to have a graph where nodes are not
+        # characters (for example, simple strings). Therefore, we are
+        # lenient here
+        try:
+            mapping[character] = name_style_fn(character)
+        except AttributeError:
+            mapping[character] = character
+    return nx.relabel_nodes(G, mapping)
 def layout_with_names(

{renard_pipeline-0.4.1 → renard_pipeline-0.6.0}/renard/ner_utils.py RENAMED Viewed

@@ -74,7 +74,7 @@ class DataCollatorForTokenClassificationWithBatchEncoding:
 class NERDataset(Dataset):
     """
     :ivar _context_mask: for each element, a mask indicating which
-        tokens are part of the context (1 for context, 0 for text on
+        tokens are part of the context (0 for context, 1 for text on
         which to perform inference).  The mask allows to discard
         predictions made for context at inference time, even though
         the context can still be passed as input to the model.
@@ -92,11 +92,11 @@ class NERDataset(Dataset):
             assert all(
                 [len(cm) == len(elt) for elt, cm in zip(self.elements, context_mask)]
             )
-        self._context_mask = context_mask or [[0] * len(elt) for elt in self.elements]
+        self._context_mask = context_mask or [[1] * len(elt) for elt in self.elements]
         self.tokenizer = tokenizer
-    def __getitem__(self, index: Union[int, List[int]]) -> BatchEncoding:
+    def __getitem__(self, index: int) -> BatchEncoding:
         element = self.elements[index]
         batch = self.tokenizer(
@@ -104,15 +104,18 @@ class NERDataset(Dataset):
             truncation=True,
             max_length=512,  # TODO
             is_split_into_words=True,
+            return_length=True,
         )
-        batch["context_mask"] = [0] * len(batch["input_ids"])
-        elt_context_mask = self._context_mask[index]
-        for i in range(len(element)):
-            w2t = batch.word_to_tokens(0, i)
-            mask_value = elt_context_mask[i]
-            tokens_mask = [mask_value] * (w2t.end - w2t.start)
-            batch["context_mask"][w2t.start : w2t.end] = tokens_mask
+        length = batch["length"][0]
+        del batch["length"]
+        if self.tokenizer.truncation_side == "right":
+            batch["context_mask"] = self._context_mask[index][:length]
+        else:
+            assert self.tokenizer.truncation_side == "left"
+            batch["context_mask"] = self._context_mask[index][
+                len(batch["input_ids"]) - length :
+            ]
         return batch
@@ -181,6 +184,7 @@ def load_conll2002_bio(
     path: str,
     tag_conversion_map: Optional[Dict[str, str]] = None,
     separator: str = "\t",
+    max_sent_len: Optional[int] = None,
     **kwargs,
 ) -> Tuple[List[List[str]], List[str], List[NEREntity]]:
     """Load a file under CoNLL2022 BIO format.  Sentences are expected
@@ -192,7 +196,9 @@ def load_conll2002_bio(
     :param separator: separator between token and BIO tags
     :param tag_conversion_map: conversion map for tags found in the
         input file.  Example : ``{'B': 'B-PER', 'I': 'I-PER'}``
-    :param kwargs: additional kwargs for ``open`` (such as
+    :param max_sent_len: if specified, maximum length, in tokens, of
+        sentences.
+    :param kwargs: additional kwargs for :func:`open` (such as
         ``encoding`` or ``newline``).
     :return: ``(sentences, tokens, entities)``
@@ -207,7 +213,9 @@ def load_conll2002_bio(
     tags = []
     for line in raw_data.split("\n"):
         line = line.strip("\n")
-        if re.fullmatch(r"\s*", line):
+        if re.fullmatch(r"\s*", line) or (
+            not max_sent_len is None and len(sent_tokens) >= max_sent_len
+        ):
             if len(sent_tokens) == 0:
                 continue
             sents.append(sent_tokens)
@@ -227,6 +235,7 @@ def hgdataset_from_conll2002(
     path: str,
     tag_conversion_map: Optional[Dict[str, str]] = None,
     separator: str = "\t",
+    max_sent_len: Optional[int] = None,
     **kwargs,
 ) -> HGDataset:
     """Load a CoNLL-2002 file as a Huggingface Dataset.
@@ -234,12 +243,13 @@ def hgdataset_from_conll2002(
     :param path: passed to :func:`.load_conll2002_bio`
     :param tag_conversion_map: passed to :func:`load_conll2002_bio`
     :param separator: passed to :func:`load_conll2002_bio`
-    :param kwargs: passed to :func:`load_conll2002_bio`
+    :param max_sent_len: passed to :func:`load_conll2002_bio`
+    :param kwargs: additional kwargs for :func:`open`
     :return: a :class:`datasets.Dataset` with features 'tokens' and 'labels'.
     """
     sentences, tokens, entities = load_conll2002_bio(
-        path, tag_conversion_map, separator, **kwargs
+        path, tag_conversion_map, separator, max_sent_len, **kwargs
     )
     # convert entities to labels

{renard_pipeline-0.4.1 → renard_pipeline-0.6.0}/renard/pipeline/character_unification.py RENAMED Viewed

@@ -1,5 +1,5 @@
 from typing import Any, Dict, List, FrozenSet, Set, Optional, Tuple, Union, Literal
-import copy
+import re, sys
 from itertools import combinations
 from collections import defaultdict, Counter
 from dataclasses import dataclass
@@ -11,6 +11,7 @@ from renard.pipeline.ner import NEREntity
 from renard.pipeline.progress import ProgressReporter
 from renard.resources.hypocorisms import HypocorismGazetteer
 from renard.resources.pronouns import is_a_female_pronoun, is_a_male_pronoun
+from renard.resources.determiners import singular_determiners
 from renard.resources.titles import is_a_male_title, is_a_female_title, all_titles
@@ -61,6 +62,8 @@ def _assign_coreference_mentions(
     # we assign each chain to the character with highest name
     # occurence in it
     for chain in corefs:
+        if len(char_mentions) == 0:
+            break
         # determine the characters with the highest number of
         # occurences
         occ_counter = {}
@@ -98,8 +101,13 @@ class NaiveCharacterUnifier(PipelineStep):
             character for it to be valid
         """
         self.min_appearances = min_appearances
+        # a default value, will be est by _pipeline_init_
+        self.character_ner_tag = "PER"
         super().__init__()
+    def _pipeline_init_(self, lang: str, character_ner_tag: str, **kwargs):
+        self.character_ner_tag = character_ner_tag
     def __call__(
         self,
         text: str,
@@ -112,7 +120,7 @@ class NaiveCharacterUnifier(PipelineStep):
         :param tokens:
         :param entities:
         """
-        persons = [e for e in entities if e.tag == "PER"]
+        persons = [e for e in entities if e.tag == self.character_ner_tag]
         characters = defaultdict(list)
         for entity in persons:
@@ -159,6 +167,8 @@ class GraphRulesCharacterUnifier(PipelineStep):
         min_appearances: int = 0,
         additional_hypocorisms: Optional[List[Tuple[str, List[str]]]] = None,
         link_corefs_mentions: bool = False,
+        ignore_lone_titles: Optional[Set[str]] = None,
+        ignore_leading_determiner: bool = False,
     ) -> None:
         """
         :param min_appearances: minimum number of appearances of a
@@ -173,20 +183,32 @@ class GraphRulesCharacterUnifier(PipelineStep):
             extract a lot of spurious links.  However, linking by
             coref is sometimes the only way to resolve a character
             alias.
+        :param ignore_lone_titles: a set of titles to ignore when they
+            stand on their own.  This avoids extracting false
+            positives characters such as 'Mr.' or 'Miss'.
+        :param ignore_leading_determiner: if ``True``, will ignore the
+            leading determiner when applying unification rules.  This
+            is useful if the NER model used in the pipeline adds
+            leading determiners as part of entites.
         """
         self.min_appearances = min_appearances
         self.additional_hypocorisms = additional_hypocorisms
         self.link_corefs_mentions = link_corefs_mentions
+        self.ignore_lone_titles = ignore_lone_titles or set()
+        self.character_ner_tag = "PER"  # a default value, will be set by _pipeline_init
+        self.ignore_leading_determiner = ignore_leading_determiner
         super().__init__()
-    def _pipeline_init_(self, lang: str, progress_reporter: ProgressReporter):
+    def _pipeline_init_(self, lang: str, character_ner_tag: str, **kwargs):
         self.hypocorism_gazetteer = HypocorismGazetteer(lang=lang)
         if not self.additional_hypocorisms is None:
             for name, nicknames in self.additional_hypocorisms:
                 self.hypocorism_gazetteer._add_hypocorism_(name, nicknames)
-        return super()._pipeline_init_(lang, progress_reporter)
+        self.character_ner_tag = character_ner_tag
+        return super()._pipeline_init_(lang, **kwargs)
     def __call__(
         self,
@@ -196,12 +218,17 @@ class GraphRulesCharacterUnifier(PipelineStep):
     ) -> Dict[str, Any]:
         import networkx as nx
-        mentions = [m for m in entities if m.tag == "PER"]
-        mentions_str = [" ".join(m.tokens) for m in mentions]
+        mentions = [m for m in entities if m.tag == self.character_ner_tag]
+        mentions_str = set(
+            filter(
+                lambda m: not m in self.ignore_lone_titles,
+                map(lambda m: " ".join(m.tokens), mentions),
+            )
+        )
         # * create a graph where each node is a mention detected by NER
         G = nx.Graph()
-        for mention_str in set(mentions_str):
+        for mention_str in mentions_str:
             G.add_node(mention_str)
         # * HumanName local configuration - dependant on language
@@ -209,23 +236,28 @@ class GraphRulesCharacterUnifier(PipelineStep):
         # * link nodes based on several rules
         for name1, name2 in combinations(G.nodes(), 2):
+            # preprocess name when needed
+            pname1 = self._preprocess_name(name1)
+            pname2 = self._preprocess_name(name2)
             # is one name a known hypocorism of the other ? (also
             # checks if both names are the same)
-            if self.hypocorism_gazetteer.are_related(name1, name2):
+            if self.hypocorism_gazetteer.are_related(pname1, pname2):
                 G.add_edge(name1, name2)
                 continue
             # if we remove the title, is one name related to the other
             # ?
             if self.names_are_related_after_title_removal(
-                name1, name2, hname_constants
+                pname1, pname2, hname_constants
             ):
                 G.add_edge(name1, name2)
                 continue
             # add an edge if two characters have the same family names
-            human_name1 = HumanName(name1, constants=hname_constants)
-            human_name2 = HumanName(name2, constants=hname_constants)
+            human_name1 = HumanName(pname1, constants=hname_constants)
+            human_name2 = HumanName(pname2, constants=hname_constants)
             if (
                 len(human_name1.last) > 0
                 and human_name1.last.lower() == human_name2.last.lower()
@@ -262,10 +294,15 @@ class GraphRulesCharacterUnifier(PipelineStep):
                 pass
         for name1, name2 in combinations(G.nodes(), 2):
+            # preprocess names when needed
+            pname1 = self._preprocess_name(name1)
+            pname2 = self._preprocess_name(name2)
             # check if characters have the same last name but a
             # different first name.
-            human_name1 = HumanName(name1, constants=hname_constants)
-            human_name2 = HumanName(name2, constants=hname_constants)
+            human_name1 = HumanName(pname1, constants=hname_constants)
+            human_name2 = HumanName(pname2, constants=hname_constants)
             if (
                 len(human_name1.last) > 0
                 and len(human_name2.last) > 0
@@ -317,6 +354,17 @@ class GraphRulesCharacterUnifier(PipelineStep):
         return {"characters": characters}
+    def _preprocess_name(self, name) -> str:
+        if self.ignore_leading_determiner:
+            if not self.lang in singular_determiners:
+                print(
+                    f"[warning] can't ignore leading determiners for {self.lang}",
+                    file=sys.stderr,
+                )
+            for determiner in singular_determiners.get(self.lang, []):
+                name = re.sub(f"^{determiner} ", " ", name, flags=re.I)
+        return name
     def _make_hname_constants(self) -> Constants:
         if self.lang == "eng":
             return Constants()
@@ -345,13 +393,18 @@ class GraphRulesCharacterUnifier(PipelineStep):
             or self.hypocorism_gazetteer.are_related(raw_name1, raw_name2)
         )
-    def names_are_in_coref(self, name1: str, name2: str, corefs: List[List[Mention]]):
+    def names_are_in_coref(
+        self, name1: str, name2: str, corefs: List[List[Mention]]
+    ) -> bool:
+        once_together = False
         for coref_chain in corefs:
-            if any([name1 == " ".join(m.tokens) for m in coref_chain]) and any(
-                [name2 == " ".join(m.tokens) for m in coref_chain]
-            ):
-                return True
-        return False
+            name1_in = any([name1 == " ".join(m.tokens) for m in coref_chain])
+            name2_in = any([name2 == " ".join(m.tokens) for m in coref_chain])
+            if name1_in == (not name2_in):
+                return False
+            elif name1_in and name2_in:
+                once_together = True
+        return once_together
     def infer_name_gender(
         self,

{renard_pipeline-0.4.1 → renard_pipeline-0.6.0}/renard/pipeline/characters_extraction.py RENAMED Viewed

@@ -1,7 +1,9 @@
+import sys
 import renard.pipeline.character_unification as cu
 print(
-    "[warning] the characters_extraction module is deprecated. Use character_unification instead."
+    "[warning] the characters_extraction module is deprecated. Use character_unification instead.",
+    file=sys.stderr,
 )
 Character = cu.Character

renard-pipeline 0.4.1__tar.gz → 0.6.0__tar.gz

Potentially problematic release.

renard-pipeline 0.4.1tar.gz → 0.6.0tar.gz