PyPI - biocypher - Versions diffs - 0.5.19__tar.gz → 0.5.21__tar.gz - Mend

biocypher 0.5.19tar.gz → 0.5.21tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of biocypher might be problematic. Click here for more details.

Files changed (27) hide show

{biocypher-0.5.19 → biocypher-0.5.21}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: biocypher
-Version: 0.5.19
+Version: 0.5.21
 Summary: A unifying framework for biomedical research knowledge graphs
 Home-page: https://github.com/biocypher/biocypher
 License: MIT
@@ -25,8 +25,10 @@ Requires-Dist: more_itertools
 Requires-Dist: neo4j-utils (==0.0.7)
 Requires-Dist: networkx (>=3.0,<4.0)
 Requires-Dist: pandas (>=2.0.1,<3.0.0)
+Requires-Dist: pooch (>=1.7.0,<2.0.0)
 Requires-Dist: rdflib (>=6.2.0,<7.0.0)
 Requires-Dist: stringcase (>=1.2.0,<2.0.0)
+Requires-Dist: tqdm (>=4.65.0,<5.0.0)
 Requires-Dist: treelib (>=1.6.1,<2.0.0)
 Project-URL: Bug Tracker, https://github.com/biocypher/biocypher/issues
 Project-URL: Repository, https://github.com/biocypher/biocypher
@@ -38,7 +40,9 @@ Description-Content-Type: text/markdown
 ![Python](https://img.shields.io/badge/python-3.10-blue.svg)
 [![PyPI version](https://badge.fury.io/py/biocypher.svg)](https://badge.fury.io/py/biocypher)
 [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
-![Docs build](https://github.com/biocypher/biocypher/actions/workflows/sphinx_autodoc.yml/badge.svg)
+[![CI](https://github.com/biocypher/biocypher/actions/workflows/ci_cd.yaml/badge.svg)](https://github.com/biocypher/biocypher/actions/workflows/ci_cd.yaml)
+![Coverage](https://raw.githubusercontent.com/biocypher/biocypher/coverage/coverage.svg)
+[![Docs build](https://github.com/biocypher/biocypher/actions/workflows/sphinx_autodoc.yaml/badge.svg)](https://github.com/biocypher/biocypher/actions/workflows/sphinx_autodoc.yaml)
 [![Downloads](https://static.pepy.tech/badge/biocypher)](https://pepy.tech/project/biocypher)
 [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
 [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)

{biocypher-0.5.19 → biocypher-0.5.21}/README.md RENAMED Viewed

@@ -4,7 +4,9 @@
 ![Python](https://img.shields.io/badge/python-3.10-blue.svg)
 [![PyPI version](https://badge.fury.io/py/biocypher.svg)](https://badge.fury.io/py/biocypher)
 [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
-![Docs build](https://github.com/biocypher/biocypher/actions/workflows/sphinx_autodoc.yml/badge.svg)
+[![CI](https://github.com/biocypher/biocypher/actions/workflows/ci_cd.yaml/badge.svg)](https://github.com/biocypher/biocypher/actions/workflows/ci_cd.yaml)
+![Coverage](https://raw.githubusercontent.com/biocypher/biocypher/coverage/coverage.svg)
+[![Docs build](https://github.com/biocypher/biocypher/actions/workflows/sphinx_autodoc.yaml/badge.svg)](https://github.com/biocypher/biocypher/actions/workflows/sphinx_autodoc.yaml)
 [![Downloads](https://static.pepy.tech/badge/biocypher)](https://pepy.tech/project/biocypher)
 [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
 [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)

{biocypher-0.5.19 → biocypher-0.5.21}/biocypher/_connect.py RENAMED Viewed

@@ -53,8 +53,6 @@ class _Neo4jDriver:
         increment_version (bool): Whether to increment the version number.
-        ontology (Ontology): The ontology to use for mapping.
         translator (Translator): The translator to use for mapping.
     """
@@ -66,14 +64,12 @@ class _Neo4jDriver:
         user: str,
         password: str,
         multi_db: bool,
-        ontology: Ontology,
         translator: Translator,
         wipe: bool = False,
         fetch_size: int = 1000,
         increment_version: bool = True,
     ):
-        self._ontology = ontology
-        self._translator = translator
+        self.translator = translator
         self._driver = neo4j_utils.Driver(
             db_name=database_name,
@@ -103,7 +99,7 @@ class _Neo4jDriver:
             "MATCH (v:BioCypher) " "WHERE NOT (v)-[:PRECEDES]->() " "RETURN v",
         )
         # add version node
-        self.add_biocypher_nodes(self._ontology)
+        self.add_biocypher_nodes(self.translator.ontology)
         # connect version node to previous
         if db_version[0]:
@@ -111,7 +107,7 @@ class _Neo4jDriver:
             previous_id = previous["v"]["id"]
             e_meta = BioCypherEdge(
                 previous_id,
-                self._ontology.get_dict().get("node_id"),
+                self.translator.ontology.get_dict().get("node_id"),
                 "PRECEDES",
             )
             self.add_biocypher_edges(e_meta)
@@ -142,7 +138,7 @@ class _Neo4jDriver:
         logger.info("Creating constraints for node types in config.")
         # get structure
-        for leaf in self._ontology.extended_schema.items():
+        for leaf in self.translator.ontology.mapping.extended_schema.items():
             label = _misc.sentencecase_to_pascalcase(leaf[0])
             if leaf[1]["represented_as"] == "node":
                 s = (
@@ -172,7 +168,7 @@ class _Neo4jDriver:
                 - second entry: Neo4j summary.
         """
-        bn = self._translator.translate_nodes(id_type_tuples)
+        bn = self.translator.translate_nodes(id_type_tuples)
         return self.add_biocypher_nodes(bn)
     def add_edges(self, id_src_tar_type_tuples: Iterable[tuple]) -> tuple:
@@ -204,7 +200,7 @@ class _Neo4jDriver:
                 - second entry: Neo4j summary.
         """
-        bn = self._translator.translate_edges(id_src_tar_type_tuples)
+        bn = self.translator.translate_edges(id_src_tar_type_tuples)
         return self.add_biocypher_edges(bn)
     def add_biocypher_nodes(
@@ -375,7 +371,6 @@ class _Neo4jDriver:
 def get_driver(
     dbms: str,
     translator: "Translator",
-    ontology: "Ontology",
 ):
     """
     Function to return the writer class.
@@ -394,7 +389,6 @@ def get_driver(
             user=dbms_config["user"],
             password=dbms_config["password"],
             multi_db=dbms_config["multi_db"],
-            ontology=ontology,
             translator=translator,
         )

{biocypher-0.5.19 → biocypher-0.5.21}/biocypher/_core.py RENAMED Viewed

@@ -13,8 +13,10 @@ BioCypher core module. Interfaces with the user and distributes tasks to
 submodules.
 """
 from typing import Optional
+import os
 from more_itertools import peekable
+import yaml
 import pandas as pd
@@ -22,10 +24,11 @@ from ._logger import logger
 logger.debug(f"Loading module {__name__}.")
+from ._get import Downloader
 from ._write import get_writer
 from ._config import config as _config
 from ._config import update_from_file as _file_update
-from ._create import BioCypherEdge, BioCypherNode
+from ._create import BioCypherEdge, BioCypherNode, BioCypherRelAsNode
 from ._pandas import Pandas
 from ._connect import get_driver
 from ._mapping import OntologyMapping
@@ -181,19 +184,6 @@ class BioCypher:
         return self._ontology_mapping
-    def _get_translator(self) -> Translator:
-        """
-        Create translator if not exists and return.
-        """
-        if not self._translator:
-            self._translator = Translator(
-                ontology_mapping=self._get_ontology_mapping(),
-                strict_mode=self._strict_mode,
-            )
-        return self._translator
     def _get_ontology(self) -> Ontology:
         """
         Create ontology if not exists and return.
@@ -208,17 +198,28 @@ class BioCypher:
         return self._ontology
+    def _get_translator(self) -> Translator:
+        """
+        Create translator if not exists and return.
+        """
+        if not self._translator:
+            self._translator = Translator(
+                ontology=self._get_ontology(),
+                strict_mode=self._strict_mode,
+            )
+        return self._translator
     def _get_writer(self):
         """
         Create writer if not online. Set as instance variable `self._writer`.
         """
-        # Get worker
         if self._offline:
             self._writer = get_writer(
                 dbms=self._dbms,
                 translator=self._get_translator(),
-                ontology=self._get_ontology(),
                 deduplicator=self._get_deduplicator(),
                 output_directory=self._output_directory,
                 strict_mode=self._strict_mode,
@@ -235,7 +236,6 @@ class BioCypher:
             self._driver = get_driver(
                 dbms=self._dbms,
                 translator=self._get_translator(),
-                ontology=self._get_ontology(),
                 deduplicator=self._get_deduplicator(),
             )
         else:
@@ -308,24 +308,33 @@ class BioCypher:
         return self._pd.dfs
-    def add(self, entities):
+    def add(self, entities) -> None:
         """
         Function to add entities to the in-memory database. Accepts an iterable
         of tuples (if given, translates to ``BioCypherNode`` or
         ``BioCypherEdge`` objects) or an iterable of ``BioCypherNode`` or
         ``BioCypherEdge`` objects.
+        Args:
+            entities (iterable): An iterable of entities to add to the database.
+                Can be 3-tuples (nodes) or 5-tuples (edges); also accepts
+                4-tuples for edges (deprecated).
+        Returns:
+            None
         """
         if not self._pd:
             self._pd = Pandas(
                 translator=self._get_translator(),
-                ontology=self._get_ontology(),
                 deduplicator=self._get_deduplicator(),
             )
         entities = peekable(entities)
-        if isinstance(entities.peek(), BioCypherNode) or isinstance(
-            entities.peek(), BioCypherEdge
+        if (
+            isinstance(entities.peek(), BioCypherNode)
+            or isinstance(entities.peek(), BioCypherEdge)
+            or isinstance(entities.peek(), BioCypherRelAsNode)
         ):
             tentities = entities
         elif len(entities.peek()) < 4:
@@ -335,10 +344,28 @@ class BioCypher:
         self._pd.add_tables(tentities)
-    def add_nodes(self, nodes):
+    def add_nodes(self, nodes) -> None:
+        """
+        Wrapper for ``add()`` to add nodes to the in-memory database.
+        Args:
+            nodes (iterable): An iterable of node tuples to add to the database.
+        Returns:
+            None
+        """
         self.add(nodes)
-    def add_edges(self, edges):
+    def add_edges(self, edges) -> None:
+        """
+        Wrapper for ``add()`` to add edges to the in-memory database.
+        Args:
+            edges (iterable): An iterable of edge tuples to add to the database.
+        Returns:
+            None
+        """
         self.add(edges)
     def merge_nodes(self, nodes) -> bool:
@@ -389,6 +416,24 @@ class BioCypher:
         # write edge files
         return self._driver.add_biocypher_edges(tedges)
+    # DOWNLOAD AND CACHE MANAGEMENT METHODS ###
+    def _get_downloader(self):
+        """
+        Create downloader if not exists.
+        """
+        if not self._downloader:
+            self._downloader = Downloader()
+    def download(self, force: bool = False) -> None:
+        """
+        Use the :class:`Downloader` class to download or load from cache the
+        resources given by the adapter.
+        """
+        self._get_downloader()
     # OVERVIEW AND CONVENIENCE METHODS ###
     def log_missing_input_labels(self) -> Optional[dict[str, list[str]]]:
@@ -504,6 +549,73 @@ class BioCypher:
         self._writer.write_import_call()
+    def write_schema_info(self) -> None:
+        """
+        Write an extended schema info YAML file that extends the
+        `schema_config.yaml` with run-time information of the built KG. For
+        instance, include information on whether something present in the actual
+        knowledge graph, whether it is a relationship (which is important in the
+        case of representing relationships as nodes) and the actual sources and
+        targets of edges. Since this file can be used in place of the original
+        `schema_config.yaml` file, it indicates that it is the extended schema
+        by setting `is_schema_info` to `true`.
+        We start by using the `extended_schema` dictionary from the ontology
+        class instance, which contains all expanded entities and relationships.
+        The information of whether something is a relationship can be gathered
+        from the deduplicator instance, which keeps track of all entities that
+        have been seen.
+        """
+        if not self._offline:
+            raise NotImplementedError(
+                "Cannot write schema info in online mode."
+            )
+        ontology = self._get_ontology()
+        schema = ontology.mapping.extended_schema
+        schema["is_schema_info"] = True
+        deduplicator = self._get_deduplicator()
+        for node in deduplicator.entity_types:
+            if node in schema.keys():
+                schema[node]["present_in_knowledge_graph"] = True
+                schema[node]["is_relationship"] = False
+            else:
+                logger.info(
+                    f"Node {node} not present in extended schema. "
+                    "Skipping schema info."
+                )
+        # find 'label_as_edge' cases in schema entries
+        changed_labels = {}
+        for k, v in schema.items():
+            if not isinstance(v, dict):
+                continue
+            if "label_as_edge" in v.keys():
+                if v["label_as_edge"] in deduplicator.seen_relationships.keys():
+                    changed_labels[v["label_as_edge"]] = k
+        for edge in deduplicator.seen_relationships.keys():
+            if edge in changed_labels.keys():
+                edge = changed_labels[edge]
+            if edge in schema.keys():
+                schema[edge]["present_in_knowledge_graph"] = True
+                schema[edge]["is_relationship"] = True
+                # TODO information about source and target nodes
+            else:
+                logger.info(
+                    f"Edge {edge} not present in extended schema. "
+                    "Skipping schema info."
+                )
+        # write to output directory as YAML file
+        path = os.path.join(self._output_directory, "schema_info.yaml")
+        with open(path, "w") as f:
+            f.write(yaml.dump(schema))
+        return schema
     # TRANSLATION METHODS ###
     def translate_term(self, term: str) -> str:

biocypher-0.5.21/biocypher/_deduplicate.py ADDED Viewed

@@ -0,0 +1,147 @@
+from ._logger import logger
+logger.debug(f"Loading module {__name__}.")
+from ._create import BioCypherEdge, BioCypherNode, BioCypherRelAsNode
+class Deduplicator:
+    """
+    Singleton class responsible of deduplicating BioCypher inputs. Maintains
+    sets/dictionaries of node and edge types and their unique identifiers.
+    Nodes identifiers should be globally unique (represented as a set), while
+    edge identifiers are only unique per edge type (represented as a dict of
+    sets, keyed by edge type).
+    Stores collection of duplicate node and edge identifiers and types for
+    troubleshooting and to avoid overloading the log.
+    """
+    def __init__(self):
+        self.seen_entity_ids = set()
+        self.duplicate_entity_ids = set()
+        self.entity_types = set()
+        self.duplicate_entity_types = set()
+        self.seen_relationships = {}
+        self.duplicate_relationship_ids = set()
+        self.duplicate_relationship_types = set()
+    def node_seen(self, entity: BioCypherNode) -> bool:
+        """
+        Adds a node to the instance and checks if it has been seen before.
+        Args:
+            node: BioCypherNode to be added.
+        Returns:
+            True if the node has been seen before, False otherwise.
+        """
+        if entity.get_label() not in self.entity_types:
+            self.entity_types.add(entity.get_label())
+        if entity.get_id() in self.seen_entity_ids:
+            self.duplicate_entity_ids.add(entity.get_id())
+            if entity.get_label() not in self.duplicate_entity_types:
+                logger.warning(
+                    f"Duplicate node type {entity.get_label()} found. "
+                )
+                self.duplicate_entity_types.add(entity.get_label())
+            return True
+        self.seen_entity_ids.add(entity.get_id())
+        return False
+    def edge_seen(self, relationship: BioCypherEdge) -> bool:
+        """
+        Adds an edge to the instance and checks if it has been seen before.
+        Args:
+            edge: BioCypherEdge to be added.
+        Returns:
+            True if the edge has been seen before, False otherwise.
+        """
+        if relationship.get_type() not in self.seen_relationships:
+            self.seen_relationships[relationship.get_type()] = set()
+        # concatenate source and target if no id is present
+        if not relationship.get_id():
+            _id = (
+                f"{relationship.get_source_id()}_{relationship.get_target_id()}"
+            )
+        else:
+            _id = relationship.get_id()
+        if _id in self.seen_relationships[relationship.get_type()]:
+            self.duplicate_relationship_ids.add(_id)
+            if relationship.get_type() not in self.duplicate_relationship_types:
+                logger.warning(
+                    f"Duplicate edge type {relationship.get_type()} found. "
+                )
+                self.duplicate_relationship_types.add(relationship.get_type())
+            return True
+        self.seen_relationships[relationship.get_type()].add(_id)
+        return False
+    def rel_as_node_seen(self, rel_as_node: BioCypherRelAsNode) -> bool:
+        """
+        Adds a rel_as_node to the instance (one entity and two relationships)
+        and checks if it has been seen before. Only the node is relevant for
+        identifying the rel_as_node as a duplicate.
+        Args:
+            rel_as_node: BioCypherRelAsNode to be added.
+        Returns:
+            True if the rel_as_node has been seen before, False otherwise.
+        """
+        node = rel_as_node.get_node()
+        if node.get_label() not in self.seen_relationships:
+            self.seen_relationships[node.get_label()] = set()
+        # rel as node always has an id
+        _id = node.get_id()
+        if _id in self.seen_relationships[node.get_type()]:
+            self.duplicate_relationship_ids.add(_id)
+            if node.get_type() not in self.duplicate_relationship_types:
+                logger.warning(f"Duplicate edge type {node.get_type()} found. ")
+                self.duplicate_relationship_types.add(node.get_type())
+            return True
+        self.seen_relationships[node.get_type()].add(_id)
+        return False
+    def get_duplicate_nodes(self):
+        """
+        Function to return a list of duplicate nodes.
+        Returns:
+            list: list of duplicate nodes
+        """
+        if self.duplicate_entity_types:
+            return (self.duplicate_entity_types, self.duplicate_entity_ids)
+        else:
+            return None
+    def get_duplicate_edges(self):
+        """
+        Function to return a list of duplicate edges.
+        Returns:
+            list: list of duplicate edges
+        """
+        if self.duplicate_relationship_types:
+            return (
+                self.duplicate_relationship_types,
+                self.duplicate_relationship_ids,
+            )
+        else:
+            return None

biocypher 0.5.19__tar.gz → 0.5.21__tar.gz

Potentially problematic release.

biocypher 0.5.19tar.gz → 0.5.21tar.gz