PyPI - henge - Versions diffs - 0.2.1__tar.gz → 0.2.3__tar.gz - Mend

henge 0.2.1tar.gz → 0.2.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

henge-0.2.3/PKG-INFO +132 -0
henge-0.2.3/README.md +98 -0
{henge-0.2.1 → henge-0.2.3}/henge/__init__.py +1 -0
henge-0.2.3/henge/_version.py +1 -0
{henge-0.2.1 → henge-0.2.3}/henge/henge.py +55 -27
henge-0.2.3/henge/scconf.py +359 -0
henge-0.2.3/henge.egg-info/PKG-INFO +132 -0
{henge-0.2.1 → henge-0.2.3}/henge.egg-info/SOURCES.txt +3 -2
{henge-0.2.1 → henge-0.2.3}/setup.py +5 -9
henge-0.2.3/tests/test_henge.py +71 -0
henge-0.2.1/PKG-INFO +0 -25
henge-0.2.1/README.md +0 -7
henge-0.2.1/henge/_version.py +0 -1
henge-0.2.1/henge.egg-info/PKG-INFO +0 -25
henge-0.2.1/henge.egg-info/entry_points.txt +0 -2
{henge-0.2.1 → henge-0.2.3}/LICENSE.txt +0 -0
{henge-0.2.1 → henge-0.2.3}/henge/const.py +0 -0
{henge-0.2.1 → henge-0.2.3}/henge/deprecated.py +0 -0
{henge-0.2.1 → henge-0.2.3}/henge.egg-info/dependency_links.txt +0 -0
{henge-0.2.1 → henge-0.2.3}/henge.egg-info/requires.txt +0 -0
{henge-0.2.1 → henge-0.2.3}/henge.egg-info/top_level.txt +0 -0
{henge-0.2.1 → henge-0.2.3}/setup.cfg +0 -0

henge-0.2.3/PKG-INFO ADDED Viewed

@@ -0,0 +1,132 @@
+Metadata-Version: 2.4
+Name: henge
+Version: 0.2.3
+Summary: Storage and retrieval of object-derived, decomposable recursive unique identifiers.
+Home-page: https://databio.org
+Author: Nathan Sheffield
+Author-email: nathan@code.databio.org
+License: BSD2
+Classifier: Development Status :: 4 - Beta
+Classifier: License :: OSI Approved :: BSD License
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: System :: Distributed Computing
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE.txt
+Requires-Dist: jsonschema
+Requires-Dist: ubiquerg>=0.5.2
+Requires-Dist: yacman>=0.6.7
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: keywords
+Dynamic: license
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# Henge
+Henge is a Python package for building data storage and retrieval interfaces for arbitrary data. Henge is based on the idea of **decomposable recursive unique identifiers (DRUIDs)**, which are hash-based unique identifiers for data derived from the data itself. For arbitrary data with any structure, Henge can mint unique DRUIDs to identify data, store the data in a key-value database of your choice, and provide lookup functions to retrieve the data in its original structure using its DRUID identifier.
+Henge was intended as a building block for [sequence collections](https://github.com/refgenie/seqcol), but is generic enough to use for any data type that needs content-derived identifiers with database lookup capability.
+## Install
+```
+pip install henge
+```
+## Quick Start
+Create a Henge object by providing a database and a data schema. The database can be a Python dict or backed by persistent storage. Data schemas are [JSON-schema](https://json-schema.org/) descriptions of data types, and can be hierarchical.
+```python
+import henge
+schemas = ["path/to/json_schema.yaml"]
+h = henge.Henge(database={}, schemas=schemas)
+```
+Insert items into the henge. Upon insert, henge returns the DRUID (digest/checksum/unique identifier) for your object:
+```python
+druid = h.insert({"name": "Pat", "age": 38}, item_type="person")
+```
+Retrieve the original object using the DRUID:
+```python
+h.retrieve(druid)
+# {'age': '38', 'name': 'Pat'}
+```
+## Tutorial
+For a comprehensive walkthrough covering basic types, arrays, nested objects, and advanced features, see the [tutorial notebook](docs/tutorial.ipynb).
+## What are DRUIDs?
+DRUIDs are a special type of unique identifiers with two powerful properties:
+- **Decomposable**: Identifiers in henge automatically retrieve structured data (tuples, arrays, objects). The structure is defined by a JSON schema, so henge can be used as a back-end for arbitrary data types.
+- **Recursive**: Individual elements retrieved by henge can be tagged as recursive, meaning these attributes contain their own DRUIDs. Henge can recurse through these, allowing you to mint unique identifiers for arbitrary nested data structures.
+A DRUID is ultimately the result of a digest operation (such as `md5` or `sha256`) on some data. Because DRUIDs are computed deterministically from the item, they represent globally unique identifiers. If you insert the same item repeatedly, it will produce the same DRUID -- this is true across henges as long as they share a data schema.
+## Persisting Data
+### In-memory (default)
+Use a Python `dict` as the database for testing or ephemeral use:
+```python
+h = henge.Henge(database={}, schemas=schemas)
+```
+### SQLite backend
+For persistent storage with SQLite:
+```python
+from sqlitedict import SqliteDict
+mydict = SqliteDict('./my_db.sqlite', autocommit=True)
+h = henge.Henge(mydict, schemas=schemas)
+```
+Requires: `pip install sqlitedict`
+### MongoDB backend
+For production use with MongoDB:
+1. **Start MongoDB with Docker:**
+```bash
+docker run --network="host" mongo
+```
+For persistent storage, mount a volume to `/data/db`:
+```bash
+docker run -it --network="host" -v /path/to/data:/data/db mongo
+```
+2. **Connect henge to MongoDB:**
+```python
+import henge
+h = henge.Henge(henge.connect_mongo(), schemas=schemas)
+```
+Requires: `pip install pymongo mongodict`

henge-0.2.3/README.md ADDED Viewed

@@ -0,0 +1,98 @@
+# Henge
+Henge is a Python package for building data storage and retrieval interfaces for arbitrary data. Henge is based on the idea of **decomposable recursive unique identifiers (DRUIDs)**, which are hash-based unique identifiers for data derived from the data itself. For arbitrary data with any structure, Henge can mint unique DRUIDs to identify data, store the data in a key-value database of your choice, and provide lookup functions to retrieve the data in its original structure using its DRUID identifier.
+Henge was intended as a building block for [sequence collections](https://github.com/refgenie/seqcol), but is generic enough to use for any data type that needs content-derived identifiers with database lookup capability.
+## Install
+```
+pip install henge
+```
+## Quick Start
+Create a Henge object by providing a database and a data schema. The database can be a Python dict or backed by persistent storage. Data schemas are [JSON-schema](https://json-schema.org/) descriptions of data types, and can be hierarchical.
+```python
+import henge
+schemas = ["path/to/json_schema.yaml"]
+h = henge.Henge(database={}, schemas=schemas)
+```
+Insert items into the henge. Upon insert, henge returns the DRUID (digest/checksum/unique identifier) for your object:
+```python
+druid = h.insert({"name": "Pat", "age": 38}, item_type="person")
+```
+Retrieve the original object using the DRUID:
+```python
+h.retrieve(druid)
+# {'age': '38', 'name': 'Pat'}
+```
+## Tutorial
+For a comprehensive walkthrough covering basic types, arrays, nested objects, and advanced features, see the [tutorial notebook](docs/tutorial.ipynb).
+## What are DRUIDs?
+DRUIDs are a special type of unique identifiers with two powerful properties:
+- **Decomposable**: Identifiers in henge automatically retrieve structured data (tuples, arrays, objects). The structure is defined by a JSON schema, so henge can be used as a back-end for arbitrary data types.
+- **Recursive**: Individual elements retrieved by henge can be tagged as recursive, meaning these attributes contain their own DRUIDs. Henge can recurse through these, allowing you to mint unique identifiers for arbitrary nested data structures.
+A DRUID is ultimately the result of a digest operation (such as `md5` or `sha256`) on some data. Because DRUIDs are computed deterministically from the item, they represent globally unique identifiers. If you insert the same item repeatedly, it will produce the same DRUID -- this is true across henges as long as they share a data schema.
+## Persisting Data
+### In-memory (default)
+Use a Python `dict` as the database for testing or ephemeral use:
+```python
+h = henge.Henge(database={}, schemas=schemas)
+```
+### SQLite backend
+For persistent storage with SQLite:
+```python
+from sqlitedict import SqliteDict
+mydict = SqliteDict('./my_db.sqlite', autocommit=True)
+h = henge.Henge(mydict, schemas=schemas)
+```
+Requires: `pip install sqlitedict`
+### MongoDB backend
+For production use with MongoDB:
+1. **Start MongoDB with Docker:**
+```bash
+docker run --network="host" mongo
+```
+For persistent storage, mount a volume to `/data/db`:
+```bash
+docker run -it --network="host" -v /path/to/data:/data/db mongo
+```
+2. **Connect henge to MongoDB:**
+```python
+import henge
+h = henge.Henge(henge.connect_mongo(), schemas=schemas)
+```
+Requires: `pip install pymongo mongodict`

{henge-0.2.1 → henge-0.2.3}/henge/__init__.py RENAMED Viewed

@@ -9,4 +9,5 @@ __all__ = __classes__ + [
     "split_schema",
     "NotFoundException",
     "canonical_str",
+    "sha512t24u_digest",
 ]

henge-0.2.3/henge/_version.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = "0.2.3"

{henge-0.2.1 → henge-0.2.3}/henge/henge.py RENAMED Viewed

@@ -1,10 +1,11 @@
-""" An interface to a database back-end for DRUIDs """
+"""An interface to a database back-end for DRUIDs"""
+import base64
 import copy
 import hashlib
 import jsonschema
-import logging
 import json
+import logging
 import os
 import sys
 import yacman
@@ -28,6 +29,13 @@ class NotFoundException(Exception):
         return self.message
+def sha512t24u_digest(seq: str, offset: int = 24) -> str:
+    """GA4GH digest function"""
+    digest = hashlib.sha512(seq.encode()).digest()
+    tdigest_b64us = base64.urlsafe_b64encode(digest[:offset])
+    return tdigest_b64us.decode("ascii")
 def md5(seq):
     return hashlib.md5(seq.encode()).hexdigest()
@@ -49,14 +57,18 @@ def read_url(url):
         raise e
     data = response.read()  # a `bytes` object
     text = data.decode("utf-8")
-    print(text)
     return yaml.safe_load(text)
 class Henge(object):
     def __init__(
-        self, database, schemas, schemas_str=[], henges=None, checksum_function=md5
-    ):
+        self,
+        database: dict,
+        schemas: list[str],
+        schemas_str: list[str] = None,
+        henges: dict = None,
+        checksum_function: callable = md5,
+    ) -> None:
         """
         A user interface to insert and retrieve decomposable recursive unique
         identifiers (DRUIDs).
@@ -107,7 +119,7 @@ class Henge(object):
                         )
                         # populated_schemas.append(yaml.safe_load(schema_value))
-            for schema_value in schemas_str:
+            for schema_value in schemas_str or []:
                 populated_schemas.append(yaml.safe_load(schema_value))
             split_schemas = {}
@@ -132,7 +144,9 @@ class Henge(object):
                     self.schemas[item_type] = henge.schemas[item_type]
                     self.henges[item_type] = henge
-    def retrieve(self, druid, reclimit=None, raw=False):
+    def retrieve(
+        self, druid: str, reclimit: int = None, raw: bool = False
+    ) -> dict | list:
         """
         Retrieve an item given a digest
@@ -194,7 +208,7 @@ class Henge(object):
     def lookup(self, druid, item_type):
         try:
             henge_to_query = self.henges[item_type]
-        except:
+        except KeyError:
             _LOGGER.debug("No henges available for this item type")
             raise NotFoundException(druid)
         try:
@@ -228,7 +242,9 @@ class Henge(object):
                 continue
         return valid_schemas
-    def insert(self, item, item_type, reclimit=None):
+    def insert(
+        self, item: dict | list, item_type: str, reclimit: int = None
+    ) -> str | bool:
         """
         Add structured items of a specified type to the database.
@@ -243,8 +259,9 @@ class Henge(object):
         if item_type not in self.schemas.keys():
             _LOGGER.error(
-                "I don't know about items of type '{}'. "
-                "I know of: '{}'".format(item_type, list(self.schemas.keys()))
+                "I don't know about items of type '{}'. I know of: '{}'".format(
+                    item_type, list(self.schemas.keys())
+                )
             )
             return False
@@ -328,8 +345,9 @@ class Henge(object):
         """
         if item_type not in self.schemas.keys():
             _LOGGER.error(
-                "I don't know about items of type '{}'. "
-                "I know of: '{}'".format(item_type, list(self.schemas.keys()))
+                "I don't know about items of type '{}'. I know of: '{}'".format(
+                    item_type, list(self.schemas.keys())
+                )
             )
             return False
@@ -349,7 +367,7 @@ class Henge(object):
                     item_type, item
                 )
             )
-            print(e)
+            _LOGGER.error(e)
             if isinstance(item, str):
                 henge_to_query = self.henges[item_type]
@@ -370,7 +388,6 @@ class Henge(object):
                 return item
             raise e
-            return None
         _LOGGER.debug(f"item to insert: {item}")
         item_inherent_split = select_inherent_properties(item, valid_schema)
@@ -408,17 +425,14 @@ class Henge(object):
         henge_to_query = self.henges[item_type]
         # _LOGGER.debug("henge_to_query: {}".format(henge_to_query))
-        try:
-            henge_to_query.database[druid] = string
-            henge_to_query.database[druid + ITEM_TYPE] = item_type
-            henge_to_query.database[druid + "_digest_version"] = digest_version
-            henge_to_query.database[druid + "_external_string"] = external_string
-            if henge_to_query != self:
-                self.database[druid + ITEM_TYPE] = item_type
-                self.database[druid + "_digest_version"] = digest_version
-        except Exception as e:
-            raise e
+        henge_to_query.database[druid] = string
+        henge_to_query.database[druid + ITEM_TYPE] = item_type
+        henge_to_query.database[druid + "_digest_version"] = digest_version
+        henge_to_query.database[druid + "_external_string"] = external_string
+        if henge_to_query != self:
+            self.database[druid + ITEM_TYPE] = item_type
+            self.database[druid + "_digest_version"] = digest_version
     def clean(self):
         """
@@ -440,7 +454,21 @@ class Henge(object):
         Show all items in the database.
         """
         for k, v in self.database.items():
-            print(k, v)
+            _LOGGER.info(f"{k} {v}")
+    def __len__(self):
+        return len(self.database)
+    def list(self, limit=1000, offset=0):
+        """
+        List all items in the database.
+        """
+        return {
+            "count": len(self.database),
+            "limit": limit,
+            "offset": offset,
+            "items": list(self.database.keys())[offset : (offset + limit)],
+        }
     def __repr__(self):
         repr = "Henge object. Item types: " + ",".join(self.item_types)

henge-0.2.3/henge/scconf.py ADDED Viewed

@@ -0,0 +1,359 @@
+import logging
+import os
+import psycopg2
+from collections.abc import Mapping
+from psycopg2 import OperationalError, sql
+from psycopg2.errors import UniqueViolation
+_LOGGER = logging.getLogger(__name__)
+# Use like:
+# pgdb = RDBDict(...)       # Open connection
+# pgdb["key"] = "value"     # Insert item
+# pgdb["key"]               # Retrieve item
+# pgdb.close()              # Close connection
+# This was originally written in seqcolapi.
+# I am moving it here in 2025, because the whole point was to enable
+# interesting database back-ends to have dict-style key-value pair
+# mechanisms, which was enabling henge to use these various backends
+# to back arbitrary databases.
+# with the move to sqlmodel, I abandoned the henge backend approach,
+# so intermediates are no longer important for seqcol.
+# they could become relevant for other henge use cases, so they
+# fit better here now.
+def getenv(varname):
+    """Simple wrapper to make the Exception more informative for missing env var"""
+    try:
+        return os.environ[varname]
+    except KeyError:
+        raise Exception(f"Environment variable {varname} not set.")
+import pipestat
+class PipestatMapping(pipestat.PipestatManager):
+    """A wrapper class to allow using a PipestatManager as a dict-like object."""
+    def __getitem__(self, key):
+        # This little hack makes this work with `in`;
+        # e.g.: for x in rdbdict, which is now disabled, instead of infinite.
+        if isinstance(key, int):
+            raise IndexError
+        return self.retrieve(key)
+    def __setitem__(self, key, value):
+        return self.insert({key: value})
+    def __len__(self):
+        return self.count_records()
+    def _next_page(self):
+        self._buf["page_index"] += 1
+        limit = self._buf["page_size"]
+        offset = self._buf["page_index"] * limit
+        self._buf["keys"] = self.get_records(limit, offset)
+        return self._buf["keys"][0]
+    def __iter__(self):
+        _LOGGER.debug("Iterating...")
+        self._buf = {  # buffered iterator
+            "current_view_index": 0,
+            "len": len(self),
+            "page_size": 100,
+            "page_index": -1,
+            "keys": self._next_page(),
+        }
+        return self
+    def __next__(self):
+        if self._buf["current_view_index"] > self._buf["len"]:
+            raise StopIteration
+        idx = (
+            self._buf["current_view_index"]
+            - self._buf["page_index"] * self._buf["page_size"]
+        )
+        if idx <= self._buf["page_size"]:
+            self._buf["current_view_index"] += 1
+            return self._buf["keys"][idx - 1]
+        else:  # current index is beyond current page, but not beyond total
+            return self._next_page()
+class RDBDict(Mapping):
+    """
+    A Relational DataBase Dict.
+    Simple database connection manager object that allows us to use a
+    PostgresQL database as a simple key-value store to back Python
+    dict-style access to database items.
+    """
+    def __init__(
+        self,
+        db_name: str = None,
+        db_user: str = None,
+        db_password: str = None,
+        db_host: str = None,
+        db_port: str = None,
+        db_table: str = None,
+    ):
+        self.connection = None
+        self.db_name = db_name or getenv("POSTGRES_DB")
+        self.db_user = db_user or getenv("POSTGRES_USER")
+        self.db_host = db_host or os.environ.get("POSTGRES_HOST") or "localhost"
+        self.db_port = db_port or os.environ.get("POSTGRES_PORT") or "5432"
+        self.db_table = db_table or os.environ.get("POSTGRES_TABLE") or "seqcol"
+        db_password = db_password or getenv("POSTGRES_PASSWORD")
+        try:
+            self.connection = self.create_connection(
+                self.db_name, self.db_user, db_password, self.db_host, self.db_port
+            )
+            if not self.connection:
+                raise Exception("Connection failed")
+        except Exception as e:
+            _LOGGER.info(f"{self}")
+            raise e
+        _LOGGER.info(self.connection)
+        self.connection.autocommit = True
+    def __repr__(self):
+        return (
+            "RDBD object\n"
+            + "db_table: {}\n".format(self.db_table)
+            + "db_name: {}\n".format(self.db_name)
+            + "db_user: {}\n".format(self.db_user)
+            + "db_host: {}\n".format(self.db_host)
+            + "db_port: {}\n".format(self.db_port)
+        )
+    def init_table(self):
+        # Wrap statements to prevent SQL injection attacks
+        stmt = sql.SQL(
+            """
+            CREATE TABLE IF NOT EXISTS {table}(
+            key TEXT PRIMARY KEY,
+            value TEXT);
+        """
+        ).format(table=sql.Identifier(self.db_table))
+        return self.execute_query(stmt, params=None)
+    def insert(self, key, value):
+        stmt = sql.SQL(
+            """
+            INSERT INTO {table}(key, value)
+            VALUES (%(key)s, %(value)s);
+        """
+        ).format(table=sql.Identifier(self.db_table))
+        params = {"key": key, "value": value}
+        return self.execute_query(stmt, params)
+    def update(self, key, value):
+        stmt = sql.SQL(
+            """
+            UPDATE {table} SET value=%(value)s WHERE key=%(key)s
+        """
+        ).format(table=sql.Identifier(self.db_table))
+        params = {"key": key, "value": value}
+        return self.execute_query(stmt, params)
+    def __getitem__(self, key):
+        # This little hack makes this work with `in`;
+        # e.g.: for x in rdbdict, which is now disabled, instead of infinite.
+        if isinstance(key, int):
+            raise IndexError
+        stmt = sql.SQL(
+            """
+            SELECT value FROM {table} WHERE key=%(key)s
+        """
+        ).format(table=sql.Identifier(self.db_table))
+        params = {"key": key}
+        res = self.execute_read_query(stmt, params)
+        if not res:
+            _LOGGER.info("Not found: {}".format(key))
+        return res
+    def __setitem__(self, key, value):
+        try:
+            return self.insert(key, value)
+        except UniqueViolation as e:
+            _LOGGER.info("Updating existing value for {}".format(key))
+            return self.update(key, value)
+    def __delitem__(self, key):
+        stmt = sql.SQL(
+            """
+            DELETE FROM {table} WHERE key=%(key)s
+        """
+        ).format(table=sql.Identifier(self.db_table))
+        params = {"key": key}
+        res = self.execute_query(stmt, params)
+        return res
+    def create_connection(self, db_name, db_user, db_password, db_host, db_port):
+        connection = None
+        try:
+            connection = psycopg2.connect(
+                database=db_name,
+                user=db_user,
+                password=db_password,
+                host=db_host,
+                port=db_port,
+            )
+            _LOGGER.info("Connection to PostgreSQL DB successful")
+        except OperationalError as e:
+            _LOGGER.info("Error: {e}".format(e=str(e)))
+        return connection
+    def execute_read_query(self, query, params=None):
+        cursor = self.connection.cursor()
+        result = None
+        try:
+            cursor.execute(query, params)
+            result = cursor.fetchone()
+            if result:
+                return result[0]
+            else:
+                _LOGGER.debug(f"Query: {query}")
+                _LOGGER.debug(f"Result: {result}")
+                return None
+        except OperationalError as e:
+            _LOGGER.info("Error: {e}".format(e=str(e)))
+            raise
+        except TypeError as e:
+            _LOGGER.info("TypeError: {e}, item: {q}".format(e=str(e), q=query))
+            raise
+    def execute_multi_query(self, query, params=None):
+        cursor = self.connection.cursor()
+        result = None
+        try:
+            cursor.execute(query, params)
+            result = cursor.fetchall()
+            return result
+        except OperationalError as e:
+            _LOGGER.info("Error: {e}".format(e=str(e)))
+            raise
+        except TypeError as e:
+            _LOGGER.info("TypeError: {e}, item: {q}".format(e=str(e), q=query))
+            raise
+    def execute_query(self, query, params=None):
+        cursor = self.connection.cursor()
+        try:
+            return cursor.execute(query, params)
+            _LOGGER.info("Query executed successfully")
+        except OperationalError as e:
+            _LOGGER.info("Error: {e}".format(e=str(e)))
+    def close(self):
+        _LOGGER.info("Closing connection")
+        return self.connection.close()
+    def __del__(self):
+        if self.connection:
+            self.close()
+    def __len__(self):
+        stmt = sql.SQL(
+            """
+            SELECT COUNT(*) FROM {table}
+        """
+        ).format(table=sql.Identifier(self.db_table))
+        _LOGGER.debug(stmt)
+        res = self.execute_read_query(stmt)
+        return res
+    def get_paged_keys(self, limit=None, offset=None):
+        stmt = sql.SQL("SELECT key FROM {table}").format(
+            table=sql.Identifier(self.db_table)
+        )
+        params = {}
+        if limit is not None:
+            stmt = sql.SQL("{} LIMIT %(limit)s").format(stmt)
+            params["limit"] = limit
+        if offset is not None:
+            stmt = sql.SQL("{} OFFSET %(offset)s").format(stmt)
+            params["offset"] = offset
+        res = self.execute_multi_query(stmt, params if params else None)
+        return res
+    def _next_page(self):
+        self._buf["page_index"] += 1
+        limit = self._buf["page_size"]
+        offset = self._buf["page_index"] * limit
+        self._buf["keys"] = self.get_paged_keys(limit, offset)
+        return self._buf["keys"][0]
+    def __iter__(self):
+        _LOGGER.debug("Iterating...")
+        self._buf = {  # buffered iterator
+            "current_view_index": 0,
+            "len": len(self),
+            "page_size": 10,
+            "page_index": 0,
+            "keys": self.get_paged_keys(10, 0),
+        }
+        return self
+    def __next__(self):
+        if self._buf["current_view_index"] > self._buf["len"]:
+            raise StopIteration
+        idx = (
+            self._buf["current_view_index"]
+            - self._buf["page_index"] * self._buf["page_size"]
+        )
+        if idx <= self._buf["page_size"]:
+            self._buf["current_view_index"] += 1
+            return self._buf["keys"][idx - 1]
+        else:  # current index is beyond current page, but not beyond total
+            return self._next_page()
+    # Old, non-paged iterator:
+    # def __iter__(self):
+    #     self._current_idx = 0
+    #     return self
+    # def __next__(self):
+    #     stmt = sql.SQL(
+    #         """
+    #         SELECT key,value FROM {table} LIMIT 1 OFFSET %(idx)s
+    #     """
+    #     ).format(table=sql.Identifier(self.db_table))
+    #     res = self.execute_read_query(stmt, {"idx": self._current_idx})
+    #     self._current_idx += 1
+    #     if not res:
+    #         _LOGGER.info("Not found: {}".format(self._current_idx))
+    #         raise StopIteration
+    #     return res
+# We don't need the full SeqColHenge,
+# which also has loading capability, and requires pyfaidx, which requires
+# biopython, which requires numpy, which is huge and can't compile the in
+# default fastapi container.
+# So, I had written the below class which provides retrieve only.
+# HOWEVER, switching from alpine to slim allows install of numpy;
+# This inflates the container size from 262Mb to 350Mb; perhaps that's worth paying.
+# So I can avoid duplicating this and just use the full SeqColHenge from seqcol
+# class SeqColHenge(refget.RefGetClient):
+#     def retrieve(self, druid, reclimit=None, raw=False):
+#         try:
+#             return super(SeqColHenge, self).retrieve(druid, reclimit, raw)
+#         except henge.NotFoundException as e:
+#             _LOGGER.debug(e)
+#             try:
+#                 return self.refget(druid)
+#             except Exception as e:
+#                 _LOGGER.debug(e)
+#                 raise e
+#                 return henge.NotFoundException("{} not found in database, or in refget.".format(druid))

henge-0.2.3/henge.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,132 @@
+Metadata-Version: 2.4
+Name: henge
+Version: 0.2.3
+Summary: Storage and retrieval of object-derived, decomposable recursive unique identifiers.
+Home-page: https://databio.org
+Author: Nathan Sheffield
+Author-email: nathan@code.databio.org
+License: BSD2
+Classifier: Development Status :: 4 - Beta
+Classifier: License :: OSI Approved :: BSD License
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: System :: Distributed Computing
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE.txt
+Requires-Dist: jsonschema
+Requires-Dist: ubiquerg>=0.5.2
+Requires-Dist: yacman>=0.6.7
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: keywords
+Dynamic: license
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# Henge
+Henge is a Python package for building data storage and retrieval interfaces for arbitrary data. Henge is based on the idea of **decomposable recursive unique identifiers (DRUIDs)**, which are hash-based unique identifiers for data derived from the data itself. For arbitrary data with any structure, Henge can mint unique DRUIDs to identify data, store the data in a key-value database of your choice, and provide lookup functions to retrieve the data in its original structure using its DRUID identifier.
+Henge was intended as a building block for [sequence collections](https://github.com/refgenie/seqcol), but is generic enough to use for any data type that needs content-derived identifiers with database lookup capability.
+## Install
+```
+pip install henge
+```
+## Quick Start
+Create a Henge object by providing a database and a data schema. The database can be a Python dict or backed by persistent storage. Data schemas are [JSON-schema](https://json-schema.org/) descriptions of data types, and can be hierarchical.
+```python
+import henge
+schemas = ["path/to/json_schema.yaml"]
+h = henge.Henge(database={}, schemas=schemas)
+```
+Insert items into the henge. Upon insert, henge returns the DRUID (digest/checksum/unique identifier) for your object:
+```python
+druid = h.insert({"name": "Pat", "age": 38}, item_type="person")
+```
+Retrieve the original object using the DRUID:
+```python
+h.retrieve(druid)
+# {'age': '38', 'name': 'Pat'}
+```
+## Tutorial
+For a comprehensive walkthrough covering basic types, arrays, nested objects, and advanced features, see the [tutorial notebook](docs/tutorial.ipynb).
+## What are DRUIDs?
+DRUIDs are a special type of unique identifiers with two powerful properties:
+- **Decomposable**: Identifiers in henge automatically retrieve structured data (tuples, arrays, objects). The structure is defined by a JSON schema, so henge can be used as a back-end for arbitrary data types.
+- **Recursive**: Individual elements retrieved by henge can be tagged as recursive, meaning these attributes contain their own DRUIDs. Henge can recurse through these, allowing you to mint unique identifiers for arbitrary nested data structures.
+A DRUID is ultimately the result of a digest operation (such as `md5` or `sha256`) on some data. Because DRUIDs are computed deterministically from the item, they represent globally unique identifiers. If you insert the same item repeatedly, it will produce the same DRUID -- this is true across henges as long as they share a data schema.
+## Persisting Data
+### In-memory (default)
+Use a Python `dict` as the database for testing or ephemeral use:
+```python
+h = henge.Henge(database={}, schemas=schemas)
+```
+### SQLite backend
+For persistent storage with SQLite:
+```python
+from sqlitedict import SqliteDict
+mydict = SqliteDict('./my_db.sqlite', autocommit=True)
+h = henge.Henge(mydict, schemas=schemas)
+```
+Requires: `pip install sqlitedict`
+### MongoDB backend
+For production use with MongoDB:
+1. **Start MongoDB with Docker:**
+```bash
+docker run --network="host" mongo
+```
+For persistent storage, mount a volume to `/data/db`:
+```bash
+docker run -it --network="host" -v /path/to/data:/data/db mongo
+```
+2. **Connect henge to MongoDB:**
+```python
+import henge
+h = henge.Henge(henge.connect_mongo(), schemas=schemas)
+```
+Requires: `pip install pymongo mongodict`

{henge-0.2.1 → henge-0.2.3}/henge.egg-info/SOURCES.txt RENAMED Viewed

@@ -6,9 +6,10 @@ henge/_version.py
 henge/const.py
 henge/deprecated.py
 henge/henge.py
+henge/scconf.py
 henge.egg-info/PKG-INFO
 henge.egg-info/SOURCES.txt
 henge.egg-info/dependency_links.txt
-henge.egg-info/entry_points.txt
 henge.egg-info/requires.txt
-henge.egg-info/top_level.txt
+henge.egg-info/top_level.txt
+tests/test_henge.py

{henge-0.2.1 → henge-0.2.3}/setup.py RENAMED Viewed

@@ -1,6 +1,5 @@
 #! /usr/bin/env python
-import os
 from setuptools import setup
 import sys
@@ -35,10 +34,10 @@ setup(
     classifiers=[
         "Development Status :: 4 - Beta",
         "License :: OSI Approved :: BSD License",
-        "Programming Language :: Python :: 3.7",
-        "Programming Language :: Python :: 3.8",
-        "Programming Language :: Python :: 3.9",
         "Programming Language :: Python :: 3.10",
+        "Programming Language :: Python :: 3.11",
+        "Programming Language :: Python :: 3.12",
+        "Programming Language :: Python :: 3.13",
         "Topic :: System :: Distributed Computing",
     ],
     keywords="",
@@ -46,15 +45,12 @@ setup(
     author="Nathan Sheffield",
     author_email="nathan@code.databio.org",
     license="BSD2",
-    entry_points={
-        "console_scripts": ["packagename = packagename.packagename:main"],
-    },
-    package_data={"packagename": [os.path.join("packagename", "*")]},
+    python_requires=">=3.10",
     include_package_data=True,
     test_suite="tests",
     tests_require=(["pytest"]),
     setup_requires=(
         ["pytest-runner"] if {"test", "pytest", "ptr"} & set(sys.argv) else []
     ),
-    **extra
+    **extra,
 )

henge-0.2.3/tests/test_henge.py ADDED Viewed

@@ -0,0 +1,71 @@
+import pytest
+from henge import Henge
+from jsonschema import ValidationError
+# See conftest.py for fixtures
+class TestInserting:
+    @pytest.mark.parametrize(
+        ["x", "success"],
+        [
+            ({"string_attr": "12321%@!"}, True),
+            ({"string_attr": "string", "integer_attr": 1}, True),
+            ({"string_attr": 1}, False),
+            ({"string_attr": ["a", "b"]}, False),
+            ({"string_attr": {"string_attr": "test"}}, False),
+            ({"string_attr": "string", "integer_attr": "1"}, False),
+            ({"integer_attr": 1}, False),
+        ],
+    )
+    def test_insert_validation_works(self, schema, x, success):
+        """Test whether insertion is performed only for valid objects"""
+        type_key = "test_item"
+        print(f"here's what I got for schema: {schema}")
+        h = Henge(database={}, schemas=["tests/data/schema.yaml"])
+        if success:
+            assert isinstance(h.insert(x, item_type=type_key), str)
+        else:
+            with pytest.raises(ValidationError):
+                h.insert(x, item_type=type_key)
+class TestRetrieval:
+    @pytest.mark.parametrize(
+        "x",
+        [
+            ({"string_attr": "12321%@!", "integer_attr": 2}),
+            ({"string_attr": "string", "integer_attr": 1}),
+            ({"string_attr": "string"}),
+        ],
+    )
+    def test_retrieve_returns_inserted_obj(self, schema, x):
+        type_key = "test_item"
+        h = Henge(database={}, schemas=["tests/data/schema.yaml"])
+        d = h.insert(x, item_type=type_key)
+        # returns str versions of inserted data
+        assert h.retrieve(d) == {k: v for k, v in x.items()}
+    @pytest.mark.parametrize(
+        ["seq", "anno"],
+        [
+            ("ATGCAGTA", {"name": "seq1", "length": 10, "topology": "linear"}),
+            ("AAAAAAAA", {"name": "seq2", "length": 11, "topology": "linear"}),
+        ],
+    )
+    def test_retrieve_recurses(self, schema_asd, schema_sequence, seq, anno):
+        h = Henge(
+            database={},
+            schemas=[
+                "tests/data/sequence.yaml",
+                "tests/data/annotated_sequence_digest.yaml",
+            ],
+        )
+        seq_digest = h.insert(seq, item_type="sequence")
+        anno.update({"sequence_digest": seq_digest})
+        asd = h.insert(anno, item_type="annotated_sequence_digest")
+        res = h.retrieve(asd)
+        assert isinstance(res["name"], str)
+    def test_inherent_attributes(self, inherent):
+        print("test")

henge-0.2.1/PKG-INFO DELETED Viewed

@@ -1,25 +0,0 @@
-Metadata-Version: 2.1
-Name: henge
-Version: 0.2.1
-Summary: Storage and retrieval of object-derived, decomposable recursive unique identifiers.
-Home-page: https://databio.org
-Author: Nathan Sheffield
-Author-email: nathan@code.databio.org
-License: BSD2
-Classifier: Development Status :: 4 - Beta
-Classifier: License :: OSI Approved :: BSD License
-Classifier: Programming Language :: Python :: 3.7
-Classifier: Programming Language :: Python :: 3.8
-Classifier: Programming Language :: Python :: 3.9
-Classifier: Programming Language :: Python :: 3.10
-Classifier: Topic :: System :: Distributed Computing
-Description-Content-Type: text/markdown
-License-File: LICENSE.txt
-[![Build Status](https://travis-ci.com/databio/henge.svg?branch=master)](https://travis-ci.com/databio/henge)
-# Henge
-Henge is a Python package that builds backends for generic decomposable recursive unique identifiers (or, *DRUIDs*). It is intended to be used as a building block for sequence collections (see the [seqcol package](https://github.com/databio/seqcol)), and also for other data types that need content-derived identifiers.
-Documentation at [http://henge.databio.org](http://henge.databio.org).

henge-0.2.1/README.md DELETED Viewed

@@ -1,7 +0,0 @@
-[![Build Status](https://travis-ci.com/databio/henge.svg?branch=master)](https://travis-ci.com/databio/henge)
-# Henge
-Henge is a Python package that builds backends for generic decomposable recursive unique identifiers (or, *DRUIDs*). It is intended to be used as a building block for sequence collections (see the [seqcol package](https://github.com/databio/seqcol)), and also for other data types that need content-derived identifiers.
-Documentation at [http://henge.databio.org](http://henge.databio.org).

henge-0.2.1/henge/_version.py DELETED Viewed

	@@ -1 +0,0 @@
1	- __version__ = "0.2.1"

henge-0.2.1/henge.egg-info/PKG-INFO DELETED Viewed

@@ -1,25 +0,0 @@
-Metadata-Version: 2.1
-Name: henge
-Version: 0.2.1
-Summary: Storage and retrieval of object-derived, decomposable recursive unique identifiers.
-Home-page: https://databio.org
-Author: Nathan Sheffield
-Author-email: nathan@code.databio.org
-License: BSD2
-Classifier: Development Status :: 4 - Beta
-Classifier: License :: OSI Approved :: BSD License
-Classifier: Programming Language :: Python :: 3.7
-Classifier: Programming Language :: Python :: 3.8
-Classifier: Programming Language :: Python :: 3.9
-Classifier: Programming Language :: Python :: 3.10
-Classifier: Topic :: System :: Distributed Computing
-Description-Content-Type: text/markdown
-License-File: LICENSE.txt
-[![Build Status](https://travis-ci.com/databio/henge.svg?branch=master)](https://travis-ci.com/databio/henge)
-# Henge
-Henge is a Python package that builds backends for generic decomposable recursive unique identifiers (or, *DRUIDs*). It is intended to be used as a building block for sequence collections (see the [seqcol package](https://github.com/databio/seqcol)), and also for other data types that need content-derived identifiers.
-Documentation at [http://henge.databio.org](http://henge.databio.org).