henge 0.2.1__tar.gz → 0.2.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
henge-0.2.3/PKG-INFO ADDED
@@ -0,0 +1,132 @@
1
+ Metadata-Version: 2.4
2
+ Name: henge
3
+ Version: 0.2.3
4
+ Summary: Storage and retrieval of object-derived, decomposable recursive unique identifiers.
5
+ Home-page: https://databio.org
6
+ Author: Nathan Sheffield
7
+ Author-email: nathan@code.databio.org
8
+ License: BSD2
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: License :: OSI Approved :: BSD License
11
+ Classifier: Programming Language :: Python :: 3.10
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Classifier: Programming Language :: Python :: 3.13
15
+ Classifier: Topic :: System :: Distributed Computing
16
+ Requires-Python: >=3.10
17
+ Description-Content-Type: text/markdown
18
+ License-File: LICENSE.txt
19
+ Requires-Dist: jsonschema
20
+ Requires-Dist: ubiquerg>=0.5.2
21
+ Requires-Dist: yacman>=0.6.7
22
+ Dynamic: author
23
+ Dynamic: author-email
24
+ Dynamic: classifier
25
+ Dynamic: description
26
+ Dynamic: description-content-type
27
+ Dynamic: home-page
28
+ Dynamic: keywords
29
+ Dynamic: license
30
+ Dynamic: license-file
31
+ Dynamic: requires-dist
32
+ Dynamic: requires-python
33
+ Dynamic: summary
34
+
35
+ # Henge
36
+
37
+ Henge is a Python package for building data storage and retrieval interfaces for arbitrary data. Henge is based on the idea of **decomposable recursive unique identifiers (DRUIDs)**, which are hash-based unique identifiers for data derived from the data itself. For arbitrary data with any structure, Henge can mint unique DRUIDs to identify data, store the data in a key-value database of your choice, and provide lookup functions to retrieve the data in its original structure using its DRUID identifier.
38
+
39
+ Henge was intended as a building block for [sequence collections](https://github.com/refgenie/seqcol), but is generic enough to use for any data type that needs content-derived identifiers with database lookup capability.
40
+
41
+ ## Install
42
+
43
+ ```
44
+ pip install henge
45
+ ```
46
+
47
+ ## Quick Start
48
+
49
+ Create a Henge object by providing a database and a data schema. The database can be a Python dict or backed by persistent storage. Data schemas are [JSON-schema](https://json-schema.org/) descriptions of data types, and can be hierarchical.
50
+
51
+ ```python
52
+ import henge
53
+
54
+ schemas = ["path/to/json_schema.yaml"]
55
+ h = henge.Henge(database={}, schemas=schemas)
56
+ ```
57
+
58
+ Insert items into the henge. Upon insert, henge returns the DRUID (digest/checksum/unique identifier) for your object:
59
+
60
+ ```python
61
+ druid = h.insert({"name": "Pat", "age": 38}, item_type="person")
62
+ ```
63
+
64
+ Retrieve the original object using the DRUID:
65
+
66
+ ```python
67
+ h.retrieve(druid)
68
+ # {'age': '38', 'name': 'Pat'}
69
+ ```
70
+
71
+ ## Tutorial
72
+
73
+ For a comprehensive walkthrough covering basic types, arrays, nested objects, and advanced features, see the [tutorial notebook](docs/tutorial.ipynb).
74
+
75
+ ## What are DRUIDs?
76
+
77
+ DRUIDs are a special type of unique identifiers with two powerful properties:
78
+
79
+ - **Decomposable**: Identifiers in henge automatically retrieve structured data (tuples, arrays, objects). The structure is defined by a JSON schema, so henge can be used as a back-end for arbitrary data types.
80
+
81
+ - **Recursive**: Individual elements retrieved by henge can be tagged as recursive, meaning these attributes contain their own DRUIDs. Henge can recurse through these, allowing you to mint unique identifiers for arbitrary nested data structures.
82
+
83
+ A DRUID is ultimately the result of a digest operation (such as `md5` or `sha256`) on some data. Because DRUIDs are computed deterministically from the item, they represent globally unique identifiers. If you insert the same item repeatedly, it will produce the same DRUID -- this is true across henges as long as they share a data schema.
84
+
85
+ ## Persisting Data
86
+
87
+ ### In-memory (default)
88
+
89
+ Use a Python `dict` as the database for testing or ephemeral use:
90
+
91
+ ```python
92
+ h = henge.Henge(database={}, schemas=schemas)
93
+ ```
94
+
95
+ ### SQLite backend
96
+
97
+ For persistent storage with SQLite:
98
+
99
+ ```python
100
+ from sqlitedict import SqliteDict
101
+
102
+ mydict = SqliteDict('./my_db.sqlite', autocommit=True)
103
+ h = henge.Henge(mydict, schemas=schemas)
104
+ ```
105
+
106
+ Requires: `pip install sqlitedict`
107
+
108
+ ### MongoDB backend
109
+
110
+ For production use with MongoDB:
111
+
112
+ 1. **Start MongoDB with Docker:**
113
+
114
+ ```bash
115
+ docker run --network="host" mongo
116
+ ```
117
+
118
+ For persistent storage, mount a volume to `/data/db`:
119
+
120
+ ```bash
121
+ docker run -it --network="host" -v /path/to/data:/data/db mongo
122
+ ```
123
+
124
+ 2. **Connect henge to MongoDB:**
125
+
126
+ ```python
127
+ import henge
128
+
129
+ h = henge.Henge(henge.connect_mongo(), schemas=schemas)
130
+ ```
131
+
132
+ Requires: `pip install pymongo mongodict`
henge-0.2.3/README.md ADDED
@@ -0,0 +1,98 @@
1
+ # Henge
2
+
3
+ Henge is a Python package for building data storage and retrieval interfaces for arbitrary data. Henge is based on the idea of **decomposable recursive unique identifiers (DRUIDs)**, which are hash-based unique identifiers for data derived from the data itself. For arbitrary data with any structure, Henge can mint unique DRUIDs to identify data, store the data in a key-value database of your choice, and provide lookup functions to retrieve the data in its original structure using its DRUID identifier.
4
+
5
+ Henge was intended as a building block for [sequence collections](https://github.com/refgenie/seqcol), but is generic enough to use for any data type that needs content-derived identifiers with database lookup capability.
6
+
7
+ ## Install
8
+
9
+ ```
10
+ pip install henge
11
+ ```
12
+
13
+ ## Quick Start
14
+
15
+ Create a Henge object by providing a database and a data schema. The database can be a Python dict or backed by persistent storage. Data schemas are [JSON-schema](https://json-schema.org/) descriptions of data types, and can be hierarchical.
16
+
17
+ ```python
18
+ import henge
19
+
20
+ schemas = ["path/to/json_schema.yaml"]
21
+ h = henge.Henge(database={}, schemas=schemas)
22
+ ```
23
+
24
+ Insert items into the henge. Upon insert, henge returns the DRUID (digest/checksum/unique identifier) for your object:
25
+
26
+ ```python
27
+ druid = h.insert({"name": "Pat", "age": 38}, item_type="person")
28
+ ```
29
+
30
+ Retrieve the original object using the DRUID:
31
+
32
+ ```python
33
+ h.retrieve(druid)
34
+ # {'age': '38', 'name': 'Pat'}
35
+ ```
36
+
37
+ ## Tutorial
38
+
39
+ For a comprehensive walkthrough covering basic types, arrays, nested objects, and advanced features, see the [tutorial notebook](docs/tutorial.ipynb).
40
+
41
+ ## What are DRUIDs?
42
+
43
+ DRUIDs are a special type of unique identifiers with two powerful properties:
44
+
45
+ - **Decomposable**: Identifiers in henge automatically retrieve structured data (tuples, arrays, objects). The structure is defined by a JSON schema, so henge can be used as a back-end for arbitrary data types.
46
+
47
+ - **Recursive**: Individual elements retrieved by henge can be tagged as recursive, meaning these attributes contain their own DRUIDs. Henge can recurse through these, allowing you to mint unique identifiers for arbitrary nested data structures.
48
+
49
+ A DRUID is ultimately the result of a digest operation (such as `md5` or `sha256`) on some data. Because DRUIDs are computed deterministically from the item, they represent globally unique identifiers. If you insert the same item repeatedly, it will produce the same DRUID -- this is true across henges as long as they share a data schema.
50
+
51
+ ## Persisting Data
52
+
53
+ ### In-memory (default)
54
+
55
+ Use a Python `dict` as the database for testing or ephemeral use:
56
+
57
+ ```python
58
+ h = henge.Henge(database={}, schemas=schemas)
59
+ ```
60
+
61
+ ### SQLite backend
62
+
63
+ For persistent storage with SQLite:
64
+
65
+ ```python
66
+ from sqlitedict import SqliteDict
67
+
68
+ mydict = SqliteDict('./my_db.sqlite', autocommit=True)
69
+ h = henge.Henge(mydict, schemas=schemas)
70
+ ```
71
+
72
+ Requires: `pip install sqlitedict`
73
+
74
+ ### MongoDB backend
75
+
76
+ For production use with MongoDB:
77
+
78
+ 1. **Start MongoDB with Docker:**
79
+
80
+ ```bash
81
+ docker run --network="host" mongo
82
+ ```
83
+
84
+ For persistent storage, mount a volume to `/data/db`:
85
+
86
+ ```bash
87
+ docker run -it --network="host" -v /path/to/data:/data/db mongo
88
+ ```
89
+
90
+ 2. **Connect henge to MongoDB:**
91
+
92
+ ```python
93
+ import henge
94
+
95
+ h = henge.Henge(henge.connect_mongo(), schemas=schemas)
96
+ ```
97
+
98
+ Requires: `pip install pymongo mongodict`
@@ -9,4 +9,5 @@ __all__ = __classes__ + [
9
9
  "split_schema",
10
10
  "NotFoundException",
11
11
  "canonical_str",
12
+ "sha512t24u_digest",
12
13
  ]
@@ -0,0 +1 @@
1
+ __version__ = "0.2.3"
@@ -1,10 +1,11 @@
1
- """ An interface to a database back-end for DRUIDs """
1
+ """An interface to a database back-end for DRUIDs"""
2
2
 
3
+ import base64
3
4
  import copy
4
5
  import hashlib
5
6
  import jsonschema
6
- import logging
7
7
  import json
8
+ import logging
8
9
  import os
9
10
  import sys
10
11
  import yacman
@@ -28,6 +29,13 @@ class NotFoundException(Exception):
28
29
  return self.message
29
30
 
30
31
 
32
+ def sha512t24u_digest(seq: str, offset: int = 24) -> str:
33
+ """GA4GH digest function"""
34
+ digest = hashlib.sha512(seq.encode()).digest()
35
+ tdigest_b64us = base64.urlsafe_b64encode(digest[:offset])
36
+ return tdigest_b64us.decode("ascii")
37
+
38
+
31
39
  def md5(seq):
32
40
  return hashlib.md5(seq.encode()).hexdigest()
33
41
 
@@ -49,14 +57,18 @@ def read_url(url):
49
57
  raise e
50
58
  data = response.read() # a `bytes` object
51
59
  text = data.decode("utf-8")
52
- print(text)
53
60
  return yaml.safe_load(text)
54
61
 
55
62
 
56
63
  class Henge(object):
57
64
  def __init__(
58
- self, database, schemas, schemas_str=[], henges=None, checksum_function=md5
59
- ):
65
+ self,
66
+ database: dict,
67
+ schemas: list[str],
68
+ schemas_str: list[str] = None,
69
+ henges: dict = None,
70
+ checksum_function: callable = md5,
71
+ ) -> None:
60
72
  """
61
73
  A user interface to insert and retrieve decomposable recursive unique
62
74
  identifiers (DRUIDs).
@@ -107,7 +119,7 @@ class Henge(object):
107
119
  )
108
120
  # populated_schemas.append(yaml.safe_load(schema_value))
109
121
 
110
- for schema_value in schemas_str:
122
+ for schema_value in schemas_str or []:
111
123
  populated_schemas.append(yaml.safe_load(schema_value))
112
124
 
113
125
  split_schemas = {}
@@ -132,7 +144,9 @@ class Henge(object):
132
144
  self.schemas[item_type] = henge.schemas[item_type]
133
145
  self.henges[item_type] = henge
134
146
 
135
- def retrieve(self, druid, reclimit=None, raw=False):
147
+ def retrieve(
148
+ self, druid: str, reclimit: int = None, raw: bool = False
149
+ ) -> dict | list:
136
150
  """
137
151
  Retrieve an item given a digest
138
152
 
@@ -194,7 +208,7 @@ class Henge(object):
194
208
  def lookup(self, druid, item_type):
195
209
  try:
196
210
  henge_to_query = self.henges[item_type]
197
- except:
211
+ except KeyError:
198
212
  _LOGGER.debug("No henges available for this item type")
199
213
  raise NotFoundException(druid)
200
214
  try:
@@ -228,7 +242,9 @@ class Henge(object):
228
242
  continue
229
243
  return valid_schemas
230
244
 
231
- def insert(self, item, item_type, reclimit=None):
245
+ def insert(
246
+ self, item: dict | list, item_type: str, reclimit: int = None
247
+ ) -> str | bool:
232
248
  """
233
249
  Add structured items of a specified type to the database.
234
250
 
@@ -243,8 +259,9 @@ class Henge(object):
243
259
 
244
260
  if item_type not in self.schemas.keys():
245
261
  _LOGGER.error(
246
- "I don't know about items of type '{}'. "
247
- "I know of: '{}'".format(item_type, list(self.schemas.keys()))
262
+ "I don't know about items of type '{}'. I know of: '{}'".format(
263
+ item_type, list(self.schemas.keys())
264
+ )
248
265
  )
249
266
  return False
250
267
 
@@ -328,8 +345,9 @@ class Henge(object):
328
345
  """
329
346
  if item_type not in self.schemas.keys():
330
347
  _LOGGER.error(
331
- "I don't know about items of type '{}'. "
332
- "I know of: '{}'".format(item_type, list(self.schemas.keys()))
348
+ "I don't know about items of type '{}'. I know of: '{}'".format(
349
+ item_type, list(self.schemas.keys())
350
+ )
333
351
  )
334
352
  return False
335
353
 
@@ -349,7 +367,7 @@ class Henge(object):
349
367
  item_type, item
350
368
  )
351
369
  )
352
- print(e)
370
+ _LOGGER.error(e)
353
371
 
354
372
  if isinstance(item, str):
355
373
  henge_to_query = self.henges[item_type]
@@ -370,7 +388,6 @@ class Henge(object):
370
388
  return item
371
389
 
372
390
  raise e
373
- return None
374
391
 
375
392
  _LOGGER.debug(f"item to insert: {item}")
376
393
  item_inherent_split = select_inherent_properties(item, valid_schema)
@@ -408,17 +425,14 @@ class Henge(object):
408
425
 
409
426
  henge_to_query = self.henges[item_type]
410
427
  # _LOGGER.debug("henge_to_query: {}".format(henge_to_query))
411
- try:
412
- henge_to_query.database[druid] = string
413
- henge_to_query.database[druid + ITEM_TYPE] = item_type
414
- henge_to_query.database[druid + "_digest_version"] = digest_version
415
- henge_to_query.database[druid + "_external_string"] = external_string
416
-
417
- if henge_to_query != self:
418
- self.database[druid + ITEM_TYPE] = item_type
419
- self.database[druid + "_digest_version"] = digest_version
420
- except Exception as e:
421
- raise e
428
+ henge_to_query.database[druid] = string
429
+ henge_to_query.database[druid + ITEM_TYPE] = item_type
430
+ henge_to_query.database[druid + "_digest_version"] = digest_version
431
+ henge_to_query.database[druid + "_external_string"] = external_string
432
+
433
+ if henge_to_query != self:
434
+ self.database[druid + ITEM_TYPE] = item_type
435
+ self.database[druid + "_digest_version"] = digest_version
422
436
 
423
437
  def clean(self):
424
438
  """
@@ -440,7 +454,21 @@ class Henge(object):
440
454
  Show all items in the database.
441
455
  """
442
456
  for k, v in self.database.items():
443
- print(k, v)
457
+ _LOGGER.info(f"{k} {v}")
458
+
459
+ def __len__(self):
460
+ return len(self.database)
461
+
462
+ def list(self, limit=1000, offset=0):
463
+ """
464
+ List all items in the database.
465
+ """
466
+ return {
467
+ "count": len(self.database),
468
+ "limit": limit,
469
+ "offset": offset,
470
+ "items": list(self.database.keys())[offset : (offset + limit)],
471
+ }
444
472
 
445
473
  def __repr__(self):
446
474
  repr = "Henge object. Item types: " + ",".join(self.item_types)
@@ -0,0 +1,359 @@
1
+ import logging
2
+ import os
3
+ import psycopg2
4
+
5
+ from collections.abc import Mapping
6
+ from psycopg2 import OperationalError, sql
7
+ from psycopg2.errors import UniqueViolation
8
+
9
+ _LOGGER = logging.getLogger(__name__)
10
+
11
+ # Use like:
12
+ # pgdb = RDBDict(...) # Open connection
13
+ # pgdb["key"] = "value" # Insert item
14
+ # pgdb["key"] # Retrieve item
15
+ # pgdb.close() # Close connection
16
+
17
+
18
+ # This was originally written in seqcolapi.
19
+ # I am moving it here in 2025, because the whole point was to enable
20
+ # interesting database back-ends to have dict-style key-value pair
21
+ # mechanisms, which was enabling henge to use these various backends
22
+ # to back arbitrary databases.
23
+ # with the move to sqlmodel, I abandoned the henge backend approach,
24
+ # so intermediates are no longer important for seqcol.
25
+
26
+ # they could become relevant for other henge use cases, so they
27
+ # fit better here now.
28
+
29
+
30
+ def getenv(varname):
31
+ """Simple wrapper to make the Exception more informative for missing env var"""
32
+ try:
33
+ return os.environ[varname]
34
+ except KeyError:
35
+ raise Exception(f"Environment variable {varname} not set.")
36
+
37
+
38
+ import pipestat
39
+
40
+
41
+ class PipestatMapping(pipestat.PipestatManager):
42
+ """A wrapper class to allow using a PipestatManager as a dict-like object."""
43
+
44
+ def __getitem__(self, key):
45
+ # This little hack makes this work with `in`;
46
+ # e.g.: for x in rdbdict, which is now disabled, instead of infinite.
47
+ if isinstance(key, int):
48
+ raise IndexError
49
+ return self.retrieve(key)
50
+
51
+ def __setitem__(self, key, value):
52
+ return self.insert({key: value})
53
+
54
+ def __len__(self):
55
+ return self.count_records()
56
+
57
+ def _next_page(self):
58
+ self._buf["page_index"] += 1
59
+ limit = self._buf["page_size"]
60
+ offset = self._buf["page_index"] * limit
61
+ self._buf["keys"] = self.get_records(limit, offset)
62
+ return self._buf["keys"][0]
63
+
64
+ def __iter__(self):
65
+ _LOGGER.debug("Iterating...")
66
+ self._buf = { # buffered iterator
67
+ "current_view_index": 0,
68
+ "len": len(self),
69
+ "page_size": 100,
70
+ "page_index": -1,
71
+ "keys": self._next_page(),
72
+ }
73
+ return self
74
+
75
+ def __next__(self):
76
+ if self._buf["current_view_index"] > self._buf["len"]:
77
+ raise StopIteration
78
+
79
+ idx = (
80
+ self._buf["current_view_index"]
81
+ - self._buf["page_index"] * self._buf["page_size"]
82
+ )
83
+ if idx <= self._buf["page_size"]:
84
+ self._buf["current_view_index"] += 1
85
+ return self._buf["keys"][idx - 1]
86
+ else: # current index is beyond current page, but not beyond total
87
+ return self._next_page()
88
+
89
+
90
+ class RDBDict(Mapping):
91
+ """
92
+ A Relational DataBase Dict.
93
+
94
+ Simple database connection manager object that allows us to use a
95
+ PostgresQL database as a simple key-value store to back Python
96
+ dict-style access to database items.
97
+ """
98
+
99
+ def __init__(
100
+ self,
101
+ db_name: str = None,
102
+ db_user: str = None,
103
+ db_password: str = None,
104
+ db_host: str = None,
105
+ db_port: str = None,
106
+ db_table: str = None,
107
+ ):
108
+ self.connection = None
109
+ self.db_name = db_name or getenv("POSTGRES_DB")
110
+ self.db_user = db_user or getenv("POSTGRES_USER")
111
+ self.db_host = db_host or os.environ.get("POSTGRES_HOST") or "localhost"
112
+ self.db_port = db_port or os.environ.get("POSTGRES_PORT") or "5432"
113
+ self.db_table = db_table or os.environ.get("POSTGRES_TABLE") or "seqcol"
114
+ db_password = db_password or getenv("POSTGRES_PASSWORD")
115
+
116
+ try:
117
+ self.connection = self.create_connection(
118
+ self.db_name, self.db_user, db_password, self.db_host, self.db_port
119
+ )
120
+ if not self.connection:
121
+ raise Exception("Connection failed")
122
+ except Exception as e:
123
+ _LOGGER.info(f"{self}")
124
+ raise e
125
+ _LOGGER.info(self.connection)
126
+ self.connection.autocommit = True
127
+
128
+ def __repr__(self):
129
+ return (
130
+ "RDBD object\n"
131
+ + "db_table: {}\n".format(self.db_table)
132
+ + "db_name: {}\n".format(self.db_name)
133
+ + "db_user: {}\n".format(self.db_user)
134
+ + "db_host: {}\n".format(self.db_host)
135
+ + "db_port: {}\n".format(self.db_port)
136
+ )
137
+
138
+ def init_table(self):
139
+ # Wrap statements to prevent SQL injection attacks
140
+ stmt = sql.SQL(
141
+ """
142
+ CREATE TABLE IF NOT EXISTS {table}(
143
+ key TEXT PRIMARY KEY,
144
+ value TEXT);
145
+ """
146
+ ).format(table=sql.Identifier(self.db_table))
147
+ return self.execute_query(stmt, params=None)
148
+
149
+ def insert(self, key, value):
150
+ stmt = sql.SQL(
151
+ """
152
+ INSERT INTO {table}(key, value)
153
+ VALUES (%(key)s, %(value)s);
154
+ """
155
+ ).format(table=sql.Identifier(self.db_table))
156
+ params = {"key": key, "value": value}
157
+ return self.execute_query(stmt, params)
158
+
159
+ def update(self, key, value):
160
+ stmt = sql.SQL(
161
+ """
162
+ UPDATE {table} SET value=%(value)s WHERE key=%(key)s
163
+ """
164
+ ).format(table=sql.Identifier(self.db_table))
165
+ params = {"key": key, "value": value}
166
+ return self.execute_query(stmt, params)
167
+
168
+ def __getitem__(self, key):
169
+ # This little hack makes this work with `in`;
170
+ # e.g.: for x in rdbdict, which is now disabled, instead of infinite.
171
+ if isinstance(key, int):
172
+ raise IndexError
173
+ stmt = sql.SQL(
174
+ """
175
+ SELECT value FROM {table} WHERE key=%(key)s
176
+ """
177
+ ).format(table=sql.Identifier(self.db_table))
178
+ params = {"key": key}
179
+ res = self.execute_read_query(stmt, params)
180
+ if not res:
181
+ _LOGGER.info("Not found: {}".format(key))
182
+ return res
183
+
184
+ def __setitem__(self, key, value):
185
+ try:
186
+ return self.insert(key, value)
187
+ except UniqueViolation as e:
188
+ _LOGGER.info("Updating existing value for {}".format(key))
189
+ return self.update(key, value)
190
+
191
+ def __delitem__(self, key):
192
+ stmt = sql.SQL(
193
+ """
194
+ DELETE FROM {table} WHERE key=%(key)s
195
+ """
196
+ ).format(table=sql.Identifier(self.db_table))
197
+ params = {"key": key}
198
+ res = self.execute_query(stmt, params)
199
+ return res
200
+
201
+ def create_connection(self, db_name, db_user, db_password, db_host, db_port):
202
+ connection = None
203
+ try:
204
+ connection = psycopg2.connect(
205
+ database=db_name,
206
+ user=db_user,
207
+ password=db_password,
208
+ host=db_host,
209
+ port=db_port,
210
+ )
211
+ _LOGGER.info("Connection to PostgreSQL DB successful")
212
+ except OperationalError as e:
213
+ _LOGGER.info("Error: {e}".format(e=str(e)))
214
+ return connection
215
+
216
+ def execute_read_query(self, query, params=None):
217
+ cursor = self.connection.cursor()
218
+ result = None
219
+ try:
220
+ cursor.execute(query, params)
221
+ result = cursor.fetchone()
222
+ if result:
223
+ return result[0]
224
+ else:
225
+ _LOGGER.debug(f"Query: {query}")
226
+ _LOGGER.debug(f"Result: {result}")
227
+ return None
228
+ except OperationalError as e:
229
+ _LOGGER.info("Error: {e}".format(e=str(e)))
230
+ raise
231
+ except TypeError as e:
232
+ _LOGGER.info("TypeError: {e}, item: {q}".format(e=str(e), q=query))
233
+ raise
234
+
235
+ def execute_multi_query(self, query, params=None):
236
+ cursor = self.connection.cursor()
237
+ result = None
238
+ try:
239
+ cursor.execute(query, params)
240
+ result = cursor.fetchall()
241
+ return result
242
+ except OperationalError as e:
243
+ _LOGGER.info("Error: {e}".format(e=str(e)))
244
+ raise
245
+ except TypeError as e:
246
+ _LOGGER.info("TypeError: {e}, item: {q}".format(e=str(e), q=query))
247
+ raise
248
+
249
+ def execute_query(self, query, params=None):
250
+ cursor = self.connection.cursor()
251
+ try:
252
+ return cursor.execute(query, params)
253
+ _LOGGER.info("Query executed successfully")
254
+ except OperationalError as e:
255
+ _LOGGER.info("Error: {e}".format(e=str(e)))
256
+
257
+ def close(self):
258
+ _LOGGER.info("Closing connection")
259
+ return self.connection.close()
260
+
261
+ def __del__(self):
262
+ if self.connection:
263
+ self.close()
264
+
265
+ def __len__(self):
266
+ stmt = sql.SQL(
267
+ """
268
+ SELECT COUNT(*) FROM {table}
269
+ """
270
+ ).format(table=sql.Identifier(self.db_table))
271
+ _LOGGER.debug(stmt)
272
+ res = self.execute_read_query(stmt)
273
+ return res
274
+
275
+ def get_paged_keys(self, limit=None, offset=None):
276
+ stmt = sql.SQL("SELECT key FROM {table}").format(
277
+ table=sql.Identifier(self.db_table)
278
+ )
279
+ params = {}
280
+ if limit is not None:
281
+ stmt = sql.SQL("{} LIMIT %(limit)s").format(stmt)
282
+ params["limit"] = limit
283
+ if offset is not None:
284
+ stmt = sql.SQL("{} OFFSET %(offset)s").format(stmt)
285
+ params["offset"] = offset
286
+ res = self.execute_multi_query(stmt, params if params else None)
287
+ return res
288
+
289
+ def _next_page(self):
290
+ self._buf["page_index"] += 1
291
+ limit = self._buf["page_size"]
292
+ offset = self._buf["page_index"] * limit
293
+ self._buf["keys"] = self.get_paged_keys(limit, offset)
294
+ return self._buf["keys"][0]
295
+
296
+ def __iter__(self):
297
+ _LOGGER.debug("Iterating...")
298
+ self._buf = { # buffered iterator
299
+ "current_view_index": 0,
300
+ "len": len(self),
301
+ "page_size": 10,
302
+ "page_index": 0,
303
+ "keys": self.get_paged_keys(10, 0),
304
+ }
305
+ return self
306
+
307
+ def __next__(self):
308
+ if self._buf["current_view_index"] > self._buf["len"]:
309
+ raise StopIteration
310
+
311
+ idx = (
312
+ self._buf["current_view_index"]
313
+ - self._buf["page_index"] * self._buf["page_size"]
314
+ )
315
+ if idx <= self._buf["page_size"]:
316
+ self._buf["current_view_index"] += 1
317
+ return self._buf["keys"][idx - 1]
318
+ else: # current index is beyond current page, but not beyond total
319
+ return self._next_page()
320
+
321
+ # Old, non-paged iterator:
322
+ # def __iter__(self):
323
+ # self._current_idx = 0
324
+ # return self
325
+
326
+ # def __next__(self):
327
+ # stmt = sql.SQL(
328
+ # """
329
+ # SELECT key,value FROM {table} LIMIT 1 OFFSET %(idx)s
330
+ # """
331
+ # ).format(table=sql.Identifier(self.db_table))
332
+ # res = self.execute_read_query(stmt, {"idx": self._current_idx})
333
+ # self._current_idx += 1
334
+ # if not res:
335
+ # _LOGGER.info("Not found: {}".format(self._current_idx))
336
+ # raise StopIteration
337
+ # return res
338
+
339
+
340
+ # We don't need the full SeqColHenge,
341
+ # which also has loading capability, and requires pyfaidx, which requires
342
+ # biopython, which requires numpy, which is huge and can't compile the in
343
+ # default fastapi container.
344
+ # So, I had written the below class which provides retrieve only.
345
+ # HOWEVER, switching from alpine to slim allows install of numpy;
346
+ # This inflates the container size from 262Mb to 350Mb; perhaps that's worth paying.
347
+ # So I can avoid duplicating this and just use the full SeqColHenge from seqcol
348
+ # class SeqColHenge(refget.RefGetClient):
349
+ # def retrieve(self, druid, reclimit=None, raw=False):
350
+ # try:
351
+ # return super(SeqColHenge, self).retrieve(druid, reclimit, raw)
352
+ # except henge.NotFoundException as e:
353
+ # _LOGGER.debug(e)
354
+ # try:
355
+ # return self.refget(druid)
356
+ # except Exception as e:
357
+ # _LOGGER.debug(e)
358
+ # raise e
359
+ # return henge.NotFoundException("{} not found in database, or in refget.".format(druid))
@@ -0,0 +1,132 @@
1
+ Metadata-Version: 2.4
2
+ Name: henge
3
+ Version: 0.2.3
4
+ Summary: Storage and retrieval of object-derived, decomposable recursive unique identifiers.
5
+ Home-page: https://databio.org
6
+ Author: Nathan Sheffield
7
+ Author-email: nathan@code.databio.org
8
+ License: BSD2
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: License :: OSI Approved :: BSD License
11
+ Classifier: Programming Language :: Python :: 3.10
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Classifier: Programming Language :: Python :: 3.13
15
+ Classifier: Topic :: System :: Distributed Computing
16
+ Requires-Python: >=3.10
17
+ Description-Content-Type: text/markdown
18
+ License-File: LICENSE.txt
19
+ Requires-Dist: jsonschema
20
+ Requires-Dist: ubiquerg>=0.5.2
21
+ Requires-Dist: yacman>=0.6.7
22
+ Dynamic: author
23
+ Dynamic: author-email
24
+ Dynamic: classifier
25
+ Dynamic: description
26
+ Dynamic: description-content-type
27
+ Dynamic: home-page
28
+ Dynamic: keywords
29
+ Dynamic: license
30
+ Dynamic: license-file
31
+ Dynamic: requires-dist
32
+ Dynamic: requires-python
33
+ Dynamic: summary
34
+
35
+ # Henge
36
+
37
+ Henge is a Python package for building data storage and retrieval interfaces for arbitrary data. Henge is based on the idea of **decomposable recursive unique identifiers (DRUIDs)**, which are hash-based unique identifiers for data derived from the data itself. For arbitrary data with any structure, Henge can mint unique DRUIDs to identify data, store the data in a key-value database of your choice, and provide lookup functions to retrieve the data in its original structure using its DRUID identifier.
38
+
39
+ Henge was intended as a building block for [sequence collections](https://github.com/refgenie/seqcol), but is generic enough to use for any data type that needs content-derived identifiers with database lookup capability.
40
+
41
+ ## Install
42
+
43
+ ```
44
+ pip install henge
45
+ ```
46
+
47
+ ## Quick Start
48
+
49
+ Create a Henge object by providing a database and a data schema. The database can be a Python dict or backed by persistent storage. Data schemas are [JSON-schema](https://json-schema.org/) descriptions of data types, and can be hierarchical.
50
+
51
+ ```python
52
+ import henge
53
+
54
+ schemas = ["path/to/json_schema.yaml"]
55
+ h = henge.Henge(database={}, schemas=schemas)
56
+ ```
57
+
58
+ Insert items into the henge. Upon insert, henge returns the DRUID (digest/checksum/unique identifier) for your object:
59
+
60
+ ```python
61
+ druid = h.insert({"name": "Pat", "age": 38}, item_type="person")
62
+ ```
63
+
64
+ Retrieve the original object using the DRUID:
65
+
66
+ ```python
67
+ h.retrieve(druid)
68
+ # {'age': '38', 'name': 'Pat'}
69
+ ```
70
+
71
+ ## Tutorial
72
+
73
+ For a comprehensive walkthrough covering basic types, arrays, nested objects, and advanced features, see the [tutorial notebook](docs/tutorial.ipynb).
74
+
75
+ ## What are DRUIDs?
76
+
77
+ DRUIDs are a special type of unique identifiers with two powerful properties:
78
+
79
+ - **Decomposable**: Identifiers in henge automatically retrieve structured data (tuples, arrays, objects). The structure is defined by a JSON schema, so henge can be used as a back-end for arbitrary data types.
80
+
81
+ - **Recursive**: Individual elements retrieved by henge can be tagged as recursive, meaning these attributes contain their own DRUIDs. Henge can recurse through these, allowing you to mint unique identifiers for arbitrary nested data structures.
82
+
83
+ A DRUID is ultimately the result of a digest operation (such as `md5` or `sha256`) on some data. Because DRUIDs are computed deterministically from the item, they represent globally unique identifiers. If you insert the same item repeatedly, it will produce the same DRUID -- this is true across henges as long as they share a data schema.
84
+
85
+ ## Persisting Data
86
+
87
+ ### In-memory (default)
88
+
89
+ Use a Python `dict` as the database for testing or ephemeral use:
90
+
91
+ ```python
92
+ h = henge.Henge(database={}, schemas=schemas)
93
+ ```
94
+
95
+ ### SQLite backend
96
+
97
+ For persistent storage with SQLite:
98
+
99
+ ```python
100
+ from sqlitedict import SqliteDict
101
+
102
+ mydict = SqliteDict('./my_db.sqlite', autocommit=True)
103
+ h = henge.Henge(mydict, schemas=schemas)
104
+ ```
105
+
106
+ Requires: `pip install sqlitedict`
107
+
108
+ ### MongoDB backend
109
+
110
+ For production use with MongoDB:
111
+
112
+ 1. **Start MongoDB with Docker:**
113
+
114
+ ```bash
115
+ docker run --network="host" mongo
116
+ ```
117
+
118
+ For persistent storage, mount a volume to `/data/db`:
119
+
120
+ ```bash
121
+ docker run -it --network="host" -v /path/to/data:/data/db mongo
122
+ ```
123
+
124
+ 2. **Connect henge to MongoDB:**
125
+
126
+ ```python
127
+ import henge
128
+
129
+ h = henge.Henge(henge.connect_mongo(), schemas=schemas)
130
+ ```
131
+
132
+ Requires: `pip install pymongo mongodict`
@@ -6,9 +6,10 @@ henge/_version.py
6
6
  henge/const.py
7
7
  henge/deprecated.py
8
8
  henge/henge.py
9
+ henge/scconf.py
9
10
  henge.egg-info/PKG-INFO
10
11
  henge.egg-info/SOURCES.txt
11
12
  henge.egg-info/dependency_links.txt
12
- henge.egg-info/entry_points.txt
13
13
  henge.egg-info/requires.txt
14
- henge.egg-info/top_level.txt
14
+ henge.egg-info/top_level.txt
15
+ tests/test_henge.py
@@ -1,6 +1,5 @@
1
1
  #! /usr/bin/env python
2
2
 
3
- import os
4
3
  from setuptools import setup
5
4
  import sys
6
5
 
@@ -35,10 +34,10 @@ setup(
35
34
  classifiers=[
36
35
  "Development Status :: 4 - Beta",
37
36
  "License :: OSI Approved :: BSD License",
38
- "Programming Language :: Python :: 3.7",
39
- "Programming Language :: Python :: 3.8",
40
- "Programming Language :: Python :: 3.9",
41
37
  "Programming Language :: Python :: 3.10",
38
+ "Programming Language :: Python :: 3.11",
39
+ "Programming Language :: Python :: 3.12",
40
+ "Programming Language :: Python :: 3.13",
42
41
  "Topic :: System :: Distributed Computing",
43
42
  ],
44
43
  keywords="",
@@ -46,15 +45,12 @@ setup(
46
45
  author="Nathan Sheffield",
47
46
  author_email="nathan@code.databio.org",
48
47
  license="BSD2",
49
- entry_points={
50
- "console_scripts": ["packagename = packagename.packagename:main"],
51
- },
52
- package_data={"packagename": [os.path.join("packagename", "*")]},
48
+ python_requires=">=3.10",
53
49
  include_package_data=True,
54
50
  test_suite="tests",
55
51
  tests_require=(["pytest"]),
56
52
  setup_requires=(
57
53
  ["pytest-runner"] if {"test", "pytest", "ptr"} & set(sys.argv) else []
58
54
  ),
59
- **extra
55
+ **extra,
60
56
  )
@@ -0,0 +1,71 @@
1
+ import pytest
2
+ from henge import Henge
3
+ from jsonschema import ValidationError
4
+
5
+ # See conftest.py for fixtures
6
+
7
+
8
+ class TestInserting:
9
+ @pytest.mark.parametrize(
10
+ ["x", "success"],
11
+ [
12
+ ({"string_attr": "12321%@!"}, True),
13
+ ({"string_attr": "string", "integer_attr": 1}, True),
14
+ ({"string_attr": 1}, False),
15
+ ({"string_attr": ["a", "b"]}, False),
16
+ ({"string_attr": {"string_attr": "test"}}, False),
17
+ ({"string_attr": "string", "integer_attr": "1"}, False),
18
+ ({"integer_attr": 1}, False),
19
+ ],
20
+ )
21
+ def test_insert_validation_works(self, schema, x, success):
22
+ """Test whether insertion is performed only for valid objects"""
23
+ type_key = "test_item"
24
+ print(f"here's what I got for schema: {schema}")
25
+ h = Henge(database={}, schemas=["tests/data/schema.yaml"])
26
+ if success:
27
+ assert isinstance(h.insert(x, item_type=type_key), str)
28
+ else:
29
+ with pytest.raises(ValidationError):
30
+ h.insert(x, item_type=type_key)
31
+
32
+
33
+ class TestRetrieval:
34
+ @pytest.mark.parametrize(
35
+ "x",
36
+ [
37
+ ({"string_attr": "12321%@!", "integer_attr": 2}),
38
+ ({"string_attr": "string", "integer_attr": 1}),
39
+ ({"string_attr": "string"}),
40
+ ],
41
+ )
42
+ def test_retrieve_returns_inserted_obj(self, schema, x):
43
+ type_key = "test_item"
44
+ h = Henge(database={}, schemas=["tests/data/schema.yaml"])
45
+ d = h.insert(x, item_type=type_key)
46
+ # returns str versions of inserted data
47
+ assert h.retrieve(d) == {k: v for k, v in x.items()}
48
+
49
+ @pytest.mark.parametrize(
50
+ ["seq", "anno"],
51
+ [
52
+ ("ATGCAGTA", {"name": "seq1", "length": 10, "topology": "linear"}),
53
+ ("AAAAAAAA", {"name": "seq2", "length": 11, "topology": "linear"}),
54
+ ],
55
+ )
56
+ def test_retrieve_recurses(self, schema_asd, schema_sequence, seq, anno):
57
+ h = Henge(
58
+ database={},
59
+ schemas=[
60
+ "tests/data/sequence.yaml",
61
+ "tests/data/annotated_sequence_digest.yaml",
62
+ ],
63
+ )
64
+ seq_digest = h.insert(seq, item_type="sequence")
65
+ anno.update({"sequence_digest": seq_digest})
66
+ asd = h.insert(anno, item_type="annotated_sequence_digest")
67
+ res = h.retrieve(asd)
68
+ assert isinstance(res["name"], str)
69
+
70
+ def test_inherent_attributes(self, inherent):
71
+ print("test")
henge-0.2.1/PKG-INFO DELETED
@@ -1,25 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: henge
3
- Version: 0.2.1
4
- Summary: Storage and retrieval of object-derived, decomposable recursive unique identifiers.
5
- Home-page: https://databio.org
6
- Author: Nathan Sheffield
7
- Author-email: nathan@code.databio.org
8
- License: BSD2
9
- Classifier: Development Status :: 4 - Beta
10
- Classifier: License :: OSI Approved :: BSD License
11
- Classifier: Programming Language :: Python :: 3.7
12
- Classifier: Programming Language :: Python :: 3.8
13
- Classifier: Programming Language :: Python :: 3.9
14
- Classifier: Programming Language :: Python :: 3.10
15
- Classifier: Topic :: System :: Distributed Computing
16
- Description-Content-Type: text/markdown
17
- License-File: LICENSE.txt
18
-
19
- [![Build Status](https://travis-ci.com/databio/henge.svg?branch=master)](https://travis-ci.com/databio/henge)
20
-
21
- # Henge
22
-
23
- Henge is a Python package that builds backends for generic decomposable recursive unique identifiers (or, *DRUIDs*). It is intended to be used as a building block for sequence collections (see the [seqcol package](https://github.com/databio/seqcol)), and also for other data types that need content-derived identifiers.
24
-
25
- Documentation at [http://henge.databio.org](http://henge.databio.org).
henge-0.2.1/README.md DELETED
@@ -1,7 +0,0 @@
1
- [![Build Status](https://travis-ci.com/databio/henge.svg?branch=master)](https://travis-ci.com/databio/henge)
2
-
3
- # Henge
4
-
5
- Henge is a Python package that builds backends for generic decomposable recursive unique identifiers (or, *DRUIDs*). It is intended to be used as a building block for sequence collections (see the [seqcol package](https://github.com/databio/seqcol)), and also for other data types that need content-derived identifiers.
6
-
7
- Documentation at [http://henge.databio.org](http://henge.databio.org).
@@ -1 +0,0 @@
1
- __version__ = "0.2.1"
@@ -1,25 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: henge
3
- Version: 0.2.1
4
- Summary: Storage and retrieval of object-derived, decomposable recursive unique identifiers.
5
- Home-page: https://databio.org
6
- Author: Nathan Sheffield
7
- Author-email: nathan@code.databio.org
8
- License: BSD2
9
- Classifier: Development Status :: 4 - Beta
10
- Classifier: License :: OSI Approved :: BSD License
11
- Classifier: Programming Language :: Python :: 3.7
12
- Classifier: Programming Language :: Python :: 3.8
13
- Classifier: Programming Language :: Python :: 3.9
14
- Classifier: Programming Language :: Python :: 3.10
15
- Classifier: Topic :: System :: Distributed Computing
16
- Description-Content-Type: text/markdown
17
- License-File: LICENSE.txt
18
-
19
- [![Build Status](https://travis-ci.com/databio/henge.svg?branch=master)](https://travis-ci.com/databio/henge)
20
-
21
- # Henge
22
-
23
- Henge is a Python package that builds backends for generic decomposable recursive unique identifiers (or, *DRUIDs*). It is intended to be used as a building block for sequence collections (see the [seqcol package](https://github.com/databio/seqcol)), and also for other data types that need content-derived identifiers.
24
-
25
- Documentation at [http://henge.databio.org](http://henge.databio.org).
@@ -1,2 +0,0 @@
1
- [console_scripts]
2
- packagename = packagename.packagename:main
File without changes
File without changes
File without changes
File without changes
File without changes