hyperion-sdk 0.2.0.dev1741815359__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (29) hide show
  1. hyperion_sdk-0.2.0.dev1741815359/PKG-INFO +476 -0
  2. hyperion_sdk-0.2.0.dev1741815359/README.md +445 -0
  3. hyperion_sdk-0.2.0.dev1741815359/hyperion/__init__.py +0 -0
  4. hyperion_sdk-0.2.0.dev1741815359/hyperion/asyncutils.py +79 -0
  5. hyperion_sdk-0.2.0.dev1741815359/hyperion/catalog/__init__.py +8 -0
  6. hyperion_sdk-0.2.0.dev1741815359/hyperion/catalog/catalog.py +623 -0
  7. hyperion_sdk-0.2.0.dev1741815359/hyperion/catalog/schema.py +153 -0
  8. hyperion_sdk-0.2.0.dev1741815359/hyperion/collections/__init__.py +0 -0
  9. hyperion_sdk-0.2.0.dev1741815359/hyperion/collections/asset_collection.py +285 -0
  10. hyperion_sdk-0.2.0.dev1741815359/hyperion/config.py +77 -0
  11. hyperion_sdk-0.2.0.dev1741815359/hyperion/dateutils.py +238 -0
  12. hyperion_sdk-0.2.0.dev1741815359/hyperion/entities/__init__.py +0 -0
  13. hyperion_sdk-0.2.0.dev1741815359/hyperion/entities/catalog.py +190 -0
  14. hyperion_sdk-0.2.0.dev1741815359/hyperion/infrastructure/__init__.py +0 -0
  15. hyperion_sdk-0.2.0.dev1741815359/hyperion/infrastructure/aws.py +220 -0
  16. hyperion_sdk-0.2.0.dev1741815359/hyperion/infrastructure/cache.py +396 -0
  17. hyperion_sdk-0.2.0.dev1741815359/hyperion/infrastructure/geo/__init__.py +7 -0
  18. hyperion_sdk-0.2.0.dev1741815359/hyperion/infrastructure/geo/gmaps.py +124 -0
  19. hyperion_sdk-0.2.0.dev1741815359/hyperion/infrastructure/geo/location.py +186 -0
  20. hyperion_sdk-0.2.0.dev1741815359/hyperion/infrastructure/http.py +62 -0
  21. hyperion_sdk-0.2.0.dev1741815359/hyperion/infrastructure/keyval.py +264 -0
  22. hyperion_sdk-0.2.0.dev1741815359/hyperion/infrastructure/queue.py +151 -0
  23. hyperion_sdk-0.2.0.dev1741815359/hyperion/infrastructure/secrets.py +63 -0
  24. hyperion_sdk-0.2.0.dev1741815359/hyperion/logging.py +122 -0
  25. hyperion_sdk-0.2.0.dev1741815359/hyperion/py.typed +0 -0
  26. hyperion_sdk-0.2.0.dev1741815359/hyperion/sources/__init__.py +0 -0
  27. hyperion_sdk-0.2.0.dev1741815359/hyperion/sources/base.py +105 -0
  28. hyperion_sdk-0.2.0.dev1741815359/hyperion/typeutils.py +52 -0
  29. hyperion_sdk-0.2.0.dev1741815359/pyproject.toml +101 -0
@@ -0,0 +1,476 @@
1
+ Metadata-Version: 2.3
2
+ Name: hyperion-sdk
3
+ Version: 0.2.0.dev1741815359
4
+ Summary: A headless ETL / ELT / data pipeline and integration SDK for Python.
5
+ License: MIT
6
+ Author: Tomas Votava
7
+ Author-email: info@tomasvotava.eu
8
+ Requires-Python: >=3.10,<3.13
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Programming Language :: Python :: 3.10
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Requires-Dist: aioboto3 (>=13.2.0,<14.0.0)
15
+ Requires-Dist: aws-lambda-typing (>=2.20.0,<3.0.0)
16
+ Requires-Dist: boto3 (>=1.35.36,<2.0.0)
17
+ Requires-Dist: cachetools (>=5.5.0,<6.0.0)
18
+ Requires-Dist: env-proxy (>=1.1.0,<2.0.0)
19
+ Requires-Dist: fastavro (>=1.9.7,<2.0.0)
20
+ Requires-Dist: googlemaps (>=4.10.0,<5.0.0)
21
+ Requires-Dist: haversine (>=2.9.0,<3.0.0)
22
+ Requires-Dist: httpx (>=0.28.1,<0.29.0)
23
+ Requires-Dist: loguru (>=0.7.2,<0.8.0)
24
+ Requires-Dist: numpy (>=2.2.0,<3.0.0)
25
+ Requires-Dist: pydantic (>=2.10.2,<3.0.0)
26
+ Requires-Dist: python-dateutil (>=2.9.0.post0,<3.0.0)
27
+ Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
28
+ Requires-Dist: python-snappy (>=0.7.3,<0.8.0)
29
+ Description-Content-Type: text/markdown
30
+
31
+ # Hyperion
32
+
33
+ A headless ETL / ELT / data pipeline and integration SDK for Python.
34
+
35
+ [![pre-commit](https://github.com/tomasvotava/hyperion/actions/workflows/pre-commit.yml/badge.svg?branch=master)](https://github.com/tomasvotava/hyperion/actions/workflows/pre-commit.yml)
36
+ [![pytest](https://github.com/tomasvotava/hyperion/actions/workflows/pytest.yml/badge.svg?branch=master)](https://github.com/tomasvotava/hyperion/actions/workflows/pytest.yml)
37
+ [![codecov](https://codecov.io/gh/tomasvotava/hyperion/branch/master/graph/badge.svg?token=your-token)](https://codecov.io/gh/tomasvotava/hyperion)
38
+
39
+ ## Features
40
+
41
+ - **Data Catalog System**: Manage and organize data assets across S3 buckets
42
+ - **Schema Management**: Validate and store schema definitions for data assets
43
+ - **AWS Infrastructure Abstractions**: Simplified interfaces for S3, DynamoDB, SQS, and Secrets Manager
44
+ - **Source Framework**: Define data sources that extract data and store in the catalog
45
+ - **Caching**: In-memory, local file, and DynamoDB caching options
46
+ - **Asynchronous Processing**: Utilities for async operations and task queues
47
+ - **Geo Utilities**: Location-based services with Google Maps integration
48
+
49
+ ## Core Concepts
50
+
51
+ ### Assets
52
+
53
+ Assets are the fundamental units of data in Hyperion. Each asset represents a dataset stored in a specific location with a defined schema. Hyperion supports three types of assets:
54
+
55
+ #### DataLakeAsset
56
+
57
+ - Represents raw, immutable data stored in a data lake
58
+ - Time-partitioned by date
59
+ - Each partition has a schema version
60
+ - Example use cases: raw API responses, event logs, or any source data that needs to be preserved in its original form
61
+
62
+ #### FeatureAsset
63
+
64
+ - Represents processed feature data with time resolution
65
+ - Used for analytics, machine learning features, and derived datasets
66
+ - Supports different time resolutions (seconds, minutes, hours, days, weeks, months, years)
67
+ - Can include additional partition keys for finer-grained organization
68
+ - Example use cases: aggregated metrics, processed signals, ML features
69
+
70
+ #### PersistentStoreAsset
71
+
72
+ - Represents persistent data storage without time partitioning
73
+ - Used for reference data, lookup tables, or any data that doesn't change frequently
74
+ - Example use cases: reference data, configuration settings, master data
75
+
76
+ ### Schema Management
77
+
78
+ All assets in Hyperion have associated schemas that define their structure:
79
+
80
+ - **Schema Store**: The SchemaStore manages asset schemas in Avro format
81
+ - **Schema Validation**: All data is validated against its schema during storage
82
+ - **Schema Versioning**: Assets include a schema version to support evolution over time
83
+ - **Schema Storage**: Schemas can be stored in the local filesystem or S3
84
+
85
+ If a schema is missing for an asset:
86
+
87
+ 1. An error will be raised when attempting to store or retrieve the asset
88
+ 2. You need to define the schema in Avro format and store it in the schema store
89
+ 3. The schema should be named according to the pattern: `{asset_type}/{asset_name}.v{version}.avro.json`
90
+
91
+ ### Catalog
92
+
93
+ The Catalog is the central component that manages asset storage and retrieval:
94
+
95
+ - **Storage Location**: Maps asset types to their appropriate storage buckets
96
+ - **Asset Retrieval**: Provides methods to retrieve assets by name, date, and schema version
97
+ - **Partitioning**: Handles partitioning logic for different asset types
98
+ - **Notifications**: Can send notifications when new assets arrive
99
+
100
+ ### Source Framework
101
+
102
+ Sources are responsible for extracting data from external systems and storing it in the catalog:
103
+
104
+ - **Standardized Interface**: All sources implement a common interface
105
+ - **AWS Lambda Support**: Easy integration with AWS Lambda for scheduled extraction
106
+ - **Backfill Capability**: Support for historical data backfill
107
+ - **Incremental Processing**: Extract data with date-based filtering
108
+
109
+ ## Installation
110
+
111
+ Hyperion uses [Poetry](https://python-poetry.org/) for dependency management:
112
+
113
+ ```bash
114
+ # Clone the repository
115
+ git clone https://github.com/tomasvotava/hyperion.git
116
+ cd hyperion
117
+
118
+ # Install dependencies
119
+ poetry install
120
+ ```
121
+
122
+ ## Configuration
123
+
124
+ Hyperion is configured through environment variables. You can use a `.env` file for local development:
125
+
126
+ ```bash
127
+ # Common settings
128
+ HYPERION_COMMON_LOG_PRETTY=True
129
+ HYPERION_COMMON_LOG_LEVEL=INFO
130
+ HYPERION_COMMON_SERVICE_NAME=my-service
131
+
132
+ # Storage settings
133
+ HYPERION_STORAGE_DATA_LAKE_BUCKET=my-data-lake-bucket
134
+ HYPERION_STORAGE_FEATURE_STORE_BUCKET=my-feature-store-bucket
135
+ HYPERION_STORAGE_PERSISTENT_STORE_BUCKET=my-persistent-store-bucket
136
+ HYPERION_STORAGE_SCHEMA_PATH=s3://my-schema-bucket/schemas
137
+ HYPERION_STORAGE_MAX_CONCURRENCY=5
138
+
139
+ # Queue settings
140
+ HYPERION_QUEUE_URL=https://sqs.us-east-1.amazonaws.com/123456789012/my-queue
141
+
142
+ # Secrets settings
143
+ HYPERION_SECRETS_BACKEND=AWSSecretsManager
144
+
145
+ # HTTP settings (optional)
146
+ HYPERION_HTTP_PROXY_HTTP=http://proxy:8080
147
+ HYPERION_HTTP_PROXY_HTTPS=http://proxy:8080
148
+
149
+ # Geo settings (optional)
150
+ HYPERION_GEO_GMAPS_API_KEY=your-google-maps-api-key
151
+ ```
152
+
153
+ Before any real documentation is written, you can check the
154
+ [`hyperion/config.py`](hyperion/config.py) file for all available configuration options. Hyperion is using [`EnvProxy`](https://github.com/tomasvotava/env-proxy) for configuration.
155
+
156
+ ## Usage Examples
157
+
158
+ ### Working with Assets
159
+
160
+ #### Creating and Storing a DataLakeAsset
161
+
162
+ ```python
163
+ from hyperion.catalog import Catalog
164
+ from hyperion.entities.catalog import DataLakeAsset
165
+ from datetime import datetime, timezone
166
+
167
+ # Initialize the catalog
168
+ catalog = Catalog.from_config()
169
+
170
+ # Create a data lake asset
171
+ asset = DataLakeAsset(
172
+ name="customer_data",
173
+ date=datetime.now(timezone.utc),
174
+ schema_version=1
175
+ )
176
+
177
+ # Store data in the asset
178
+ data = [
179
+ {"id": 1, "name": "Customer 1", "timestamp": datetime.now(timezone.utc)},
180
+ {"id": 2, "name": "Customer 2", "timestamp": datetime.now(timezone.utc)},
181
+ ]
182
+
183
+ catalog.store_asset(asset, data)
184
+ ```
185
+
186
+ #### Working with FeatureAssets
187
+
188
+ ```python
189
+ from hyperion.catalog import Catalog
190
+ from hyperion.entities.catalog import FeatureAsset
191
+ from hyperion.dateutils import TimeResolution
192
+ from datetime import datetime, timezone
193
+
194
+ # Initialize the catalog
195
+ catalog = Catalog.from_config()
196
+
197
+ # Create a feature asset with daily resolution
198
+ resolution = TimeResolution(1, "d") # 1 day resolution
199
+ asset = FeatureAsset(
200
+ name="customer_activity",
201
+ partition_date=datetime.now(timezone.utc),
202
+ resolution=resolution,
203
+ schema_version=1
204
+ )
205
+
206
+ # Store aggregated feature data
207
+ feature_data = [
208
+ {"customer_id": 1, "activity_score": 87.5, "date": datetime.now(timezone.utc)},
209
+ {"customer_id": 2, "activity_score": 92.1, "date": datetime.now(timezone.utc)},
210
+ ]
211
+
212
+ catalog.store_asset(asset, feature_data)
213
+
214
+ # Retrieve feature data for a specific time period
215
+ from_date = datetime(2023, 1, 1, tzinfo=timezone.utc)
216
+ to_date = datetime(2023, 1, 31, tzinfo=timezone.utc)
217
+
218
+ for feature_asset in catalog.iter_feature_store_partitions(
219
+ feature_name="customer_activity",
220
+ resolution="1d", # Can use string format too
221
+ date_from=from_date,
222
+ date_to=to_date
223
+ ):
224
+ data = catalog.retrieve_asset(feature_asset)
225
+ for record in data:
226
+ print(record)
227
+ ```
228
+
229
+ #### Working with PersistentStoreAssets
230
+
231
+ ```python
232
+ from hyperion.catalog import Catalog
233
+ from hyperion.entities.catalog import PersistentStoreAsset
234
+
235
+ # Initialize the catalog
236
+ catalog = Catalog.from_config()
237
+
238
+ # Create a persistent store asset
239
+ asset = PersistentStoreAsset(
240
+ name="product_catalog",
241
+ schema_version=1
242
+ )
243
+
244
+ # Store reference data
245
+ products = [
246
+ {"id": "P001", "name": "Product 1", "category": "Electronics"},
247
+ {"id": "P002", "name": "Product 2", "category": "Clothing"},
248
+ ]
249
+
250
+ catalog.store_asset(asset, products)
251
+
252
+ # Retrieve reference data
253
+ for product in catalog.retrieve_asset(asset):
254
+ print(product)
255
+ ```
256
+
257
+ ### Creating a Custom Source
258
+
259
+ ```python
260
+ import asyncio
261
+ from datetime import datetime, timezone
262
+ from typing import AsyncIterator
263
+
264
+ from hyperion.catalog import Catalog
265
+ from hyperion.entities.catalog import DataLakeAsset
266
+ from hyperion.sources.base import Source, SourceAsset
267
+
268
+
269
+ class MyCustomSource(Source):
270
+ source = "my-custom-source"
271
+
272
+ async def run(self, start_date=None, end_date=None) -> AsyncIterator[SourceAsset]:
273
+ # Fetch your data (this is where you'd implement your data extraction logic)
274
+ data = [
275
+ {"id": 1, "name": "Item 1", "timestamp": datetime.now(timezone.utc)},
276
+ {"id": 2, "name": "Item 2", "timestamp": datetime.now(timezone.utc)},
277
+ ]
278
+
279
+ # Create asset
280
+ asset = DataLakeAsset(
281
+ name="my-custom-data",
282
+ date=datetime.now(timezone.utc)
283
+ )
284
+
285
+ # Yield source asset
286
+ yield SourceAsset(asset=asset, data=data)
287
+
288
+
289
+ # Use with AWS Lambda
290
+ def lambda_handler(event, context):
291
+ MyCustomSource.handle_aws_lambda_event(event, context)
292
+
293
+
294
+ # Use standalone
295
+ if __name__ == "__main__":
296
+ asyncio.run(MyCustomSource._run(Catalog.from_config()))
297
+ ```
298
+
299
+ ### Working with Schemas
300
+
301
+ To create and register a schema for an asset:
302
+
303
+ ```python
304
+ import json
305
+ from pathlib import Path
306
+
307
+ # Define schema in Avro format
308
+ schema = {
309
+ "type": "record",
310
+ "name": "CustomerData",
311
+ "fields": [
312
+ {"name": "id", "type": "int"},
313
+ {"name": "name", "type": "string"},
314
+ {"name": "timestamp", "type": {"type": "long", "logicalType": "timestamp-millis"}}
315
+ ]
316
+ }
317
+
318
+ # Save schema to local file
319
+ schema_path = Path("schemas/data_lake/customer_data.v1.avro.json")
320
+ schema_path.parent.mkdir(parents=True, exist_ok=True)
321
+ with open(schema_path, "w") as f:
322
+ json.dump(schema, f)
323
+
324
+ # Or upload to S3 if using S3SchemaStore
325
+ import boto3
326
+ s3_client = boto3.client('s3')
327
+ s3_client.put_object(
328
+ Bucket="my-schema-bucket",
329
+ Key="data_lake/customer_data.v1.avro.json",
330
+ Body=json.dumps(schema)
331
+ )
332
+ ```
333
+
334
+ ## Advanced Features
335
+
336
+ ### Asset Collections
337
+
338
+ Asset collections provide a high-level interface for fetching and working with groups of assets. You can define a collection class that specifies the assets you need and fetch them all at once.
339
+
340
+ See [`docs/asset_collections.md`](docs/asset_collections.md) for more information.
341
+
342
+ ### Repartitioning Data
343
+
344
+ ```python
345
+ from hyperion.catalog import Catalog
346
+ from hyperion.entities.catalog import DataLakeAsset
347
+ from hyperion.dateutils import TimeResolutionUnit
348
+ from datetime import datetime, timezone
349
+ import asyncio
350
+
351
+ async def repartition_data():
352
+ catalog = Catalog.from_config()
353
+
354
+ # Original asset with day-level partitioning
355
+ asset = DataLakeAsset(
356
+ name="web_logs",
357
+ date=datetime.now(timezone.utc),
358
+ schema_version=1
359
+ )
360
+
361
+ # Repartition by hour
362
+ await catalog.repartition(
363
+ asset,
364
+ granularity=TimeResolutionUnit("h"),
365
+ date_attribute="timestamp"
366
+ )
367
+
368
+ asyncio.run(repartition_data())
369
+ ```
370
+
371
+ ### Caching
372
+
373
+ ```python
374
+ from hyperion.infrastructure.cache import Cache
375
+
376
+ # Get cache from configuration
377
+ cache = Cache.from_config()
378
+
379
+ # Store data in cache
380
+ cache.set("my-key", "my-value")
381
+
382
+ # Retrieve data from cache
383
+ value = cache.get("my-key")
384
+ print(value) # "my-value"
385
+
386
+ # Check if key exists
387
+ if cache.hit("my-key"):
388
+ print("Cache hit!")
389
+
390
+ # Delete key
391
+ cache.delete("my-key")
392
+ ```
393
+
394
+ ### Geo Utilities
395
+
396
+ ```python
397
+ from hyperion.infrastructure.geo import GoogleMaps, Location
398
+
399
+ # Initialize Google Maps client
400
+ gmaps = GoogleMaps.from_config()
401
+
402
+ # Geocode an address
403
+ with gmaps:
404
+ location = gmaps.geocode("1600 Amphitheatre Parkway, Mountain View, CA")
405
+ print(f"Latitude: {location.latitude}, Longitude: {location.longitude}")
406
+
407
+ # Reverse geocode a location
408
+ named_location = gmaps.reverse_geocode(location)
409
+ print(f"Address: {named_location.address}")
410
+ print(f"Country: {named_location.country}")
411
+ ```
412
+
413
+ ## Development
414
+
415
+ ### Setup Development Environment
416
+
417
+ ```bash
418
+ # Install development dependencies
419
+ poetry install
420
+
421
+ # Install pre-commit hooks
422
+ poetry run pre-commit install
423
+ ```
424
+
425
+ ### Running Tests
426
+
427
+ ```bash
428
+ # Run all tests
429
+ poetry run pytest
430
+
431
+ # Run with coverage
432
+ poetry run pytest --cov=hyperion
433
+
434
+ # Run specific test files
435
+ poetry run pytest tests/test_asyncutils.py
436
+ ```
437
+
438
+ ### Code Style
439
+
440
+ This project uses [pre-commit](https://pre-commit.com/) hooks to enforce code style:
441
+
442
+ ```bash
443
+ # Run pre-commit on all files
444
+ poetry run pre-commit run -a
445
+ ```
446
+
447
+ The project uses:
448
+
449
+ - [ruff](https://github.com/charliermarsh/ruff) for linting
450
+ - [mypy](https://mypy.readthedocs.io/) for type checking
451
+ - [commitizen](https://github.com/commitizen-tools/commitizen) for standardized commits
452
+
453
+ ## Architecture
454
+
455
+ ### Core Components
456
+
457
+ - **Catalog**: Manages data assets and their storage in S3
458
+ - **SchemaStore**: Handles schema validation and storage
459
+ - **Source**: Base class for implementing data sources
460
+ - **Infrastructure**: Abstractions for AWS services (S3, DynamoDB, SQS, etc.)
461
+ - **Utils**: Helper functions for dates, async operations, etc.
462
+
463
+ ### Asset Types
464
+
465
+ - **DataLakeAsset**: Raw data stored in a data lake
466
+ - **FeatureAsset**: Processed features with time resolution
467
+ - **PersistentStoreAsset**: Persistent data storage
468
+
469
+ ## Contributing
470
+
471
+ See [CONTRIBUTING.md](./CONTRIBUTING.md) for guidelines on contributing to this project.
472
+
473
+ ## License
474
+
475
+ This project is licensed under the MIT License - see the LICENSE file for details.
476
+