acdc_aws_etl_pipeline 0.7.2__py3-none-any.whl → 0.7.3__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -0,0 +1,75 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: acdc_aws_etl_pipeline
|
|
3
|
+
Version: 0.7.3
|
|
4
|
+
Summary: Tools for ACDC ETL pipeline
|
|
5
|
+
Author: JoshuaHarris391
|
|
6
|
+
Author-email: harjo391@gmail.com
|
|
7
|
+
Requires-Python: >=3.9.5,<4.0.0
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
10
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
14
|
+
Requires-Dist: boto3
|
|
15
|
+
Requires-Dist: dbt-athena (==1.9.4)
|
|
16
|
+
Requires-Dist: dbt-core (==1.9.4)
|
|
17
|
+
Requires-Dist: gen3 (>=4.27.4,<5.0.0)
|
|
18
|
+
Requires-Dist: gen3_validator (>=2.0.0,<3.0.0)
|
|
19
|
+
Requires-Dist: numpy (<2.0.0)
|
|
20
|
+
Requires-Dist: pyjwt (>=2.10.1,<3.0.0)
|
|
21
|
+
Requires-Dist: pytest
|
|
22
|
+
Requires-Dist: python-dotenv
|
|
23
|
+
Requires-Dist: pytz (>=2025.2,<2026.0)
|
|
24
|
+
Requires-Dist: pyyaml (>=6.0.2,<7.0.0)
|
|
25
|
+
Requires-Dist: s3fs (==2025.10.0)
|
|
26
|
+
Requires-Dist: tenacity (>=8.2,<10.0)
|
|
27
|
+
Requires-Dist: tzlocal (>=5.3.1,<6.0.0)
|
|
28
|
+
Description-Content-Type: text/markdown
|
|
29
|
+
|
|
30
|
+
# acdc-aws-etl-pipeline
|
|
31
|
+
Infrastructure and code for the ACDC ETL pipeline and data operations in AWS
|
|
32
|
+
|
|
33
|
+
## Documentation
|
|
34
|
+
|
|
35
|
+
- [Dictionary deployment](docs/dictionary_deployment.md)
|
|
36
|
+
- [Data ingestion](docs/data_ingestion.md)
|
|
37
|
+
- [Data validation](docs/data_validation.md)
|
|
38
|
+
- [Data transformation (dbt)](docs/data_transformation_dbt.md)
|
|
39
|
+
- [Data releases](docs/write_data_release.md)
|
|
40
|
+
- [Synthetic data generation](docs/synthetic_data_generation.md)
|
|
41
|
+
- [REST API upload to sheepdog](docs/rest_api_sheepdog_upload.md)
|
|
42
|
+
- [Data deletion](docs/data_deletion.md)
|
|
43
|
+
- [IndexD file registration](docs/indexd_registration.md)
|
|
44
|
+
- [Querying Athena](docs/querying_athena.md)
|
|
45
|
+
- [Writing Athena queries to JSON](docs/write_athena_queries_to_json.md)
|
|
46
|
+
- [Troubleshooting](docs/troubleshooting.md)
|
|
47
|
+
|
|
48
|
+
## Library and source code (`src/acdc_aws_etl_pipeline`)
|
|
49
|
+
|
|
50
|
+
The Python package in [`src/acdc_aws_etl_pipeline`](src/acdc_aws_etl_pipeline) provides reusable utilities for ingestion, validation, uploads, and Athena/Glue operations used across the pipeline and services.
|
|
51
|
+
|
|
52
|
+
### Modules
|
|
53
|
+
|
|
54
|
+
- **`ingest/`**: ingestion helpers for loading source datasets into S3/Glue (see [`ingest/ingest.py`](src/acdc_aws_etl_pipeline/ingest/ingest.py)).
|
|
55
|
+
- **`upload/`**: Gen3/Sheepdog metadata submission and deletion utilities (e.g. [`upload/metadata_submitter.py`](src/acdc_aws_etl_pipeline/upload/metadata_submitter.py)).
|
|
56
|
+
- **`validate/`**: schema validation utilities and helpers for validation workflows (see [`validate/validate.py`](src/acdc_aws_etl_pipeline/validate/validate.py)).
|
|
57
|
+
- **`utils/`**: shared Athena/Glue/dbt/release helpers (e.g. [`utils/athena_utils.py`](src/acdc_aws_etl_pipeline/utils/athena_utils.py), [`utils/release_writer.py`](src/acdc_aws_etl_pipeline/utils/release_writer.py)).
|
|
58
|
+
|
|
59
|
+
### Local development
|
|
60
|
+
|
|
61
|
+
To install dependencies and run tests:
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
pip install poetry
|
|
65
|
+
poetry install
|
|
66
|
+
source $(poetry env info --path)/bin/activate
|
|
67
|
+
poetry run pytest
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### Install from PyPI
|
|
71
|
+
|
|
72
|
+
Releases are published automatically, so you can also install the package directly:
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
pip install acdc_aws_etl_pipeline
|
|
@@ -9,6 +9,6 @@ acdc_aws_etl_pipeline/utils/athena_utils.py,sha256=QJlBe-07Hkq-BqmcxBu6ZtAmVfZSH
|
|
|
9
9
|
acdc_aws_etl_pipeline/utils/dbt_utils.py,sha256=5XRFOwNNIeuW2sQuor3h_OZTuXGg6xv2AUYwj9bMAAM,2054
|
|
10
10
|
acdc_aws_etl_pipeline/utils/release_writer.py,sha256=vsxHJ6l-UWPpzeyEPHurX5iFgeCEQ-9FbySAbPNfTTM,7555
|
|
11
11
|
acdc_aws_etl_pipeline/validate/validate.py,sha256=zLqK9i92FsRAaBOGdY-G7-vb0e6tmkoUXhY6zCfbjN8,24895
|
|
12
|
-
acdc_aws_etl_pipeline-0.7.
|
|
13
|
-
acdc_aws_etl_pipeline-0.7.
|
|
14
|
-
acdc_aws_etl_pipeline-0.7.
|
|
12
|
+
acdc_aws_etl_pipeline-0.7.3.dist-info/METADATA,sha256=4HU13kgbLqeFVwPPtjSjSgNO7M1FhxcKANJHR2_3c2k,3096
|
|
13
|
+
acdc_aws_etl_pipeline-0.7.3.dist-info/WHEEL,sha256=zp0Cn7JsFoX2ATtOhtaFYIiE2rmFAD4OcMhtUki8W3U,88
|
|
14
|
+
acdc_aws_etl_pipeline-0.7.3.dist-info/RECORD,,
|
|
@@ -1,82 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: acdc_aws_etl_pipeline
|
|
3
|
-
Version: 0.7.2
|
|
4
|
-
Summary: Tools for ACDC ETL pipeline
|
|
5
|
-
Author: JoshuaHarris391
|
|
6
|
-
Author-email: harjo391@gmail.com
|
|
7
|
-
Requires-Python: >=3.9.5,<4.0.0
|
|
8
|
-
Classifier: Programming Language :: Python :: 3
|
|
9
|
-
Classifier: Programming Language :: Python :: 3.10
|
|
10
|
-
Classifier: Programming Language :: Python :: 3.11
|
|
11
|
-
Classifier: Programming Language :: Python :: 3.12
|
|
12
|
-
Classifier: Programming Language :: Python :: 3.13
|
|
13
|
-
Classifier: Programming Language :: Python :: 3.14
|
|
14
|
-
Requires-Dist: boto3
|
|
15
|
-
Requires-Dist: dbt-athena (==1.9.4)
|
|
16
|
-
Requires-Dist: dbt-core (==1.9.4)
|
|
17
|
-
Requires-Dist: gen3 (>=4.27.4,<5.0.0)
|
|
18
|
-
Requires-Dist: gen3_validator (>=2.0.0,<3.0.0)
|
|
19
|
-
Requires-Dist: numpy (<2.0.0)
|
|
20
|
-
Requires-Dist: pyjwt (>=2.10.1,<3.0.0)
|
|
21
|
-
Requires-Dist: pytest
|
|
22
|
-
Requires-Dist: python-dotenv
|
|
23
|
-
Requires-Dist: pytz (>=2025.2,<2026.0)
|
|
24
|
-
Requires-Dist: pyyaml (>=6.0.2,<7.0.0)
|
|
25
|
-
Requires-Dist: s3fs (==2025.10.0)
|
|
26
|
-
Requires-Dist: tenacity (>=8.2,<10.0)
|
|
27
|
-
Requires-Dist: tzlocal (>=5.3.1,<6.0.0)
|
|
28
|
-
Description-Content-Type: text/markdown
|
|
29
|
-
|
|
30
|
-
# acdc-aws-etl-pipeline
|
|
31
|
-
Infrastructure and code for the ACDC ETL pipeline and data operations in AWS
|
|
32
|
-
|
|
33
|
-
## Ingestion
|
|
34
|
-
- [ingestion](docs/ingestion.md)
|
|
35
|
-
- [upload_synthdata_s3](docs/upload_synthdata_s3.md)
|
|
36
|
-
|
|
37
|
-
## DBT
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
## Release Management
|
|
42
|
-
- [Writing DBT Releases](docs/write_dbt_release_info.md)
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
## Deploying the dictionary
|
|
46
|
-
e.g. to testing
|
|
47
|
-
|
|
48
|
-
```bash
|
|
49
|
-
# Example
|
|
50
|
-
bash services/dictionary/pull_dict.sh <raw_dictionary_url>
|
|
51
|
-
bash services/dictionary/upload_dictionary.py <local_dictionary_path> <s3_target_uri>
|
|
52
|
-
|
|
53
|
-
# Deploying to test
|
|
54
|
-
VERSION=v0.6.3
|
|
55
|
-
bash services/dictionary/pull_dict.sh "https://raw.githubusercontent.com/AustralianBioCommons/acdc-schema-json/refs/tags/${VERSION}/dictionary/prod_dict/acdc_schema.json"
|
|
56
|
-
python3 services/dictionary/upload_dictionary.py "services/dictionary/schemas/acdc_schema_${VERSION}.json" s3://gen3schema-cad-uat-biocommons.org.au/cad.json
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
# Deploying to staging
|
|
60
|
-
VERSION=v1.0.0
|
|
61
|
-
bash services/dictionary/pull_dict.sh "https://raw.githubusercontent.com/AustralianBioCommons/acdc-schema-json/refs/tags/${VERSION}/dictionary/prod_dict/acdc_schema.json"
|
|
62
|
-
python3 services/dictionary/upload_dictionary.py "services/dictionary/schemas/acdc_schema_${VERSION}.json" s3://gen3schema-cad-staging-biocommons.org.au/cad.json
|
|
63
|
-
```
|
|
64
|
-
|
|
65
|
-
## Generating synthetic metadata
|
|
66
|
-
- Run this script to generate synthetic metadata for the studies in the dictionary
|
|
67
|
-
|
|
68
|
-
```bash
|
|
69
|
-
# this will generate 30 samples for AusDiab_Simulated and 60 samples for Baker-Biobank_Simulated
|
|
70
|
-
bash services/synthetic_data/generate_synth_metadata.sh --studies "AusDiab_Simulated,Baker-Biobank_Simulated" --permute-max-samples "30,60"
|
|
71
|
-
```
|
|
72
|
-
|
|
73
|
-
## uploading synthetic metadata to sheepdog
|
|
74
|
-
- Run this script to upload synthetic metadata to sheepdog
|
|
75
|
-
|
|
76
|
-
```python
|
|
77
|
-
# to see argumments
|
|
78
|
-
python3 services/synthetic_data/upload_synth_metadata_sheepdog.py -h
|
|
79
|
-
|
|
80
|
-
# to upload metadata for version v0.6.3
|
|
81
|
-
python3 services/synthetic_data/upload_synth_metadata_sheepdog.py --version v0.6.3
|
|
82
|
-
```
|
|
File without changes
|