acdc_aws_etl_pipeline 0.7.1__py3-none-any.whl → 0.7.3__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,75 @@
1
+ Metadata-Version: 2.4
2
+ Name: acdc_aws_etl_pipeline
3
+ Version: 0.7.3
4
+ Summary: Tools for ACDC ETL pipeline
5
+ Author: JoshuaHarris391
6
+ Author-email: harjo391@gmail.com
7
+ Requires-Python: >=3.9.5,<4.0.0
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Python :: 3.10
10
+ Classifier: Programming Language :: Python :: 3.11
11
+ Classifier: Programming Language :: Python :: 3.12
12
+ Classifier: Programming Language :: Python :: 3.13
13
+ Classifier: Programming Language :: Python :: 3.14
14
+ Requires-Dist: boto3
15
+ Requires-Dist: dbt-athena (==1.9.4)
16
+ Requires-Dist: dbt-core (==1.9.4)
17
+ Requires-Dist: gen3 (>=4.27.4,<5.0.0)
18
+ Requires-Dist: gen3_validator (>=2.0.0,<3.0.0)
19
+ Requires-Dist: numpy (<2.0.0)
20
+ Requires-Dist: pyjwt (>=2.10.1,<3.0.0)
21
+ Requires-Dist: pytest
22
+ Requires-Dist: python-dotenv
23
+ Requires-Dist: pytz (>=2025.2,<2026.0)
24
+ Requires-Dist: pyyaml (>=6.0.2,<7.0.0)
25
+ Requires-Dist: s3fs (==2025.10.0)
26
+ Requires-Dist: tenacity (>=8.2,<10.0)
27
+ Requires-Dist: tzlocal (>=5.3.1,<6.0.0)
28
+ Description-Content-Type: text/markdown
29
+
30
+ # acdc-aws-etl-pipeline
31
+ Infrastructure and code for the ACDC ETL pipeline and data operations in AWS
32
+
33
+ ## Documentation
34
+
35
+ - [Dictionary deployment](docs/dictionary_deployment.md)
36
+ - [Data ingestion](docs/data_ingestion.md)
37
+ - [Data validation](docs/data_validation.md)
38
+ - [Data transformation (dbt)](docs/data_transformation_dbt.md)
39
+ - [Data releases](docs/write_data_release.md)
40
+ - [Synthetic data generation](docs/synthetic_data_generation.md)
41
+ - [REST API upload to sheepdog](docs/rest_api_sheepdog_upload.md)
42
+ - [Data deletion](docs/data_deletion.md)
43
+ - [IndexD file registration](docs/indexd_registration.md)
44
+ - [Querying Athena](docs/querying_athena.md)
45
+ - [Writing Athena queries to JSON](docs/write_athena_queries_to_json.md)
46
+ - [Troubleshooting](docs/troubleshooting.md)
47
+
48
+ ## Library and source code (`src/acdc_aws_etl_pipeline`)
49
+
50
+ The Python package in [`src/acdc_aws_etl_pipeline`](src/acdc_aws_etl_pipeline) provides reusable utilities for ingestion, validation, uploads, and Athena/Glue operations used across the pipeline and services.
51
+
52
+ ### Modules
53
+
54
+ - **`ingest/`**: ingestion helpers for loading source datasets into S3/Glue (see [`ingest/ingest.py`](src/acdc_aws_etl_pipeline/ingest/ingest.py)).
55
+ - **`upload/`**: Gen3/Sheepdog metadata submission and deletion utilities (e.g. [`upload/metadata_submitter.py`](src/acdc_aws_etl_pipeline/upload/metadata_submitter.py)).
56
+ - **`validate/`**: schema validation utilities and helpers for validation workflows (see [`validate/validate.py`](src/acdc_aws_etl_pipeline/validate/validate.py)).
57
+ - **`utils/`**: shared Athena/Glue/dbt/release helpers (e.g. [`utils/athena_utils.py`](src/acdc_aws_etl_pipeline/utils/athena_utils.py), [`utils/release_writer.py`](src/acdc_aws_etl_pipeline/utils/release_writer.py)).
58
+
59
+ ### Local development
60
+
61
+ To install dependencies and run tests:
62
+
63
+ ```bash
64
+ pip install poetry
65
+ poetry install
66
+ source $(poetry env info --path)/bin/activate
67
+ poetry run pytest
68
+ ```
69
+
70
+ ### Install from PyPI
71
+
72
+ Releases are published automatically, so you can also install the package directly:
73
+
74
+ ```bash
75
+ pip install acdc_aws_etl_pipeline
@@ -9,6 +9,6 @@ acdc_aws_etl_pipeline/utils/athena_utils.py,sha256=QJlBe-07Hkq-BqmcxBu6ZtAmVfZSH
9
9
  acdc_aws_etl_pipeline/utils/dbt_utils.py,sha256=5XRFOwNNIeuW2sQuor3h_OZTuXGg6xv2AUYwj9bMAAM,2054
10
10
  acdc_aws_etl_pipeline/utils/release_writer.py,sha256=vsxHJ6l-UWPpzeyEPHurX5iFgeCEQ-9FbySAbPNfTTM,7555
11
11
  acdc_aws_etl_pipeline/validate/validate.py,sha256=zLqK9i92FsRAaBOGdY-G7-vb0e6tmkoUXhY6zCfbjN8,24895
12
- acdc_aws_etl_pipeline-0.7.1.dist-info/METADATA,sha256=WddwCKf3KV4-JsKtsegk5dxu6dWKvXx8YANvZZKbRGs,2964
13
- acdc_aws_etl_pipeline-0.7.1.dist-info/WHEEL,sha256=zp0Cn7JsFoX2ATtOhtaFYIiE2rmFAD4OcMhtUki8W3U,88
14
- acdc_aws_etl_pipeline-0.7.1.dist-info/RECORD,,
12
+ acdc_aws_etl_pipeline-0.7.3.dist-info/METADATA,sha256=4HU13kgbLqeFVwPPtjSjSgNO7M1FhxcKANJHR2_3c2k,3096
13
+ acdc_aws_etl_pipeline-0.7.3.dist-info/WHEEL,sha256=zp0Cn7JsFoX2ATtOhtaFYIiE2rmFAD4OcMhtUki8W3U,88
14
+ acdc_aws_etl_pipeline-0.7.3.dist-info/RECORD,,
@@ -1,82 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: acdc_aws_etl_pipeline
3
- Version: 0.7.1
4
- Summary: Tools for ACDC ETL pipeline
5
- Author: JoshuaHarris391
6
- Author-email: harjo391@gmail.com
7
- Requires-Python: >=3.9.5,<4.0.0
8
- Classifier: Programming Language :: Python :: 3
9
- Classifier: Programming Language :: Python :: 3.10
10
- Classifier: Programming Language :: Python :: 3.11
11
- Classifier: Programming Language :: Python :: 3.12
12
- Classifier: Programming Language :: Python :: 3.13
13
- Classifier: Programming Language :: Python :: 3.14
14
- Requires-Dist: boto3
15
- Requires-Dist: dbt-athena (==1.9.4)
16
- Requires-Dist: dbt-core (==1.9.4)
17
- Requires-Dist: gen3 (>=4.27.4,<5.0.0)
18
- Requires-Dist: gen3_validator (>=2.0.0,<3.0.0)
19
- Requires-Dist: numpy (<2.0.0)
20
- Requires-Dist: pyjwt (>=2.10.1,<3.0.0)
21
- Requires-Dist: pytest
22
- Requires-Dist: python-dotenv
23
- Requires-Dist: pytz (>=2025.2,<2026.0)
24
- Requires-Dist: pyyaml (>=6.0.2,<7.0.0)
25
- Requires-Dist: s3fs (==2025.10.0)
26
- Requires-Dist: tenacity (>=8.2,<10.0)
27
- Requires-Dist: tzlocal (>=5.3.1,<6.0.0)
28
- Description-Content-Type: text/markdown
29
-
30
- # acdc-aws-etl-pipeline
31
- Infrastructure and code for the ACDC ETL pipeline and data operations in AWS
32
-
33
- ## Ingestion
34
- - [ingestion](docs/ingestion.md)
35
- - [upload_synthdata_s3](docs/upload_synthdata_s3.md)
36
-
37
- ## DBT
38
-
39
-
40
-
41
- ## Release Management
42
- - [Writing DBT Releases](docs/write_dbt_release_info.md)
43
-
44
-
45
- ## Deploying the dictionary
46
- e.g. to testing
47
-
48
- ```bash
49
- # Example
50
- bash services/dictionary/pull_dict.sh <raw_dictionary_url>
51
- bash services/dictionary/upload_dictionary.py <local_dictionary_path> <s3_target_uri>
52
-
53
- # Deploying to test
54
- VERSION=v0.6.3
55
- bash services/dictionary/pull_dict.sh "https://raw.githubusercontent.com/AustralianBioCommons/acdc-schema-json/refs/tags/${VERSION}/dictionary/prod_dict/acdc_schema.json"
56
- python3 services/dictionary/upload_dictionary.py "services/dictionary/schemas/acdc_schema_${VERSION}.json" s3://gen3schema-cad-uat-biocommons.org.au/cad.json
57
-
58
-
59
- # Deploying to staging
60
- VERSION=v1.0.0
61
- bash services/dictionary/pull_dict.sh "https://raw.githubusercontent.com/AustralianBioCommons/acdc-schema-json/refs/tags/${VERSION}/dictionary/prod_dict/acdc_schema.json"
62
- python3 services/dictionary/upload_dictionary.py "services/dictionary/schemas/acdc_schema_${VERSION}.json" s3://gen3schema-cad-staging-biocommons.org.au/cad.json
63
- ```
64
-
65
- ## Generating synthetic metadata
66
- - Run this script to generate synthetic metadata for the studies in the dictionary
67
-
68
- ```bash
69
- # this will generate 30 samples for AusDiab_Simulated and 60 samples for Baker-Biobank_Simulated
70
- bash services/synthetic_data/generate_synth_metadata.sh --studies "AusDiab_Simulated,Baker-Biobank_Simulated" --permute-max-samples "30,60"
71
- ```
72
-
73
- ## uploading synthetic metadata to sheepdog
74
- - Run this script to upload synthetic metadata to sheepdog
75
-
76
- ```python
77
- # to see argumments
78
- python3 services/synthetic_data/upload_synth_metadata_sheepdog.py -h
79
-
80
- # to upload metadata for version v0.6.3
81
- python3 services/synthetic_data/upload_synth_metadata_sheepdog.py --version v0.6.3
82
- ```