PyPI - carrot-transform - Versions diffs - 0.3.3__py3-none-any.whl → 0.3.4__py3-none-any.whl - Mend

carrot-transform 0.3.3py3-none-any.whl → 0.3.4py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

carrot_transform-0.3.4.dist-info/METADATA +106 -0
carrot_transform-0.3.4.dist-info/RECORD +24 -0
{carrot_transform-0.3.3.dist-info → carrot_transform-0.3.4.dist-info}/WHEEL +1 -1
carrot_transform-0.3.4.dist-info/entry_points.txt +3 -0
carrottransform/_version.py +6 -2
carrottransform/cli/subcommands/run.py +164 -83
carrottransform/examples/test/inputs/Covid19_test.csv +801 -0
carrottransform/examples/test/inputs/Demographics.csv +1001 -0
carrottransform/examples/test/inputs/Symptoms.csv +801 -0
carrottransform/examples/test/inputs/covid19_antibody.csv +1001 -0
carrottransform/examples/test/inputs/vaccine.csv +501 -0
carrottransform/examples/test/rules/rules_14June2021.json +300 -0
carrottransform/tools/mappingrules.py +8 -8
carrottransform/tools/omopcdm.py +9 -2
carrot_transform-0.3.3.dist-info/METADATA +0 -48
carrot_transform-0.3.3.dist-info/RECORD +0 -17
{carrot_transform-0.3.3.dist-info → carrot_transform-0.3.4.dist-info}/LICENSE +0 -0

carrottransform/examples/test/rules/rules_14June2021.json ADDED Viewed

@@ -0,0 +1,300 @@
+{
+      "metadata": {
+            "date_created": "2021-06-14T15:27:37.123947",
+            "dataset": "Test"
+      },
+      "cdm": {
+            "observation": {
+                  "observation_0": {
+                        "observation_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "Asian": 35825508
+                              }
+                        },
+                        "observation_datetime": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "date_of_birth"
+                        },
+                        "observation_source_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "Asian": 35825508
+                              }
+                        },
+                        "observation_source_value": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity"
+                        },
+                        "person_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "PersonID"
+                        }
+                  },
+                  "observation_1":{
+                        "observation_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "Bangladeshi": 35825531
+                              }
+                        },
+                        "observation_datetime": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "date_of_birth"
+                        },
+                        "observation_source_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "Bangladeshi": 35825531
+                              }
+                        },
+                        "observation_source_value": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity"
+                        },
+                        "person_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "PersonID"
+                        }
+                  },
+                  "observation_2":{
+                        "observation_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "Indian": 35826241
+                              }
+                        },
+                        "observation_datetime": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "date_of_birth"
+                        },
+                        "observation_source_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "Indian": 35826241
+                              }
+                        },
+                        "observation_source_value": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity"
+                        },
+                        "person_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "PersonID"
+                        }
+                  },
+                  "observation_3":{
+                        "observation_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "White": 35827394
+                              }
+                        },
+                        "observation_datetime": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "date_of_birth"
+                        },
+                        "observation_source_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "White": 35827394
+                              }
+                        },
+                        "observation_source_value": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity"
+                        },
+                        "person_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "PersonID"
+                        }
+                  },
+                  "observation_4":{
+                        "observation_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "Black": 35825567
+                              }
+                        },
+                        "observation_datetime": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "date_of_birth"
+                        },
+                        "observation_source_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "Black": 35825567
+                              }
+                        },
+                        "observation_source_value": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity"
+                        },
+                        "person_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "PersonID"
+                        }
+                  },
+                  "observation_5":{
+                        "observation_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "White and Asian": 35827395
+                              }
+                        },
+                        "observation_datetime": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "date_of_birth"
+                        },
+                        "observation_source_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity",
+                              "term_mapping": {
+                                    "White and Asian": 35827395
+                              }
+                        },
+                        "observation_source_value": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "ethnicity"
+                        },
+                        "person_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "PersonID"
+                        }
+                  }
+            },
+            "condition_occurrence": {
+                  "condition_occurrence_0":{
+                        "condition_concept_id": {
+                              "source_table": "Symptoms.csv",
+                              "source_field": "symptom1",
+                              "term_mapping": {
+                                    "Y": 254761
+                              }
+                        },
+                        "condition_end_datetime": {
+                              "source_table": "Symptoms.csv",
+                              "source_field": "visit_date"
+                        },
+                        "condition_source_concept_id": {
+                              "source_table": "Symptoms.csv",
+                              "source_field": "symptom1",
+                              "term_mapping": {
+                                    "Y": 254761
+                              }
+                        },
+                        "condition_source_value": {
+                              "source_table": "Symptoms.csv",
+                              "source_field": "symptom1"
+                        },
+                        "condition_start_datetime": {
+                              "source_table": "Symptoms.csv",
+                              "source_field": "visit_date"
+                        },
+                        "person_id": {
+                              "source_table": "Symptoms.csv",
+                              "source_field": "PersonID"
+                        }
+                  }
+            },
+            "person": {
+                  "female":{
+                        "birth_datetime": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "date_of_birth"
+                        },
+                        "gender_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "sex",
+                              "term_mapping": {
+                                    "F": 8532
+                              }
+                        },
+                        "gender_source_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "sex",
+                              "term_mapping": {
+                                    "F": 8532
+                              }
+                        },
+                        "gender_source_value": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "sex"
+                        },
+                        "person_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "PersonID"
+                        }
+                  },
+                  "male":{
+                        "birth_datetime": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "date_of_birth"
+                        },
+                        "gender_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "sex",
+                              "term_mapping": {
+                                    "M": 8507
+                              }
+                        },
+                        "gender_source_concept_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "sex",
+                              "term_mapping": {
+                                    "M": 8507
+                              }
+                        },
+                        "gender_source_value": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "sex"
+                        },
+                        "person_id": {
+                              "source_table": "Demographics.csv",
+                              "source_field": "PersonID"
+                        }
+                  }
+            },
+	    "measurement": {
+		  "covid_antibody":{
+			"value_as_number": {
+			      "source_table": "covid19_antibody.csv",
+			      "source_field": "IgG"
+			},
+			"measurement_source_value": {
+			      "source_table": "covid19_antibody.csv",
+			      "source_field": "IgG"
+			},
+			"measurement_concept_id": {
+			      "source_table": "covid19_antibody.csv",
+			      "source_field": "IgG",
+			      "term_mapping": 37398191
+			},
+			"measurement_source_concept_id": {
+			      "source_table": "covid19_antibody.csv",
+			      "source_field": "IgG",
+			      "term_mapping": 37398191
+			},
+			"measurement_datetime": {
+			      "source_table": "covid19_antibody.csv",
+			      "source_field": "date"
+			},
+			"person_id": {
+			      "source_table": "covid19_antibody.csv",
+			      "source_field": "PersonID"
+			}
+		  }
+            }
+      }
+}

carrottransform/tools/mappingrules.py CHANGED Viewed

@@ -10,6 +10,7 @@ class MappingRules:
     """
     def __init__(self, rulesfilepath, omopcdm):
+        ## just loads the json directly
         self.rules_data = tools.load_json(rulesfilepath)
         self.omopcdm = omopcdm
@@ -34,12 +35,7 @@ class MappingRules:
         return self.dataset_name
     def get_all_outfile_names(self):
-        file_list = []
-        for outfilename in self.rules_data["cdm"]:
-            file_list.append(outfilename)
-        return file_list
+        return list(self.rules_data["cdm"])
     def get_all_infile_names(self):
         file_list = []
@@ -86,9 +82,9 @@ class MappingRules:
                 for infield, outfield_list in outfield_elem.items():
                     #print("{0}, {1}, {2}".format(outfile, infield, str(outfield_list)))
                     for outfield in outfield_list:
-                        if outfield in self.omopcdm.get_omop_datetime_fields(outfile):
+                        if outfield.split('~')[0] in self.omopcdm.get_omop_datetime_fields(outfile):
                             datetime_source = infield
-                        if outfield == self.omopcdm.get_omop_person_id_field(outfile):
+                        if outfield.split('~')[0] == self.omopcdm.get_omop_person_id_field(outfile):
                             person_id_source = infield
         return datetime_source, person_id_source
@@ -101,6 +97,7 @@ class MappingRules:
         person_id_source = None
         if tgtfilename in self.rules_data["cdm"]:
             source_rules_data = self.rules_data["cdm"][tgtfilename]
+            ## this loops over all the fields in the person part of the rules, which will lead to overwriting of the source variables and unneccesary looping
             for rule_name, rule_fields in source_rules_data.items():
                 if "birth_datetime" in rule_fields:
                     birth_datetime_source = rule_fields["birth_datetime"]["source_field"]
@@ -113,6 +110,7 @@ class MappingRules:
         """
         Parse rules to produce a map of source to target data for a given input file
         """
+        ## creates a dict of dicts that has input files as keys, and infile~field~data~target as keys for the underlying keys, which contain a list of dicts of lists
         if infilename in self.outfile_names and infilename in self.parsed_rules:
             return self.outfile_names[infilename], self.parsed_rules[infilename]
         outfilenames = []
@@ -141,6 +139,7 @@ class MappingRules:
         plain_key = ""
         term_value_key = ""
+        ## iterate through the rules, looking for rules that apply to the input file.
         for outfield, source_info in rules.items():
             if source_info["source_field"] not in data:
                 data[source_info["source_field"]] = []
@@ -148,6 +147,7 @@ class MappingRules:
                 if "term_mapping" in source_info:
                     if type(source_info["term_mapping"]) is dict:
                         for inputvalue, term in source_info["term_mapping"].items():
+                            ## add a key/add to the list of data in the dict for the given input file
                             term_value_key = infilename + "~" + source_info["source_field"] + "~" + str(inputvalue) + "~" + outfilename
                             data[source_info["source_field"]].append(outfield + "~" + str(source_info["term_mapping"][str(inputvalue)]))
                     else:

carrottransform/tools/omopcdm.py CHANGED Viewed

@@ -14,7 +14,10 @@ class OmopCDM:
         self.numeric_types = ["integer", "numeric"]
         self.datetime_types = ["timestamp"]
         self.date_types = ["date"]
+        ## ddl sets the headers to go in each table, and whether or not to make it null. Also allows for more tables than we will use.
+        ## also adds additional useful keys, like 'all_columns' - before merge
         self.omop_json = self.load_ddl(omopddl)
+        ## adds fields as a dict of dicts - is this so they can get picked up by some of these get_columns?
         self.omop_json = self.merge_json(self.omop_json, omopcfg)
         self.all_columns = self.get_columns("all_columns")
         self.numeric_fields = self.get_columns("numeric_fields")
@@ -47,9 +50,13 @@ class OmopCDM:
         output_dict["datetime_fields"] = {}
         output_dict["date_fields"] = {}
+        ## matching for version number - matches '--postgres', any number of chars and some digits of the form X.Y, plus an end of string or end of line
         ver_rgx = re.compile(r'^--postgresql.*(\d+\.\d+)$')
-        start_rgx = re.compile(r'^CREATE\s*TABLE\s*(\@?[a-zA-Z]+\.)?([A-Z_]+)')
+        ## matching for table name - matches 'CREATE TABLE @', some letters (upper and lower case), '.' and some more letters (lower case)
+        start_rgx = re.compile(r'^CREATE\s*TABLE\s*(\@?[a-zA-Z]+\.)?([a-zA-Z_]+)')
+        ## matches some whitespace, lower case letters(or underscores), whitespace, letters (upper/lower and underscores)
         datatype_rgx = re.compile(r'^\s*([a-z_]+)\s+([a-zA-Z_]+)')
+        ## matching for end of file - matches close bracket, semi colon, end of file or line
         end_rgx = re.compile(r'.*[)];$')
         vermatched = False
         processing_table_data = False
@@ -76,7 +83,7 @@ class OmopCDM:
                     fname = idtmatch.group(1)
                     ftype = idtmatch.group(2)
-                    # Check for dictionary element presence
+                    # Check for dictionary element presence, adn start an empty list if it doesn't already exist
                     if tabname not in output_dict["all_columns"]:
                         output_dict["all_columns"][tabname] = []
                     if tabname not in output_dict["numeric_fields"]:

carrot_transform-0.3.3.dist-info/METADATA DELETED Viewed

@@ -1,48 +0,0 @@
-Metadata-Version: 2.3
-Name: carrot_transform
-Version: 0.3.3
-Summary:
-Author: anwarfg
-Author-email: 913028+anwarfg@users.noreply.github.com
-Requires-Python: >=3.10,<4.0
-Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.10
-Classifier: Programming Language :: Python :: 3.11
-Classifier: Programming Language :: Python :: 3.12
-Classifier: Programming Language :: Python :: 3.13
-Requires-Dist: click (>=8.1.7,<9.0.0)
-Requires-Dist: jinja2 (>=3.1.4,<4.0.0)
-Requires-Dist: pandas (>=2.2.3,<3.0.0)
-Description-Content-Type: text/markdown
-<p align="center">
-  <a href="https://carrot.ac.uk/" target="_blank">
-  <picture>
-    <source media="(prefers-color-scheme: dark)" srcset="/images/logo-dark.png">
-    <img alt="Carrot Logo" src="/images/logo-primary.png" width="280"/>
-  </picture>
-  </a>
-</p>
-<div align="center">
-  <strong>
-  <h2>Streamlined Data Mapping to OMOP</h2>
-  <a href="https://carrot.ac.uk/">Carrot Tranform</a> executes the conversion of the data to the OMOP CDM.<br />
-  </strong>
-</div>
-TODO:
-- Document carrot-transform
-- Add more comments in-code
-- Handle capture of ddl and json config via the command-line as optional args
-Reduction in complexity over the original CaRROT-CDM version for the Transform part of _ETL_ - In practice _Extract_ is always
-performed by Data Partners, _Load_ by database bulk-load software.
-Statistics
-External libraries imported (approximate)
-carrot-cdm 61
-carrot-transform 12

carrot_transform-0.3.3.dist-info/RECORD DELETED Viewed

@@ -1,17 +0,0 @@
-carrottransform/__init__.py,sha256=cQJKTCpG2qmKxDl-VtSWQ3_WFjyzg4u_8nZacWAHFcU,73
-carrottransform/_version.py,sha256=NfGqG2TgfjxxrlCHaOtwl3BcE0f6UH0VPrQgoDPjV7Y,72
-carrottransform/cli/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
-carrottransform/cli/command.py,sha256=xYTaJsVZyRYv0CzUwrh7ZPK8hhGyC3MDfvVYxHcXYSM,508
-carrottransform/cli/subcommands/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
-carrottransform/cli/subcommands/run.py,sha256=3z5cRG4ekyPOP5tvjZOyHUxbclKfBr_Z0tQRRoKj73E,20651
-carrottransform/config/OMOPCDM_postgresql_5.3_ddl.sql,sha256=fXrPfdL3IzU5ux55ogsQKjjd-c1KzdP_N2A_JjlY3gk,18084
-carrottransform/config/omop.json,sha256=OT3jvfPjKhjsDnQcQw1OAEOHhQLoHXNxTj_MDwNbYqo,1934
-carrottransform/tools/__init__.py,sha256=b3JuCwgJVx0rqx5igB8hNNKO0ktlbQjHGHwy-vzpdo0,198
-carrottransform/tools/file_helpers.py,sha256=xlODDAUpsx0H4sweGZ81ttjJjNQGn2spNUa1Fndotw8,316
-carrottransform/tools/mappingrules.py,sha256=bV6tXHBwVeKAUgCwFTZE2-qTcxKtbs3zbJWedBSviVI,6567
-carrottransform/tools/metrics.py,sha256=LOzm80-YIVM9mvgvQXRpyArl2nSfSTTW9DikqJ5M2Yg,5700
-carrottransform/tools/omopcdm.py,sha256=ycyPGgUTUwui7MLxH8JXd-MyCRkG0xOfEoDhCXeogmQ,7623
-carrot_transform-0.3.3.dist-info/LICENSE,sha256=pqIiuuTs6Na-oFd10MMsZoZmdfhfUhHeOtQzgzSkcaw,1082
-carrot_transform-0.3.3.dist-info/METADATA,sha256=23mVHLHLXOqgXUFLoU7cSaqIr_yzl9mYf_zgZnteeoY,1474
-carrot_transform-0.3.3.dist-info/WHEEL,sha256=IYZQI976HJqqOpQU6PHkJ8fb3tMNBFjg-Cn-pwAbaFM,88
-carrot_transform-0.3.3.dist-info/RECORD,,

{carrot_transform-0.3.3.dist-info → carrot_transform-0.3.4.dist-info}/LICENSE RENAMED Viewed

File without changes

carrot-transform 0.3.3__py3-none-any.whl → 0.3.4__py3-none-any.whl

carrot-transform 0.3.3py3-none-any.whl → 0.3.4py3-none-any.whl