PyPI - carrot-transform - Versions diffs - 0.3.3__py3-none-any.whl → 0.3.4__py3-none-any.whl - Mend

carrot-transform 0.3.3py3-none-any.whl → 0.3.4py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

carrot_transform-0.3.4.dist-info/METADATA +106 -0
carrot_transform-0.3.4.dist-info/RECORD +24 -0
{carrot_transform-0.3.3.dist-info → carrot_transform-0.3.4.dist-info}/WHEEL +1 -1
carrot_transform-0.3.4.dist-info/entry_points.txt +3 -0
carrottransform/_version.py +6 -2
carrottransform/cli/subcommands/run.py +164 -83
carrottransform/examples/test/inputs/Covid19_test.csv +801 -0
carrottransform/examples/test/inputs/Demographics.csv +1001 -0
carrottransform/examples/test/inputs/Symptoms.csv +801 -0
carrottransform/examples/test/inputs/covid19_antibody.csv +1001 -0
carrottransform/examples/test/inputs/vaccine.csv +501 -0
carrottransform/examples/test/rules/rules_14June2021.json +300 -0
carrottransform/tools/mappingrules.py +8 -8
carrottransform/tools/omopcdm.py +9 -2
carrot_transform-0.3.3.dist-info/METADATA +0 -48
carrot_transform-0.3.3.dist-info/RECORD +0 -17
{carrot_transform-0.3.3.dist-info → carrot_transform-0.3.4.dist-info}/LICENSE +0 -0

carrot_transform-0.3.4.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,106 @@
+Metadata-Version: 2.3
+Name: carrot_transform
+Version: 0.3.4
+Summary:
+Author: anwarfg
+Author-email: 913028+anwarfg@users.noreply.github.com
+Requires-Python: >=3.10,<4.0
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Requires-Dist: click (>=8.1.7,<9.0.0)
+Requires-Dist: jinja2 (>=3.1.4,<4.0.0)
+Requires-Dist: pandas (>=2.2.3,<3.0.0)
+Requires-Dist: pytest (>=8.3.4,<9.0.0)
+Description-Content-Type: text/markdown
+<p align="center">
+  <a href="https://carrot.ac.uk/" target="_blank">
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="/images/logo-dark.png">
+    <img alt="Carrot Logo" src="/images/logo-primary.png" width="280"/>
+  </picture>
+  </a>
+</p>
+<p align="center">
+<a href="https://github.com/Health-Informatics-UoN/carrot-transform/releases">
+  <img src="https://img.shields.io/github/v/release/Health-Informatics-UoN/carrot-transform" alt="Release">
+</a>
+<a href="https://opensource.org/license/mit">
+  <img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License">
+</a>
+</p>
+<div align="center">
+  <strong>
+  <h2>Streamlined Data Transformation to OMOP</h2><br />
+<a href="https://carrot.ac.uk/">Carrot Transform</a> automates data transformation processes and facilitates the standardisation of datasets to the OMOP vocabulary, simplifying the integration of diverse data sources.
+  <br />
+  </strong>
+</div>
+<p align="center">
+  <br />
+  <a href="https://carrot.ac.uk/transform" rel="dofollow"><strong>Explore the docs »</strong></a>
+  <br />
+<br />
+<a href="https://carrot.ac.uk/">Carrot Mapper</a> is a webapp which allows the user to use the metadata (as output by [WhiteRabbit](https://github.com/OHDSI/WhiteRabbit)) from a dataset to produce mapping rules to the OMOP standard, in the JSON format. These can be ingested by [Carrot Transform](https://carrot.ac.uk/transform/quickstart) to perform the mapping of the contents of the dataset to OMOP.
+Carrot Transform transforms input data into tab separated variable files of standard OMOP tables, with  concepts mapped according to the provided rules (generated from Carrot Mapper).
+## Quick Start for Developers
+To have the project up and running, please follow the [Quick Start Guide](https://carrot.ac.uk/transform/quickstart).
+## Release Procedure
+To release a new version of `carrot-transform` follow these steps:
+### 1. Prepare the repository
+  - First ensure that repository is clean and all required changes have been merged.
+  - Pull the latest changes from `main` with `git pull origin main`.
+### 2. Create a release branch
+- Now create a new feature branch name `release/v<NEW-VERSION>` (e.g. `release/v0.2.0`).
+### 3. Update the version number
+- Use poetry to bump the version. For example, for a minor version update invoke:
+```bash
+poetry version minor
+```
+- Commit and push the changes (to the release feature branch):
+```bash
+NEW_VERSION=$(poetry version -s)
+git add pyproject.toml
+git commit -m "Bump version to $NEW_VERSION"
+git push --set-upstream origin release/v$NEW_VERSION
+```
+### 4. Create pull request
+- Open a pull request from `release/v$NEW_VERSION` to `main` and await approval.
+### 5. Merge and tag
+- After approval merge the the feature branch to `main`.
+- Checkout to `main`, pull updates, and create a tag corresponding to the new version number.
+```bash
+git checkout main
+git pull origin main
+git tag -a "$NEW_VERSION" -m "Release $NEW_VERSION"
+git push origin "$NEW_VERSION"
+```
+### 6. Create a release
+- We must now link the tag to a release in the GitHub repository. To do this from the command line first install GitHub command line tools `gh` and then invoke:
+```bash
+gh release create "$TAG" --title "$TAG" --notes "Release for $VERSION"
+```
+- Alternatively, follow the instructions in the [GitHub documentation](https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository) to manually create a release.
+## License
+This repository's source code is available under the [MIT license](LICENSE).

carrot_transform-0.3.4.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,24 @@
+carrottransform/__init__.py,sha256=cQJKTCpG2qmKxDl-VtSWQ3_WFjyzg4u_8nZacWAHFcU,73
+carrottransform/_version.py,sha256=bm7SM-_MN0gstlNsCDO6dAajKcjQD-NxI_xpvfRx0Ts,172
+carrottransform/cli/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
+carrottransform/cli/command.py,sha256=xYTaJsVZyRYv0CzUwrh7ZPK8hhGyC3MDfvVYxHcXYSM,508
+carrottransform/cli/subcommands/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
+carrottransform/cli/subcommands/run.py,sha256=r2XanTvy4QowPbziZ5lqs-Tm8CAzCquL7DRy4lTT9Ak,23977
+carrottransform/config/OMOPCDM_postgresql_5.3_ddl.sql,sha256=fXrPfdL3IzU5ux55ogsQKjjd-c1KzdP_N2A_JjlY3gk,18084
+carrottransform/config/omop.json,sha256=OT3jvfPjKhjsDnQcQw1OAEOHhQLoHXNxTj_MDwNbYqo,1934
+carrottransform/examples/test/inputs/Covid19_test.csv,sha256=d5t7Lfhkwbfe3Uk2IBqB2ZT5o0h9QaeraC8E5-IMERo,67521
+carrottransform/examples/test/inputs/Demographics.csv,sha256=_ukUTpD4g751sL_mSL3f26T_Edd2kvH-evwm54VfXJI,85237
+carrottransform/examples/test/inputs/Symptoms.csv,sha256=5dvGv16PNJJO_lFc0reRmQbE3m7iWfWajl51JDsqg0M,78447
+carrottransform/examples/test/inputs/covid19_antibody.csv,sha256=SPCpyqpTbVq9987jXZ8AS4FEkrchRMAIYhTQJjfpwfY,98927
+carrottransform/examples/test/inputs/vaccine.csv,sha256=_gcM-SIymyt2Dkkr_zGmQI9keIdmDm-gDI_QvXXLFrY,44037
+carrottransform/examples/test/rules/rules_14June2021.json,sha256=n2OYNFhbx-NLhmqjAad6RsfXjQFknZIgQ7a5uyJF0Co,13226
+carrottransform/tools/__init__.py,sha256=b3JuCwgJVx0rqx5igB8hNNKO0ktlbQjHGHwy-vzpdo0,198
+carrottransform/tools/file_helpers.py,sha256=xlODDAUpsx0H4sweGZ81ttjJjNQGn2spNUa1Fndotw8,316
+carrottransform/tools/mappingrules.py,sha256=IiZx24G27Rag-YgV-4jDxprJea9Ce7SZUbjxMm0n49k,7040
+carrottransform/tools/metrics.py,sha256=LOzm80-YIVM9mvgvQXRpyArl2nSfSTTW9DikqJ5M2Yg,5700
+carrottransform/tools/omopcdm.py,sha256=MwS_MwwBrypwjbFLuxoE0xlddWIi0T3BEPgN9LPkGAs,8508
+carrot_transform-0.3.4.dist-info/LICENSE,sha256=pqIiuuTs6Na-oFd10MMsZoZmdfhfUhHeOtQzgzSkcaw,1082
+carrot_transform-0.3.4.dist-info/METADATA,sha256=mbB8-GgOH6EnJXDr2j46Q97R3ID4Dro9IbgAFcJVAXY,4219
+carrot_transform-0.3.4.dist-info/WHEEL,sha256=XbeZDeTWKc1w7CSIyre5aMDU_-PohRwTQceYnisIYYY,88
+carrot_transform-0.3.4.dist-info/entry_points.txt,sha256=z7qmjTl7C8shrYiPBy6yZo9RRZ31Jcvo6L8ntdqbs2E,74
+carrot_transform-0.3.4.dist-info/RECORD,,

{carrot_transform-0.3.3.dist-info → carrot_transform-0.3.4.dist-info}/WHEEL RENAMED Viewed

@@ -1,4 +1,4 @@
 Wheel-Version: 1.0
-Generator: poetry-core 2.0.1
+Generator: poetry-core 2.1.1
 Root-Is-Purelib: true
 Tag: py3-none-any

carrot_transform-0.3.4.dist-info/entry_points.txt ADDED Viewed

@@ -0,0 +1,3 @@
+[console_scripts]
+carrot-transform=carrottransform.cli.command:transform

carrottransform/_version.py CHANGED Viewed

@@ -1,2 +1,6 @@
-# TODO - pick this up automatically when building
-__version__ = '0.3.2'
+from importlib.metadata import version
+try:
+    __version__ = version("carrot_transform")  # Defined in the pyproject.toml
+except Exception:
+    __version__ = "unknown"

carrottransform/cli/subcommands/run.py CHANGED Viewed

@@ -8,6 +8,9 @@ import json
 import importlib.resources
 import carrottransform
 import carrottransform.tools as tools
+from carrottransform.tools.omopcdm import OmopCDM
+from typing import Iterator, IO
 @click.group(help="Commands for mapping data to the OMOP CommonDataModel (CDM).")
 def run():
@@ -35,7 +38,7 @@ def run():
               help="File containing additional / override json config for omop outputs")
 @click.option("--omop-version",
               required=False,
-              help="Quoted string containing opmop version - eg '5.3'")
+              help="Quoted string containing omop version - eg '5.3'")
 @click.option("--saved-person-id-file",
               default=None,
               required=False,
@@ -66,28 +69,26 @@ def mapstream(rules_file, output_dir, write_mode,
     # - check for values in optional arguments
     # - read in configuration files
     # - check main directories for existence
-    # - handle saved persion ids
-    # - initialise metrics
-    if (omop_ddl_file == None) and (omop_config_file == None) and (omop_version != None):
-      omop_config_file = str(importlib.resources.files('carrottransform')) + '/' + 'config/omop.json'
-      omop_ddl_file_name = "OMOPCDM_postgresql_" + omop_version + "_ddl.sql"
-      omop_ddl_file = str(importlib.resources.files('carrottransform')) + '/' + 'config/' + omop_ddl_file_name
-    if os.path.isdir(input_dir[0]) == False:
-        print("Not a directory, input dir {0}".format(input_dir[0]))
-        sys.exit(1)
-    if os.path.isdir(output_dir) == False:
-        print("Not a directory, output dir {0}".format(output_dir))
-        sys.exit(1)
-    if saved_person_id_file == None:
-        saved_person_id_file = output_dir + "/" + "person_ids.tsv"
-        if os.path.exists(saved_person_id_file):
-            os.remove(saved_person_id_file)
+    # - handle saved person ids
+    # - initialise metrics
+    print(rules_file, output_dir, write_mode,
+              person_file, omop_ddl_file, omop_config_file,
+              omop_version, saved_person_id_file, use_input_person_ids,
+              last_used_ids_file, log_file_threshold, input_dir)
+    ## set omop filenames
+    omop_config_file, omop_ddl_file = set_omop_filenames(omop_ddl_file, omop_config_file, omop_version)
+    ## check directories are valid
+    check_dir_isvalid(input_dir)
+    check_dir_isvalid(output_dir)
+    saved_person_id_file = set_saved_person_id_file(saved_person_id_file, output_dir)
     starttime = time.time()
+    ## create OmopCDM object, which contains attributes and methods for the omop data tables.
     omopcdm = tools.omopcdm.OmopCDM(omop_ddl_file, omop_config_file)
+    ## mapping rules determine the ouput files? which input files and fields in the source data, AND the mappings to omop concepts
     mappingrules = tools.mappingrules.MappingRules(rules_file, omopcdm)
     metrics = tools.metrics.Metrics(mappingrules.get_dataset_name(), log_file_threshold)
     nowtime = time.time()
@@ -95,36 +96,41 @@ def mapstream(rules_file, output_dir, write_mode,
     print("--------------------------------------------------------------------------------")
     print("Loaded mapping rules from: {0} in {1:.5f} secs".format(rules_file, (nowtime - starttime)))
     output_files = mappingrules.get_all_outfile_names()
+    ## set record number
+    ## will keep track of the current record number in each file, e.g., measurement_id, observation_id.
     record_numbers = {}
     for output_file in output_files:
         record_numbers[output_file] = 1
+    if last_used_ids_file != None:
+        if os.path.isfile(last_used_ids_file):
+            record_numbers = load_last_used_ids(last_used_ids_file, record_numbers)
     fhd = {}
     tgtcolmaps = {}
     try:
-        # Saved-person-file existence test, reload if found, return last used integer
-        if os.path.isfile(saved_person_id_file):
-            person_lookup, last_used_integer = load_saved_person_ids(saved_person_id_file)
-        else:
-            person_lookup = {}
-            last_used_integer = 1
-        if last_used_ids_file != None:
-            if os.path.isfile(last_used_ids_file):
-                record_numbers = load_last_used_ids(last_used_ids_file, record_numbers)
-        person_lookup, rejected_person_count = load_person_ids(person_file, person_lookup, mappingrules, use_input_person_ids, last_used_integer)
-        fhpout = open(saved_person_id_file, mode="w")
-        fhpout.write("SOURCE_SUBJECT\tTARGET_SUBJECT\n")
-        for person_id, person_assigned_id in person_lookup.items():
-            fhpout.write("{0}\t{1}\n".format(str(person_id), str(person_assigned_id)))
-        fhpout.close()
-        # Initialise output files, output a header for each
+        ## get all person_ids from file and either renumber with an int or take directly, and add to a dict
+        person_lookup, rejected_person_count = load_person_ids(saved_person_id_file, person_file, mappingrules, use_input_person_ids)
+        ## open person_ids output file
+        with open(saved_person_id_file, mode="w") as fhpout:
+            ## write the header to the file
+            fhpout.write("SOURCE_SUBJECT\tTARGET_SUBJECT\n")
+            ##iterate through the ids and write them to the file.
+            for person_id, person_assigned_id in person_lookup.items():
+                fhpout.write("{0}\t{1}\n".format(str(person_id), str(person_assigned_id)))
+        ## Initialise output files (adding them to a dict), output a header for each
+        ## these aren't being closed deliberately
         for tgtfile in output_files:
             fhd[tgtfile] = open(output_dir + "/" + tgtfile + ".tsv", mode=write_mode)
             if write_mode == 'w':
                 outhdr = omopcdm.get_omop_column_list(tgtfile)
                 fhd[tgtfile].write("\t".join(outhdr) + "\n")
+            ## maps all omop columns for each file into a dict containing the column name and the index
+            ## so tgtcolmaps is a dict of dicts.
             tgtcolmaps[tgtfile] = omopcdm.get_omop_column_map(tgtfile)
     except IOError as e:
@@ -133,43 +139,35 @@ def mapstream(rules_file, output_dir, write_mode,
     print("person_id stats: total loaded {0}, reject count {1}".format(len(person_lookup), rejected_person_count))
-    # Compare files found in the input_dir with those expected based on mapping rules
+    ## Compare files found in the input_dir with those expected based on mapping rules
     existing_input_files = fnmatch.filter(os.listdir(input_dir[0]), '*.csv')
     rules_input_files = mappingrules.get_all_infile_names()
-    # Log mismatches but continue
-    for infile in existing_input_files:
-        if infile not in rules_input_files:
-            msg = "ERROR: no mapping rules found for existing input file - {0}".format(infile)
-            print(msg)
-    for infile in rules_input_files:
-        if infile not in existing_input_files:
-            msg = "ERROR: no data for mapped input file - {0}".format(infile)
-            print(msg)
-    # set up overall counts
+    ## Log mismatches but continue
+    check_files_in_rules_exist(rules_input_files, existing_input_files)
+    ## set up overall counts
     rejidcounts = {}
     rejdatecounts = {}
     print(rules_input_files)
-    # set up per-input counts
+    ## set up per-input counts
     for srcfilename in rules_input_files:
         rejidcounts[srcfilename] = 0
         rejdatecounts[srcfilename] = 0
-    # main processing loop, for each input file
+    ## main processing loop, for each input file
     for srcfilename in rules_input_files:
         outcounts = {}
         rejcounts = {}
         rcount = 0
-        try:
-            fh = open(input_dir[0] + "/" + srcfilename, mode="r", encoding="utf-8-sig")
-            csvr = csv.reader(fh)
-        except IOError as e:
-            print("Unable to open: {0}".format(input_dir[0] + "/" + srcfilename))
-            print("I/O error({0}): {1}".format(e.errno, e.strerror))
+        fh, csvr = open_file(input_dir[0], srcfilename)
+        if fh is None:
             continue
+        ## create dict for input file, giving the data and output file
         tgtfiles, src_to_tgt = mappingrules.parse_rules_src_to_tgt(srcfilename)
         infile_datetime_source, infile_person_id_source = mappingrules.get_infile_date_person_id(srcfilename)
         for tgtfile in tgtfiles:
@@ -185,12 +183,13 @@ def mapstream(rules_file, output_dir, write_mode,
         datetime_col = inputcolmap[infile_datetime_source]
         print("--------------------------------------------------------------------------------")
         print("Processing input: {0}".format(srcfilename))
         # for each input record
         for indata in csvr:
             key = srcfilename + "~all~all~all~"
             metrics.increment_key_count(key, "input_count")
             rcount += 1
+            # if there is a date, parse it - read it is a string and convert to YYYY-MM-DD
             strdate = indata[datetime_col].split(" ")[0]
             fulldate = parse_date(strdate)
             if fulldate != None:
@@ -214,30 +213,15 @@ def mapstream(rules_file, output_dir, write_mode,
                         for outrecord in outrecords:
                             if auto_num_col != None:
                                 outrecord[tgtcolmap[auto_num_col]] = str(record_numbers[tgtfile])
+                                ### most of the rest of this section is actually to do with metrics
                                 record_numbers[tgtfile] += 1
                             if (outrecord[tgtcolmap[pers_id_col]]) in person_lookup:
                                 outrecord[tgtcolmap[pers_id_col]] = person_lookup[outrecord[tgtcolmap[pers_id_col]]]
                                 outcounts[tgtfile] += 1
-                                key = srcfilename + "~all~all~all~"
-                                metrics.increment_key_count(key, "output_count")
-                                key = "all~all~" + tgtfile + "~all~"
-                                metrics.increment_key_count(key, "output_count")
-                                key = srcfilename + "~all~" + tgtfile + "~all~"
-                                metrics.increment_key_count(key, "output_count")
-                                if tgtfile == "person":
-                                    key = srcfilename + "~all~" + tgtfile + "~" + outrecord[1] +"~"
-                                    metrics.increment_key_count(key, "output_count")
-                                    key = srcfilename + "~" + datacol +"~" + tgtfile + "~" + outrecord[1] + "~" + outrecord[2]
-                                    metrics.increment_key_count(key, "output_count")
-                                else:
-                                    key = srcfilename + "~" + datacol +"~" + tgtfile + "~" + outrecord[2] + "~"
-                                    metrics.increment_key_count(key, "output_count")
-                                    key = srcfilename + "~all~" + tgtfile + "~" + outrecord[2] + "~"
-                                    metrics.increment_key_count(key, "output_count")
-                                    key = "all~all~" + tgtfile + "~" + outrecord[2] + "~"
-                                    metrics.increment_key_count(key, "output_count")
-                                    key = "all~all~all~" + outrecord[2] + "~"
-                                    metrics.increment_key_count(key, "output_count")
+                                increment_key_counts(srcfilename, metrics, tgtfile, datacol, outrecord)
+                                # write the line to the file
                                 fhd[tgtfile].write("\t".join(outrecord) + "\n")
                             else:
                                 key = srcfilename + "~all~" + tgtfile + "~all~"
@@ -266,7 +250,39 @@ def mapstream(rules_file, output_dir, write_mode,
     nowtime = time.time()
     print("Elapsed time = {0:.5f} secs".format(nowtime - starttime))
-def get_target_records(tgtfilename, tgtcolmap, rulesmap, srcfield, srcdata, srccolmap, srcfilename, omopcdm, metrics):
+def increment_key_counts(srcfilename: str, metrics: tools.metrics.Metrics, tgtfile: str, datacol: str, outrecord: list[str]) -> None:
+    key = srcfilename + "~all~all~all~"
+    metrics.increment_key_count(key, "output_count")
+    key = "all~all~" + tgtfile + "~all~"
+    metrics.increment_key_count(key, "output_count")
+    key = srcfilename + "~all~" + tgtfile + "~all~"
+    metrics.increment_key_count(key, "output_count")
+    if tgtfile == "person":
+        key = srcfilename + "~all~" + tgtfile + "~" + outrecord[1] + "~"
+        metrics.increment_key_count(key, "output_count")
+        key = srcfilename + "~" + datacol + "~" + tgtfile + "~" + outrecord[1] + "~" + outrecord[2]
+        metrics.increment_key_count(key, "output_count")
+    else:
+        key = srcfilename + "~" + datacol + "~" + tgtfile + "~" + outrecord[2] + "~"
+        metrics.increment_key_count(key, "output_count")
+        key = srcfilename + "~all~" + tgtfile + "~" + outrecord[2] + "~"
+        metrics.increment_key_count(key, "output_count")
+        key = "all~all~" + tgtfile + "~" + outrecord[2] + "~"
+        metrics.increment_key_count(key, "output_count")
+        key = "all~all~all~" + outrecord[2] + "~"
+        metrics.increment_key_count(key, "output_count")
+    return
+def get_target_records(tgtfilename: str, tgtcolmap: dict[str, dict[str, int]], rulesmap: dict[str, list[dict[str, list[str]]]], srcfield: str, srcdata: list[str], srccolmap: dict[str, int], srcfilename: str, omopcdm: OmopCDM, metrics: tools.metrics.Metrics) -> \
+tuple[bool, list[str], tools.metrics.Metrics]:
     """
     build all target records for a given input field
     """
@@ -279,6 +295,7 @@ def get_target_records(tgtfilename, tgtcolmap, rulesmap, srcfield, srcdata, srcc
     srckey = srcfilename + "~" + srcfield + "~" + tgtfilename
     summarykey = srcfilename + "~" + srcfield + "~" + tgtfilename + "~all~"
     if valid_value(str(srcdata[srccolmap[srcfield]])):
+        ## check if either or both of the srckey and summarykey are in the rules
         srcfullkey = srcfilename + "~" + srcfield + "~" + str(srcdata[srccolmap[srcfield]]) + "~" + tgtfilename
         dictkeys = []
         if srcfullkey in rulesmap:
@@ -291,6 +308,7 @@ def get_target_records(tgtfilename, tgtcolmap, rulesmap, srcfield, srcdata, srcc
             for dictkey in dictkeys:
                 for out_data_elem in rulesmap[dictkey]:
                     valid_data_elem = True
+                    ## create empty list to store the data. Populate numerical data elements with 0 instead of empty string.
                     tgtarray = ['']*len(tgtcolmap)
                     for req_integer in notnull_numeric_fields:
                         tgtarray[tgtcolmap[req_integer]] = "0"
@@ -302,6 +320,7 @@ def get_target_records(tgtfilename, tgtcolmap, rulesmap, srcfield, srcdata, srcc
                             else:
                                 tgtarray[tgtcolmap[output_col_data]] = srcdata[srccolmap[infield]]
                             if output_col_data in date_component_data:
+                                ## parse the date and store it in the proper format
                                 strdate = srcdata[srccolmap[infield]].split(" ")[0]
                                 dt = get_datetime_value(strdate)
                                 if dt != None:
@@ -453,7 +472,9 @@ def load_saved_person_ids(person_file):
     fh.close()
     return person_ids, last_int
-def load_person_ids(person_file, person_ids, mappingrules, use_input_person_ids, person_number=1, delim=","):
+def load_person_ids(saved_person_id_file, person_file, mappingrules, use_input_person_ids, delim=","):
+    person_ids, person_number = get_person_lookup(saved_person_id_file)
     fh = open(person_file, mode="r", encoding="utf-8-sig")
     csvr = csv.reader(fh, delimiter=delim)
     person_columns = {}
@@ -468,23 +489,25 @@ def load_person_ids(person_file, person_ids, mappingrules, use_input_person_ids,
         person_columns[col] = person_col_in_hdr_number
         person_col_in_hdr_number += 1
+## check the mapping rules for person to find where to get the person data) i.e., which column in the person file contains dob, sex
     birth_datetime_source, person_id_source = mappingrules.get_person_source_field_info("person")
     print("Load Person Data {0}, {1}".format(birth_datetime_source, person_id_source))
+    ## get the column index of the PersonID from the input file
     person_col = person_columns[person_id_source]
     for persondata in csvr:
-        if not valid_value(persondata[person_columns[person_id_source]]):
+        if not valid_value(persondata[person_columns[person_id_source]]): #just checking that the id is not an empty string
             reject_count += 1
             continue
         if not valid_date_value(persondata[person_columns[birth_datetime_source]]):
             reject_count += 1
             continue
-        if persondata[person_col] not in person_ids:
+        if persondata[person_col] not in person_ids: #if not already in person_ids dict, add it
             if use_input_person_ids == "N":
-                person_ids[persondata[person_col]] = str(person_number)
+                person_ids[persondata[person_col]] = str(person_number) #create a new integer person_id
                 person_number += 1
             else:
-                person_ids[persondata[person_col]] = str(persondata[person_col])
+                person_ids[persondata[person_col]] = str(persondata[person_col]) #use existing person_id
     fh.close()
     return person_ids, reject_count
@@ -493,4 +516,62 @@ def load_person_ids(person_file, person_ids, mappingrules, use_input_person_ids,
 def py():
     pass
+def check_dir_isvalid(directory: str | tuple[str, ...]) -> None:
+    ## check output dir is valid
+    if type(directory) is tuple:
+        directory = directory[0]
+    if not os.path.isdir(directory):
+        print("Not a directory, dir {0}".format(directory))
+        sys.exit(1)
+def set_saved_person_id_file(saved_person_id_file: str, output_dir: str) -> str:
+## check if there is a saved person id file set in options - if not, check if the file exists and remove it
+    if saved_person_id_file is None:
+        saved_person_id_file = output_dir + "/" + "person_ids.tsv"
+        if os.path.exists(saved_person_id_file):
+            os.remove(saved_person_id_file)
+    return saved_person_id_file
+def check_files_in_rules_exist(rules_input_files: list[str], existing_input_files: list[str]) -> None:
+    for infile in existing_input_files:
+        if infile not in rules_input_files:
+            msg = "WARNING: no mapping rules found for existing input file - {0}".format(infile)
+            print(msg)
+    for infile in rules_input_files:
+        if infile not in existing_input_files:
+            msg = "WARNING: no data for mapped input file - {0}".format(infile)
+            print(msg)
+def open_file(directory: str, filename: str) -> tuple[IO[str], Iterator[list[str]]] | None:
+#def open_file(directory: str, filename: str):
+    try:
+        fh = open(directory + "/" + filename, mode="r", encoding="utf-8-sig")
+        csvr = csv.reader(fh)
+        return fh, csvr
+    except IOError as e:
+        print("Unable to open: {0}".format(directory + "/" + filename))
+        print("I/O error({0}): {1}".format(e.errno, e.strerror))
+        return None
+def set_omop_filenames(omop_ddl_file: str, omop_config_file: str, omop_version: str) -> tuple[str, str]:
+    if (omop_ddl_file is None) and (omop_config_file is None) and (omop_version is not None):
+        omop_config_file = str(importlib.resources.files('carrottransform')) + '/' + 'config/omop.json'
+        omop_ddl_file_name = "OMOPCDM_postgresql_" + omop_version + "_ddl.sql"
+        omop_ddl_file = str(importlib.resources.files('carrottransform')) + '/' + 'config/' + omop_ddl_file_name
+    return omop_config_file, omop_ddl_file
+def get_person_lookup(saved_person_id_file: str) -> tuple[dict[str, str], int]:
+    # Saved-person-file existence test, reload if found, return last used integer
+    if os.path.isfile(saved_person_id_file):
+        person_lookup, last_used_integer = load_saved_person_ids(saved_person_id_file)
+    else:
+        person_lookup = {}
+        last_used_integer = 1
+    return person_lookup, last_used_integer
 run.add_command(mapstream,"mapstream")
+if __name__== '__main__':
+    mapstream()

carrot-transform 0.3.3__py3-none-any.whl → 0.3.4__py3-none-any.whl

carrot-transform 0.3.3py3-none-any.whl → 0.3.4py3-none-any.whl