PyPI - data-prep-toolkit - Versions diffs - 0.0.1.dev4__tar.gz → 0.1.0__tar.gz - Mend

data-prep-toolkit 0.0.1.dev4tar.gz → 0.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (114) hide show

{data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/Makefile RENAMED Viewed

@@ -53,10 +53,11 @@ venv::	pyproject.toml
 # pytest-forked was tried, but then we get SIGABRT in pytest when running the s3 tests, some of which are skipped..
 test::
 	@# Help: Use the already-built virtual environment to run pytest on the test directory.
-	source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST)  data_processing_tests/data_access;
-	source venv/bin/activate; export PYTHONPATH=../src;  cd test; $(PYTEST)  data_processing_tests/transform;
-	source venv/bin/activate; export PYTHONPATH=../src;  cd test; $(PYTEST)  data_processing_tests/pure_python;
-	source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST)  data_processing_tests/ray/ray_util_test.py;
-	source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST)  data_processing_tests/ray/launcher_test.py;
-	source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST)  data_processing_tests/ray/test_noop_launch.py;
+	source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST)  data_processing_tests/data_access;
+	source venv/bin/activate; export PYTHONPATH=../src;  cd test; $(PYTEST)  data_processing_tests/transform;
+	source venv/bin/activate; export PYTHONPATH=../src;  cd test; $(PYTEST)  data_processing_tests/launch/pure_python/launcher_test.py;
+	source venv/bin/activate; export PYTHONPATH=../src;  cd test; $(PYTEST)  data_processing_tests/launch/pure_python/test_noop_launch.py;
+	source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST)  data_processing_tests/launch/ray/ray_util_test.py;
+	source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST)  data_processing_tests/launch/ray/launcher_test.py;
+	source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST)  data_processing_tests/launch/ray/test_noop_launch.py;

{data_prep_toolkit-0.0.1.dev4/src/data_prep_toolkit.egg-info → data_prep_toolkit-0.1.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: data_prep_toolkit
-Version: 0.0.1.dev4
+Version: 0.1.0
 Summary: Data Preparation Toolkit Library
 Author-email: David Wood <dawood@us.ibm.com>, Boris Lublinsky <blublinsky@ibm.com>
 License: Apache-2.0

{data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/doc/advanced-transform-tutorial.md RENAMED Viewed

@@ -13,6 +13,7 @@ removes duplicate documents across all files. In this tutorial, we will show the
   the operation of our _noop_ transform.
 The complete task involves the following:
 * EdedupTransform - class that implements the specific transformation
 * EdedupRuntime - class that implements custom TransformRuntime to create supporting Ray objects and enhance job output
   statistics
@@ -39,6 +40,7 @@ First, let's define the transform class.  To do this we extend
 the base abstract/interface class
 [AbstractTableTransform](../src/data_processing/transform/table_transform.py),
 which requires definition of the following:
 * an initializer (i.e. `init()`) that accepts a dictionary of configuration
   data.  For this example, the configuration data will only be defined by
   command line arguments (defined below).
@@ -56,15 +58,17 @@ from typing import Any
 import pyarrow as pa
 import ray
-from data_processing.data_access import DataAccessFactory
-from data_processing.ray import (
-  RayLauncherConfiguration,
+from data_processing.data_access import DataAccessFactoryBase
+from data_processing.runtime.ray import (
   DefaultTableTransformRuntimeRay,
-  RayUtils,
   RayTransformLauncher,
+  RayUtils,
+)
+from data_processing.runtime.ray.runtime_configuration import (
+  RayTransformRuntimeConfiguration,
 )
-from data_processing.transform import AbstractTableTransform
-from data_processing.utils import GB, TransformUtils
+from data_processing.transform import AbstractTableTransform, TransformConfiguration
+from data_processing.utils import GB, CLIArgumentProvider, TransformUtils, get_logger
 from ray.actor import ActorHandle
@@ -136,8 +140,9 @@ If there is no metadata then simply return an empty dictionary.
 First, let's define the transform runtime class.  To do this we extend
 the base abstract/interface class
-[DefaultTableTransformRuntime](../src/data_processing/ray/transform_runtime.py),
+[DefaultTableTransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py),
 which requires definition of the following:
 * an initializer (i.e. `init()`) that accepts a dictionary of configuration
   data.  For this example, the configuration data will only be defined by
   command line arguments (defined below).
@@ -202,8 +207,10 @@ collected by hash actors and custom computations based on statistics data.
 ## EdedupTableTransformConfiguration
-The final class we need to implement is `EdedupTableTransformConfiguration` class and its initializer that
-define the following:
+The final class we need to implement is `EdedupRayTransformConfiguration` class that provides configuration for
+running our transform. Although we provide only Ray-based implementation, Ray-based configuration relies on Python-based
+configuration that we need to define first. So we first need to define `EdedupTableTransformConfiguration` class,
+defining the following:
 * The short name for the transform
 * The class implementing the transform - in our case EdedupTransform
@@ -216,10 +223,12 @@ First we define the class and its initializer,
 short_name = "ededup"
 cli_prefix = f"{short_name}_"
-class EdedupTableTransformConfiguration(DefaultTableTransformConfiguration):
-    def __init__(self):
-        super().__init__(name=short_name, runtime_class=EdedupRuntime, transform_class=EdedupTransform)
-        self.params = {}
+class EdedupTableTransformConfiguration(TransformConfiguration):
+  def __init__(self):
+    super().__init__(
+      name=short_name,
+      transform_class=EdedupTransform,
+    )
 ```
 The initializer extends the DefaultTableTransformConfiguration which provides simple
@@ -253,6 +262,13 @@ and which allows us to capture the `EdedupTransform`-specific arguments and opti
       logger.info(f"exact dedup params are {self.params}")
       return True
 ```
+Now we can implement `EdedupRayTransformConfiguration`with the following code
+```python
+class EdedupRayTransformConfiguration(RayTransformConfiguration):
+    def __init__(self):
+        super().__init__(transform_config=EdedupTableTransformConfiguration(), runtime_class=EdedupRuntime)
+```
 ## main()
@@ -261,11 +277,12 @@ framework's `TransformLauncher` class.
 ```python
 if __name__ == "__main__":
-    launcher = TransformLauncher(transform_runtime_config=EdedupTransformConfiguration())
-    launcher.launch()
+  launcher = RayTransformLauncher(EdedupRayTransformConfiguration())
+  launcher.launch()
 ```
 The launcher requires only an instance of DefaultTableTransformConfiguration
-(our `EdedupTransformConfiguration` class).
+(our `EdedupRayTransformConfiguration` class).
 A single method `launch()` is then invoked to run the transform in a Ray cluster.
 ## Running
@@ -280,5 +297,5 @@ python ededup_transform.py --hash_cpu 0.5 --num_hashes 2 --doc_column "contents"
   --s3_config "{'input_folder': 'cos-optimal-llm-pile/test/david/input/', 'output_folder': 'cos-optimal-llm-pile/test/david/output/'}"
 ```
 This is a minimal set of options to run locally.
-See the [launcher options](launcher-options.md) for a complete list of
+See the [launcher options](ray-launcher-options) for a complete list of
 transform-independent command line options.

{data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/doc/architecture.md RENAMED Viewed

@@ -13,10 +13,10 @@ process many input files in parallel using a distribute network of RayWorkers.
 The architecture includes the following core components:
-* [RayLauncher](../src/data_processing/ray/transform_launcher.py) accepts and validates
+* [RayLauncher](../src/data_processing/runtime/ray/transform_launcher.py) accepts and validates
  CLI parameters to establish the Ray Orchestrator with the proper configuration.
 It uses the following components, all of which can/do define CLI configuration parameters.:
-    * [Transform Orchestrator Configuration](../src/data_processing/ray/transform_orchestrator_configuration.py) is responsible
+    * [Transform Orchestrator Configuration](../src/data_processing/runtime/ray/execution_configuration.py) is responsible
      for defining and validating infrastructure parameters
      (e.g., number of workers, memory and cpu, local or remote cluster, etc.). This class has very simple state
      (several dictionaries) and is fully pickleable. As a result framework uses its instance as a
@@ -25,27 +25,29 @@ It uses the following components, all of which can/do define CLI configuration p
       configuration for the type of DataAccess to use when reading/writing the input/output data for
       the transforms.  Similar to Transform Orchestrator Configuration, this is a pickleable
       instance that is passed between Launcher, Orchestrator and Workers.
-    * [TransformConfiguration](../src/data_processing/ray/transform_runtime.py) - defines specifics
+    * [TransformConfiguration](../src/data_processing/runtime/ray/transform_runtime.py) - defines specifics
       of the transform implementation including transform implementation class, its short name, any transform-
       specific CLI parameters, and an optional TransformRuntime class, discussed below.
     After all parameters are validated, the ray cluster is started and the DataAccessFactory, TransformOrchestratorConfiguraiton
     and TransformConfiguration are given to the Ray Orchestrator, via Ray remote() method invocation.
     The Launcher waits for the Ray Orchestrator to complete.
-* [Ray Orchestrator](../src/data_processing/ray/transform_orchestrator.py) is responsible for overall management of
+* documents with [Ray Orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py) is responsible for overall management of
   the data processing job. It creates the actors, determines the set of input data and distributes the
   references to the data files to be processed by the workers. More specifically, it performs the following:
   1. Uses the DataAccess instance created by the DataAccessFactory to determine the set of the files
   to be processed.
   2. uses the TransformConfiguration to create the TransformRuntime instance
   3. Uses the TransformRuntime to optionally apply additional configuration (ray object storage, etc) for the configuration
   and operation of the Transform.
-  3. uses the TransformOrchestratorConfiguration to determine the set of RayWorkers to create
+  4. uses the TransformOrchestratorConfiguration to determine the set of RayWorkers to create
   to execute transformers in parallel, providing the following to each worker:
       * Ray worker configuration
       * DataAccessFactory
       * Transform class and its TransformConfiguration containing the CLI parameters and any TransformRuntime additions.
-  4. in a load-balanced, round-robin fashion, distributes the names of the input files to the workers for them to transform/process.
+  5. in a load-balanced, round-robin fashion, distributes the names of the input files to the workers for them to transform/process.
   Additionally, to provide monitoring of long-running transforms, the orchestrator is instrumented with
   [custom metrics](https://docs.ray.io/en/latest/ray-observability/user-guides/add-app-metrics.html), that are exported to localhost:8080 (this is the endpoint that
@@ -53,12 +55,14 @@ It uses the following components, all of which can/do define CLI configuration p
   Once all data is processed, the orchestrator will collect execution statistics (from the statistics actor)
   and build and save it in the form of execution metadata (`metadata.json`). Finally, it will return the execution
   result to the Launcher.
-* [Ray worker](../src/data_processing/ray/transform_table_processor.py) is responsible for
+* [Ray worker](../src/data_processing/runtime/ray/transform_table_processor.py) is responsible for
 reading files (as [PyArrow Tables](https://levelup.gitconnected.com/deep-dive-into-pyarrow-understanding-its-features-and-benefits-2cce8b1466c8))
 assigned by the orchestrator, applying the transform to the input table and writing out the
 resulting table(s).  Metadata produced by each table transformation is aggregated into
 Transform Statistics (below).
-* [Transform Statistics](../src/data_processing/ray/transform_statistics.py) is a general
+* [Transform Statistics](../src/data_processing/runtime/ray/transform_statistics.py) is a general
 purpose data collector actor aggregating the numeric metadata from different places of
 the framework (especially metadata produced by the transform).
 These statistics are reported as metadata (`metadata.json`) by the orchestrator upon completion.
@@ -92,10 +96,10 @@ For a more complete discussion, see the [tutorials](transform-tutorials.md).
 of any transform implementation - `transform()` and `flush()` - and provides the bulk of any transform implementation
 convert one Table to 0 or more new Tables.   In general, this is not tied to the above Ray infrastructure
 and so can usually be used independent of Ray.
-* [TransformRuntime ](../src/data_processing/ray/transform_runtime.py) - this class only needs to be
+* [TransformRuntime ](../src/data_processing/runtime/ray/transform_runtime.py) - this class only needs to be
 extended/implemented when additional Ray components (actors, shared memory objects, etc.) are used
 by the transform. The main method `get_transform_config()` is used to enable these extensions.
-* [TransformConfiguration](../src/data_processing/ray/transform_runtime.py) - this is the bootstrap
+* [TransformConfiguration](../src/data_processing/runtime/ray/transform_runtime.py) - this is the bootstrap
   class provided to the Launcher that enables the instantiation of the Transform and the TransformRuntime within
   the architecture.  It is a CLIProvider, which allows it to define transform-specific CLI configuration
   that is made available to the Transform's initializer.

{data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/doc/overview.md RENAMED Viewed

@@ -12,16 +12,17 @@ developers of data transformation are:
 * [Transformation](../src/data_processing/transform/table_transform.py) - a simple, easily-implemented interface defines
 the specifics of a given data transformation.
-* [Transform Configuration](../src/data_processing/ray/transform_runtime.py) - defines
-the transform implementation and runtime classes, the
-command line arguments specific to transform, and the short name for the transform.
-* [Transformation Runtime](../src/data_processing/ray/transform_runtime.py) - allows for customization of the Ray environment for the transformer.
-This might include provisioning of shared memory objects or creation of additional actors.
+* [Transform Configuration](../src/data_processing/runtime/ray/transform_runtime.py) - defines
+the transform short name, its implementation class,  and command line configuration
+parameters.
 To learn more consider the following:
 * [Transform Tutorials](transform-tutorials.md)
-* [Testing transformers with S3](using_s3_transformers.md)
+* [Transform Runtimes](transform-runtimes.md)
+* [Transform Examples](transform-tutorial-examples.md)
+* [Testing Transforms](transform-testing.md)
+* [Utilities](transformer-utilities.md)
 * [Architecture Deep Dive](architecture.md)
 * [Transform project root readme](../../transforms/README.md)

data_prep_toolkit-0.1.0/doc/python-launcher-options.md ADDED Viewed

@@ -0,0 +1,61 @@
+# Pure Python Launcher Command Line Options
+A number of command line options are available when launching a transform as a Python class.
+The following is a current --help output (a work in progress) for
+the `NOOPTransform` (note the --noop_sleep_sec option):
+```
+usage: noop_python_runtime.py [-h] [--noop_sleep_sec NOOP_SLEEP_SEC] [--noop_pwd NOOP_PWD] [--data_s3_cred DATA_S3_CRED] [--data_s3_config DATA_S3_CONFIG] [--data_local_config DATA_LOCAL_CONFIG] [--data_max_files DATA_MAX_FILES]
+                              [--data_checkpointing DATA_CHECKPOINTING] [--data_data_sets DATA_DATA_SETS] [--data_files_to_use DATA_FILES_TO_USE] [--data_num_samples DATA_NUM_SAMPLES] [--runtime_pipeline_id RUNTIME_PIPELINE_ID]
+                              [--runtime_job_id RUNTIME_JOB_ID] [--runtime_code_location RUNTIME_CODE_LOCATION]
+Driver for noop processing
+options:
+  -h, --help            show this help message and exit
+  --noop_sleep_sec NOOP_SLEEP_SEC
+                        Sleep actor for a number of seconds while processing the data frame, before writing the file to COS
+  --noop_pwd NOOP_PWD   A dummy password which should be filtered out of the metadata
+  --data_s3_cred DATA_S3_CRED
+                        AST string of options for s3 credentials. Only required for S3 data access.
+                        access_key: access key help text
+                        secret_key: secret key help text
+                        url: optional s3 url
+                        region: optional s3 region
+                        Example: { 'access_key': 'access', 'secret_key': 'secret',
+                        'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud',
+                        'region': 'us-east-1' }
+  --data_s3_config DATA_S3_CONFIG
+                        AST string containing input/output paths.
+                        input_folder: Path to input folder of files to be processed
+                        output_folder: Path to output folder of processed files
+                        Example: { 'input_folder': 's3-path/your-input-bucket',
+                        'output_folder': 's3-path/your-output-bucket' }
+  --data_local_config DATA_LOCAL_CONFIG
+                        ast string containing input/output folders using local fs.
+                        input_folder: Path to input folder of files to be processed
+                        output_folder: Path to output folder of processed files
+                        Example: { 'input_folder': './input', 'output_folder': '/tmp/output' }
+  --data_max_files DATA_MAX_FILES
+                        Max amount of files to process
+  --data_checkpointing DATA_CHECKPOINTING
+                        checkpointing flag
+  --data_data_sets DATA_DATA_SETS
+                        List of sub-directories of input directory to use for input. For example, ['dir1', 'dir2']
+  --data_files_to_use DATA_FILES_TO_USE
+                        list of file extensions to choose for input.
+  --data_num_samples DATA_NUM_SAMPLES
+                        number of random input files to process
+  --runtime_pipeline_id RUNTIME_PIPELINE_ID
+                        pipeline id
+  --runtime_job_id RUNTIME_JOB_ID
+                        job id
+  --runtime_code_location RUNTIME_CODE_LOCATION
+                        AST string containing code location
+                        github: Github repository URL.
+                        commit_hash: github commit hash
+                        path: Path within the repository
+                        Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324',
+                        'path': 'transforms/universal/code' }
+```

data_prep_toolkit-0.1.0/doc/python-runtime.md ADDED Viewed

@@ -0,0 +1,12 @@
+## Python Runtime
+The python runtime provides a simple mechanism to run a transform on a set of input data to produce
+a set of output data, all within the python execution environment.
+A `PythonTransformLauncher` class is provided that enables the running of the transform.  For example,
+```python
+launcher = PythonTransformLauncher(YourTransformConfiguration())
+launcher.launch()
+```
+The `YourTransformConfiguration` class configures your transform.
+More details can be found in the [transform tutorial](transform-tutorials.md).

data_prep_toolkit-0.0.1.dev4/doc/launcher-options.md → data_prep_toolkit-0.1.0/doc/ray-launcher-options.md RENAMED Viewed

@@ -1,29 +1,16 @@
-# Launcher Command Line Options
+# Ray Launcher Command Line Options
 A number of command line options are available when launching a transform.
 The following is a current --help output (a work in progress) for
-the `NOOPTransform` (note the --noop_sleep_sec option):
+the `NOOPTransform` (note the --noop_sleep_sec and --noop_pwd options):
 ```
-usage: noop_transform.py [-h]
-                         [--run_locally RUN_LOCALLY]
-                         [--noop_sleep_sec NOOP_SLEEP_SEC]
-                         [--data_s3_cred DATA_S3_CRED]
-                         [--data_s3_config DATA_S3_CONFIG]
-                         [--data_local_config DATA_LOCAL_CONFIG]
-                         [--data_max_files DATA_MAX_FILES]
-                         [--data_checkpointing DATA_CHECKPOINTING]
-                         [--data_data_sets DATA_DATA_SETS]
-                         [--data_max_files MAX_FILES]
-                         [--data_files_to_use DATA_FILES_TO_USE]
-                         [--data_num_samples DATA_NUM_SAMPLES]
-                         [--runtime_num_workers NUM_WORKERS]
-                         [--runtime_worker_options WORKER_OPTIONS]
-                         [--runtime_pipeline_id PIPELINE_ID] [--job_id JOB_ID]
-                         [--runtime_creation_delay CREATION_DELAY]
-                         [--runtime_code_location CODE_LOCATION]
+usage: noop_transform.py [-h] [--run_locally RUN_LOCALLY] [--noop_sleep_sec NOOP_SLEEP_SEC] [--noop_pwd NOOP_PWD] [--data_s3_cred DATA_S3_CRED] [--data_s3_config DATA_S3_CONFIG] [--data_local_config DATA_LOCAL_CONFIG]
+                         [--data_max_files DATA_MAX_FILES] [--data_checkpointing DATA_CHECKPOINTING] [--data_data_sets DATA_DATA_SETS] [--data_files_to_use DATA_FILES_TO_USE] [--data_num_samples DATA_NUM_SAMPLES]
+                         [--runtime_num_workers RUNTIME_NUM_WORKERS] [--runtime_worker_options RUNTIME_WORKER_OPTIONS] [--runtime_creation_delay RUNTIME_CREATION_DELAY] [--runtime_pipeline_id RUNTIME_PIPELINE_ID]
+                         [--runtime_job_id RUNTIME_JOB_ID] [--runtime_code_location RUNTIME_CODE_LOCATION]
-Driver for NOOP processing
+Driver for noop processing
 options:
   -h, --help            show this help message and exit
@@ -31,35 +18,40 @@ options:
                         running ray local flag
   --noop_sleep_sec NOOP_SLEEP_SEC
                         Sleep actor for a number of seconds while processing the data frame, before writing the file to COS
-  --data_s3_cred S3_CRED
-                        AST string of options for cos credentials. Only required for COS or Lakehouse.
+  --noop_pwd NOOP_PWD   A dummy password which should be filtered out of the metadata
+  --data_s3_cred DATA_S3_CRED
+                        AST string of options for s3 credentials. Only required for S3 data access.
                         access_key: access key help text
                         secret_key: secret key help text
-                        cos_url: COS url
-                        Example: { 'access_key': 'access', 'secret_key': 'secret', 's3_url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud' }
-  --data_s3_config S3_CONFIG
+                        url: optional s3 url
+                        region: optional s3 region
+                        Example: { 'access_key': 'access', 'secret_key': 'secret',
+                        'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud',
+                        'region': 'us-east-1' }
+  --data_s3_config DATA_S3_CONFIG
                         AST string containing input/output paths.
                         input_folder: Path to input folder of files to be processed
                         output_folder: Path to output folder of processed files
-                        Example: { 'input_folder': 'your input folder', 'output_folder ': 'your output folder' }
-  --data_local_config LOCAL_CONFIG
+                        Example: { 'input_folder': 's3-path/your-input-bucket',
+                        'output_folder': 's3-path/your-output-bucket' }
+  --data_local_config DATA_LOCAL_CONFIG
                         ast string containing input/output folders using local fs.
                         input_folder: Path to input folder of files to be processed
                         output_folder: Path to output folder of processed files
                         Example: { 'input_folder': './input', 'output_folder': '/tmp/output' }
-  --data_max_files MAX_FILES
+  --data_max_files DATA_MAX_FILES
                         Max amount of files to process
-  --data_checkpointing CHECKPOINTING
+  --data_checkpointing DATA_CHECKPOINTING
                         checkpointing flag
-  --data_data_sets DATA_SETS
-                        List of data sets
+  --data_data_sets DATA_DATA_SETS
+                        List of sub-directories of input directory to use for input. For example, ['dir1', 'dir2']
   --data_files_to_use DATA_FILES_TO_USE
-                        files extensions to use, default .parquet
+                        list of file extensions to choose for input.
   --data_num_samples DATA_NUM_SAMPLES
-                        number of randomply picked files to use
-  --runtime_num_workers NUM_WORKERS
+                        number of random input files to process
+  --runtime_num_workers RUNTIME_NUM_WORKERS
                         number of workers
-  --runtime_worker_options WORKER_OPTIONS
+  --runtime_worker_options RUNTIME_WORKER_OPTIONS
                         AST string defining worker resource requirements.
                         num_cpus: Required number of CPUs.
                         num_gpus: Required number of GPUs
@@ -69,16 +61,19 @@ options:
                                    placement_group_bundle_index, placement_group_capture_child_tasks, resources, runtime_env,
                                    scheduling_strategy, _metadata, concurrency_groups, lifetime, max_concurrency, max_restarts,
                                    max_task_retries, max_pending_calls, namespace, get_if_exists
-                        Example: { 'num_cpus': '8', 'num_gpus': '1', 'resources': '{"special_hardware": 1, "custom_label": 1}' }
-  --runtime_pipeline_id PIPELINE_ID
-                        pipeline id
-  --runtime_job_id JOB_ID       job id
-  --runtime_creation_delay CREATION_DELAY
+                        Example: { 'num_cpus': '8', 'num_gpus': '1',
+                        'resources': '{"special_hardware": 1, "custom_label": 1}' }
+  --runtime_creation_delay RUNTIME_CREATION_DELAY
                         delay between actor' creation
-  --runtime_code_location CODE_LOCATION
+  --runtime_pipeline_id RUNTIME_PIPELINE_ID
+                        pipeline id
+  --runtime_job_id RUNTIME_JOB_ID
+                        job id
+  --runtime_code_location RUNTIME_CODE_LOCATION
                         AST string containing code location
                         github: Github repository URL.
                         commit_hash: github commit hash
                         path: Path within the repository
-                        Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '13241231asdfaed', 'path': 'transforms/universal/ededup' }
+                        Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324',
+                        'path': 'transforms/universal/code' }
 ```

data_prep_toolkit-0.1.0/doc/ray-runtime.md ADDED Viewed

@@ -0,0 +1,143 @@
+# Ray Runtime
+The Ray runtime includes the following set of components:
+* [RayTransformLauncher](../src/data_processing/runtime/ray/transform_launcher.py) - this is a
+class generally used to implement `main()` that makes use of a `TransformConfiguration` to
+start the Ray runtime and execute the transform over the specified set of input files.
+The RayTransformLauncher is created using a `RayTransformConfiguration` instance.
+* [RayTransformConfiguration](../src/data_processing/runtime/ray/runtime_configuration.py) - this
+class extends transform's base TransformConfiguration implementation to add an optional
+`TranformRuntime` (see next) class to be used by the transform implementation.
+* [TransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py) -
+this provides the ability for the transform implementor to create additional Ray resources
+and include them in the configuration used to create a transform
+(see, for example, [exact dedup](../../transforms/universal/ededup/src/ededup_transform.py)).
+This also provide the ability to supplement the statics collected by
+[Statistics](../src/data_processing/runtime/ray/transform_statistics.py) (see below).
+Roughly speaking the following steps are completed to establish transforms in the RayWorkers
+1. Launcher parses the CLI parameters using an ArgumentParser configured with its own CLI parameters
+along with those of the Transform Configuration,
+2. Launcher passes the Transform Configuration and CLI parameters to the [RayOrchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py)
+3. RayOrchestrator creates the Transform Runtime using the Transform Configuration and its CLI parameter values
+4. Transform Runtime creates transform initialization/configuration including the CLI parameters,
+and any Ray components need by the transform.
+5. [RayWorker](../src/data_processing/runtime/ray/transform_table_processor.py) is started with configuration from the Transform Runtime.
+6. RayWorker creates the Transform using the configuration provided by the Transform Runtime.
+7. Statistics is used to collect the statistics submitted by the individual transform, that
+is used for building execution metadata.
+![Processing Architecture](processing-architecture.jpg)
+## Ray Transform Launcher
+The [RayTransformLauncher](../src/data_processing/runtime/ray/transform_launcher.py) uses the Transform Configuration
+and provides a single method, `launch()`, that kicks off the Ray environment and transform execution coordinated
+by [orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py).
+For example,
+```python
+launcher = RayTransformLauncher(YourTransformConfiguration())
+launcher.launch()
+```
+Note that the launcher defines some additional CLI parameters that are used to control the operation of the
+[orchestrator and workers](../src/data_processing/runtime/ray/execution_configuration.py) and
+[data access](../src/data_processing/data_access/data_access_factory.py).  Things such as data access configuration,
+number of workers, worker resources, etc.
+Discussion of these options is beyond the scope of this document
+(see [Launcher Options](ray-launcher-options) for a list of available options.)
+## Ray Transform Configuration
+In general, a transform should be able to run in both the python and Ray runtimes.
+As such we first define the python-only transform configuration, which will then
+be used by the Ray-runtime-specific transform configuration.
+The python transform configuration implements
+[TransformConfiguration](../src/data_processing/runtime/runtime_configuration.py)
+and deifnes with transform-specific name, and implementation
+and class. In addition, it is responsible for providing transform-specific
+methods to define and capture optional command line arguments.
+```python
+class YourTransformConfiguration(TransformConfiguration):
+    def __init__(self):
+        super().__init__(name="YourTransform", transform_class=YourTransform)
+        self.params = {}
+    def add_input_params(self, parser: ArgumentParser) -> None:
+        ...
+    def apply_input_params(self, args: Namespace) -> bool:
+        ...
+```
+Next we define the Ray-runtime specific transform configuration as an exension of
+the RayTransformConfiguration and uses the `YourTransformConfiguration` above.
+```python
+class YourTransformConfiguration(RayTransformConfiguration):
+    def __init__(self):
+        super().__init__(YourTransformConfiguration(),
+                         runtime_class=YourTransformRuntime
+```
+This class provides the ability to create the instance of `YourTransformRuntime` class (see below)
+as needed by the Ray runtime.  Note, that not all transforms will require a `runtime_class`
+and can omit this parameter to default to an acceptable runtime class.
+Details are covered in the [advanced transform tutorial](advanced-transform-tutorial.md).
+## Transform Runtime
+The
+[DefaultTableTransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py)
+class is provided and will be
+sufficient for many use cases, especially 1:1 table transformation.
+However, some transforms will require use of the Ray environment, for example,
+to create additional workers, establish a shared memory object, etc.
+Of course, these transforms will generally not run outside of Ray environment.
+```python
+class DefaultTableTransformRuntime:
+    def __init__(self, params: dict[str, Any]):
+        ...
+    def get_transform_config(
+        self, data_access_factory: DataAccessFactory, statistics: ActorHandle, files: list[str]
+    ) -> dict[str, Any]:
+        ...
+    def compute_execution_stats(self, stats: dict[str, Any]) -> dict[str, Any]:
+        ...
+```
+The RayOrchestrator initializes the instance with the CLI parameters provided by the Transform Configurations
+`get_input_params()` method.
+The `get_transform_config()` method is used by the RayOrchestrator to create the parameters
+used to initialize the Transform in the RayWorker.
+This is where additional Ray components would be added to the environment
+and references added to them, as needed, in the returned dictionary of configuration data
+that will initialize the transform.
+For those transforms that don't need this support, the default implementation
+simpy returns the CLI parameters used to initialize the runtime instance.
+The `computed_execution_stats()` provides an opportunity to augment the statistics
+collected and aggregated by the TransformStatistics actor. It is called by the RayOrchestrator
+after all files have been processed.
+## Exceptions
+A transform may find that it needs to signal error conditions.
+For example, if a referenced model could not be loaded or
+a given table does not have the expected column.
+In general, it should identify such conditions by raising an exception.
+With this in mind, there are two types of exceptions:
+1. Those that would not allow any tables to be processed (e.g. model loading problem).
+2. Those that would not allow a specific table to be processed (e.g. missing column).
+In the first situation the transform should throw an exception from the initializer, which
+will cause the Ray framework to terminate processing of all tables.
+In the second situation (identified in the `transform()` or `flush()` methods), the transform
+should throw an exception from the associated method.  This will cause only the
+error-causing
+table to be ignored and not written out, but allow continued processing of tables by
+the transform.
+In both cases, the framework will log the exception as an error.

data-prep-toolkit 0.0.1.dev4__tar.gz → 0.1.0__tar.gz

data-prep-toolkit 0.0.1.dev4tar.gz → 0.1.0tar.gz