data-prep-toolkit 0.0.1.dev4__tar.gz → 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (114) hide show
  1. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/Makefile +7 -6
  2. {data_prep_toolkit-0.0.1.dev4/src/data_prep_toolkit.egg-info → data_prep_toolkit-0.1.0}/PKG-INFO +1 -1
  3. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/doc/advanced-transform-tutorial.md +34 -17
  4. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/doc/architecture.md +14 -10
  5. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/doc/overview.md +7 -6
  6. data_prep_toolkit-0.1.0/doc/python-launcher-options.md +61 -0
  7. data_prep_toolkit-0.1.0/doc/python-runtime.md +12 -0
  8. data_prep_toolkit-0.0.1.dev4/doc/launcher-options.md → data_prep_toolkit-0.1.0/doc/ray-launcher-options.md +37 -42
  9. data_prep_toolkit-0.1.0/doc/ray-runtime.md +143 -0
  10. data_prep_toolkit-0.1.0/doc/simplest-transform-tutorial.md +221 -0
  11. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/doc/testing-e2e-transform.md +2 -1
  12. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/doc/transform-external-resources.md +1 -0
  13. data_prep_toolkit-0.1.0/doc/transform-runtimes.md +9 -0
  14. data_prep_toolkit-0.0.1.dev4/doc/testing-transforms.md → data_prep_toolkit-0.1.0/doc/transform-standalone-testing.md +1 -0
  15. data_prep_toolkit-0.1.0/doc/transform-testing.md +6 -0
  16. data_prep_toolkit-0.1.0/doc/transform-tutorial-examples.md +15 -0
  17. data_prep_toolkit-0.1.0/doc/transform-tutorials.md +70 -0
  18. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/doc/transformer-utilities.md +3 -1
  19. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/pyproject.toml +2 -2
  20. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0/src/data_prep_toolkit.egg-info}/PKG-INFO +1 -1
  21. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_prep_toolkit.egg-info/SOURCES.txt +38 -26
  22. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/data_access/arrow_s3.py +5 -17
  23. data_prep_toolkit-0.1.0/src/data_processing/runtime/__init__.py +4 -0
  24. data_prep_toolkit-0.1.0/src/data_processing/runtime/pure_python/__init__.py +4 -0
  25. data_prep_toolkit-0.1.0/src/data_processing/runtime/pure_python/runtime_configuration.py +24 -0
  26. {data_prep_toolkit-0.0.1.dev4/src/data_processing → data_prep_toolkit-0.1.0/src/data_processing/runtime}/pure_python/transform_launcher.py +13 -14
  27. {data_prep_toolkit-0.0.1.dev4/src/data_processing → data_prep_toolkit-0.1.0/src/data_processing/runtime}/pure_python/transform_orchestrator.py +11 -10
  28. data_prep_toolkit-0.1.0/src/data_processing/runtime/pure_python/transform_table_processor.py +53 -0
  29. data_prep_toolkit-0.1.0/src/data_processing/runtime/ray/__init__.py +9 -0
  30. data_prep_toolkit-0.0.1.dev4/src/data_processing/ray/transform_orchestrator_configuration.py → data_prep_toolkit-0.1.0/src/data_processing/runtime/ray/execution_configuration.py +2 -2
  31. data_prep_toolkit-0.1.0/src/data_processing/runtime/ray/runtime_configuration.py +38 -0
  32. {data_prep_toolkit-0.0.1.dev4/src/data_processing → data_prep_toolkit-0.1.0/src/data_processing/runtime}/ray/transform_launcher.py +19 -28
  33. {data_prep_toolkit-0.0.1.dev4/src/data_processing → data_prep_toolkit-0.1.0/src/data_processing/runtime}/ray/transform_orchestrator.py +9 -9
  34. {data_prep_toolkit-0.0.1.dev4/src/data_processing → data_prep_toolkit-0.1.0/src/data_processing/runtime}/ray/transform_runtime.py +0 -52
  35. data_prep_toolkit-0.1.0/src/data_processing/runtime/ray/transform_table_processor.py +46 -0
  36. data_prep_toolkit-0.1.0/src/data_processing/runtime/runtime_configuration.py +64 -0
  37. data_prep_toolkit-0.1.0/src/data_processing/runtime/transform_launcher.py +79 -0
  38. data_prep_toolkit-0.1.0/src/data_processing/runtime/transform_table_processor.py +176 -0
  39. data_prep_toolkit-0.1.0/src/data_processing/test_support/launch/__init__.py +0 -0
  40. {data_prep_toolkit-0.0.1.dev4/src/data_processing/test_support/ray → data_prep_toolkit-0.1.0/src/data_processing/test_support/launch}/transform_test.py +13 -11
  41. data_prep_toolkit-0.1.0/src/data_processing/test_support/transform/__init__.py +6 -0
  42. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/test_support/transform/noop_transform.py +38 -25
  43. data_prep_toolkit-0.1.0/src/data_processing/transform/__init__.py +3 -0
  44. data_prep_toolkit-0.0.1.dev4/src/data_processing/transform/launcher_configuration.py → data_prep_toolkit-0.1.0/src/data_processing/transform/transform_configuration.py +45 -3
  45. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/utils/transform_utils.py +36 -7
  46. data_prep_toolkit-0.1.0/test/data_processing_tests/launch/pure_python/launcher_test.py +169 -0
  47. data_prep_toolkit-0.1.0/test/data_processing_tests/launch/pure_python/multi_launcher_test.py +78 -0
  48. {data_prep_toolkit-0.0.1.dev4/test/data_processing_tests/ray → data_prep_toolkit-0.1.0/test/data_processing_tests/launch/pure_python}/test_noop_launch.py +8 -4
  49. {data_prep_toolkit-0.0.1.dev4/test/data_processing_tests/pure_python → data_prep_toolkit-0.1.0/test/data_processing_tests/launch/ray}/launcher_test.py +39 -76
  50. data_prep_toolkit-0.1.0/test/data_processing_tests/launch/ray/multi_launcher_test.py +80 -0
  51. {data_prep_toolkit-0.0.1.dev4/test/data_processing_tests → data_prep_toolkit-0.1.0/test/data_processing_tests/launch}/ray/ray_util_test.py +1 -1
  52. data_prep_toolkit-0.1.0/test/data_processing_tests/launch/ray/test_noop_launch.py +41 -0
  53. data_prep_toolkit-0.0.1.dev4/doc/logo-ibm-dark.png +0 -0
  54. data_prep_toolkit-0.0.1.dev4/doc/logo-ibm.png +0 -0
  55. data_prep_toolkit-0.0.1.dev4/doc/simplest-transform-tutorial.md +0 -198
  56. data_prep_toolkit-0.0.1.dev4/doc/transform-tutorials.md +0 -194
  57. data_prep_toolkit-0.0.1.dev4/src/data_processing/pure_python/__init__.py +0 -4
  58. data_prep_toolkit-0.0.1.dev4/src/data_processing/pure_python/python_launcher_configuration.py +0 -99
  59. data_prep_toolkit-0.0.1.dev4/src/data_processing/pure_python/transform_table_processor.py +0 -190
  60. data_prep_toolkit-0.0.1.dev4/src/data_processing/ray/__init__.py +0 -10
  61. data_prep_toolkit-0.0.1.dev4/src/data_processing/ray/transform_table_processor.py +0 -191
  62. data_prep_toolkit-0.0.1.dev4/src/data_processing/test_support/ray/__init__.py +0 -1
  63. data_prep_toolkit-0.0.1.dev4/src/data_processing/test_support/transform/__init__.py +0 -7
  64. data_prep_toolkit-0.0.1.dev4/src/data_processing/transform/__init__.py +0 -7
  65. data_prep_toolkit-0.0.1.dev4/test/data_processing_tests/ray/launcher_test.py +0 -287
  66. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/.gitignore +0 -0
  67. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/README.md +0 -0
  68. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/doc/processing-architecture.jpg +0 -0
  69. /data_prep_toolkit-0.0.1.dev4/doc/using_s3_transformers.md → /data_prep_toolkit-0.1.0/doc/transform-s3-testing.md +0 -0
  70. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/setup.cfg +0 -0
  71. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_prep_toolkit.egg-info/dependency_links.txt +0 -0
  72. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_prep_toolkit.egg-info/requires.txt +0 -0
  73. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_prep_toolkit.egg-info/top_level.txt +0 -0
  74. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/__init__.py +0 -0
  75. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/data_access/__init__.py +0 -0
  76. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/data_access/data_access.py +0 -0
  77. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/data_access/data_access_factory.py +0 -0
  78. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/data_access/data_access_factory_base.py +0 -0
  79. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/data_access/data_access_local.py +0 -0
  80. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/data_access/data_access_s3.py +0 -0
  81. {data_prep_toolkit-0.0.1.dev4/src/data_processing/transform → data_prep_toolkit-0.1.0/src/data_processing/runtime}/execution_configuration.py +0 -0
  82. {data_prep_toolkit-0.0.1.dev4/src/data_processing → data_prep_toolkit-0.1.0/src/data_processing/runtime}/ray/ray_utils.py +0 -0
  83. {data_prep_toolkit-0.0.1.dev4/src/data_processing → data_prep_toolkit-0.1.0/src/data_processing/runtime}/ray/transform_statistics.py +0 -0
  84. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/test_support/__init__.py +0 -0
  85. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/test_support/abstract_test.py +0 -0
  86. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/test_support/data_access/__init__.py +0 -0
  87. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/test_support/data_access/data_access_factory_test.py +0 -0
  88. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/test_support/transform/transform_test.py +0 -0
  89. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/transform/table_transform.py +0 -0
  90. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/transform/transform_statistics.py +0 -0
  91. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/utils/__init__.py +0 -0
  92. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/utils/cli_utils.py +0 -0
  93. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/utils/config.py +0 -0
  94. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/utils/log.py +0 -0
  95. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/src/data_processing/utils/params_utils.py +0 -0
  96. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test/data_processing_tests/data_access/daf_local_test.py +0 -0
  97. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test/data_processing_tests/data_access/data_access_local_test.py +0 -0
  98. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test/data_processing_tests/data_access/data_access_s3_test.py +0 -0
  99. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test/data_processing_tests/data_access/sample_input_data_test.py +0 -0
  100. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test/data_processing_tests/transform/test_noop.py +0 -0
  101. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test/data_processing_tests/util/transform_utils_test.py +0 -0
  102. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/daf/input/ds1/sample1.parquet +0 -0
  103. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/daf/input/ds1/sample2.parquet +0 -0
  104. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/daf/input/ds2/sample3.parquet +0 -0
  105. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/daf/output/ds1/sample1.parquet +0 -0
  106. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/input/sample1.parquet +0 -0
  107. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/input_multiple/sample1.parquet +0 -0
  108. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/input_multiple/sample2.parquet +0 -0
  109. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/input_multiple/sample3.parquet +0 -0
  110. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/ray/noop/expected/metadata.json +0 -0
  111. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/ray/noop/expected/sample1.parquet +0 -0
  112. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/ray/noop/expected/subdir/test1.parquet +0 -0
  113. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/ray/noop/input/sample1.parquet +0 -0
  114. {data_prep_toolkit-0.0.1.dev4 → data_prep_toolkit-0.1.0}/test-data/data_processing/ray/noop/input/subdir/test1.parquet +0 -0
@@ -53,10 +53,11 @@ venv:: pyproject.toml
53
53
  # pytest-forked was tried, but then we get SIGABRT in pytest when running the s3 tests, some of which are skipped..
54
54
  test::
55
55
  @# Help: Use the already-built virtual environment to run pytest on the test directory.
56
- source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/data_access;
57
- source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/transform;
58
- source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/pure_python;
59
- source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/ray/ray_util_test.py;
60
- source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/ray/launcher_test.py;
61
- source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/ray/test_noop_launch.py;
56
+ source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/data_access;
57
+ source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/transform;
58
+ source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/launch/pure_python/launcher_test.py;
59
+ source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/launch/pure_python/test_noop_launch.py;
60
+ source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/launch/ray/ray_util_test.py;
61
+ source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/launch/ray/launcher_test.py;
62
+ source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) data_processing_tests/launch/ray/test_noop_launch.py;
62
63
 
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: data_prep_toolkit
3
- Version: 0.0.1.dev4
3
+ Version: 0.1.0
4
4
  Summary: Data Preparation Toolkit Library
5
5
  Author-email: David Wood <dawood@us.ibm.com>, Boris Lublinsky <blublinsky@ibm.com>
6
6
  License: Apache-2.0
@@ -13,6 +13,7 @@ removes duplicate documents across all files. In this tutorial, we will show the
13
13
  the operation of our _noop_ transform.
14
14
 
15
15
  The complete task involves the following:
16
+
16
17
  * EdedupTransform - class that implements the specific transformation
17
18
  * EdedupRuntime - class that implements custom TransformRuntime to create supporting Ray objects and enhance job output
18
19
  statistics
@@ -39,6 +40,7 @@ First, let's define the transform class. To do this we extend
39
40
  the base abstract/interface class
40
41
  [AbstractTableTransform](../src/data_processing/transform/table_transform.py),
41
42
  which requires definition of the following:
43
+
42
44
  * an initializer (i.e. `init()`) that accepts a dictionary of configuration
43
45
  data. For this example, the configuration data will only be defined by
44
46
  command line arguments (defined below).
@@ -56,15 +58,17 @@ from typing import Any
56
58
 
57
59
  import pyarrow as pa
58
60
  import ray
59
- from data_processing.data_access import DataAccessFactory
60
- from data_processing.ray import (
61
- RayLauncherConfiguration,
61
+ from data_processing.data_access import DataAccessFactoryBase
62
+ from data_processing.runtime.ray import (
62
63
  DefaultTableTransformRuntimeRay,
63
- RayUtils,
64
64
  RayTransformLauncher,
65
+ RayUtils,
66
+ )
67
+ from data_processing.runtime.ray.runtime_configuration import (
68
+ RayTransformRuntimeConfiguration,
65
69
  )
66
- from data_processing.transform import AbstractTableTransform
67
- from data_processing.utils import GB, TransformUtils
70
+ from data_processing.transform import AbstractTableTransform, TransformConfiguration
71
+ from data_processing.utils import GB, CLIArgumentProvider, TransformUtils, get_logger
68
72
  from ray.actor import ActorHandle
69
73
 
70
74
 
@@ -136,8 +140,9 @@ If there is no metadata then simply return an empty dictionary.
136
140
 
137
141
  First, let's define the transform runtime class. To do this we extend
138
142
  the base abstract/interface class
139
- [DefaultTableTransformRuntime](../src/data_processing/ray/transform_runtime.py),
143
+ [DefaultTableTransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py),
140
144
  which requires definition of the following:
145
+
141
146
  * an initializer (i.e. `init()`) that accepts a dictionary of configuration
142
147
  data. For this example, the configuration data will only be defined by
143
148
  command line arguments (defined below).
@@ -202,8 +207,10 @@ collected by hash actors and custom computations based on statistics data.
202
207
 
203
208
  ## EdedupTableTransformConfiguration
204
209
 
205
- The final class we need to implement is `EdedupTableTransformConfiguration` class and its initializer that
206
- define the following:
210
+ The final class we need to implement is `EdedupRayTransformConfiguration` class that provides configuration for
211
+ running our transform. Although we provide only Ray-based implementation, Ray-based configuration relies on Python-based
212
+ configuration that we need to define first. So we first need to define `EdedupTableTransformConfiguration` class,
213
+ defining the following:
207
214
 
208
215
  * The short name for the transform
209
216
  * The class implementing the transform - in our case EdedupTransform
@@ -216,10 +223,12 @@ First we define the class and its initializer,
216
223
  short_name = "ededup"
217
224
  cli_prefix = f"{short_name}_"
218
225
 
219
- class EdedupTableTransformConfiguration(DefaultTableTransformConfiguration):
220
- def __init__(self):
221
- super().__init__(name=short_name, runtime_class=EdedupRuntime, transform_class=EdedupTransform)
222
- self.params = {}
226
+ class EdedupTableTransformConfiguration(TransformConfiguration):
227
+ def __init__(self):
228
+ super().__init__(
229
+ name=short_name,
230
+ transform_class=EdedupTransform,
231
+ )
223
232
  ```
224
233
 
225
234
  The initializer extends the DefaultTableTransformConfiguration which provides simple
@@ -253,6 +262,13 @@ and which allows us to capture the `EdedupTransform`-specific arguments and opti
253
262
  logger.info(f"exact dedup params are {self.params}")
254
263
  return True
255
264
  ```
265
+ Now we can implement `EdedupRayTransformConfiguration`with the following code
266
+ ```python
267
+ class EdedupRayTransformConfiguration(RayTransformConfiguration):
268
+ def __init__(self):
269
+ super().__init__(transform_config=EdedupTableTransformConfiguration(), runtime_class=EdedupRuntime)
270
+
271
+ ```
256
272
 
257
273
  ## main()
258
274
 
@@ -261,11 +277,12 @@ framework's `TransformLauncher` class.
261
277
 
262
278
  ```python
263
279
  if __name__ == "__main__":
264
- launcher = TransformLauncher(transform_runtime_config=EdedupTransformConfiguration())
265
- launcher.launch()
280
+ launcher = RayTransformLauncher(EdedupRayTransformConfiguration())
281
+ launcher.launch()
266
282
  ```
283
+
267
284
  The launcher requires only an instance of DefaultTableTransformConfiguration
268
- (our `EdedupTransformConfiguration` class).
285
+ (our `EdedupRayTransformConfiguration` class).
269
286
  A single method `launch()` is then invoked to run the transform in a Ray cluster.
270
287
 
271
288
  ## Running
@@ -280,5 +297,5 @@ python ededup_transform.py --hash_cpu 0.5 --num_hashes 2 --doc_column "contents"
280
297
  --s3_config "{'input_folder': 'cos-optimal-llm-pile/test/david/input/', 'output_folder': 'cos-optimal-llm-pile/test/david/output/'}"
281
298
  ```
282
299
  This is a minimal set of options to run locally.
283
- See the [launcher options](launcher-options.md) for a complete list of
300
+ See the [launcher options](ray-launcher-options) for a complete list of
284
301
  transform-independent command line options.
@@ -13,10 +13,10 @@ process many input files in parallel using a distribute network of RayWorkers.
13
13
 
14
14
  The architecture includes the following core components:
15
15
 
16
- * [RayLauncher](../src/data_processing/ray/transform_launcher.py) accepts and validates
16
+ * [RayLauncher](../src/data_processing/runtime/ray/transform_launcher.py) accepts and validates
17
17
  CLI parameters to establish the Ray Orchestrator with the proper configuration.
18
18
  It uses the following components, all of which can/do define CLI configuration parameters.:
19
- * [Transform Orchestrator Configuration](../src/data_processing/ray/transform_orchestrator_configuration.py) is responsible
19
+ * [Transform Orchestrator Configuration](../src/data_processing/runtime/ray/execution_configuration.py) is responsible
20
20
  for defining and validating infrastructure parameters
21
21
  (e.g., number of workers, memory and cpu, local or remote cluster, etc.). This class has very simple state
22
22
  (several dictionaries) and is fully pickleable. As a result framework uses its instance as a
@@ -25,27 +25,29 @@ It uses the following components, all of which can/do define CLI configuration p
25
25
  configuration for the type of DataAccess to use when reading/writing the input/output data for
26
26
  the transforms. Similar to Transform Orchestrator Configuration, this is a pickleable
27
27
  instance that is passed between Launcher, Orchestrator and Workers.
28
- * [TransformConfiguration](../src/data_processing/ray/transform_runtime.py) - defines specifics
28
+ * [TransformConfiguration](../src/data_processing/runtime/ray/transform_runtime.py) - defines specifics
29
29
  of the transform implementation including transform implementation class, its short name, any transform-
30
30
  specific CLI parameters, and an optional TransformRuntime class, discussed below.
31
31
 
32
32
  After all parameters are validated, the ray cluster is started and the DataAccessFactory, TransformOrchestratorConfiguraiton
33
33
  and TransformConfiguration are given to the Ray Orchestrator, via Ray remote() method invocation.
34
34
  The Launcher waits for the Ray Orchestrator to complete.
35
- * [Ray Orchestrator](../src/data_processing/ray/transform_orchestrator.py) is responsible for overall management of
35
+
36
+ * documents with [Ray Orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py) is responsible for overall management of
36
37
  the data processing job. It creates the actors, determines the set of input data and distributes the
37
38
  references to the data files to be processed by the workers. More specifically, it performs the following:
39
+
38
40
  1. Uses the DataAccess instance created by the DataAccessFactory to determine the set of the files
39
41
  to be processed.
40
42
  2. uses the TransformConfiguration to create the TransformRuntime instance
41
43
  3. Uses the TransformRuntime to optionally apply additional configuration (ray object storage, etc) for the configuration
42
44
  and operation of the Transform.
43
- 3. uses the TransformOrchestratorConfiguration to determine the set of RayWorkers to create
45
+ 4. uses the TransformOrchestratorConfiguration to determine the set of RayWorkers to create
44
46
  to execute transformers in parallel, providing the following to each worker:
45
47
  * Ray worker configuration
46
48
  * DataAccessFactory
47
49
  * Transform class and its TransformConfiguration containing the CLI parameters and any TransformRuntime additions.
48
- 4. in a load-balanced, round-robin fashion, distributes the names of the input files to the workers for them to transform/process.
50
+ 5. in a load-balanced, round-robin fashion, distributes the names of the input files to the workers for them to transform/process.
49
51
 
50
52
  Additionally, to provide monitoring of long-running transforms, the orchestrator is instrumented with
51
53
  [custom metrics](https://docs.ray.io/en/latest/ray-observability/user-guides/add-app-metrics.html), that are exported to localhost:8080 (this is the endpoint that
@@ -53,12 +55,14 @@ It uses the following components, all of which can/do define CLI configuration p
53
55
  Once all data is processed, the orchestrator will collect execution statistics (from the statistics actor)
54
56
  and build and save it in the form of execution metadata (`metadata.json`). Finally, it will return the execution
55
57
  result to the Launcher.
56
- * [Ray worker](../src/data_processing/ray/transform_table_processor.py) is responsible for
58
+
59
+ * [Ray worker](../src/data_processing/runtime/ray/transform_table_processor.py) is responsible for
57
60
  reading files (as [PyArrow Tables](https://levelup.gitconnected.com/deep-dive-into-pyarrow-understanding-its-features-and-benefits-2cce8b1466c8))
58
61
  assigned by the orchestrator, applying the transform to the input table and writing out the
59
62
  resulting table(s). Metadata produced by each table transformation is aggregated into
60
63
  Transform Statistics (below).
61
- * [Transform Statistics](../src/data_processing/ray/transform_statistics.py) is a general
64
+
65
+ * [Transform Statistics](../src/data_processing/runtime/ray/transform_statistics.py) is a general
62
66
  purpose data collector actor aggregating the numeric metadata from different places of
63
67
  the framework (especially metadata produced by the transform).
64
68
  These statistics are reported as metadata (`metadata.json`) by the orchestrator upon completion.
@@ -92,10 +96,10 @@ For a more complete discussion, see the [tutorials](transform-tutorials.md).
92
96
  of any transform implementation - `transform()` and `flush()` - and provides the bulk of any transform implementation
93
97
  convert one Table to 0 or more new Tables. In general, this is not tied to the above Ray infrastructure
94
98
  and so can usually be used independent of Ray.
95
- * [TransformRuntime ](../src/data_processing/ray/transform_runtime.py) - this class only needs to be
99
+ * [TransformRuntime ](../src/data_processing/runtime/ray/transform_runtime.py) - this class only needs to be
96
100
  extended/implemented when additional Ray components (actors, shared memory objects, etc.) are used
97
101
  by the transform. The main method `get_transform_config()` is used to enable these extensions.
98
- * [TransformConfiguration](../src/data_processing/ray/transform_runtime.py) - this is the bootstrap
102
+ * [TransformConfiguration](../src/data_processing/runtime/ray/transform_runtime.py) - this is the bootstrap
99
103
  class provided to the Launcher that enables the instantiation of the Transform and the TransformRuntime within
100
104
  the architecture. It is a CLIProvider, which allows it to define transform-specific CLI configuration
101
105
  that is made available to the Transform's initializer.
@@ -12,16 +12,17 @@ developers of data transformation are:
12
12
 
13
13
  * [Transformation](../src/data_processing/transform/table_transform.py) - a simple, easily-implemented interface defines
14
14
  the specifics of a given data transformation.
15
- * [Transform Configuration](../src/data_processing/ray/transform_runtime.py) - defines
16
- the transform implementation and runtime classes, the
17
- command line arguments specific to transform, and the short name for the transform.
18
- * [Transformation Runtime](../src/data_processing/ray/transform_runtime.py) - allows for customization of the Ray environment for the transformer.
19
- This might include provisioning of shared memory objects or creation of additional actors.
15
+ * [Transform Configuration](../src/data_processing/runtime/ray/transform_runtime.py) - defines
16
+ the transform short name, its implementation class, and command line configuration
17
+ parameters.
20
18
 
21
19
  To learn more consider the following:
22
20
 
23
21
  * [Transform Tutorials](transform-tutorials.md)
24
- * [Testing transformers with S3](using_s3_transformers.md)
22
+ * [Transform Runtimes](transform-runtimes.md)
23
+ * [Transform Examples](transform-tutorial-examples.md)
24
+ * [Testing Transforms](transform-testing.md)
25
+ * [Utilities](transformer-utilities.md)
25
26
  * [Architecture Deep Dive](architecture.md)
26
27
  * [Transform project root readme](../../transforms/README.md)
27
28
 
@@ -0,0 +1,61 @@
1
+ # Pure Python Launcher Command Line Options
2
+
3
+ A number of command line options are available when launching a transform as a Python class.
4
+
5
+ The following is a current --help output (a work in progress) for
6
+ the `NOOPTransform` (note the --noop_sleep_sec option):
7
+
8
+ ```
9
+ usage: noop_python_runtime.py [-h] [--noop_sleep_sec NOOP_SLEEP_SEC] [--noop_pwd NOOP_PWD] [--data_s3_cred DATA_S3_CRED] [--data_s3_config DATA_S3_CONFIG] [--data_local_config DATA_LOCAL_CONFIG] [--data_max_files DATA_MAX_FILES]
10
+ [--data_checkpointing DATA_CHECKPOINTING] [--data_data_sets DATA_DATA_SETS] [--data_files_to_use DATA_FILES_TO_USE] [--data_num_samples DATA_NUM_SAMPLES] [--runtime_pipeline_id RUNTIME_PIPELINE_ID]
11
+ [--runtime_job_id RUNTIME_JOB_ID] [--runtime_code_location RUNTIME_CODE_LOCATION]
12
+
13
+ Driver for noop processing
14
+
15
+ options:
16
+ -h, --help show this help message and exit
17
+ --noop_sleep_sec NOOP_SLEEP_SEC
18
+ Sleep actor for a number of seconds while processing the data frame, before writing the file to COS
19
+ --noop_pwd NOOP_PWD A dummy password which should be filtered out of the metadata
20
+ --data_s3_cred DATA_S3_CRED
21
+ AST string of options for s3 credentials. Only required for S3 data access.
22
+ access_key: access key help text
23
+ secret_key: secret key help text
24
+ url: optional s3 url
25
+ region: optional s3 region
26
+ Example: { 'access_key': 'access', 'secret_key': 'secret',
27
+ 'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud',
28
+ 'region': 'us-east-1' }
29
+ --data_s3_config DATA_S3_CONFIG
30
+ AST string containing input/output paths.
31
+ input_folder: Path to input folder of files to be processed
32
+ output_folder: Path to output folder of processed files
33
+ Example: { 'input_folder': 's3-path/your-input-bucket',
34
+ 'output_folder': 's3-path/your-output-bucket' }
35
+ --data_local_config DATA_LOCAL_CONFIG
36
+ ast string containing input/output folders using local fs.
37
+ input_folder: Path to input folder of files to be processed
38
+ output_folder: Path to output folder of processed files
39
+ Example: { 'input_folder': './input', 'output_folder': '/tmp/output' }
40
+ --data_max_files DATA_MAX_FILES
41
+ Max amount of files to process
42
+ --data_checkpointing DATA_CHECKPOINTING
43
+ checkpointing flag
44
+ --data_data_sets DATA_DATA_SETS
45
+ List of sub-directories of input directory to use for input. For example, ['dir1', 'dir2']
46
+ --data_files_to_use DATA_FILES_TO_USE
47
+ list of file extensions to choose for input.
48
+ --data_num_samples DATA_NUM_SAMPLES
49
+ number of random input files to process
50
+ --runtime_pipeline_id RUNTIME_PIPELINE_ID
51
+ pipeline id
52
+ --runtime_job_id RUNTIME_JOB_ID
53
+ job id
54
+ --runtime_code_location RUNTIME_CODE_LOCATION
55
+ AST string containing code location
56
+ github: Github repository URL.
57
+ commit_hash: github commit hash
58
+ path: Path within the repository
59
+ Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324',
60
+ 'path': 'transforms/universal/code' }
61
+ ```
@@ -0,0 +1,12 @@
1
+ ## Python Runtime
2
+ The python runtime provides a simple mechanism to run a transform on a set of input data to produce
3
+ a set of output data, all within the python execution environment.
4
+
5
+ A `PythonTransformLauncher` class is provided that enables the running of the transform. For example,
6
+
7
+ ```python
8
+ launcher = PythonTransformLauncher(YourTransformConfiguration())
9
+ launcher.launch()
10
+ ```
11
+ The `YourTransformConfiguration` class configures your transform.
12
+ More details can be found in the [transform tutorial](transform-tutorials.md).
@@ -1,29 +1,16 @@
1
- # Launcher Command Line Options
1
+ # Ray Launcher Command Line Options
2
2
  A number of command line options are available when launching a transform.
3
3
 
4
4
  The following is a current --help output (a work in progress) for
5
- the `NOOPTransform` (note the --noop_sleep_sec option):
5
+ the `NOOPTransform` (note the --noop_sleep_sec and --noop_pwd options):
6
6
 
7
7
  ```
8
- usage: noop_transform.py [-h]
9
- [--run_locally RUN_LOCALLY]
10
- [--noop_sleep_sec NOOP_SLEEP_SEC]
11
- [--data_s3_cred DATA_S3_CRED]
12
- [--data_s3_config DATA_S3_CONFIG]
13
- [--data_local_config DATA_LOCAL_CONFIG]
14
- [--data_max_files DATA_MAX_FILES]
15
- [--data_checkpointing DATA_CHECKPOINTING]
16
- [--data_data_sets DATA_DATA_SETS]
17
- [--data_max_files MAX_FILES]
18
- [--data_files_to_use DATA_FILES_TO_USE]
19
- [--data_num_samples DATA_NUM_SAMPLES]
20
- [--runtime_num_workers NUM_WORKERS]
21
- [--runtime_worker_options WORKER_OPTIONS]
22
- [--runtime_pipeline_id PIPELINE_ID] [--job_id JOB_ID]
23
- [--runtime_creation_delay CREATION_DELAY]
24
- [--runtime_code_location CODE_LOCATION]
8
+ usage: noop_transform.py [-h] [--run_locally RUN_LOCALLY] [--noop_sleep_sec NOOP_SLEEP_SEC] [--noop_pwd NOOP_PWD] [--data_s3_cred DATA_S3_CRED] [--data_s3_config DATA_S3_CONFIG] [--data_local_config DATA_LOCAL_CONFIG]
9
+ [--data_max_files DATA_MAX_FILES] [--data_checkpointing DATA_CHECKPOINTING] [--data_data_sets DATA_DATA_SETS] [--data_files_to_use DATA_FILES_TO_USE] [--data_num_samples DATA_NUM_SAMPLES]
10
+ [--runtime_num_workers RUNTIME_NUM_WORKERS] [--runtime_worker_options RUNTIME_WORKER_OPTIONS] [--runtime_creation_delay RUNTIME_CREATION_DELAY] [--runtime_pipeline_id RUNTIME_PIPELINE_ID]
11
+ [--runtime_job_id RUNTIME_JOB_ID] [--runtime_code_location RUNTIME_CODE_LOCATION]
25
12
 
26
- Driver for NOOP processing
13
+ Driver for noop processing
27
14
 
28
15
  options:
29
16
  -h, --help show this help message and exit
@@ -31,35 +18,40 @@ options:
31
18
  running ray local flag
32
19
  --noop_sleep_sec NOOP_SLEEP_SEC
33
20
  Sleep actor for a number of seconds while processing the data frame, before writing the file to COS
34
- --data_s3_cred S3_CRED
35
- AST string of options for cos credentials. Only required for COS or Lakehouse.
21
+ --noop_pwd NOOP_PWD A dummy password which should be filtered out of the metadata
22
+ --data_s3_cred DATA_S3_CRED
23
+ AST string of options for s3 credentials. Only required for S3 data access.
36
24
  access_key: access key help text
37
25
  secret_key: secret key help text
38
- cos_url: COS url
39
- Example: { 'access_key': 'access', 'secret_key': 'secret', 's3_url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud' }
40
- --data_s3_config S3_CONFIG
26
+ url: optional s3 url
27
+ region: optional s3 region
28
+ Example: { 'access_key': 'access', 'secret_key': 'secret',
29
+ 'url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud',
30
+ 'region': 'us-east-1' }
31
+ --data_s3_config DATA_S3_CONFIG
41
32
  AST string containing input/output paths.
42
33
  input_folder: Path to input folder of files to be processed
43
34
  output_folder: Path to output folder of processed files
44
- Example: { 'input_folder': 'your input folder', 'output_folder ': 'your output folder' }
45
- --data_local_config LOCAL_CONFIG
35
+ Example: { 'input_folder': 's3-path/your-input-bucket',
36
+ 'output_folder': 's3-path/your-output-bucket' }
37
+ --data_local_config DATA_LOCAL_CONFIG
46
38
  ast string containing input/output folders using local fs.
47
39
  input_folder: Path to input folder of files to be processed
48
40
  output_folder: Path to output folder of processed files
49
41
  Example: { 'input_folder': './input', 'output_folder': '/tmp/output' }
50
- --data_max_files MAX_FILES
42
+ --data_max_files DATA_MAX_FILES
51
43
  Max amount of files to process
52
- --data_checkpointing CHECKPOINTING
44
+ --data_checkpointing DATA_CHECKPOINTING
53
45
  checkpointing flag
54
- --data_data_sets DATA_SETS
55
- List of data sets
46
+ --data_data_sets DATA_DATA_SETS
47
+ List of sub-directories of input directory to use for input. For example, ['dir1', 'dir2']
56
48
  --data_files_to_use DATA_FILES_TO_USE
57
- files extensions to use, default .parquet
49
+ list of file extensions to choose for input.
58
50
  --data_num_samples DATA_NUM_SAMPLES
59
- number of randomply picked files to use
60
- --runtime_num_workers NUM_WORKERS
51
+ number of random input files to process
52
+ --runtime_num_workers RUNTIME_NUM_WORKERS
61
53
  number of workers
62
- --runtime_worker_options WORKER_OPTIONS
54
+ --runtime_worker_options RUNTIME_WORKER_OPTIONS
63
55
  AST string defining worker resource requirements.
64
56
  num_cpus: Required number of CPUs.
65
57
  num_gpus: Required number of GPUs
@@ -69,16 +61,19 @@ options:
69
61
  placement_group_bundle_index, placement_group_capture_child_tasks, resources, runtime_env,
70
62
  scheduling_strategy, _metadata, concurrency_groups, lifetime, max_concurrency, max_restarts,
71
63
  max_task_retries, max_pending_calls, namespace, get_if_exists
72
- Example: { 'num_cpus': '8', 'num_gpus': '1', 'resources': '{"special_hardware": 1, "custom_label": 1}' }
73
- --runtime_pipeline_id PIPELINE_ID
74
- pipeline id
75
- --runtime_job_id JOB_ID job id
76
- --runtime_creation_delay CREATION_DELAY
64
+ Example: { 'num_cpus': '8', 'num_gpus': '1',
65
+ 'resources': '{"special_hardware": 1, "custom_label": 1}' }
66
+ --runtime_creation_delay RUNTIME_CREATION_DELAY
77
67
  delay between actor' creation
78
- --runtime_code_location CODE_LOCATION
68
+ --runtime_pipeline_id RUNTIME_PIPELINE_ID
69
+ pipeline id
70
+ --runtime_job_id RUNTIME_JOB_ID
71
+ job id
72
+ --runtime_code_location RUNTIME_CODE_LOCATION
79
73
  AST string containing code location
80
74
  github: Github repository URL.
81
75
  commit_hash: github commit hash
82
76
  path: Path within the repository
83
- Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '13241231asdfaed', 'path': 'transforms/universal/ededup' }
77
+ Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324',
78
+ 'path': 'transforms/universal/code' }
84
79
  ```
@@ -0,0 +1,143 @@
1
+ # Ray Runtime
2
+ The Ray runtime includes the following set of components:
3
+
4
+ * [RayTransformLauncher](../src/data_processing/runtime/ray/transform_launcher.py) - this is a
5
+ class generally used to implement `main()` that makes use of a `TransformConfiguration` to
6
+ start the Ray runtime and execute the transform over the specified set of input files.
7
+ The RayTransformLauncher is created using a `RayTransformConfiguration` instance.
8
+ * [RayTransformConfiguration](../src/data_processing/runtime/ray/runtime_configuration.py) - this
9
+ class extends transform's base TransformConfiguration implementation to add an optional
10
+ `TranformRuntime` (see next) class to be used by the transform implementation.
11
+ * [TransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py) -
12
+ this provides the ability for the transform implementor to create additional Ray resources
13
+ and include them in the configuration used to create a transform
14
+ (see, for example, [exact dedup](../../transforms/universal/ededup/src/ededup_transform.py)).
15
+ This also provide the ability to supplement the statics collected by
16
+ [Statistics](../src/data_processing/runtime/ray/transform_statistics.py) (see below).
17
+
18
+ Roughly speaking the following steps are completed to establish transforms in the RayWorkers
19
+
20
+ 1. Launcher parses the CLI parameters using an ArgumentParser configured with its own CLI parameters
21
+ along with those of the Transform Configuration,
22
+ 2. Launcher passes the Transform Configuration and CLI parameters to the [RayOrchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py)
23
+ 3. RayOrchestrator creates the Transform Runtime using the Transform Configuration and its CLI parameter values
24
+ 4. Transform Runtime creates transform initialization/configuration including the CLI parameters,
25
+ and any Ray components need by the transform.
26
+ 5. [RayWorker](../src/data_processing/runtime/ray/transform_table_processor.py) is started with configuration from the Transform Runtime.
27
+ 6. RayWorker creates the Transform using the configuration provided by the Transform Runtime.
28
+ 7. Statistics is used to collect the statistics submitted by the individual transform, that
29
+ is used for building execution metadata.
30
+
31
+ ![Processing Architecture](processing-architecture.jpg)
32
+
33
+ ## Ray Transform Launcher
34
+ The [RayTransformLauncher](../src/data_processing/runtime/ray/transform_launcher.py) uses the Transform Configuration
35
+ and provides a single method, `launch()`, that kicks off the Ray environment and transform execution coordinated
36
+ by [orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py).
37
+ For example,
38
+ ```python
39
+ launcher = RayTransformLauncher(YourTransformConfiguration())
40
+ launcher.launch()
41
+ ```
42
+ Note that the launcher defines some additional CLI parameters that are used to control the operation of the
43
+ [orchestrator and workers](../src/data_processing/runtime/ray/execution_configuration.py) and
44
+ [data access](../src/data_processing/data_access/data_access_factory.py). Things such as data access configuration,
45
+ number of workers, worker resources, etc.
46
+ Discussion of these options is beyond the scope of this document
47
+ (see [Launcher Options](ray-launcher-options) for a list of available options.)
48
+
49
+ ## Ray Transform Configuration
50
+ In general, a transform should be able to run in both the python and Ray runtimes.
51
+ As such we first define the python-only transform configuration, which will then
52
+ be used by the Ray-runtime-specific transform configuration.
53
+ The python transform configuration implements
54
+ [TransformConfiguration](../src/data_processing/runtime/runtime_configuration.py)
55
+ and deifnes with transform-specific name, and implementation
56
+ and class. In addition, it is responsible for providing transform-specific
57
+ methods to define and capture optional command line arguments.
58
+ ```python
59
+
60
+ class YourTransformConfiguration(TransformConfiguration):
61
+
62
+ def __init__(self):
63
+ super().__init__(name="YourTransform", transform_class=YourTransform)
64
+ self.params = {}
65
+
66
+ def add_input_params(self, parser: ArgumentParser) -> None:
67
+ ...
68
+ def apply_input_params(self, args: Namespace) -> bool:
69
+ ...
70
+ ```
71
+ Next we define the Ray-runtime specific transform configuration as an exension of
72
+ the RayTransformConfiguration and uses the `YourTransformConfiguration` above.
73
+ ```python
74
+
75
+ class YourTransformConfiguration(RayTransformConfiguration):
76
+ def __init__(self):
77
+ super().__init__(YourTransformConfiguration(),
78
+ runtime_class=YourTransformRuntime
79
+ ```
80
+ This class provides the ability to create the instance of `YourTransformRuntime` class (see below)
81
+ as needed by the Ray runtime. Note, that not all transforms will require a `runtime_class`
82
+ and can omit this parameter to default to an acceptable runtime class.
83
+ Details are covered in the [advanced transform tutorial](advanced-transform-tutorial.md).
84
+
85
+ ## Transform Runtime
86
+ The
87
+ [DefaultTableTransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py)
88
+ class is provided and will be
89
+ sufficient for many use cases, especially 1:1 table transformation.
90
+ However, some transforms will require use of the Ray environment, for example,
91
+ to create additional workers, establish a shared memory object, etc.
92
+ Of course, these transforms will generally not run outside of Ray environment.
93
+
94
+ ```python
95
+ class DefaultTableTransformRuntime:
96
+
97
+ def __init__(self, params: dict[str, Any]):
98
+ ...
99
+
100
+ def get_transform_config(
101
+ self, data_access_factory: DataAccessFactory, statistics: ActorHandle, files: list[str]
102
+ ) -> dict[str, Any]:
103
+ ...
104
+
105
+ def compute_execution_stats(self, stats: dict[str, Any]) -> dict[str, Any]:
106
+ ...
107
+ ```
108
+
109
+ The RayOrchestrator initializes the instance with the CLI parameters provided by the Transform Configurations
110
+ `get_input_params()` method.
111
+
112
+ The `get_transform_config()` method is used by the RayOrchestrator to create the parameters
113
+ used to initialize the Transform in the RayWorker.
114
+ This is where additional Ray components would be added to the environment
115
+ and references added to them, as needed, in the returned dictionary of configuration data
116
+ that will initialize the transform.
117
+ For those transforms that don't need this support, the default implementation
118
+ simpy returns the CLI parameters used to initialize the runtime instance.
119
+
120
+ The `computed_execution_stats()` provides an opportunity to augment the statistics
121
+ collected and aggregated by the TransformStatistics actor. It is called by the RayOrchestrator
122
+ after all files have been processed.
123
+
124
+ ## Exceptions
125
+ A transform may find that it needs to signal error conditions.
126
+ For example, if a referenced model could not be loaded or
127
+ a given table does not have the expected column.
128
+ In general, it should identify such conditions by raising an exception.
129
+ With this in mind, there are two types of exceptions:
130
+
131
+ 1. Those that would not allow any tables to be processed (e.g. model loading problem).
132
+ 2. Those that would not allow a specific table to be processed (e.g. missing column).
133
+
134
+ In the first situation the transform should throw an exception from the initializer, which
135
+ will cause the Ray framework to terminate processing of all tables.
136
+ In the second situation (identified in the `transform()` or `flush()` methods), the transform
137
+ should throw an exception from the associated method. This will cause only the
138
+ error-causing
139
+ table to be ignored and not written out, but allow continued processing of tables by
140
+ the transform.
141
+ In both cases, the framework will log the exception as an error.
142
+
143
+