easylink 0.1.14__tar.gz → 0.1.16__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (217) hide show
  1. {easylink-0.1.14 → easylink-0.1.16}/CHANGELOG.rst +8 -0
  2. {easylink-0.1.14 → easylink-0.1.16}/PKG-INFO +9 -7
  3. {easylink-0.1.14 → easylink-0.1.16}/README.rst +7 -6
  4. easylink-0.1.16/docs/source/concepts/pipeline_schema/images/02_default_implementation.drawio.png +0 -0
  5. easylink-0.1.16/docs/source/concepts/pipeline_schema/images/clustering_sub_steps.drawio.png +0 -0
  6. easylink-0.1.16/docs/source/concepts/pipeline_schema/images/easylink_pipeline_schema.drawio.png +0 -0
  7. easylink-0.1.16/docs/source/concepts/pipeline_schema/images/entity_resolution_sub_steps.drawio.png +0 -0
  8. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/index.rst +272 -51
  9. easylink-0.1.16/docs/source/user_guide/tutorials/DAG-r-pyspark.svg +317 -0
  10. {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/tutorials/getting_started.rst +350 -11
  11. easylink-0.1.16/docs/source/user_guide/tutorials/impl-config-pipeline.yaml +19 -0
  12. easylink-0.1.16/docs/source/user_guide/tutorials/input_data.yaml +3 -0
  13. easylink-0.1.16/docs/source/user_guide/tutorials/input_file_1.parquet +0 -0
  14. easylink-0.1.16/docs/source/user_guide/tutorials/input_file_2.parquet +0 -0
  15. easylink-0.1.16/docs/source/user_guide/tutorials/input_file_3.parquet +0 -0
  16. easylink-0.1.16/docs/source/user_guide/tutorials/r_spark_pipeline.yaml +15 -0
  17. {easylink-0.1.14 → easylink-0.1.16}/pyproject.toml +1 -3
  18. {easylink-0.1.14 → easylink-0.1.16}/setup.py +1 -0
  19. easylink-0.1.16/src/easylink/_version.py +1 -0
  20. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/cli.py +112 -4
  21. easylink-0.1.16/src/easylink/devtools/implementation_creator.py +435 -0
  22. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/implementation_metadata.yaml +60 -61
  23. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline_schema_constants/development.py +26 -4
  24. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/spark.smk +2 -2
  25. {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/PKG-INFO +9 -7
  26. {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/SOURCES.txt +14 -3
  27. {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/requires.txt +1 -0
  28. easylink-0.1.16/tests/unit/recipe_strings/python_pandas.txt +22 -0
  29. easylink-0.1.16/tests/unit/test_implementation_creator.py +241 -0
  30. easylink-0.1.14/docs/source/concepts/pipeline_schema/images/02_default_implementation.drawio.png +0 -0
  31. easylink-0.1.14/docs/source/concepts/pipeline_schema/images/clustering_pass_sub_steps.drawio.png +0 -0
  32. easylink-0.1.14/docs/source/concepts/pipeline_schema/images/entity_resolution_pipeline_schema.drawio.png +0 -0
  33. easylink-0.1.14/src/easylink/_version.py +0 -1
  34. {easylink-0.1.14 → easylink-0.1.16}/.bandit +0 -0
  35. {easylink-0.1.14 → easylink-0.1.16}/.flake8 +0 -0
  36. {easylink-0.1.14 → easylink-0.1.16}/.github/CODEOWNERS +0 -0
  37. {easylink-0.1.14 → easylink-0.1.16}/.github/pull_request_template.md +0 -0
  38. {easylink-0.1.14 → easylink-0.1.16}/.github/workflows/deploy.yml +0 -0
  39. {easylink-0.1.14 → easylink-0.1.16}/.github/workflows/update_readme.yml +0 -0
  40. {easylink-0.1.14 → easylink-0.1.16}/.gitignore +0 -0
  41. {easylink-0.1.14 → easylink-0.1.16}/.readthedocs.yml +0 -0
  42. {easylink-0.1.14 → easylink-0.1.16}/CONTRIBUTING.rst +0 -0
  43. {easylink-0.1.14 → easylink-0.1.16}/Jenkinsfile +0 -0
  44. {easylink-0.1.14 → easylink-0.1.16}/Makefile +0 -0
  45. {easylink-0.1.14 → easylink-0.1.16}/docs/Makefile +0 -0
  46. {easylink-0.1.14 → easylink-0.1.16}/docs/nitpick-exceptions +0 -0
  47. {easylink-0.1.14 → easylink-0.1.16}/docs/source/_static/style.css +0 -0
  48. {easylink-0.1.14 → easylink-0.1.16}/docs/source/_templates/layout.html +0 -0
  49. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/cli.rst +0 -0
  50. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/configuration.rst +0 -0
  51. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/graph_components.rst +0 -0
  52. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/implementation.rst +0 -0
  53. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/index.rst +0 -0
  54. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline.rst +0 -0
  55. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline_graph.rst +0 -0
  56. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline_schema.rst +0 -0
  57. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline_schema_constants/development.rst +0 -0
  58. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline_schema_constants/index.rst +0 -0
  59. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline_schema_constants/testing.rst +0 -0
  60. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/rule.rst +0 -0
  61. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/runner.rst +0 -0
  62. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/step.rst +0 -0
  63. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/aggregator_utils.rst +0 -0
  64. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/data_utils.rst +0 -0
  65. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/general_utils.rst +0 -0
  66. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/index.rst +0 -0
  67. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/paths.rst +0 -0
  68. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/splitter_utils.rst +0 -0
  69. {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/validation_utils.rst +0 -0
  70. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/index.rst +0 -0
  71. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/01_step.drawio.png +0 -0
  72. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/03_slots.drawio.png +0 -0
  73. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/04_data_dependency.drawio.png +0 -0
  74. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/05_pipeline_schema.drawio.png +0 -0
  75. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/06_default_input.drawio.png +0 -0
  76. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/07_cloneable_section.drawio.png +0 -0
  77. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/08_cloneable_section_expanded.drawio.png +0 -0
  78. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/09_loopable_section.drawio.png +0 -0
  79. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/10_loopable_section_expanded.drawio.png +0 -0
  80. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/11_cloneable_section_splitter.drawio.png +0 -0
  81. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/12_cloneable_section_splitter_expanded.drawio.png +0 -0
  82. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/13_autoparallel_section.drawio.png +0 -0
  83. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/14_choice_section.drawio.png +0 -0
  84. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/15_choice_section_expanded.drawio.png +0 -0
  85. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/16_step_hierarchy.drawio.png +0 -0
  86. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/17_draws.drawio.png +0 -0
  87. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/18_schema_to_pipeline.drawio.png +0 -0
  88. {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/19_schema_to_pipeline_combined.drawio.png +0 -0
  89. {easylink-0.1.14 → easylink-0.1.16}/docs/source/conf.py +0 -0
  90. {easylink-0.1.14 → easylink-0.1.16}/docs/source/glossary.rst +0 -0
  91. {easylink-0.1.14 → easylink-0.1.16}/docs/source/index.rst +0 -0
  92. {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/cli.rst +0 -0
  93. {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/index.rst +0 -0
  94. {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/tutorials/DAG-common-pipeline.svg +0 -0
  95. {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/tutorials/DAG-e2e-pipeline-expanded.svg +0 -0
  96. {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/tutorials/DAG-e2e-pipeline.svg +0 -0
  97. {easylink-0.1.14/tests/specifications/examples → easylink-0.1.16/docs/source/user_guide/tutorials}/environment_slurm.yaml +0 -0
  98. {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/tutorials/index.rst +0 -0
  99. {easylink-0.1.14 → easylink-0.1.16}/python_versions.json +0 -0
  100. {easylink-0.1.14 → easylink-0.1.16}/pytype.cfg +0 -0
  101. {easylink-0.1.14 → easylink-0.1.16}/setup.cfg +0 -0
  102. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/__about__.py +0 -0
  103. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/__init__.py +0 -0
  104. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/configuration.py +0 -0
  105. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/graph_components.py +0 -0
  106. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/images/spark_cluster/Dockerfile +0 -0
  107. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/images/spark_cluster/README.md +0 -0
  108. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/implementation.py +0 -0
  109. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline.py +0 -0
  110. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline_graph.py +0 -0
  111. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline_schema.py +0 -0
  112. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline_schema_constants/__init__.py +0 -0
  113. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline_schema_constants/testing.py +0 -0
  114. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/rule.py +0 -0
  115. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/runner.py +0 -0
  116. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/step.py +0 -0
  117. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/README.md +0 -0
  118. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/build-containers-local.sh +0 -0
  119. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/build-containers-remote.sh +0 -0
  120. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/input_data/create_input_files.ipynb +0 -0
  121. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/input_data/input_file_1.csv +0 -0
  122. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/input_data/input_file_1.parquet +0 -0
  123. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/input_data/input_file_2.csv +0 -0
  124. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/input_data/input_file_2.parquet +0 -0
  125. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pandas/README.md +0 -0
  126. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pandas/dummy_step.py +0 -0
  127. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pandas/python_pandas.def +0 -0
  128. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pyspark/README.md +0 -0
  129. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pyspark/dummy_step.py +0 -0
  130. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pyspark/python_pyspark.def +0 -0
  131. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/r/README.md +0 -0
  132. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/r/dummy_step.R +0 -0
  133. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/r/r-image.def +0 -0
  134. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/test.py +0 -0
  135. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/__init__.py +0 -0
  136. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/aggregator_utils.py +0 -0
  137. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/data_utils.py +0 -0
  138. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/general_utils.py +0 -0
  139. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/paths.py +0 -0
  140. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/splitter_utils.py +0 -0
  141. {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/validation_utils.py +0 -0
  142. {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/dependency_links.txt +0 -0
  143. {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/entry_points.txt +0 -0
  144. {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/not-zip-safe +0 -0
  145. {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/top_level.txt +0 -0
  146. {easylink-0.1.14 → easylink-0.1.16}/tests/__init__.py +0 -0
  147. {easylink-0.1.14 → easylink-0.1.16}/tests/conftest.py +0 -0
  148. {easylink-0.1.14 → easylink-0.1.16}/tests/e2e/test_easylink_run.py +0 -0
  149. {easylink-0.1.14 → easylink-0.1.16}/tests/e2e/test_step_types.py +0 -0
  150. {easylink-0.1.14 → easylink-0.1.16}/tests/integration/test_compositions.py +0 -0
  151. {easylink-0.1.14 → easylink-0.1.16}/tests/integration/test_snakemake.py +0 -0
  152. {easylink-0.1.14 → easylink-0.1.16}/tests/integration/test_snakemake_slurm.py +0 -0
  153. {easylink-0.1.14 → easylink-0.1.16}/tests/integration/test_snakemake_spark.py +0 -0
  154. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/common/environment_local.yaml +0 -0
  155. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/common/input_data.yaml +0 -0
  156. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/common/pipeline.yaml +0 -0
  157. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/e2e/environment_slurm.yaml +0 -0
  158. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/e2e/pipeline.yaml +0 -0
  159. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/e2e/pipeline_expanded.yaml +0 -0
  160. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/embarrassingly_parallel/pipeline_hierarchical_step.yaml +0 -0
  161. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/embarrassingly_parallel/pipeline_loop_step.yaml +0 -0
  162. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/embarrassingly_parallel/pipeline_parallel_step.yaml +0 -0
  163. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/environment_spark_slurm.yaml +0 -0
  164. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/pipeline.yaml +0 -0
  165. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/pipeline_spark.yaml +0 -0
  166. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/environment_minimum.yaml +0 -0
  167. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/environment_spark_slurm.yaml +0 -0
  168. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline.yaml +0 -0
  169. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_bad_combined_implementations.yaml +0 -0
  170. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_bad_implementation.yaml +0 -0
  171. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_bad_loop_formatting.yaml +0 -0
  172. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_bad_step.yaml +0 -0
  173. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_bad_type_key.yaml +0 -0
  174. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_bad_implementation_names.yaml +0 -0
  175. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_bad_topology.yaml +0 -0
  176. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_two_steps.yaml +0 -0
  177. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_with_extra_node.yaml +0 -0
  178. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_with_iteration.yaml +0 -0
  179. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_with_iteration_cycle.yaml +0 -0
  180. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_with_missing_node.yaml +0 -0
  181. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_with_parallel.yaml +0 -0
  182. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_implementation_name.yaml +0 -0
  183. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_implementations.yaml +0 -0
  184. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_loop_nodes.yaml +0 -0
  185. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_step.yaml +0 -0
  186. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_substeps.yaml +0 -0
  187. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_type_key.yaml +0 -0
  188. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_nested_templated_steps.yaml +0 -0
  189. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_out_of_order.yaml +0 -0
  190. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_spark.yaml +0 -0
  191. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_type_config_mismatch.yaml +0 -0
  192. {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_wrong_parallel_split_keys.yaml +0 -0
  193. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/__init__.py +0 -0
  194. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/conftest.py +0 -0
  195. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/aggregation_rule.txt +0 -0
  196. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/checkpoint_rule.txt +0 -0
  197. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/embarrassingly_parallel_rule.txt +0 -0
  198. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/implemented_rule_local.txt +0 -0
  199. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/implemented_rule_slurm.txt +0 -0
  200. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/pipeline_local.txt +0 -0
  201. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/pipeline_slurm.txt +0 -0
  202. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/target_rule.txt +0 -0
  203. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/validation_rule.txt +0 -0
  204. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_cli.py +0 -0
  205. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_config.py +0 -0
  206. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_data_utils.py +0 -0
  207. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_general_utils.py +0 -0
  208. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_graph_components.py +0 -0
  209. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_implementation.py +0 -0
  210. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_pipeline.py +0 -0
  211. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_pipeline_graph.py +0 -0
  212. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_pipeline_schema.py +0 -0
  213. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_rule.py +0 -0
  214. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_runner.py +0 -0
  215. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_step.py +0 -0
  216. {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_validations.py +0 -0
  217. {easylink-0.1.14 → easylink-0.1.16}/update_readme.py +0 -0
@@ -1,3 +1,11 @@
1
+ **0.1.16 - 5/13/25**
2
+
3
+ - Implement new cli command to simplify implementation creation
4
+
5
+ **0.1.15 - 5/5/25**
6
+
7
+ - Fix SyntaxWarning for unescaped backslashes
8
+
1
9
  **0.1.14 - 5/1/25**
2
10
 
3
11
  - Add support for EmbarrassinglyParallelSteps to accept sections (i.e. non-leaf steps)
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: easylink
3
- Version: 0.1.14
3
+ Version: 0.1.16
4
4
  Summary: Research repository for the EasyLink ER ecosystem project.
5
5
  Home-page: https://github.com/ihmeuw/easylink
6
6
  Author: The EasyLink developers
@@ -21,6 +21,7 @@ Requires-Dist: snakemake-interface-executor-plugins<9.0.0
21
21
  Requires-Dist: snakemake-executor-plugin-slurm
22
22
  Requires-Dist: pandas-stubs
23
23
  Requires-Dist: pyarrow-stubs
24
+ Requires-Dist: types-PyYAML
24
25
  Provides-Extra: docs
25
26
  Requires-Dist: sphinx<8.2.0; extra == "docs"
26
27
  Requires-Dist: sphinx-rtd-theme; extra == "docs"
@@ -78,15 +79,16 @@ There are a few things to install in order to use this package:
78
79
  - Install singularity.
79
80
 
80
81
  You may need to request it from your system admin.
81
- Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
82
- You can check if you already have singularity installed by running the command ``singularity --version``. For an
83
- existing installation, your singularity version number is printed.
82
+ Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
83
+ You can check if you already have singularity installed by running the command
84
+ ``singularity --version``. For an existing installation, your singularity version
85
+ number is printed.
84
86
 
85
87
  - Install conda.
86
88
 
87
- We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can check if you already
88
- have conda installed by running the command ``conda --version``. For an existing installation, a version
89
- will be displayed.
89
+ We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can
90
+ check if you already have conda installed by running the command ``conda --version``.
91
+ For an existing installation, a version will be displayed.
90
92
 
91
93
  - Install easylink, python and graphviz in a conda environment.
92
94
 
@@ -21,15 +21,16 @@ There are a few things to install in order to use this package:
21
21
  - Install singularity.
22
22
 
23
23
  You may need to request it from your system admin.
24
- Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
25
- You can check if you already have singularity installed by running the command ``singularity --version``. For an
26
- existing installation, your singularity version number is printed.
24
+ Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
25
+ You can check if you already have singularity installed by running the command
26
+ ``singularity --version``. For an existing installation, your singularity version
27
+ number is printed.
27
28
 
28
29
  - Install conda.
29
30
 
30
- We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can check if you already
31
- have conda installed by running the command ``conda --version``. For an existing installation, a version
32
- will be displayed.
31
+ We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can
32
+ check if you already have conda installed by running the command ``conda --version``.
33
+ For an existing installation, a version will be displayed.
33
34
 
34
35
  - Install easylink, python and graphviz in a conda environment.
35
36
 
@@ -89,6 +89,7 @@ Default Implementations
89
89
  A step with a check mark on its top right corner has a default implementation.
90
90
  Therefore, the user doesn't *have* to specify anything.
91
91
  If the user wants to, they can override the default implementation.
92
+ We draw these steps in gray.
92
93
 
93
94
  .. image:: images/02_default_implementation.drawio.png
94
95
  :alt: Diagram showing a step with a default implementation
@@ -384,6 +385,9 @@ Because this can get so complicated, we don't show all the hierarchical levels i
384
385
  as we've done above with the dotted line "insert."
385
386
  Instead, we make a separate diagram with the title "Step 2"
386
387
  that represents the step graph contained within Step 2.
388
+ In this diagram, we show a little "mini-map" of the levels of hierarchy above,
389
+ highlighting in red the step that we are diagramming the inside of.
390
+ Think of this like a "You are Here!" label.
387
391
 
388
392
  At the top level of the step hierarchy,
389
393
  the pipeline schema splits the entity resolution task into very coarse steps,
@@ -486,10 +490,12 @@ Data dependencies *between* these steps are removed, and then the step nodes are
486
490
  :alt: Diagram of the two conceptual steps transforming a pipeline schema into a particular
487
491
  pipeline graph which includes a combined implementation
488
492
 
489
- Entity resolution pipeline schema
490
- ---------------------------------
493
+ .. _easylink_pipeline_schema:
491
494
 
492
- .. image:: images/entity_resolution_pipeline_schema.drawio.png
495
+ EasyLink pipeline schema
496
+ ------------------------
497
+
498
+ .. image:: images/easylink_pipeline_schema.drawio.png
493
499
 
494
500
  Input datasets
495
501
  ^^^^^^^^^^^^^^
@@ -522,6 +528,15 @@ but one of them must be called “Record ID” and it must have unique values.
522
528
  - Allen
523
529
  - 456 Other Drive, Anytown WA, 99999
524
530
 
531
+ Known clusters
532
+ ^^^^^^^^^^^^^^
533
+
534
+ **Interpretation:**
535
+ If any clusters are already known, they can be provided here
536
+ (format described in "Clusters" sub-section).
537
+ This is typically empty, which is the default,
538
+ representing that there is no prior knowledge of clusters (all records are unresolved).
539
+
525
540
  Clusters
526
541
  ^^^^^^^^
527
542
 
@@ -531,9 +546,11 @@ which indicates that records assigned the same cluster ID are observations of th
531
546
  and records with different cluster IDs are observations of different entities.
532
547
  Records without a cluster ID are unresolved
533
548
  (they may or may not be part of one of the existing clusters).
534
- Pre-assigned clusters can be used to indicate any prior knowledge of clustering.
535
- If there is no prior knowledge, this can be an empty table (which is the default),
536
- indicating that all records are unresolved.
549
+
550
+ Clusters are similar to pairwise *links* (described in more detail :ref:`below <clustering_sub_steps>`)
551
+ but inherently enforce the logical consistency of *transitivity* --
552
+ if A and B are in the same cluster, and B and C are in the same cluster,
553
+ then A and C are in the same cluster by definition.
537
554
 
538
555
  **Specification:**
539
556
  A file in a tabular format with two columns: "Input Record ID" and "Cluster ID".
@@ -570,23 +587,29 @@ Lastly, input_file_4 and input_file_5 are considered duplicates
570
587
  (records, from the same data source, referring to the same entity)
571
588
  and are also a match to reference_file_2.
572
589
 
573
- Clustering pass
574
- ^^^^^^^^^^^^^^^
590
+ .. _entity_resolution_step:
591
+
592
+ Entity resolution
593
+ ^^^^^^^^^^^^^^^^^
575
594
 
576
595
  **Interpretation:**
577
- A pass at clustering (some) records.
578
- This may take into account already-found clusters as it sees fit:
596
+ Resolving (some) records to correspond to particular entities.
597
+ A set of records corresponding to the same entity is called a "cluster."
598
+
599
+ This step may take into account already-known clusters as it sees fit:
579
600
  anything from using them as a starting point for optimization to treating those clusters as set-in-stone and unchangeable.
580
601
 
581
- Going from one of these passes to the next is
582
- one kind of *cascading*, an iterative approach to entity resolution
602
+ Typically, this would only be be performed once, but the red dashed box
603
+ in the diagram above indicates that it *may* be looped, with the clusters
604
+ found in each iteration passed on to the next.
605
+ This allows for one kind of *cascading*, an iterative approach to entity resolution
583
606
  used by the US Census Bureau (and possibly other organizations too)
584
607
  to deal with the computational challenge of linking billions of records.
585
608
  In cascading, multiple passes are made to find clusters, starting with
586
609
  faster techniques (such as exact matching) that
587
610
  can solve some "easy" cases and make the problem smaller.
588
611
  As the focus narrows to only the records that
589
- are hardest to cluster/link, making the size of the problem smaller,
612
+ are hardest to cluster, making the size of the problem smaller,
590
613
  more sophisticated and computationally expensive
591
614
  techniques can be used.
592
615
 
@@ -594,20 +617,22 @@ techniques can be used.
594
617
 
595
618
  Give cascading its own documentation page?
596
619
 
597
- The sort of cascading represented by clustering passes is
598
- the kind in which a consistent clustering
599
- which satisfies transitivity (as opposed to pairwise comparisons)
620
+ The sort of cascading represented by the looping section in this diagram is
621
+ the kind in which a *clustering* (guaranteed to satisfy transitivity)
600
622
  is confirmed before moving to the next iteration.
601
- See the sub-steps of clustering for the other kind of cascading.
623
+ There is another kind of cascading, in which *pairwise links* are confirmed
624
+ but transitivity is not enforced.
625
+ That kind of cascading is represented by the looping section in :ref:`the sub-steps of clustering <clustering_sub_steps>`,
626
+ which nests within this entity resolution step.
602
627
 
603
- This step :ref:`has sub-steps <clustering_pass_sub_steps>`, which may be expanded for more detail.
628
+ This step :ref:`has sub-steps <entity_resolution_sub_steps>`, which may be expanded for more detail.
604
629
 
605
630
  **Examples:**
606
631
 
607
632
  - The US Census Bureau's Person Identification and Validation System (PVS)
608
- *modules* are considered clustering passes, since clusters
633
+ *modules* are considered entity resolution passes, since full *clusters*
609
634
  -- called "protected identification keys" (PIKs) in that system --
610
- are resolved in between modules.
635
+ are resolved in between modules (not only pairwise links!).
611
636
  As described below, each module only considers records not already clustered.
612
637
  - In `FIRLA <https://www.sciencedirect.com/science/article/pii/S1532046422001101>`_
613
638
  and similar incremental methods, the already-found clusters would be used directly
@@ -620,7 +645,7 @@ Canonicalizing and downstream analysis
620
645
  Everything else you want to do, after determining which records belong to the same entity and which don't.
621
646
  This definition is a little fuzzy.
622
647
  The downstream task is only included in the pipeline schema at all
623
- so that combined implementations can jointly do part of the ER task with the downstream task,
648
+ so that combined implementations can jointly do part of the entity resolution task with the downstream task,
624
649
  each informing the other.
625
650
  If this kind of joint model isn't necessary,
626
651
  this step can simply output entire datasets
@@ -647,12 +672,39 @@ May contain multiple draws in different files or subdirectories, or not.
647
672
  **Specification:**
648
673
  None. May take any form.
649
674
 
650
- .. _clustering_pass_sub_steps:
675
+ .. _entity_resolution_sub_steps:
651
676
 
652
- Clustering pass sub-steps
653
- -------------------------
677
+ Entity resolution sub-steps
678
+ ---------------------------
654
679
 
655
- .. image:: images/clustering_pass_sub_steps.drawio.png
680
+ The direct sub-steps of entity resolution mostly have to do with
681
+ *cascading* and *incorporating already-known clusters*,
682
+ both of which are rare situations.
683
+ All of the steps except for **clustering** have default implementations
684
+ and are not relevant in the common situation of starting from scratch
685
+ (no known clusters) and clustering in one pass (no cascading).
686
+ For this reason, clustering is described first below.
687
+
688
+ .. image:: images/entity_resolution_sub_steps.drawio.png
689
+
690
+ Clustering
691
+ ^^^^^^^^^^
692
+
693
+ **Interpretation:**
694
+ Assigning cluster IDs to (some) records to indicate which correspond to the same entity.
695
+ *May* use information about "old" clusters as a starting point.
696
+
697
+ This step :ref:`has sub-steps <clustering_sub_steps>`, which may be expanded for more detail
698
+ *by pairwise methods.*
699
+ Methods that are not pairwise should implement this step directly.
700
+
701
+ **Examples:**
702
+
703
+ - The core part of a PVS module
704
+ - `dblink <https://github.com/cleanzr/dblink>`_
705
+ (would ignore "old" clusters, since there is no way for it to update)
706
+ - In Splink, this step would correspond to estimating parameters, making pairwise
707
+ predictions, and then clustering entities with connected components or similar
656
708
 
657
709
  Eliminating records
658
710
  ^^^^^^^^^^^^^^^^^^^
@@ -664,8 +716,12 @@ Usually these will be records that have already been clustered sufficiently well
664
716
  (whatever that means as defined by the implementation of this step)
665
717
  that we don't need to look at them anymore.
666
718
 
719
+ **Default implementation:**
720
+ Throws an error if there are any known clusters.
721
+ Otherwise, returns an empty list (no records to eliminate).
722
+
667
723
  **Example:**
668
- As mentioned above, our main example of clustering passes is PVS *modules*
724
+ As mentioned above, our main example of entity resolution passes is PVS *modules*
669
725
  such as NameSearch, DOBSearch, etc.
670
726
  In those modules, the implementation of this step would be to eliminate
671
727
  all input-file records that are already linked to at least one reference-file
@@ -701,30 +757,6 @@ Pandas code dropping records with matching record IDs.
701
757
  Note that if the default implementation is used,
702
758
  input and output data specifications do not need to be checked.
703
759
 
704
- Datasets for pass
705
- ^^^^^^^^^^^^^^^^^
706
-
707
- **Interpretation:**
708
- The input datasets to consider for the purposes of this clustering pass.
709
-
710
- **Specification:**
711
- See specification for "Input datasets."
712
-
713
- Clustering
714
- ^^^^^^^^^^
715
-
716
- **Interpretation:**
717
- Assigning cluster IDs to (some) records to indicate which correspond to the same entity.
718
- May use information about "old" clusters as a starting point.
719
-
720
- **Examples:**
721
-
722
- - The core part of a PVS module
723
- - `dblink <https://github.com/cleanzr/dblink>`_
724
- (would ignore "old" clusters, since there is no way for it to update)
725
- - In Splink, this step would correspond to estimating parameters, making pairwise
726
- predictions, and then clustering entities with connected components or similar
727
-
728
760
  New clusters
729
761
  ^^^^^^^^^^^^
730
762
 
@@ -741,6 +773,10 @@ Updating clusters
741
773
  **Interpretation:**
742
774
  Updating/reconciling previously-found clusters with newly-found clusters.
743
775
 
776
+ **Default implementation:**
777
+ Throws an error if there are any known clusters.
778
+ Otherwise, returns the new clusters unchanged.
779
+
744
780
  **Examples:**
745
781
 
746
782
  - In PVS, simply appending PIKs found in this module to those found in previous
@@ -748,4 +784,189 @@ Updating/reconciling previously-found clusters with newly-found clusters.
748
784
  Because of the "eliminating records" strategy used in PVS, these are guaranteed
749
785
  to not include any of the same input file records.
750
786
  - A simple approach would be to make each set of clusters into a graph of records,
751
- merge the graphs, and take the connected components as the updated clusters.
787
+ merge the graphs, and take the connected components as the updated clusters.
788
+
789
+ .. _clustering_sub_steps:
790
+
791
+ Clustering sub-steps
792
+ --------------------
793
+
794
+ As mentioned above, the sub-steps of clustering are designed for *pairwise* methods --
795
+ models of entity resolution that only consider *pairs* of records at a time.
796
+ Breaking down the entity resolution task into a binary classification problem
797
+ about whether or not each pair of two records belong to the same entity simplifies
798
+ it enormously, and traditional methods going back to `Fellegi and Sunter (1969) <https://courses.cs.washington.edu/courses/cse590q/04au/papers/Felligi69.pdf>`_
799
+ take this approach.
800
+
801
+ Methods that are not pairwise will need to implement the "clustering" step as a whole,
802
+ as they are not composed of parts that align with these sub-steps.
803
+
804
+ .. image:: images/clustering_sub_steps.drawio.png
805
+
806
+ Clusters to links
807
+ ^^^^^^^^^^^^^^^^^
808
+
809
+ **Interpretation:**
810
+ Converting *clusters* (sets of records that are all mutually linked)
811
+ to *links* (pairs of records that are linked).
812
+
813
+ **Default implementation:**
814
+ Pandas code that gets of list of Record IDs for each Cluster ID,
815
+ then generates all the unique (unordered) pairs of records,
816
+ and pairs them with probability 1.
817
+
818
+ Here is a rough draft of the code for this default implementation:
819
+
820
+ .. code::
821
+
822
+ import pandas as pd
823
+ from itertools import combinations
824
+
825
+ def clusters_to_links(clusters_df):
826
+ # Group by Cluster ID and collect Record IDs for each cluster
827
+ grouped = clusters_df.groupby("Cluster ID")["Input Record ID"].apply(list)
828
+
829
+ # Generate all unique pairs of Record IDs within each cluster
830
+ links = []
831
+ for record_ids in grouped:
832
+ links.extend(combinations(sorted(record_ids), 2))
833
+
834
+ # Create a DataFrame for the links
835
+ links_df = pd.DataFrame(links, columns=["Left Record ID", "Right Record ID"])
836
+ links_df["Probability"] = 1.0
837
+ return links_df
838
+
839
+ Links
840
+ ^^^^^
841
+
842
+ **Interpretation:**
843
+ Pairs of records that are linked with some probability.
844
+
845
+ Links can be seen as another way to represent
846
+ the same information as *clusters*,
847
+ but links are not conducive to enforcing the structural constraint
848
+ of *transitivity*: that if A links to B
849
+ and B links to C, A must link to C.
850
+ This lack of structural awareness is inherent to pairwise methods,
851
+ and the loss of information this represents is a tradeoff with the
852
+ benefits of the simplicity of the pairwise approach to entity resolution.
853
+
854
+ Assigning a probability to each pair is a more efficient system for
855
+ representing uncertainty than draws,
856
+ when the statistical dependence structure between the pairwise links
857
+ is unknown.
858
+ Draws may be used in addition to pairwise
859
+ probabilities when (some information about) the dependence
860
+ structure is known.
861
+ It is up to downstream steps to interpret/assume the dependence structure between pairwise probabilities.
862
+ If a method doesn't represent uncertainty, it can set
863
+ all probabilities to 1 (or another constant).
864
+
865
+ **Specification:**
866
+ A table with three columns, "Left Record ID", "Right Record ID", and "Probability".
867
+ Every value in both Record ID columns should exist in one of the input datasets.
868
+ Left Record ID and Right Record ID are not permitted to be equal to one another in any given row.
869
+ Rows should be unique (i.e. multiple rows with the same Left Record ID *and* Right Record ID would not be permitted).
870
+ The Left Record ID value should be alphabetically before the Right Record ID
871
+ value in each row.
872
+ (This ensures each pair is truly unique, and not
873
+ a mirror image of another.)
874
+ Each value in the Probability column must be between
875
+ 0 and 1 (inclusive).
876
+
877
+ **Example:**
878
+
879
+ .. list-table::
880
+ :header-rows: 1
881
+
882
+ * - Left Record ID
883
+ - Right Record ID
884
+ - Probability
885
+ * - input_file_2
886
+ - reference_file_3
887
+ - 0.9
888
+ * - input_file_2
889
+ - reference_file_4
890
+ - 0.8
891
+ * - input_file_3
892
+ - reference_file_6
893
+ - 0.4
894
+
895
+ Linking
896
+ ^^^^^^^
897
+
898
+ **Interpretation:**
899
+ Finding pairs of records that should
900
+ be considered links (correspond to the same entity).
901
+
902
+ Typically, this would only be be performed once, but the red dashed box
903
+ in the diagram above indicates that it *may* be looped, with the links
904
+ found in each iteration passed on to the next.
905
+ This allows for the other kind of *cascading*, an iterative approach
906
+ described :ref:`above <entity_resolution_step>`.
907
+
908
+ The sort of cascading represented by the looping section in this diagram is
909
+ the kind in which *links*
910
+ are confirmed before moving to the next iteration.
911
+ There is another kind of cascading, in which *clusters* are confirmed
912
+ and transitivity is enforced.
913
+ That kind of cascading is represented by the looping section in :ref:`the top-level pipeline schema <easylink_pipeline_schema>`.
914
+
915
+ **Examples:**
916
+
917
+ - A single PVS pass *within* a module, such as the first pass
918
+ of GeoSearch, which `as of 2014 <https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-02.pdf>`_
919
+ used blocking on the Master Address File (MAF) ID.
920
+ - In Splink, this step would correspond to estimating parameters and making pairwise predictions (possibly with a threshold)
921
+
922
+ Links to clusters
923
+ ^^^^^^^^^^^^^^^^^
924
+
925
+ **Interpretation:**
926
+ Converting *links* (pairs of records that are linked) to *clusters* (sets of records that are all mutually linked).
927
+
928
+ This implies resolving issues with transitivity: if A links to B
929
+ and B links to C, A must link to C.
930
+ Resolving these issues requires making after-the-fact corrections
931
+ to some of the links found, taking advantage of the context provided
932
+ by other links.
933
+ Making these corrections outside the linkage model is not ideal,
934
+ but this is the price paid in return for the simplicity of the pairwise approach.
935
+
936
+ Clusters are also much more conducive to representing *other* structural
937
+ constraints the analyst may have, such as a one-to-one link between two files.
938
+ We expect that these constraints will typically be enforced during this step.
939
+
940
+ **Examples:**
941
+
942
+ - The simplest algorithm is finding the
943
+ `components <https://en.wikipedia.org/wiki/Component_(graph_theory)>`_
944
+ (also called "connected components")
945
+ of the graph created by giving every record a node
946
+ and every pair (with probability above a threshold) an edge.
947
+ This is implemented `in Splink <https://moj-analytical-services.github.io/splink/api_docs/clustering.html>`_.
948
+ - In PVS, the algorithm incorporates the restriction
949
+ that multiple records from the *reference* file
950
+ should never be in the same cluster.
951
+ Therefore, the links are filtered before going
952
+ into connected components:
953
+ only the link with the highest probability for
954
+ each input file record is kept, and if there are
955
+ ties for the highest probability, no links
956
+ involving that input file record are kept.
957
+ This is described `here <https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-02.pdf>`_
958
+ as a "post-search program."
959
+ - In other Census Bureau processes such as the linkage of
960
+ the Post Enumeration Survey (PES) to the Census,
961
+ there is a 1-to-1 restriction: there can only be one record
962
+ from each file in a cluster.
963
+ This is achieved by finding the matching such that the
964
+ sum of the (logit) probabilities of the accepted matches
965
+ is maximized, as described in `Jaro (1989) <https://www.jstor.org/stable/2289924?seq=4>`_.
966
+
967
+ .. note::
968
+
969
+ None of the methods in this list are able to
970
+ propagate the uncertainty represented by the pairwise probabilities
971
+ through this step, e.g. by *sampling* clusters somehow.
972
+ Further research is needed in this area.