easylink 0.1.14__tar.gz → 0.1.16__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {easylink-0.1.14 → easylink-0.1.16}/CHANGELOG.rst +8 -0
- {easylink-0.1.14 → easylink-0.1.16}/PKG-INFO +9 -7
- {easylink-0.1.14 → easylink-0.1.16}/README.rst +7 -6
- easylink-0.1.16/docs/source/concepts/pipeline_schema/images/02_default_implementation.drawio.png +0 -0
- easylink-0.1.16/docs/source/concepts/pipeline_schema/images/clustering_sub_steps.drawio.png +0 -0
- easylink-0.1.16/docs/source/concepts/pipeline_schema/images/easylink_pipeline_schema.drawio.png +0 -0
- easylink-0.1.16/docs/source/concepts/pipeline_schema/images/entity_resolution_sub_steps.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/index.rst +272 -51
- easylink-0.1.16/docs/source/user_guide/tutorials/DAG-r-pyspark.svg +317 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/tutorials/getting_started.rst +350 -11
- easylink-0.1.16/docs/source/user_guide/tutorials/impl-config-pipeline.yaml +19 -0
- easylink-0.1.16/docs/source/user_guide/tutorials/input_data.yaml +3 -0
- easylink-0.1.16/docs/source/user_guide/tutorials/input_file_1.parquet +0 -0
- easylink-0.1.16/docs/source/user_guide/tutorials/input_file_2.parquet +0 -0
- easylink-0.1.16/docs/source/user_guide/tutorials/input_file_3.parquet +0 -0
- easylink-0.1.16/docs/source/user_guide/tutorials/r_spark_pipeline.yaml +15 -0
- {easylink-0.1.14 → easylink-0.1.16}/pyproject.toml +1 -3
- {easylink-0.1.14 → easylink-0.1.16}/setup.py +1 -0
- easylink-0.1.16/src/easylink/_version.py +1 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/cli.py +112 -4
- easylink-0.1.16/src/easylink/devtools/implementation_creator.py +435 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/implementation_metadata.yaml +60 -61
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline_schema_constants/development.py +26 -4
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/spark.smk +2 -2
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/PKG-INFO +9 -7
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/SOURCES.txt +14 -3
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/requires.txt +1 -0
- easylink-0.1.16/tests/unit/recipe_strings/python_pandas.txt +22 -0
- easylink-0.1.16/tests/unit/test_implementation_creator.py +241 -0
- easylink-0.1.14/docs/source/concepts/pipeline_schema/images/02_default_implementation.drawio.png +0 -0
- easylink-0.1.14/docs/source/concepts/pipeline_schema/images/clustering_pass_sub_steps.drawio.png +0 -0
- easylink-0.1.14/docs/source/concepts/pipeline_schema/images/entity_resolution_pipeline_schema.drawio.png +0 -0
- easylink-0.1.14/src/easylink/_version.py +0 -1
- {easylink-0.1.14 → easylink-0.1.16}/.bandit +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/.flake8 +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/.github/CODEOWNERS +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/.github/pull_request_template.md +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/.github/workflows/deploy.yml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/.github/workflows/update_readme.yml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/.gitignore +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/.readthedocs.yml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/CONTRIBUTING.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/Jenkinsfile +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/Makefile +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/Makefile +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/nitpick-exceptions +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/_static/style.css +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/_templates/layout.html +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/cli.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/configuration.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/graph_components.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/implementation.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/index.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline_graph.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline_schema.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline_schema_constants/development.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline_schema_constants/index.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/pipeline_schema_constants/testing.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/rule.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/runner.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/step.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/aggregator_utils.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/data_utils.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/general_utils.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/index.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/paths.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/splitter_utils.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/api_reference/utilities/validation_utils.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/index.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/01_step.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/03_slots.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/04_data_dependency.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/05_pipeline_schema.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/06_default_input.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/07_cloneable_section.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/08_cloneable_section_expanded.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/09_loopable_section.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/10_loopable_section_expanded.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/11_cloneable_section_splitter.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/12_cloneable_section_splitter_expanded.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/13_autoparallel_section.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/14_choice_section.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/15_choice_section_expanded.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/16_step_hierarchy.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/17_draws.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/18_schema_to_pipeline.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/images/19_schema_to_pipeline_combined.drawio.png +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/conf.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/glossary.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/index.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/cli.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/index.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/tutorials/DAG-common-pipeline.svg +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/tutorials/DAG-e2e-pipeline-expanded.svg +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/tutorials/DAG-e2e-pipeline.svg +0 -0
- {easylink-0.1.14/tests/specifications/examples → easylink-0.1.16/docs/source/user_guide/tutorials}/environment_slurm.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/docs/source/user_guide/tutorials/index.rst +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/python_versions.json +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/pytype.cfg +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/setup.cfg +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/__about__.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/__init__.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/configuration.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/graph_components.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/images/spark_cluster/Dockerfile +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/images/spark_cluster/README.md +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/implementation.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline_graph.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline_schema.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline_schema_constants/__init__.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/pipeline_schema_constants/testing.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/rule.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/runner.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/step.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/README.md +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/build-containers-local.sh +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/build-containers-remote.sh +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/input_data/create_input_files.ipynb +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/input_data/input_file_1.csv +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/input_data/input_file_1.parquet +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/input_data/input_file_2.csv +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/input_data/input_file_2.parquet +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pandas/README.md +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pandas/dummy_step.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pandas/python_pandas.def +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pyspark/README.md +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pyspark/dummy_step.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/python_pyspark/python_pyspark.def +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/r/README.md +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/r/dummy_step.R +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/r/r-image.def +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/steps/dev/test.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/__init__.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/aggregator_utils.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/data_utils.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/general_utils.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/paths.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/splitter_utils.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink/utilities/validation_utils.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/dependency_links.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/entry_points.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/not-zip-safe +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/src/easylink.egg-info/top_level.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/__init__.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/conftest.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/e2e/test_easylink_run.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/e2e/test_step_types.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/integration/test_compositions.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/integration/test_snakemake.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/integration/test_snakemake_slurm.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/integration/test_snakemake_spark.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/common/environment_local.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/common/input_data.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/common/pipeline.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/e2e/environment_slurm.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/e2e/pipeline.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/e2e/pipeline_expanded.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/embarrassingly_parallel/pipeline_hierarchical_step.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/embarrassingly_parallel/pipeline_loop_step.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/embarrassingly_parallel/pipeline_parallel_step.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/environment_spark_slurm.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/pipeline.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/integration/pipeline_spark.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/environment_minimum.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/environment_spark_slurm.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_bad_combined_implementations.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_bad_implementation.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_bad_loop_formatting.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_bad_step.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_bad_type_key.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_bad_implementation_names.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_bad_topology.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_two_steps.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_with_extra_node.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_with_iteration.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_with_iteration_cycle.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_with_missing_node.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_combine_with_parallel.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_implementation_name.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_implementations.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_loop_nodes.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_step.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_substeps.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_missing_type_key.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_nested_templated_steps.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_out_of_order.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_spark.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_type_config_mismatch.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/specifications/unit/pipeline_wrong_parallel_split_keys.yaml +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/__init__.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/conftest.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/aggregation_rule.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/checkpoint_rule.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/embarrassingly_parallel_rule.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/implemented_rule_local.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/implemented_rule_slurm.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/pipeline_local.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/pipeline_slurm.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/target_rule.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/rule_strings/validation_rule.txt +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_cli.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_config.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_data_utils.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_general_utils.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_graph_components.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_implementation.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_pipeline.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_pipeline_graph.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_pipeline_schema.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_rule.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_runner.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_step.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/tests/unit/test_validations.py +0 -0
- {easylink-0.1.14 → easylink-0.1.16}/update_readme.py +0 -0
@@ -1,3 +1,11 @@
|
|
1
|
+
**0.1.16 - 5/13/25**
|
2
|
+
|
3
|
+
- Implement new cli command to simplify implementation creation
|
4
|
+
|
5
|
+
**0.1.15 - 5/5/25**
|
6
|
+
|
7
|
+
- Fix SyntaxWarning for unescaped backslashes
|
8
|
+
|
1
9
|
**0.1.14 - 5/1/25**
|
2
10
|
|
3
11
|
- Add support for EmbarrassinglyParallelSteps to accept sections (i.e. non-leaf steps)
|
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: easylink
|
3
|
-
Version: 0.1.
|
3
|
+
Version: 0.1.16
|
4
4
|
Summary: Research repository for the EasyLink ER ecosystem project.
|
5
5
|
Home-page: https://github.com/ihmeuw/easylink
|
6
6
|
Author: The EasyLink developers
|
@@ -21,6 +21,7 @@ Requires-Dist: snakemake-interface-executor-plugins<9.0.0
|
|
21
21
|
Requires-Dist: snakemake-executor-plugin-slurm
|
22
22
|
Requires-Dist: pandas-stubs
|
23
23
|
Requires-Dist: pyarrow-stubs
|
24
|
+
Requires-Dist: types-PyYAML
|
24
25
|
Provides-Extra: docs
|
25
26
|
Requires-Dist: sphinx<8.2.0; extra == "docs"
|
26
27
|
Requires-Dist: sphinx-rtd-theme; extra == "docs"
|
@@ -78,15 +79,16 @@ There are a few things to install in order to use this package:
|
|
78
79
|
- Install singularity.
|
79
80
|
|
80
81
|
You may need to request it from your system admin.
|
81
|
-
Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
|
82
|
-
You can check if you already have singularity installed by running the command
|
83
|
-
existing installation, your singularity version
|
82
|
+
Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
|
83
|
+
You can check if you already have singularity installed by running the command
|
84
|
+
``singularity --version``. For an existing installation, your singularity version
|
85
|
+
number is printed.
|
84
86
|
|
85
87
|
- Install conda.
|
86
88
|
|
87
|
-
We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can
|
88
|
-
have conda installed by running the command ``conda --version``.
|
89
|
-
will be displayed.
|
89
|
+
We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can
|
90
|
+
check if you already have conda installed by running the command ``conda --version``.
|
91
|
+
For an existing installation, a version will be displayed.
|
90
92
|
|
91
93
|
- Install easylink, python and graphviz in a conda environment.
|
92
94
|
|
@@ -21,15 +21,16 @@ There are a few things to install in order to use this package:
|
|
21
21
|
- Install singularity.
|
22
22
|
|
23
23
|
You may need to request it from your system admin.
|
24
|
-
Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
|
25
|
-
You can check if you already have singularity installed by running the command
|
26
|
-
existing installation, your singularity version
|
24
|
+
Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
|
25
|
+
You can check if you already have singularity installed by running the command
|
26
|
+
``singularity --version``. For an existing installation, your singularity version
|
27
|
+
number is printed.
|
27
28
|
|
28
29
|
- Install conda.
|
29
30
|
|
30
|
-
We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can
|
31
|
-
have conda installed by running the command ``conda --version``.
|
32
|
-
will be displayed.
|
31
|
+
We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can
|
32
|
+
check if you already have conda installed by running the command ``conda --version``.
|
33
|
+
For an existing installation, a version will be displayed.
|
33
34
|
|
34
35
|
- Install easylink, python and graphviz in a conda environment.
|
35
36
|
|
easylink-0.1.16/docs/source/concepts/pipeline_schema/images/02_default_implementation.drawio.png
ADDED
Binary file
|
Binary file
|
easylink-0.1.16/docs/source/concepts/pipeline_schema/images/easylink_pipeline_schema.drawio.png
ADDED
Binary file
|
easylink-0.1.16/docs/source/concepts/pipeline_schema/images/entity_resolution_sub_steps.drawio.png
ADDED
Binary file
|
@@ -89,6 +89,7 @@ Default Implementations
|
|
89
89
|
A step with a check mark on its top right corner has a default implementation.
|
90
90
|
Therefore, the user doesn't *have* to specify anything.
|
91
91
|
If the user wants to, they can override the default implementation.
|
92
|
+
We draw these steps in gray.
|
92
93
|
|
93
94
|
.. image:: images/02_default_implementation.drawio.png
|
94
95
|
:alt: Diagram showing a step with a default implementation
|
@@ -384,6 +385,9 @@ Because this can get so complicated, we don't show all the hierarchical levels i
|
|
384
385
|
as we've done above with the dotted line "insert."
|
385
386
|
Instead, we make a separate diagram with the title "Step 2"
|
386
387
|
that represents the step graph contained within Step 2.
|
388
|
+
In this diagram, we show a little "mini-map" of the levels of hierarchy above,
|
389
|
+
highlighting in red the step that we are diagramming the inside of.
|
390
|
+
Think of this like a "You are Here!" label.
|
387
391
|
|
388
392
|
At the top level of the step hierarchy,
|
389
393
|
the pipeline schema splits the entity resolution task into very coarse steps,
|
@@ -486,10 +490,12 @@ Data dependencies *between* these steps are removed, and then the step nodes are
|
|
486
490
|
:alt: Diagram of the two conceptual steps transforming a pipeline schema into a particular
|
487
491
|
pipeline graph which includes a combined implementation
|
488
492
|
|
489
|
-
|
490
|
-
---------------------------------
|
493
|
+
.. _easylink_pipeline_schema:
|
491
494
|
|
492
|
-
|
495
|
+
EasyLink pipeline schema
|
496
|
+
------------------------
|
497
|
+
|
498
|
+
.. image:: images/easylink_pipeline_schema.drawio.png
|
493
499
|
|
494
500
|
Input datasets
|
495
501
|
^^^^^^^^^^^^^^
|
@@ -522,6 +528,15 @@ but one of them must be called “Record ID” and it must have unique values.
|
|
522
528
|
- Allen
|
523
529
|
- 456 Other Drive, Anytown WA, 99999
|
524
530
|
|
531
|
+
Known clusters
|
532
|
+
^^^^^^^^^^^^^^
|
533
|
+
|
534
|
+
**Interpretation:**
|
535
|
+
If any clusters are already known, they can be provided here
|
536
|
+
(format described in "Clusters" sub-section).
|
537
|
+
This is typically empty, which is the default,
|
538
|
+
representing that there is no prior knowledge of clusters (all records are unresolved).
|
539
|
+
|
525
540
|
Clusters
|
526
541
|
^^^^^^^^
|
527
542
|
|
@@ -531,9 +546,11 @@ which indicates that records assigned the same cluster ID are observations of th
|
|
531
546
|
and records with different cluster IDs are observations of different entities.
|
532
547
|
Records without a cluster ID are unresolved
|
533
548
|
(they may or may not be part of one of the existing clusters).
|
534
|
-
|
535
|
-
|
536
|
-
|
549
|
+
|
550
|
+
Clusters are similar to pairwise *links* (described in more detail :ref:`below <clustering_sub_steps>`)
|
551
|
+
but inherently enforce the logical consistency of *transitivity* --
|
552
|
+
if A and B are in the same cluster, and B and C are in the same cluster,
|
553
|
+
then A and C are in the same cluster by definition.
|
537
554
|
|
538
555
|
**Specification:**
|
539
556
|
A file in a tabular format with two columns: "Input Record ID" and "Cluster ID".
|
@@ -570,23 +587,29 @@ Lastly, input_file_4 and input_file_5 are considered duplicates
|
|
570
587
|
(records, from the same data source, referring to the same entity)
|
571
588
|
and are also a match to reference_file_2.
|
572
589
|
|
573
|
-
|
574
|
-
|
590
|
+
.. _entity_resolution_step:
|
591
|
+
|
592
|
+
Entity resolution
|
593
|
+
^^^^^^^^^^^^^^^^^
|
575
594
|
|
576
595
|
**Interpretation:**
|
577
|
-
|
578
|
-
|
596
|
+
Resolving (some) records to correspond to particular entities.
|
597
|
+
A set of records corresponding to the same entity is called a "cluster."
|
598
|
+
|
599
|
+
This step may take into account already-known clusters as it sees fit:
|
579
600
|
anything from using them as a starting point for optimization to treating those clusters as set-in-stone and unchangeable.
|
580
601
|
|
581
|
-
|
582
|
-
|
602
|
+
Typically, this would only be be performed once, but the red dashed box
|
603
|
+
in the diagram above indicates that it *may* be looped, with the clusters
|
604
|
+
found in each iteration passed on to the next.
|
605
|
+
This allows for one kind of *cascading*, an iterative approach to entity resolution
|
583
606
|
used by the US Census Bureau (and possibly other organizations too)
|
584
607
|
to deal with the computational challenge of linking billions of records.
|
585
608
|
In cascading, multiple passes are made to find clusters, starting with
|
586
609
|
faster techniques (such as exact matching) that
|
587
610
|
can solve some "easy" cases and make the problem smaller.
|
588
611
|
As the focus narrows to only the records that
|
589
|
-
are hardest to cluster
|
612
|
+
are hardest to cluster, making the size of the problem smaller,
|
590
613
|
more sophisticated and computationally expensive
|
591
614
|
techniques can be used.
|
592
615
|
|
@@ -594,20 +617,22 @@ techniques can be used.
|
|
594
617
|
|
595
618
|
Give cascading its own documentation page?
|
596
619
|
|
597
|
-
The sort of cascading represented by
|
598
|
-
the kind in which a
|
599
|
-
which satisfies transitivity (as opposed to pairwise comparisons)
|
620
|
+
The sort of cascading represented by the looping section in this diagram is
|
621
|
+
the kind in which a *clustering* (guaranteed to satisfy transitivity)
|
600
622
|
is confirmed before moving to the next iteration.
|
601
|
-
|
623
|
+
There is another kind of cascading, in which *pairwise links* are confirmed
|
624
|
+
but transitivity is not enforced.
|
625
|
+
That kind of cascading is represented by the looping section in :ref:`the sub-steps of clustering <clustering_sub_steps>`,
|
626
|
+
which nests within this entity resolution step.
|
602
627
|
|
603
|
-
This step :ref:`has sub-steps <
|
628
|
+
This step :ref:`has sub-steps <entity_resolution_sub_steps>`, which may be expanded for more detail.
|
604
629
|
|
605
630
|
**Examples:**
|
606
631
|
|
607
632
|
- The US Census Bureau's Person Identification and Validation System (PVS)
|
608
|
-
*modules* are considered
|
633
|
+
*modules* are considered entity resolution passes, since full *clusters*
|
609
634
|
-- called "protected identification keys" (PIKs) in that system --
|
610
|
-
are resolved in between modules.
|
635
|
+
are resolved in between modules (not only pairwise links!).
|
611
636
|
As described below, each module only considers records not already clustered.
|
612
637
|
- In `FIRLA <https://www.sciencedirect.com/science/article/pii/S1532046422001101>`_
|
613
638
|
and similar incremental methods, the already-found clusters would be used directly
|
@@ -620,7 +645,7 @@ Canonicalizing and downstream analysis
|
|
620
645
|
Everything else you want to do, after determining which records belong to the same entity and which don't.
|
621
646
|
This definition is a little fuzzy.
|
622
647
|
The downstream task is only included in the pipeline schema at all
|
623
|
-
so that combined implementations can jointly do part of the
|
648
|
+
so that combined implementations can jointly do part of the entity resolution task with the downstream task,
|
624
649
|
each informing the other.
|
625
650
|
If this kind of joint model isn't necessary,
|
626
651
|
this step can simply output entire datasets
|
@@ -647,12 +672,39 @@ May contain multiple draws in different files or subdirectories, or not.
|
|
647
672
|
**Specification:**
|
648
673
|
None. May take any form.
|
649
674
|
|
650
|
-
..
|
675
|
+
.. _entity_resolution_sub_steps:
|
651
676
|
|
652
|
-
|
653
|
-
|
677
|
+
Entity resolution sub-steps
|
678
|
+
---------------------------
|
654
679
|
|
655
|
-
|
680
|
+
The direct sub-steps of entity resolution mostly have to do with
|
681
|
+
*cascading* and *incorporating already-known clusters*,
|
682
|
+
both of which are rare situations.
|
683
|
+
All of the steps except for **clustering** have default implementations
|
684
|
+
and are not relevant in the common situation of starting from scratch
|
685
|
+
(no known clusters) and clustering in one pass (no cascading).
|
686
|
+
For this reason, clustering is described first below.
|
687
|
+
|
688
|
+
.. image:: images/entity_resolution_sub_steps.drawio.png
|
689
|
+
|
690
|
+
Clustering
|
691
|
+
^^^^^^^^^^
|
692
|
+
|
693
|
+
**Interpretation:**
|
694
|
+
Assigning cluster IDs to (some) records to indicate which correspond to the same entity.
|
695
|
+
*May* use information about "old" clusters as a starting point.
|
696
|
+
|
697
|
+
This step :ref:`has sub-steps <clustering_sub_steps>`, which may be expanded for more detail
|
698
|
+
*by pairwise methods.*
|
699
|
+
Methods that are not pairwise should implement this step directly.
|
700
|
+
|
701
|
+
**Examples:**
|
702
|
+
|
703
|
+
- The core part of a PVS module
|
704
|
+
- `dblink <https://github.com/cleanzr/dblink>`_
|
705
|
+
(would ignore "old" clusters, since there is no way for it to update)
|
706
|
+
- In Splink, this step would correspond to estimating parameters, making pairwise
|
707
|
+
predictions, and then clustering entities with connected components or similar
|
656
708
|
|
657
709
|
Eliminating records
|
658
710
|
^^^^^^^^^^^^^^^^^^^
|
@@ -664,8 +716,12 @@ Usually these will be records that have already been clustered sufficiently well
|
|
664
716
|
(whatever that means as defined by the implementation of this step)
|
665
717
|
that we don't need to look at them anymore.
|
666
718
|
|
719
|
+
**Default implementation:**
|
720
|
+
Throws an error if there are any known clusters.
|
721
|
+
Otherwise, returns an empty list (no records to eliminate).
|
722
|
+
|
667
723
|
**Example:**
|
668
|
-
As mentioned above, our main example of
|
724
|
+
As mentioned above, our main example of entity resolution passes is PVS *modules*
|
669
725
|
such as NameSearch, DOBSearch, etc.
|
670
726
|
In those modules, the implementation of this step would be to eliminate
|
671
727
|
all input-file records that are already linked to at least one reference-file
|
@@ -701,30 +757,6 @@ Pandas code dropping records with matching record IDs.
|
|
701
757
|
Note that if the default implementation is used,
|
702
758
|
input and output data specifications do not need to be checked.
|
703
759
|
|
704
|
-
Datasets for pass
|
705
|
-
^^^^^^^^^^^^^^^^^
|
706
|
-
|
707
|
-
**Interpretation:**
|
708
|
-
The input datasets to consider for the purposes of this clustering pass.
|
709
|
-
|
710
|
-
**Specification:**
|
711
|
-
See specification for "Input datasets."
|
712
|
-
|
713
|
-
Clustering
|
714
|
-
^^^^^^^^^^
|
715
|
-
|
716
|
-
**Interpretation:**
|
717
|
-
Assigning cluster IDs to (some) records to indicate which correspond to the same entity.
|
718
|
-
May use information about "old" clusters as a starting point.
|
719
|
-
|
720
|
-
**Examples:**
|
721
|
-
|
722
|
-
- The core part of a PVS module
|
723
|
-
- `dblink <https://github.com/cleanzr/dblink>`_
|
724
|
-
(would ignore "old" clusters, since there is no way for it to update)
|
725
|
-
- In Splink, this step would correspond to estimating parameters, making pairwise
|
726
|
-
predictions, and then clustering entities with connected components or similar
|
727
|
-
|
728
760
|
New clusters
|
729
761
|
^^^^^^^^^^^^
|
730
762
|
|
@@ -741,6 +773,10 @@ Updating clusters
|
|
741
773
|
**Interpretation:**
|
742
774
|
Updating/reconciling previously-found clusters with newly-found clusters.
|
743
775
|
|
776
|
+
**Default implementation:**
|
777
|
+
Throws an error if there are any known clusters.
|
778
|
+
Otherwise, returns the new clusters unchanged.
|
779
|
+
|
744
780
|
**Examples:**
|
745
781
|
|
746
782
|
- In PVS, simply appending PIKs found in this module to those found in previous
|
@@ -748,4 +784,189 @@ Updating/reconciling previously-found clusters with newly-found clusters.
|
|
748
784
|
Because of the "eliminating records" strategy used in PVS, these are guaranteed
|
749
785
|
to not include any of the same input file records.
|
750
786
|
- A simple approach would be to make each set of clusters into a graph of records,
|
751
|
-
merge the graphs, and take the connected components as the updated clusters.
|
787
|
+
merge the graphs, and take the connected components as the updated clusters.
|
788
|
+
|
789
|
+
.. _clustering_sub_steps:
|
790
|
+
|
791
|
+
Clustering sub-steps
|
792
|
+
--------------------
|
793
|
+
|
794
|
+
As mentioned above, the sub-steps of clustering are designed for *pairwise* methods --
|
795
|
+
models of entity resolution that only consider *pairs* of records at a time.
|
796
|
+
Breaking down the entity resolution task into a binary classification problem
|
797
|
+
about whether or not each pair of two records belong to the same entity simplifies
|
798
|
+
it enormously, and traditional methods going back to `Fellegi and Sunter (1969) <https://courses.cs.washington.edu/courses/cse590q/04au/papers/Felligi69.pdf>`_
|
799
|
+
take this approach.
|
800
|
+
|
801
|
+
Methods that are not pairwise will need to implement the "clustering" step as a whole,
|
802
|
+
as they are not composed of parts that align with these sub-steps.
|
803
|
+
|
804
|
+
.. image:: images/clustering_sub_steps.drawio.png
|
805
|
+
|
806
|
+
Clusters to links
|
807
|
+
^^^^^^^^^^^^^^^^^
|
808
|
+
|
809
|
+
**Interpretation:**
|
810
|
+
Converting *clusters* (sets of records that are all mutually linked)
|
811
|
+
to *links* (pairs of records that are linked).
|
812
|
+
|
813
|
+
**Default implementation:**
|
814
|
+
Pandas code that gets of list of Record IDs for each Cluster ID,
|
815
|
+
then generates all the unique (unordered) pairs of records,
|
816
|
+
and pairs them with probability 1.
|
817
|
+
|
818
|
+
Here is a rough draft of the code for this default implementation:
|
819
|
+
|
820
|
+
.. code::
|
821
|
+
|
822
|
+
import pandas as pd
|
823
|
+
from itertools import combinations
|
824
|
+
|
825
|
+
def clusters_to_links(clusters_df):
|
826
|
+
# Group by Cluster ID and collect Record IDs for each cluster
|
827
|
+
grouped = clusters_df.groupby("Cluster ID")["Input Record ID"].apply(list)
|
828
|
+
|
829
|
+
# Generate all unique pairs of Record IDs within each cluster
|
830
|
+
links = []
|
831
|
+
for record_ids in grouped:
|
832
|
+
links.extend(combinations(sorted(record_ids), 2))
|
833
|
+
|
834
|
+
# Create a DataFrame for the links
|
835
|
+
links_df = pd.DataFrame(links, columns=["Left Record ID", "Right Record ID"])
|
836
|
+
links_df["Probability"] = 1.0
|
837
|
+
return links_df
|
838
|
+
|
839
|
+
Links
|
840
|
+
^^^^^
|
841
|
+
|
842
|
+
**Interpretation:**
|
843
|
+
Pairs of records that are linked with some probability.
|
844
|
+
|
845
|
+
Links can be seen as another way to represent
|
846
|
+
the same information as *clusters*,
|
847
|
+
but links are not conducive to enforcing the structural constraint
|
848
|
+
of *transitivity*: that if A links to B
|
849
|
+
and B links to C, A must link to C.
|
850
|
+
This lack of structural awareness is inherent to pairwise methods,
|
851
|
+
and the loss of information this represents is a tradeoff with the
|
852
|
+
benefits of the simplicity of the pairwise approach to entity resolution.
|
853
|
+
|
854
|
+
Assigning a probability to each pair is a more efficient system for
|
855
|
+
representing uncertainty than draws,
|
856
|
+
when the statistical dependence structure between the pairwise links
|
857
|
+
is unknown.
|
858
|
+
Draws may be used in addition to pairwise
|
859
|
+
probabilities when (some information about) the dependence
|
860
|
+
structure is known.
|
861
|
+
It is up to downstream steps to interpret/assume the dependence structure between pairwise probabilities.
|
862
|
+
If a method doesn't represent uncertainty, it can set
|
863
|
+
all probabilities to 1 (or another constant).
|
864
|
+
|
865
|
+
**Specification:**
|
866
|
+
A table with three columns, "Left Record ID", "Right Record ID", and "Probability".
|
867
|
+
Every value in both Record ID columns should exist in one of the input datasets.
|
868
|
+
Left Record ID and Right Record ID are not permitted to be equal to one another in any given row.
|
869
|
+
Rows should be unique (i.e. multiple rows with the same Left Record ID *and* Right Record ID would not be permitted).
|
870
|
+
The Left Record ID value should be alphabetically before the Right Record ID
|
871
|
+
value in each row.
|
872
|
+
(This ensures each pair is truly unique, and not
|
873
|
+
a mirror image of another.)
|
874
|
+
Each value in the Probability column must be between
|
875
|
+
0 and 1 (inclusive).
|
876
|
+
|
877
|
+
**Example:**
|
878
|
+
|
879
|
+
.. list-table::
|
880
|
+
:header-rows: 1
|
881
|
+
|
882
|
+
* - Left Record ID
|
883
|
+
- Right Record ID
|
884
|
+
- Probability
|
885
|
+
* - input_file_2
|
886
|
+
- reference_file_3
|
887
|
+
- 0.9
|
888
|
+
* - input_file_2
|
889
|
+
- reference_file_4
|
890
|
+
- 0.8
|
891
|
+
* - input_file_3
|
892
|
+
- reference_file_6
|
893
|
+
- 0.4
|
894
|
+
|
895
|
+
Linking
|
896
|
+
^^^^^^^
|
897
|
+
|
898
|
+
**Interpretation:**
|
899
|
+
Finding pairs of records that should
|
900
|
+
be considered links (correspond to the same entity).
|
901
|
+
|
902
|
+
Typically, this would only be be performed once, but the red dashed box
|
903
|
+
in the diagram above indicates that it *may* be looped, with the links
|
904
|
+
found in each iteration passed on to the next.
|
905
|
+
This allows for the other kind of *cascading*, an iterative approach
|
906
|
+
described :ref:`above <entity_resolution_step>`.
|
907
|
+
|
908
|
+
The sort of cascading represented by the looping section in this diagram is
|
909
|
+
the kind in which *links*
|
910
|
+
are confirmed before moving to the next iteration.
|
911
|
+
There is another kind of cascading, in which *clusters* are confirmed
|
912
|
+
and transitivity is enforced.
|
913
|
+
That kind of cascading is represented by the looping section in :ref:`the top-level pipeline schema <easylink_pipeline_schema>`.
|
914
|
+
|
915
|
+
**Examples:**
|
916
|
+
|
917
|
+
- A single PVS pass *within* a module, such as the first pass
|
918
|
+
of GeoSearch, which `as of 2014 <https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-02.pdf>`_
|
919
|
+
used blocking on the Master Address File (MAF) ID.
|
920
|
+
- In Splink, this step would correspond to estimating parameters and making pairwise predictions (possibly with a threshold)
|
921
|
+
|
922
|
+
Links to clusters
|
923
|
+
^^^^^^^^^^^^^^^^^
|
924
|
+
|
925
|
+
**Interpretation:**
|
926
|
+
Converting *links* (pairs of records that are linked) to *clusters* (sets of records that are all mutually linked).
|
927
|
+
|
928
|
+
This implies resolving issues with transitivity: if A links to B
|
929
|
+
and B links to C, A must link to C.
|
930
|
+
Resolving these issues requires making after-the-fact corrections
|
931
|
+
to some of the links found, taking advantage of the context provided
|
932
|
+
by other links.
|
933
|
+
Making these corrections outside the linkage model is not ideal,
|
934
|
+
but this is the price paid in return for the simplicity of the pairwise approach.
|
935
|
+
|
936
|
+
Clusters are also much more conducive to representing *other* structural
|
937
|
+
constraints the analyst may have, such as a one-to-one link between two files.
|
938
|
+
We expect that these constraints will typically be enforced during this step.
|
939
|
+
|
940
|
+
**Examples:**
|
941
|
+
|
942
|
+
- The simplest algorithm is finding the
|
943
|
+
`components <https://en.wikipedia.org/wiki/Component_(graph_theory)>`_
|
944
|
+
(also called "connected components")
|
945
|
+
of the graph created by giving every record a node
|
946
|
+
and every pair (with probability above a threshold) an edge.
|
947
|
+
This is implemented `in Splink <https://moj-analytical-services.github.io/splink/api_docs/clustering.html>`_.
|
948
|
+
- In PVS, the algorithm incorporates the restriction
|
949
|
+
that multiple records from the *reference* file
|
950
|
+
should never be in the same cluster.
|
951
|
+
Therefore, the links are filtered before going
|
952
|
+
into connected components:
|
953
|
+
only the link with the highest probability for
|
954
|
+
each input file record is kept, and if there are
|
955
|
+
ties for the highest probability, no links
|
956
|
+
involving that input file record are kept.
|
957
|
+
This is described `here <https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-02.pdf>`_
|
958
|
+
as a "post-search program."
|
959
|
+
- In other Census Bureau processes such as the linkage of
|
960
|
+
the Post Enumeration Survey (PES) to the Census,
|
961
|
+
there is a 1-to-1 restriction: there can only be one record
|
962
|
+
from each file in a cluster.
|
963
|
+
This is achieved by finding the matching such that the
|
964
|
+
sum of the (logit) probabilities of the accepted matches
|
965
|
+
is maximized, as described in `Jaro (1989) <https://www.jstor.org/stable/2289924?seq=4>`_.
|
966
|
+
|
967
|
+
.. note::
|
968
|
+
|
969
|
+
None of the methods in this list are able to
|
970
|
+
propagate the uncertainty represented by the pairwise probabilities
|
971
|
+
through this step, e.g. by *sampling* clusters somehow.
|
972
|
+
Further research is needed in this area.
|