PyPI - easylink - Versions diffs - 0.1.14__tar.gz → 0.1.16__tar.gz - Mend

easylink 0.1.14tar.gz → 0.1.16tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (217) hide show

{easylink-0.1.14 → easylink-0.1.16}/CHANGELOG.rst RENAMED Viewed

@@ -1,3 +1,11 @@
+**0.1.16 - 5/13/25**
+ - Implement new cli command to simplify implementation creation
+**0.1.15 - 5/5/25**
+ - Fix SyntaxWarning for unescaped backslashes
 **0.1.14 - 5/1/25**
  - Add support for EmbarrassinglyParallelSteps to accept sections (i.e. non-leaf steps)

{easylink-0.1.14 → easylink-0.1.16}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: easylink
-Version: 0.1.14
+Version: 0.1.16
 Summary: Research repository for the EasyLink ER ecosystem project.
 Home-page: https://github.com/ihmeuw/easylink
 Author: The EasyLink developers
@@ -21,6 +21,7 @@ Requires-Dist: snakemake-interface-executor-plugins<9.0.0
 Requires-Dist: snakemake-executor-plugin-slurm
 Requires-Dist: pandas-stubs
 Requires-Dist: pyarrow-stubs
+Requires-Dist: types-PyYAML
 Provides-Extra: docs
 Requires-Dist: sphinx<8.2.0; extra == "docs"
 Requires-Dist: sphinx-rtd-theme; extra == "docs"
@@ -78,15 +79,16 @@ There are a few things to install in order to use this package:
 - Install singularity.
   You may need to request it from your system admin.
-  Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
-  You can check if you already have singularity installed by running the command ``singularity --version``. For an
-  existing installation, your singularity version number is printed.
+  Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
+  You can check if you already have singularity installed by running the command
+  ``singularity --version``. For an existing installation, your singularity version
+  number is printed.
 - Install conda.
-  We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can check if you already
-  have conda installed by running the command ``conda --version``. For an existing installation, a version
-  will be displayed.
+  We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can
+  check if you already have conda installed by running the command ``conda --version``.
+  For an existing installation, a version will be displayed.
 - Install easylink, python and graphviz in a conda environment.

{easylink-0.1.14 → easylink-0.1.16}/README.rst RENAMED Viewed

@@ -21,15 +21,16 @@ There are a few things to install in order to use this package:
 - Install singularity.
   You may need to request it from your system admin.
-  Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
-  You can check if you already have singularity installed by running the command ``singularity --version``. For an
-  existing installation, your singularity version number is printed.
+  Refer to https://docs.sylabs.io/guides/4.1/admin-guide/installation.html.
+  You can check if you already have singularity installed by running the command
+  ``singularity --version``. For an existing installation, your singularity version
+  number is printed.
 - Install conda.
-  We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can check if you already
-  have conda installed by running the command ``conda --version``. For an existing installation, a version
-  will be displayed.
+  We recommend `miniforge <https://github.com/conda-forge/miniforge>`_. You can
+  check if you already have conda installed by running the command ``conda --version``.
+  For an existing installation, a version will be displayed.
 - Install easylink, python and graphviz in a conda environment.

easylink-0.1.16/docs/source/concepts/pipeline_schema/images/02_default_implementation.drawio.png ADDED Viewed

Binary file

easylink-0.1.16/docs/source/concepts/pipeline_schema/images/clustering_sub_steps.drawio.png ADDED Viewed

Binary file

easylink-0.1.16/docs/source/concepts/pipeline_schema/images/easylink_pipeline_schema.drawio.png ADDED Viewed

Binary file

easylink-0.1.16/docs/source/concepts/pipeline_schema/images/entity_resolution_sub_steps.drawio.png ADDED Viewed

Binary file

{easylink-0.1.14 → easylink-0.1.16}/docs/source/concepts/pipeline_schema/index.rst RENAMED Viewed

@@ -89,6 +89,7 @@ Default Implementations
 A step with a check mark on its top right corner has a default implementation.
 Therefore, the user doesn't *have* to specify anything.
 If the user wants to, they can override the default implementation.
+We draw these steps in gray.
 .. image:: images/02_default_implementation.drawio.png
    :alt: Diagram showing a step with a default implementation
@@ -384,6 +385,9 @@ Because this can get so complicated, we don't show all the hierarchical levels i
 as we've done above with the dotted line "insert."
 Instead, we make a separate diagram with the title "Step 2"
 that represents the step graph contained within Step 2.
+In this diagram, we show a little "mini-map" of the levels of hierarchy above,
+highlighting in red the step that we are diagramming the inside of.
+Think of this like a "You are Here!" label.
 At the top level of the step hierarchy,
 the pipeline schema splits the entity resolution task into very coarse steps,
@@ -486,10 +490,12 @@ Data dependencies *between* these steps are removed, and then the step nodes are
    :alt: Diagram of the two conceptual steps transforming a pipeline schema into a particular
       pipeline graph which includes a combined implementation
-Entity resolution pipeline schema
----------------------------------
+.. _easylink_pipeline_schema:
-.. image:: images/entity_resolution_pipeline_schema.drawio.png
+EasyLink pipeline schema
+------------------------
+.. image:: images/easylink_pipeline_schema.drawio.png
 Input datasets
 ^^^^^^^^^^^^^^
@@ -522,6 +528,15 @@ but one of them must be called “Record ID” and it must have unique values.
      - Allen
      - 456 Other Drive, Anytown WA, 99999
+Known clusters
+^^^^^^^^^^^^^^
+**Interpretation:**
+If any clusters are already known, they can be provided here
+(format described in "Clusters" sub-section).
+This is typically empty, which is the default,
+representing that there is no prior knowledge of clusters (all records are unresolved).
 Clusters
 ^^^^^^^^
@@ -531,9 +546,11 @@ which indicates that records assigned the same cluster ID are observations of th
 and records with different cluster IDs are observations of different entities.
 Records without a cluster ID are unresolved
 (they may or may not be part of one of the existing clusters).
-Pre-assigned clusters can be used to indicate any prior knowledge of clustering.
-If there is no prior knowledge, this can be an empty table (which is the default),
-indicating that all records are unresolved.
+Clusters are similar to pairwise *links* (described in more detail :ref:`below <clustering_sub_steps>`)
+but inherently enforce the logical consistency of *transitivity* --
+if A and B are in the same cluster, and B and C are in the same cluster,
+then A and C are in the same cluster by definition.
 **Specification:**
 A file in a tabular format with two columns: "Input Record ID" and "Cluster ID".
@@ -570,23 +587,29 @@ Lastly, input_file_4 and input_file_5 are considered duplicates
 (records, from the same data source, referring to the same entity)
 and are also a match to reference_file_2.
-Clustering pass
-^^^^^^^^^^^^^^^
+.. _entity_resolution_step:
+Entity resolution
+^^^^^^^^^^^^^^^^^
 **Interpretation:**
-A pass at clustering (some) records.
-This may take into account already-found clusters as it sees fit:
+Resolving (some) records to correspond to particular entities.
+A set of records corresponding to the same entity is called a "cluster."
+This step may take into account already-known clusters as it sees fit:
 anything from using them as a starting point for optimization to treating those clusters as set-in-stone and unchangeable.
-Going from one of these passes to the next is
-one kind of *cascading*, an iterative approach to entity resolution
+Typically, this would only be be performed once, but the red dashed box
+in the diagram above indicates that it *may* be looped, with the clusters
+found in each iteration passed on to the next.
+This allows for one kind of *cascading*, an iterative approach to entity resolution
 used by the US Census Bureau (and possibly other organizations too)
 to deal with the computational challenge of linking billions of records.
 In cascading, multiple passes are made to find clusters, starting with
 faster techniques (such as exact matching) that
 can solve some "easy" cases and make the problem smaller.
 As the focus narrows to only the records that
-are hardest to cluster/link, making the size of the problem smaller,
+are hardest to cluster, making the size of the problem smaller,
 more sophisticated and computationally expensive
 techniques can be used.
@@ -594,20 +617,22 @@ techniques can be used.
    Give cascading its own documentation page?
-The sort of cascading represented by clustering passes is
-the kind in which a consistent clustering
-which satisfies transitivity (as opposed to pairwise comparisons)
+The sort of cascading represented by the looping section in this diagram is
+the kind in which a *clustering* (guaranteed to satisfy transitivity)
 is confirmed before moving to the next iteration.
-See the sub-steps of clustering for the other kind of cascading.
+There is another kind of cascading, in which *pairwise links* are confirmed
+but transitivity is not enforced.
+That kind of cascading is represented by the looping section in :ref:`the sub-steps of clustering <clustering_sub_steps>`,
+which nests within this entity resolution step.
-This step :ref:`has sub-steps <clustering_pass_sub_steps>`, which may be expanded for more detail.
+This step :ref:`has sub-steps <entity_resolution_sub_steps>`, which may be expanded for more detail.
 **Examples:**
 - The US Census Bureau's Person Identification and Validation System (PVS)
-  *modules* are considered clustering passes, since clusters
+  *modules* are considered entity resolution passes, since full *clusters*
   -- called "protected identification keys" (PIKs) in that system --
-  are resolved in between modules.
+  are resolved in between modules (not only pairwise links!).
   As described below, each module only considers records not already clustered.
 - In `FIRLA <https://www.sciencedirect.com/science/article/pii/S1532046422001101>`_
   and similar incremental methods, the already-found clusters would be used directly
@@ -620,7 +645,7 @@ Canonicalizing and downstream analysis
 Everything else you want to do, after determining which records belong to the same entity and which don't.
 This definition is a little fuzzy.
 The downstream task is only included in the pipeline schema at all
-so that combined implementations can jointly do part of the ER task with the downstream task,
+so that combined implementations can jointly do part of the entity resolution task with the downstream task,
 each informing the other.
 If this kind of joint model isn't necessary,
 this step can simply output entire datasets
@@ -647,12 +672,39 @@ May contain multiple draws in different files or subdirectories, or not.
 **Specification:**
 None. May take any form.
-.. _clustering_pass_sub_steps:
+.. _entity_resolution_sub_steps:
-Clustering pass sub-steps
--------------------------
+Entity resolution sub-steps
+---------------------------
-.. image:: images/clustering_pass_sub_steps.drawio.png
+The direct sub-steps of entity resolution mostly have to do with
+*cascading* and *incorporating already-known clusters*,
+both of which are rare situations.
+All of the steps except for **clustering** have default implementations
+and are not relevant in the common situation of starting from scratch
+(no known clusters) and clustering in one pass (no cascading).
+For this reason, clustering is described first below.
+.. image:: images/entity_resolution_sub_steps.drawio.png
+Clustering
+^^^^^^^^^^
+**Interpretation:**
+Assigning cluster IDs to (some) records to indicate which correspond to the same entity.
+*May* use information about "old" clusters as a starting point.
+This step :ref:`has sub-steps <clustering_sub_steps>`, which may be expanded for more detail
+*by pairwise methods.*
+Methods that are not pairwise should implement this step directly.
+**Examples:**
+- The core part of a PVS module
+- `dblink <https://github.com/cleanzr/dblink>`_
+  (would ignore "old" clusters, since there is no way for it to update)
+- In Splink, this step would correspond to estimating parameters, making pairwise
+  predictions, and then clustering entities with connected components or similar
 Eliminating records
 ^^^^^^^^^^^^^^^^^^^
@@ -664,8 +716,12 @@ Usually these will be records that have already been clustered sufficiently well
 (whatever that means as defined by the implementation of this step)
 that we don't need to look at them anymore.
+**Default implementation:**
+Throws an error if there are any known clusters.
+Otherwise, returns an empty list (no records to eliminate).
 **Example:**
-As mentioned above, our main example of clustering passes is PVS *modules*
+As mentioned above, our main example of entity resolution passes is PVS *modules*
 such as NameSearch, DOBSearch, etc.
 In those modules, the implementation of this step would be to eliminate
 all input-file records that are already linked to at least one reference-file
@@ -701,30 +757,6 @@ Pandas code dropping records with matching record IDs.
 Note that if the default implementation is used,
 input and output data specifications do not need to be checked.
-Datasets for pass
-^^^^^^^^^^^^^^^^^
-**Interpretation:**
-The input datasets to consider for the purposes of this clustering pass.
-**Specification:**
-See specification for "Input datasets."
-Clustering
-^^^^^^^^^^
-**Interpretation:**
-Assigning cluster IDs to (some) records to indicate which correspond to the same entity.
-May use information about "old" clusters as a starting point.
-**Examples:**
-- The core part of a PVS module
-- `dblink <https://github.com/cleanzr/dblink>`_
-  (would ignore "old" clusters, since there is no way for it to update)
-- In Splink, this step would correspond to estimating parameters, making pairwise
-  predictions, and then clustering entities with connected components or similar
 New clusters
 ^^^^^^^^^^^^
@@ -741,6 +773,10 @@ Updating clusters
 **Interpretation:**
 Updating/reconciling previously-found clusters with newly-found clusters.
+**Default implementation:**
+Throws an error if there are any known clusters.
+Otherwise, returns the new clusters unchanged.
 **Examples:**
 - In PVS, simply appending PIKs found in this module to those found in previous
@@ -748,4 +784,189 @@ Updating/reconciling previously-found clusters with newly-found clusters.
   Because of the "eliminating records" strategy used in PVS, these are guaranteed
   to not include any of the same input file records.
 - A simple approach would be to make each set of clusters into a graph of records,
-  merge the graphs, and take the connected components as the updated clusters.
+  merge the graphs, and take the connected components as the updated clusters.
+.. _clustering_sub_steps:
+Clustering sub-steps
+--------------------
+As mentioned above, the sub-steps of clustering are designed for *pairwise* methods --
+models of entity resolution that only consider *pairs* of records at a time.
+Breaking down the entity resolution task into a binary classification problem
+about whether or not each pair of two records belong to the same entity simplifies
+it enormously, and traditional methods going back to `Fellegi and Sunter (1969) <https://courses.cs.washington.edu/courses/cse590q/04au/papers/Felligi69.pdf>`_
+take this approach.
+Methods that are not pairwise will need to implement the "clustering" step as a whole,
+as they are not composed of parts that align with these sub-steps.
+.. image:: images/clustering_sub_steps.drawio.png
+Clusters to links
+^^^^^^^^^^^^^^^^^
+**Interpretation:**
+Converting *clusters* (sets of records that are all mutually linked)
+to *links* (pairs of records that are linked).
+**Default implementation:**
+Pandas code that gets of list of Record IDs for each Cluster ID,
+then generates all the unique (unordered) pairs of records,
+and pairs them with probability 1.
+Here is a rough draft of the code for this default implementation:
+.. code::
+   import pandas as pd
+   from itertools import combinations
+   def clusters_to_links(clusters_df):
+      # Group by Cluster ID and collect Record IDs for each cluster
+      grouped = clusters_df.groupby("Cluster ID")["Input Record ID"].apply(list)
+      # Generate all unique pairs of Record IDs within each cluster
+      links = []
+      for record_ids in grouped:
+         links.extend(combinations(sorted(record_ids), 2))
+      # Create a DataFrame for the links
+      links_df = pd.DataFrame(links, columns=["Left Record ID", "Right Record ID"])
+      links_df["Probability"] = 1.0
+      return links_df
+Links
+^^^^^
+**Interpretation:**
+Pairs of records that are linked with some probability.
+Links can be seen as another way to represent
+the same information as *clusters*,
+but links are not conducive to enforcing the structural constraint
+of *transitivity*: that if A links to B
+and B links to C, A must link to C.
+This lack of structural awareness is inherent to pairwise methods,
+and the loss of information this represents is a tradeoff with the
+benefits of the simplicity of the pairwise approach to entity resolution.
+Assigning a probability to each pair is a more efficient system for
+representing uncertainty than draws,
+when the statistical dependence structure between the pairwise links
+is unknown.
+Draws may be used in addition to pairwise
+probabilities when (some information about) the dependence
+structure is known.
+It is up to downstream steps to interpret/assume the dependence structure between pairwise probabilities.
+If a method doesn't represent uncertainty, it can set
+all probabilities to 1 (or another constant).
+**Specification:**
+A table with three columns, "Left Record ID", "Right Record ID", and "Probability".
+Every value in both Record ID columns should exist in one of the input datasets.
+Left Record ID and Right Record ID are not permitted to be equal to one another in any given row.
+Rows should be unique (i.e. multiple rows with the same Left Record ID *and* Right Record ID would not be permitted).
+The Left Record ID value should be alphabetically before the Right Record ID
+value in each row.
+(This ensures each pair is truly unique, and not
+a mirror image of another.)
+Each value in the Probability column must be between
+0 and 1 (inclusive).
+**Example:**
+.. list-table::
+   :header-rows: 1
+   * - Left Record ID
+     - Right Record ID
+     - Probability
+   * - input_file_2
+     - reference_file_3
+     - 0.9
+   * - input_file_2
+     - reference_file_4
+     - 0.8
+   * - input_file_3
+     - reference_file_6
+     - 0.4
+Linking
+^^^^^^^
+**Interpretation:**
+Finding pairs of records that should
+be considered links (correspond to the same entity).
+Typically, this would only be be performed once, but the red dashed box
+in the diagram above indicates that it *may* be looped, with the links
+found in each iteration passed on to the next.
+This allows for the other kind of *cascading*, an iterative approach
+described :ref:`above <entity_resolution_step>`.
+The sort of cascading represented by the looping section in this diagram is
+the kind in which *links*
+are confirmed before moving to the next iteration.
+There is another kind of cascading, in which *clusters* are confirmed
+and transitivity is enforced.
+That kind of cascading is represented by the looping section in :ref:`the top-level pipeline schema <easylink_pipeline_schema>`.
+**Examples:**
+- A single PVS pass *within* a module, such as the first pass
+  of GeoSearch, which `as of 2014 <https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-02.pdf>`_
+  used blocking on the Master Address File (MAF) ID.
+- In Splink, this step would correspond to estimating parameters and making pairwise predictions (possibly with a threshold)
+Links to clusters
+^^^^^^^^^^^^^^^^^
+**Interpretation:**
+Converting *links* (pairs of records that are linked) to *clusters* (sets of records that are all mutually linked).
+This implies resolving issues with transitivity: if A links to B
+and B links to C, A must link to C.
+Resolving these issues requires making after-the-fact corrections
+to some of the links found, taking advantage of the context provided
+by other links.
+Making these corrections outside the linkage model is not ideal,
+but this is the price paid in return for the simplicity of the pairwise approach.
+Clusters are also much more conducive to representing *other* structural
+constraints the analyst may have, such as a one-to-one link between two files.
+We expect that these constraints will typically be enforced during this step.
+**Examples:**
+- The simplest algorithm is finding the
+  `components <https://en.wikipedia.org/wiki/Component_(graph_theory)>`_
+  (also called "connected components")
+  of the graph created by giving every record a node
+  and every pair (with probability above a threshold) an edge.
+  This is implemented `in Splink <https://moj-analytical-services.github.io/splink/api_docs/clustering.html>`_.
+- In PVS, the algorithm incorporates the restriction
+  that multiple records from the *reference* file
+  should never be in the same cluster.
+  Therefore, the links are filtered before going
+  into connected components:
+  only the link with the highest probability for
+  each input file record is kept, and if there are
+  ties for the highest probability, no links
+  involving that input file record are kept.
+  This is described `here <https://www.census.gov/content/dam/Census/library/working-papers/2014/adrm/carra-wp-2014-02.pdf>`_
+  as a "post-search program."
+- In other Census Bureau processes such as the linkage of
+  the Post Enumeration Survey (PES) to the Census,
+  there is a 1-to-1 restriction: there can only be one record
+  from each file in a cluster.
+  This is achieved by finding the matching such that the
+  sum of the (logit) probabilities of the accepted matches
+  is maximized, as described in `Jaro (1989) <https://www.jstor.org/stable/2289924?seq=4>`_.
+.. note::
+   None of the methods in this list are able to
+   propagate the uncertainty represented by the pairwise probabilities
+   through this step, e.g. by *sampling* clusters somehow.
+   Further research is needed in this area.

easylink 0.1.14__tar.gz → 0.1.16__tar.gz

easylink 0.1.14tar.gz → 0.1.16tar.gz