PyPI - crossref-matcher - Versions diffs - 0.0.dev2317948835__tar.gz - Mend

crossref-matcher 0.0.dev2317948835__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (129) hide show

crossref_matcher-0.0.dev2317948835/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 Crossref
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

crossref_matcher-0.0.dev2317948835/MANIFEST.in ADDED Viewed

@@ -0,0 +1,5 @@
+include crossref_matcher/resources/plugins.json
+include crossref_matcher/resources/VERSION.txt
+recursive-include crossref_matcher/resources/data/models *.pkl *.gz *.local
+recursive-include crossref_matcher/resources/data/common_ror_endings *.txt
+recursive-include crossref_matcher/resources/data/countries *.txt

crossref_matcher-0.0.dev2317948835/PKG-INFO ADDED Viewed

@@ -0,0 +1,34 @@
+Metadata-Version: 2.4
+Name: crossref-matcher
+Version: 0.0.dev2317948835
+Summary: Crossref's matching API
+Home-page: https://gitlab.com/crossref/marple
+Author: Crossref
+License-File: LICENSE
+Requires-Dist: fastapi>=0.100.0
+Requires-Dist: uvicorn>=0.30.0
+Requires-Dist: pydantic>=2.0.0
+Requires-Dist: crossref-example-matching-strategy>=0.2.1
+Provides-Extra: evaluation
+Requires-Dist: scikit-learn>=1.7.0; extra == "evaluation"
+Provides-Extra: strategies-core
+Requires-Dist: boto3>=1.30.0; extra == "strategies-core"
+Requires-Dist: opensearch-py>=2.0.0; extra == "strategies-core"
+Requires-Dist: opensearch-dsl>=2.0.0; extra == "strategies-core"
+Requires-Dist: rapidfuzz>=3.0.0; extra == "strategies-core"
+Requires-Dist: unidecode>=1.3.0; extra == "strategies-core"
+Requires-Dist: fuzzywuzzy>=0.17.0; extra == "strategies-core"
+Requires-Dist: geonamescache>=1.3.0; extra == "strategies-core"
+Requires-Dist: requests>=2.30.0; extra == "strategies-core"
+Requires-Dist: ratelimit==2.2.0; extra == "strategies-core"
+Provides-Extra: test
+Requires-Dist: pytest>=8.4.2; extra == "test"
+Requires-Dist: anyio>=4.10.0; extra == "test"
+Requires-Dist: httpx>=0.28.1; extra == "test"
+Requires-Dist: jsonschema>=4.25.1; extra == "test"
+Dynamic: author
+Dynamic: home-page
+Dynamic: license-file
+Dynamic: provides-extra
+Dynamic: requires-dist
+Dynamic: summary

crossref_matcher-0.0.dev2317948835/README.md ADDED Viewed

@@ -0,0 +1,107 @@
+# Marple
+Crossref's matching service—codenamed Marple—provides the functionality for metadata matching:
+* run multiple matching tasks
+* implement multiple matching strategies
+* create and populate backend indexes for the matching strategies
+* evaluate the strategies against ground truth datasets
+## Terminology
+In short, matching is the task or process of finding an identifier of an item based on a structured or unstructured “description” of it. Examples include:
+* finding the DOI based on a bibliographic reference,
+* finding the ROR ID based on an affiliation string,
+* finding the grant DOI based on the acknowledgment section of a paper.
+There are also tasks that are technically not matching but are closely related, and in practice, we often treat them as matching tasks:
+* finding a duplicate of a journal article,
+* linking preprints to journal articles.
+And there are a few tasks that are often included in matching conversations, but are definitely not matching, for example:
+* retrieving the metadata of a work based on its DOI,
+* retrieving all works that contain the phrase “citation parsing” in the title.
+A matching task defines the nature of matching. Example matching tasks are bibliographic reference matching, preprint matching, and affiliation matching. A matching task has input and output:
+* Input is all the data needed for the matching, for example: a structured record or a list of them, unstructured text.
+* Output are simply matched identifiers. Within a specific matching task, output identifiers are usually of a specific type (i.e. we match to ROR ID, and not ORCID ID). In some cases, there can be a certain target database as well (i.e. we match only to DataCite DOIs). The output identifiers can have different cardinality depending on the task, some matching tasks will allow zero, one, and/or more identifiers as a result of matching of a single input.
+A matching strategy defines how the matching is actually done. Multiple strategies can exist for a specific matching task. Some strategies can even run other strategies and combine their outcomes.
+## Evaluation
+We can evaluate a matching strategy using an evaluation set. See [Evaluation](crossref_matcher/evaluation/README.md) for more about this.
+## Using `crossref-matcher` as a library
+### Installation
+Install with pip: `pip install crossref-matcher`
+### Usage
+With the `crossref-matcher` library installed in your own project, you can import the `Strategy` class to develop your own matching strategies:
+```python
+from crossref_matcher import Strategy, MatchTask
+class MyMatchingStrategy(Strategy):
+    id = "my-matching-strategy"
+    task = MatchTask.REFERENCE
+    description = "Example strategy."
+    def __init__(self):
+        pass
+    def match(self, input_data):
+        # matching logic goes here
+```
+Find more information on the `Strategy` class in [`crossref_matcher/strategies/__init__.py`](crossref_matcher/strategies/__init__.py).
+## Importing external strategies as plugins
+After installing an external strategy, you can add it to [`plugins.json`](crossref_matcher/resources/plugins.json) to use it.
+To see this in action, you can add the following entry to the `crossref_matcher.plugins.enabled` array in `plugins.json`:
+```json
+        "crossref_example_matching_strategy"
+```
+This will make available the [example matching strategy](https://gitlab.com/crossref/example-matching-strategy), which should already be installed (via `setup.py` and/or `requirements.txt`).
+## Development
+### Indexes
+Some strategies need an OpenSearch index to match. [indexes directory](crossref_matcher/indexes/) contains scripts for creating and populating the indexes.
+Use ES_HOST env var to point Marple to the OpenSearch cluster.
+### How to run
+Run
+```sh
+python -m crossref_matcher.app --host 0.0.0.0 --port 8000
+```
+and then visit http://localhost:8000/docs
+### Strategies
+Strategies are located in the [strategies directory](crossref_matcher/strategies/). They inherit from the [`Strategy` abstract base class](crossref_matcher/strategies/__init__.py).
+Strategies must be added to the "enabled" list in [plugins.json](crossref_matcher/resources/plugins.json) before they can be used.
+### Dependencies
+Requirements are defined in `setup.py`. Requirements files with exact versions can be updated with `pip-compile`, which is a part of `pip-tools`. First run `pip install pip-tools`. Then:
+* To update `requirements.txt` file: `pip-compile setup.py --extra strategies-core`
+* To update `requirements-tests.txt` file: `pip-compile setup.py --extra strategies-core --extra evaluation --extra test -o requirements-tests.txt`

crossref_matcher-0.0.dev2317948835/crossref_matcher/MatchTask.py ADDED Viewed

@@ -0,0 +1,36 @@
+from enum import Enum
+class MatchTask(Enum):
+    REFERENCE = (
+        "reference",
+        "Matching bibliographic references to works, such as "
+        + "journal articles, conference papers, etc.",
+    )
+    PREPRINT = ("preprint", "Matching journal articles to preprints.")
+    AFFILIATION = ("affiliation", "Matching affiliations to ROR IDs.")
+    FUNDER = ("funder", "Matching funders to ROR IDs.")
+    OTHER = ("other", "A generic matching task.")
+    HEALTHCHECK = (
+        "healthcheck",
+        "used internally to check that things are working properly",
+    )
+    def __new__(cls, value, description):
+        obj = object.__new__(cls)
+        obj._value_ = value
+        obj.description = description
+        return obj
+    @classmethod
+    def get_tasks(cls):
+        return [{"id": t.value, "description": t.description} for t in cls]
+    @classmethod
+    def get_ids(cls):
+        return [t.value for t in cls]
+    @property
+    def id(self):
+        # alias for "value"
+        return self.value

crossref_matcher-0.0.dev2317948835/crossref_matcher/__init__.py ADDED Viewed

@@ -0,0 +1,6 @@
+from .MatchTask import MatchTask as MatchTask
+from .strategies import Strategy as Strategy
+from crossref_matcher.matching.utils import get_resource_path
+with open(get_resource_path("crossref_matcher.resources", "VERSION.txt"), "r") as f:
+    __version__ = f.read().strip()

crossref_matcher-0.0.dev2317948835/crossref_matcher/app.py ADDED Viewed

@@ -0,0 +1,248 @@
+import json
+import time
+from datetime import datetime
+from starlette.responses import JSONResponse
+from typing import Any
+from contextlib import asynccontextmanager
+from crossref_matcher import MatchTask, __version__
+from crossref_matcher.evaluation.api_routes import router as evaluation_router
+from crossref_matcher.matching import schemas
+from crossref_matcher.strategies import (
+    Strategy,
+    get_default_strategy,
+    list_strategies,
+    get_strategy,
+)
+from fastapi import FastAPI, HTTPException, Request, Query
+from typing import Union, Annotated
+import logging
+import uvicorn
+import uvicorn.logging
+logger = logging.getLogger(__name__)
+# logging.getLogger("matching").setLevel(logging.DEBUG)
+# handler = logging.StreamHandler()
+# handler.setFormatter(
+#     logging.Formatter(
+#         fmt="%(asctime)s %(name)s.%(lineno)d %(levelname)s : %(message)s",
+#         datefmt="%H:%M:%S",
+#     )
+# )
+# logging.getLogger("matching").addHandler(handler)
+class AsciiJSONResponse(JSONResponse):
+    def render(self, content: Any) -> bytes:
+        return json.dumps(content, ensure_ascii=True).encode("utf-8")
+Strategy.load_plugins()
+tags_metadata = [
+    {"name": "Health", "description": "Health check endpoints."},
+    {
+        "name": "Tasks and strategies",
+        "description": "Endpoints to list available matching tasks and strategies.",
+    },
+    {"name": "Matching", "description": "Endpoints to perform matching."},
+    {
+        "name": "Evaluation",
+        "description": "Endpoints to evaluate matching strategies on datasets.",
+    },
+]
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    # startup code goes here
+    yield
+    # shutdown code goes here
+app = FastAPI(
+    title="Crossref Matching API",
+    description="Matching API allows to match structured "
+    "and unstructured data to identifiers.",
+    lifespan=lifespan,
+    openapi_tags=tags_metadata,
+)
+@app.middleware("http")
+async def add_headers(request: Request, call_next):
+    start_time = time.perf_counter()
+    start_time_iso = datetime.fromtimestamp(time.time()).isoformat()
+    response = await call_next(request)
+    process_time = time.perf_counter() - start_time
+    response.headers["X-Start-Time"] = start_time_iso
+    response.headers["X-Process-Time"] = str(process_time)
+    response.headers["X-Service-Version"] = __version__
+    return response
+app.include_router(evaluation_router)
+@app.get(
+    "/heartbeat",
+    response_model=schemas.HeartbeatResponse,
+    tags=["Health"],
+    name="",
+    description="Heartbeat",
+    response_class=AsciiJSONResponse,
+)
+async def heartbeat():
+    return {"status": "ok"}
+@app.get(
+    "/tasks",
+    response_model=schemas.TasksResponse,
+    tags=["Tasks and strategies"],
+    name="",
+    description="The list of supported matching tasks.",
+    response_class=AsciiJSONResponse,
+)
+async def list_tasks_endpoint():
+    from crossref_matcher.strategies import (
+        NoDefaultStrategyError,
+        MultipleDefaultStrategiesError,
+    )
+    tasks = MatchTask.get_tasks()
+    for task in tasks:
+        try:
+            default_strategy = get_default_strategy(task["id"])
+            task["default_strategy"] = default_strategy.id
+        except NoDefaultStrategyError:
+            if task["id"] == "other":
+                task["default_strategy"] = "N/A"
+            else:
+                task["default_strategy"] = "WARNING: No default strategy set"
+        except MultipleDefaultStrategiesError:
+            task["default_strategy"] = "WARNING: Multiple default strategies set"
+    return schemas.TasksResponse(message=schemas.TasksMessage(items=tasks))
+@app.get(
+    "/tasks/{task_id}/strategies",
+    response_model=schemas.StrategiesResponse,
+    tags=["Tasks and strategies"],
+    name="",
+    description="The list of strategies available for a given matching task.",
+    response_class=AsciiJSONResponse,
+)
+async def list_strategies_endpoint(task_id: str, include_disabled: bool = False):
+    task_id = task_id.replace("-matching", "")
+    try:
+        match_task = MatchTask(task_id)
+    except ValueError:
+        raise HTTPException(status_code=404, detail="No such matching task")
+    strategies = list_strategies(task=match_task.id, include_disabled=include_disabled)
+    strategies = [
+        {
+            "id": s.id,
+            "description": s.description,
+            "default": s.is_default(),
+            "disabled": s.is_disabled(),
+        }
+        for s in strategies
+    ]
+    return schemas.StrategiesResponse(
+        message=schemas.StrategiesMessage(items=strategies)
+    )
+def match(
+    task: str,
+    input_data: str,
+    strategy: Union[str, None] = None,
+):
+    task = task.replace("-matching", "")
+    try:
+        match_task = MatchTask(task)
+    except ValueError:
+        raise HTTPException(status_code=404, detail="No such matching task")
+    try:
+        strategy_class = get_strategy(strategy, match_task)
+    except ValueError as e:
+        raise HTTPException(status_code=404, detail=str(e))
+    try:
+        strategy_id = strategy_class.id
+        s = strategy_class()
+        items = s.match(input_data)
+        target_data = getattr(s, "target_data", None)
+        items = [schemas.MatchedItem.model_validate(i) for i in items]
+        return schemas.MatchedItemsResponse(
+            message=schemas.MatchedItemsMessage(
+                items=items, strategy=strategy_id, target_data=target_data
+            )
+        )
+    except NotImplementedError:
+        raise HTTPException(
+            status_code=400,
+            detail="Strategy is not implemented",
+        )
+@app.get(
+    "/match",
+    response_model=schemas.MatchedItemsResponse,
+    tags=["Matching"],
+    name="",
+    description="Match input to identifiers.",
+    response_class=AsciiJSONResponse,
+)
+async def match_get(
+    task: str,
+    input_data: Annotated[str, Query(alias="input")],
+    strategy: Union[str, None] = None,
+):
+    return match(task, input_data, strategy)
+@app.post(
+    "/match",
+    response_model=schemas.MatchedItemsResponse,
+    tags=["Matching"],
+    name="",
+    description="Match input to identifiers.",
+    response_class=AsciiJSONResponse,
+)
+async def match_post(
+    request: Request,
+    task: str,
+    strategy: Union[str, None] = None,
+):
+    input_data = (await request.body()).decode("UTF-8")
+    return match(task, input_data, strategy)
+def run_app(host="0.0.0.0", port=8000):
+    uvicorn.run(
+        "crossref_matcher.app:app",
+        host=host,
+        port=port,
+        reload=True,
+        log_level="debug",
+    )
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description="Run Crossref Matching API server")
+    parser.add_argument(
+        "--host", help="Host to run the server on", type=str, default="0.0.0.0"
+    )
+    parser.add_argument(
+        "--port", help="Port to run the server on", type=int, default=8000
+    )
+    args = parser.parse_args()
+    run_app(host=args.host, port=args.port)

crossref_matcher-0.0.dev2317948835/crossref_matcher/evaluation/__init__.py ADDED Viewed

@@ -0,0 +1,13 @@
+import warnings
+from .api_routes import router as router
+try:
+    from .evaluation import (
+        get_matched_alternate as get_matched_alternate,
+        EvaluateSklearn as EvaluateSklearn,
+        run_strategy_on_eval_data as run_strategy_on_eval_data,
+    )
+except ImportError as e:
+    warnings.warn(
+        f"Error when importing evaluation module. Some functionality may be unavailable: {e}"
+    )

crossref_matcher-0.0.dev2317948835/crossref_matcher/evaluation/all_links.py ADDED Viewed

@@ -0,0 +1,185 @@
+from copy import deepcopy
+from typing import Iterable
+from crossref_matcher.evaluation.evaluation import get_matched_alternate
+def res_cat(true_link, matched_link):
+    if matched_link == "no_match":
+        if true_link == "no_match":
+            return "TN"
+        else:
+            return "FN"
+    else:
+        if matched_link == true_link:
+            return "TP"
+        else:
+            return "FP"
+def get_candidate_position(candidates, target) -> int | None:
+    if target == "no_match":
+        return None
+    if not candidates:
+        return None
+    for i, candidate in enumerate(candidates):
+        if candidate["id"] == target:
+            return i
+    return -1
+def construct_true_relaxed_list(this_result_true, alternates_negative):
+    for t in this_result_true:
+        if t in alternates_negative:
+            yield "no_match"
+        else:
+            yield t
+def get_matched_name(item, matched_id):
+    for match in item["extra"].get("matching_result", []):
+        if match["id"] == matched_id:
+            return match.get("matched_name")
+    return None
+def construct_matched_list(this_result_true, matched, alternates_negative=None):
+    # get a "matched" list the same length as the true list
+    _matched = deepcopy(matched)
+    this_result_matched = []
+    if len(_matched) == 0:
+        _matched = ["no_match"]
+    for t in this_result_true:
+        if t in _matched:
+            this_result_matched.append(_matched.pop(_matched.index(t)))
+        else:
+            this_result_matched.append(False)
+    for idx, val in enumerate(this_result_matched):
+        # replace False values with either the false positive match or "no_match"
+        if val is False:
+            if len(_matched) > 0:
+                item = _matched.pop()
+                if alternates_negative is not None and item in alternates_negative:
+                    # This will end up changing a true positive into a true negative.
+                    # This is a bug, but it's a tough one to fix.
+                    this_result_matched[idx] = "no_match"
+                else:
+                    this_result_matched[idx] = item
+            else:
+                this_result_matched[idx] = "no_match"
+    return this_result_matched
+def get_all_links(items: Iterable[dict]) -> list[dict]:
+    """
+    Convert evaluation items into a detailed flat list of link records for analysis.
+    These records can then be used to create a pandas DataFrame for further evaluation.
+    This function transforms evaluation data items into individual link records,
+    creating one record per true link in each item. Each record contains both
+    the expected (true) link and the actual (matched) link, along with metadata
+    for evaluation analysis.
+    The function handles both strict and relaxed matching scenarios:
+    - Strict matching: Direct comparison between true and matched links
+    - Relaxed matching: Accounts for alternate matches
+    Some of the fields included in the output records require the use of an extra_fn
+    function during evaluation to populate necessary metadata. See the docstring
+    of the `run_strategy_on_eval_data` function for details on how to implement this.
+    Args:
+        items: Iterable of evaluation data items. These can be obtained from a
+        list of Result objects by calling `r.model_dump() for r in results` or
+        from a ResultSet object by calling `r.model_dump() for r in result_set.results`
+    Returns:
+        List of link record dicts. Each record contains:
+            - "unique_id": Formatted unique identifier (seq_no + true_link)
+            - "seq_no": Original sequence number from input item
+            - "input": Original input text
+            - "true_link": The expected/true link for this record
+            - "matched": The link that was matched by the strategy
+            - "matched_relaxed": The matched link under relaxed criteria
+            - "res_cat": Result category (TP/TN/FP/FN) for strict matching
+            - "res_cat_relaxed": Result category for relaxed matching
+            - "matched_name": Human-readable name of the matched entity
+            - "target_in_initial_candidates": Position of true link in candidates
+            - "target_es_score": Elasticsearch score for the true link
+            - "weight": Weight value from original item
+    Note:
+        - Items with no expected output are treated as expecting "no_match"
+        - The function pairs true links with matched links, handling cases where
+          the counts don't match by using "no_match" or false positives
+        - Relaxed matching applies alternate matches as specified in the item's alternates
+    Example usage:
+        >>> # result_set is an instance of ResultSet obtained from evaluation.run_strategy_on_eval_data()
+        >>> data = get_all_links(r.model_dump() for r in result_set.results)
+        >>> import pandas as pd
+        >>> df = pd.DataFrame(data).set_index("unique_id")
+    """
+    data = []
+    for item in items:
+        if len(item["output"]) == 0:
+            this_result_true = ["no_match"]
+        else:
+            this_result_true = item["output"]
+        alternates_negative = [
+            x[1] for x in item.get("alternates", []) if x[0] == "no_match"
+        ]
+        this_result_true_relaxed = list(
+            construct_true_relaxed_list(this_result_true, alternates_negative)
+        )
+        this_result_matched = construct_matched_list(this_result_true, item["matched"])
+        this_result_matched_relaxed = construct_matched_list(
+            this_result_true_relaxed,
+            get_matched_alternate(item),
+            alternates_negative=alternates_negative,
+        )
+        for i, this_row_true in enumerate(this_result_true):
+            if this_row_true in this_result_matched:
+                this_row_matched = this_row_true
+            else:
+                this_row_matched = this_result_matched[0]
+            this_result_matched.remove(this_row_matched)
+            if this_result_true_relaxed[i] in this_result_matched_relaxed:
+                this_row_matched_relaxed = this_result_true_relaxed[i]
+            else:
+                this_row_matched_relaxed = this_result_matched_relaxed[0]
+            this_result_matched_relaxed.remove(this_row_matched_relaxed)
+            unique_id = f"""{item["seq_no"]:06}_{this_row_true}"""
+            target_in_initial_candidates = get_candidate_position(
+                item["extra"].get("initial_candidates", None), this_row_true
+            )
+            if (
+                target_in_initial_candidates is not None
+                and target_in_initial_candidates >= 0
+            ):
+                target_es_score = item["extra"]["initial_candidates"][
+                    target_in_initial_candidates
+                ]["elasticsearch_score"]
+            else:
+                target_es_score = None
+            matched_name = get_matched_name(item, this_row_matched)
+            data.append(
+                {
+                    "unique_id": unique_id,
+                    "seq_no": item["seq_no"],
+                    "input": item["input"],
+                    "true_link": this_row_true,
+                    "matched": this_row_matched,
+                    "matched_relaxed": this_row_matched_relaxed,
+                    "res_cat": res_cat(this_row_true, this_row_matched),
+                    "res_cat_relaxed": res_cat(
+                        this_result_true_relaxed[i], this_row_matched_relaxed
+                    ),
+                    "matched_name": matched_name,
+                    "target_in_initial_candidates": target_in_initial_candidates,
+                    "target_es_score": target_es_score,
+                    "weight": item.get("weight"),
+                }
+            )
+    return data

crossref_matcher-0.0.dev2317948835/crossref_matcher/evaluation/api_routes.py ADDED Viewed

@@ -0,0 +1,20 @@
+import json
+from starlette.responses import JSONResponse
+from typing import Any
+from fastapi import APIRouter
+class AsciiJSONResponse(JSONResponse):
+    def render(self, content: Any) -> bytes:
+        return json.dumps(content, ensure_ascii=True).encode("utf-8")
+router = APIRouter(
+    prefix="/evaluate",
+    tags=["Evaluation"],
+)
+@router.get("/", tags=["Evaluation"], response_class=AsciiJSONResponse)
+async def evaluation_coming_soon():
+    return {"message": "Evaluation endpoints coming soon!"}