PyPI - endnote-utils - Versions diffs - 0.1.3__tar.gz → 0.2.0__tar.gz - Mend

endnote-utils 0.1.3tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

endnote_utils-0.2.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,223 @@
+Metadata-Version: 2.4
+Name: endnote-utils
+Version: 0.2.0
+Summary: Convert EndNote XML to CSV/JSON/XLSX with streaming parse and TXT report.
+Author-email: Minh Quach <minhquach8@gmail.com>
+License: MIT
+Keywords: endnote,xml,csv,bibliography,research
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+Requires-Dist: openpyxl>=3.1.0
+# EndNote Utils
+Convert **EndNote XML files** into clean CSV/JSON/XLSX with automatic TXT reports.
+Supports both **Python API** and **command-line interface (CLI)**.
+---
+## Features
+- ✅ Parse one XML file (`--xml`) or an entire folder of `*.xml` (`--folder`)
+- ✅ Streams `<record>` elements using `iterparse` (low memory usage)
+- ✅ Extracts fields:
+  `database, ref_type, title, journal, authors, year, volume, number, abstract, doi, urls, keywords, publisher, isbn, language, extracted_date`
+- ✅ Adds a `database` column from the XML filename stem (`IEEE.xml → IEEE`)
+- ✅ Normalizes DOI (`10.xxxx` → `https://doi.org/...`)
+- ✅ Supports **multiple output formats**: CSV, JSON, XLSX
+- ✅ Always generates a **TXT report** (default: `<out>_report.txt`) with:
+  - per-file counts (exported/skipped)
+  - totals, files processed
+  - run timestamp & duration
+  - **duplicate table** per database (Origin / Retractions / Duplicates / Remaining)
+  - optional duplicate key list (top-N)
+  - optional summary stats (year, ref_type, journal, top authors)
+- ✅ Auto-creates output folders if missing
+- ✅ Deduplication:
+  - `--dedupe doi` (unique by DOI)
+  - `--dedupe title-year` (unique by normalized title + year)
+  - `--dedupe-keep first|last` (keep first or last occurrence within each file)
+- ✅ Summary stats (`--stats`) with optional JSON export (`--stats-json`)
+- ✅ CLI options for CSV formatting, filters, verbosity
+- ✅ Importable Python API for scripting & integration
+---
+## Installation
+### From PyPI
+```bash
+pip install endnote-utils
+```
+Requires **Python 3.8+**.
+---
+## Usage
+### Command Line
+#### Single file
+```bash
+endnote-utils --xml data/IEEE.xml --out output/ieee.csv
+```
+#### Folder with multiple files
+```bash
+endnote-utils --folder data/xmls --out output/all_records.csv
+```
+#### Custom report path
+```bash
+endnote-utils \
+  --xml data/Scopus.xml \
+  --out output/scopus.csv \
+  --report reports/scopus_run.txt \
+  --stats \
+  --verbose
+```
+If `--report` is not provided, it defaults to `<out>_report.txt`.
+Use `--no-report` to disable report generation.
+---
+### CLI Options
+| Option          | Description                                         | Default            |
+| --------------- | --------------------------------------------------- | ------------------ |
+| `--xml`         | Path to a single EndNote XML file                   | –                  |
+| `--folder`      | Path to a folder containing multiple `*.xml` files  | –                  |
+| `--csv`         | (Legacy) Output CSV path                            | –                  |
+| `--out`         | Generic output path (`.csv`, `.json`, `.xlsx`)      | –                  |
+| `--format`      | Explicit format (`csv`, `json`, `xlsx`)             | inferred           |
+| `--report`      | Output TXT report path                              | `<out>_report.txt` |
+| `--no-report`   | Disable TXT report completely                       | –                  |
+| `--delimiter`   | CSV delimiter                                       | `,`                |
+| `--quoting`     | CSV quoting: `minimal`, `all`, `nonnumeric`, `none` | `minimal`          |
+| `--no-header`   | Suppress CSV header row                             | –                  |
+| `--encoding`    | Output text encoding                                | `utf-8`            |
+| `--ref-type`    | Only include records with this `ref_type` name      | –                  |
+| `--year`        | Only include records with this year                 | –                  |
+| `--max-records` | Stop after N records per file (for testing)         | –                  |
+| `--dedupe`      | Deduplicate mode: `none`, `doi`, `title-year`       | `none`             |
+| `--dedupe-keep` | Deduplication strategy: `first`, `last`             | `first`            |
+| `--stats`       | Include summary stats in TXT report                 | –                  |
+| `--stats-json`  | Path to JSON file to save stats & duplicate info    | –                  |
+| `--verbose`     | Verbose logging with debug details                  | –                  |
+---
+### Example Report (snippet)
+```
+========================================
+EndNote Export Report
+========================================
+Run started : 2025-09-11 14:30:22
+Files       : 4
+Duration    : 0.47 seconds
+Per-file results
+----------------------------------------
+GGScholar.xml    : 13 exported, 0 skipped
+IEEE.xml         : 2147 exported, 0 skipped
+PubMed.xml       : 504 exported, 0 skipped
+Scopus.xml       : 847 exported, 0 skipped
+TOTAL exported: 3511
+Duplicates table (by database)
+----------------------------------------
+Database        Origin   Retractions  Duplicates  Remaining
+------------------------------------------------------------
+GGScholar           179            0         27        152
+IEEE               1900            0        589       1311
+PubMed              320            0        225         95
+Scopus             1999            1        511       1489
+TOTAL              4410            1       1352       3047
+Duplicate keys (top)
+----------------------------------------
+Mode   : doi
+Keep   : first
+Removed: 1352
+Details (top):
+  10.1109/SPMB55497.2022.10014965 : 3 duplicate(s)
+  10.1109/TSSA63730.2024.10864368 : 2 duplicate(s)
+Summary stats
+----------------------------------------
+By year:
+   2022 : 569
+   2023 : 684
+   2024 : 1148
+   2025 : 1108
+By ref_type (top):
+  Journal Article: 2037
+  Conference Proceedings: 1470
+  Book Section: 4
+By journal (top 20):
+  IEEE Access: 175
+  IEEE Journal of Biomedical and Health Informatics: 67
+  ...
+Top authors (top 10):
+  Y. Wang: 50
+  X. Wang: 35
+  ...
+```
+---
+## Python API
+```python
+from pathlib import Path
+from endnote_utils import export, export_folder
+# Single file
+total, out_file, report_file = export(
+    Path("data/IEEE.xml"),
+    Path("output/ieee.csv"),
+    dedupe="doi", stats=True
+)
+# Folder
+total, out_file, report_file = export_folder(
+    Path("data/xmls"),
+    Path("output/all.csv"),
+    ref_type="Conference Proceedings",
+    year="2024",
+    dedupe="title-year",
+    dedupe_keep="last",
+    stats=True,
+    stats_json=Path("output/stats.json"),
+)
+```
+---
+## Development Notes
+* Pure Python, uses only standard library (`argparse`, `csv`, `xml.etree.ElementTree`, `logging`, `pathlib`, `json`).
+* Optional dependency: `openpyxl` (for Excel `.xlsx` export).
+* Streaming XML parsing avoids high memory usage.
+* Deduplication strategies configurable (`doi` / `title-year`).
+* Report includes per-database table and optional JSON snapshot.
+* Follows [PEP 621](https://peps.python.org/pep-0621/) packaging (`pyproject.toml`).
+---
+## License
+MIT License © 2025 Minh Quach

endnote_utils-0.2.0/README.md ADDED Viewed

@@ -0,0 +1,209 @@
+# EndNote Utils
+Convert **EndNote XML files** into clean CSV/JSON/XLSX with automatic TXT reports.
+Supports both **Python API** and **command-line interface (CLI)**.
+---
+## Features
+- ✅ Parse one XML file (`--xml`) or an entire folder of `*.xml` (`--folder`)
+- ✅ Streams `<record>` elements using `iterparse` (low memory usage)
+- ✅ Extracts fields:
+  `database, ref_type, title, journal, authors, year, volume, number, abstract, doi, urls, keywords, publisher, isbn, language, extracted_date`
+- ✅ Adds a `database` column from the XML filename stem (`IEEE.xml → IEEE`)
+- ✅ Normalizes DOI (`10.xxxx` → `https://doi.org/...`)
+- ✅ Supports **multiple output formats**: CSV, JSON, XLSX
+- ✅ Always generates a **TXT report** (default: `<out>_report.txt`) with:
+  - per-file counts (exported/skipped)
+  - totals, files processed
+  - run timestamp & duration
+  - **duplicate table** per database (Origin / Retractions / Duplicates / Remaining)
+  - optional duplicate key list (top-N)
+  - optional summary stats (year, ref_type, journal, top authors)
+- ✅ Auto-creates output folders if missing
+- ✅ Deduplication:
+  - `--dedupe doi` (unique by DOI)
+  - `--dedupe title-year` (unique by normalized title + year)
+  - `--dedupe-keep first|last` (keep first or last occurrence within each file)
+- ✅ Summary stats (`--stats`) with optional JSON export (`--stats-json`)
+- ✅ CLI options for CSV formatting, filters, verbosity
+- ✅ Importable Python API for scripting & integration
+---
+## Installation
+### From PyPI
+```bash
+pip install endnote-utils
+```
+Requires **Python 3.8+**.
+---
+## Usage
+### Command Line
+#### Single file
+```bash
+endnote-utils --xml data/IEEE.xml --out output/ieee.csv
+```
+#### Folder with multiple files
+```bash
+endnote-utils --folder data/xmls --out output/all_records.csv
+```
+#### Custom report path
+```bash
+endnote-utils \
+  --xml data/Scopus.xml \
+  --out output/scopus.csv \
+  --report reports/scopus_run.txt \
+  --stats \
+  --verbose
+```
+If `--report` is not provided, it defaults to `<out>_report.txt`.
+Use `--no-report` to disable report generation.
+---
+### CLI Options
+| Option          | Description                                         | Default            |
+| --------------- | --------------------------------------------------- | ------------------ |
+| `--xml`         | Path to a single EndNote XML file                   | –                  |
+| `--folder`      | Path to a folder containing multiple `*.xml` files  | –                  |
+| `--csv`         | (Legacy) Output CSV path                            | –                  |
+| `--out`         | Generic output path (`.csv`, `.json`, `.xlsx`)      | –                  |
+| `--format`      | Explicit format (`csv`, `json`, `xlsx`)             | inferred           |
+| `--report`      | Output TXT report path                              | `<out>_report.txt` |
+| `--no-report`   | Disable TXT report completely                       | –                  |
+| `--delimiter`   | CSV delimiter                                       | `,`                |
+| `--quoting`     | CSV quoting: `minimal`, `all`, `nonnumeric`, `none` | `minimal`          |
+| `--no-header`   | Suppress CSV header row                             | –                  |
+| `--encoding`    | Output text encoding                                | `utf-8`            |
+| `--ref-type`    | Only include records with this `ref_type` name      | –                  |
+| `--year`        | Only include records with this year                 | –                  |
+| `--max-records` | Stop after N records per file (for testing)         | –                  |
+| `--dedupe`      | Deduplicate mode: `none`, `doi`, `title-year`       | `none`             |
+| `--dedupe-keep` | Deduplication strategy: `first`, `last`             | `first`            |
+| `--stats`       | Include summary stats in TXT report                 | –                  |
+| `--stats-json`  | Path to JSON file to save stats & duplicate info    | –                  |
+| `--verbose`     | Verbose logging with debug details                  | –                  |
+---
+### Example Report (snippet)
+```
+========================================
+EndNote Export Report
+========================================
+Run started : 2025-09-11 14:30:22
+Files       : 4
+Duration    : 0.47 seconds
+Per-file results
+----------------------------------------
+GGScholar.xml    : 13 exported, 0 skipped
+IEEE.xml         : 2147 exported, 0 skipped
+PubMed.xml       : 504 exported, 0 skipped
+Scopus.xml       : 847 exported, 0 skipped
+TOTAL exported: 3511
+Duplicates table (by database)
+----------------------------------------
+Database        Origin   Retractions  Duplicates  Remaining
+------------------------------------------------------------
+GGScholar           179            0         27        152
+IEEE               1900            0        589       1311
+PubMed              320            0        225         95
+Scopus             1999            1        511       1489
+TOTAL              4410            1       1352       3047
+Duplicate keys (top)
+----------------------------------------
+Mode   : doi
+Keep   : first
+Removed: 1352
+Details (top):
+  10.1109/SPMB55497.2022.10014965 : 3 duplicate(s)
+  10.1109/TSSA63730.2024.10864368 : 2 duplicate(s)
+Summary stats
+----------------------------------------
+By year:
+   2022 : 569
+   2023 : 684
+   2024 : 1148
+   2025 : 1108
+By ref_type (top):
+  Journal Article: 2037
+  Conference Proceedings: 1470
+  Book Section: 4
+By journal (top 20):
+  IEEE Access: 175
+  IEEE Journal of Biomedical and Health Informatics: 67
+  ...
+Top authors (top 10):
+  Y. Wang: 50
+  X. Wang: 35
+  ...
+```
+---
+## Python API
+```python
+from pathlib import Path
+from endnote_utils import export, export_folder
+# Single file
+total, out_file, report_file = export(
+    Path("data/IEEE.xml"),
+    Path("output/ieee.csv"),
+    dedupe="doi", stats=True
+)
+# Folder
+total, out_file, report_file = export_folder(
+    Path("data/xmls"),
+    Path("output/all.csv"),
+    ref_type="Conference Proceedings",
+    year="2024",
+    dedupe="title-year",
+    dedupe_keep="last",
+    stats=True,
+    stats_json=Path("output/stats.json"),
+)
+```
+---
+## Development Notes
+* Pure Python, uses only standard library (`argparse`, `csv`, `xml.etree.ElementTree`, `logging`, `pathlib`, `json`).
+* Optional dependency: `openpyxl` (for Excel `.xlsx` export).
+* Streaming XML parsing avoids high memory usage.
+* Deduplication strategies configurable (`doi` / `title-year`).
+* Report includes per-database table and optional JSON snapshot.
+* Follows [PEP 621](https://peps.python.org/pep-0621/) packaging (`pyproject.toml`).
+---
+## License
+MIT License © 2025 Minh Quach

{endnote_utils-0.1.3 → endnote_utils-0.2.0}/pyproject.toml RENAMED Viewed

@@ -4,8 +4,8 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "endnote-utils"
-version = "0.1.3"
-description = "Convert EndNote XML to CSV with streaming parse and TXT report."
+version = "0.2.0"
+description = "Convert EndNote XML to CSV/JSON/XLSX with streaming parse and TXT report."
 readme = { file = "README.md", content-type = "text/markdown" }
 requires-python = ">=3.8"
 license = {text = "MIT"}
@@ -17,7 +17,9 @@ classifiers = [
   "Operating System :: OS Independent",
 ]
-dependencies = []  # stdlib only
+dependencies = [
+    "openpyxl>=3.1.0"
+]
 [project.scripts]
 endnote-utils = "endnote_utils.cli:main"

endnote_utils-0.2.0/src/endnote_utils/cli.py ADDED Viewed

@@ -0,0 +1,186 @@
+from __future__ import annotations
+import argparse
+import logging
+import sys
+from pathlib import Path
+from typing import List, Optional, Tuple
+from .core import (
+    DEFAULT_FIELDNAMES,
+    export_files_with_report,  # generic writer: csv/json/xlsx
+)
+SUPPORTED_FORMATS = ("csv", "json", "xlsx")
+EXT_TO_FORMAT = {".csv": "csv", ".json": "json", ".xlsx": "xlsx"}
+def build_parser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        description="Export EndNote XML (file or folder) to CSV/JSON/XLSX with a TXT report."
+    )
+    # Input source (mutually exclusive)
+    g = p.add_mutually_exclusive_group(required=True)
+    g.add_argument("--xml", help="Path to a single EndNote XML file.")
+    g.add_argument("--folder", help="Path to a folder containing *.xml files.")
+    # Output selection (CSV legacy flag + new generic flags)
+    p.add_argument(
+        "--csv",
+        required=False,
+        help="(Legacy) Output CSV path. Prefer --out for csv/json/xlsx.",
+    )
+    p.add_argument(
+        "--out",
+        required=False,
+        help="Generic output path; format inferred from file extension if --format not provided. "
+             "Supported extensions: .csv, .json, .xlsx",
+    )
+    p.add_argument(
+        "--format",
+        choices=SUPPORTED_FORMATS,
+        help="Output format. If omitted, inferred from --out extension or --csv.",
+    )
+    # Report controls
+    p.add_argument("--report", required=False, help="Path to TXT report (default: <output>_report.txt).")
+    p.add_argument(
+        "--no-report",
+        action="store_true",
+        help="Disable writing the TXT report (by default, a report is always generated).",
+    )
+    # CSV-specific formatting options (ignored for JSON/XLSX except delimiter/quoting/header)
+    p.add_argument("--delimiter", default=",", help="CSV delimiter (default: ',').")
+    p.add_argument(
+        "--quoting",
+        default="minimal",
+        choices=["minimal", "all", "nonnumeric", "none"],
+        help="CSV quoting mode (default: minimal).",
+    )
+    p.add_argument("--no-header", action="store_true", help="Do not write CSV header row.")
+    p.add_argument("--encoding", default="utf-8", help="Output text encoding (default: utf-8).")
+    # Filters / limits
+    p.add_argument("--ref-type", default=None, help="Filter by ref_type name.")
+    p.add_argument("--year", default=None, help="Filter by year.")
+    p.add_argument("--max-records", type=int, default=None, help="Max records per file (testing).")
+    # Deduplication & Stats
+    p.add_argument("--dedupe", choices=["none", "doi", "title-year"], default="none",
+                help="Deduplicate records by key. Default: none.")
+    p.add_argument("--dedupe-keep", choices=["first", "last"], default="first",
+                help="When duplicates found, keep the first or last occurrence. Default: first.")
+    p.add_argument("--stats", action="store_true",
+                help="Compute summary stats and include them in the TXT report.")
+    p.add_argument("--stats-json",
+                help="Optional JSON file path to write detailed stats (when --stats is used).")
+    p.add_argument("--top-authors", type=int, default=10,
+                help="How many top authors to list in the report/stats JSON. Default: 10.")
+    # Verbosity
+    p.add_argument("--verbose", action="store_true", help="Verbose logging.")
+    return p
+def _resolve_inputs(args: argparse.Namespace) -> List[Path]:
+    if args.xml:
+        xml_path = Path(args.xml)
+        if not xml_path.is_file():
+            raise FileNotFoundError(xml_path)
+        return [xml_path]
+    folder = Path(args.folder)
+    if not folder.is_dir():
+        raise FileNotFoundError(folder)
+    inputs = sorted(p for p in folder.glob("*.xml") if p.is_file())
+    if not inputs:
+        raise FileNotFoundError(f"No *.xml files found in folder: {folder}")
+    return inputs
+def _resolve_output_and_format(args: argparse.Namespace) -> tuple[Path, str, Optional[Path]]:
+    """
+    Decide final out_path, out_format, and report_path using:
+      - Prefer --out/--format if provided
+      - Fallback to --csv (legacy) which implies CSV
+      - If --no-report, return report_path=None
+    """
+    target_path: Optional[Path] = None
+    out_format: Optional[str] = None
+    if args.out:
+        target_path = Path(args.out)
+        out_format = args.format
+        if not out_format:
+            # infer from extension
+            out_format = EXT_TO_FORMAT.get(target_path.suffix.lower())
+            if not out_format:
+                raise SystemExit(
+                    "Cannot infer output format from extension. "
+                    "Use --format {csv,json,xlsx} or set a supported extension."
+                )
+    elif args.csv:
+        target_path = Path(args.csv)
+        out_format = args.format or "csv"
+        if out_format != "csv":
+            # user asked for non-csv but used --csv path
+            raise SystemExit("When using --csv, --format must be 'csv'. Use --out for json/xlsx.")
+    else:
+        raise SystemExit("You must provide either --out (preferred) or --csv (legacy).")
+    # Report path defaults next to chosen output file (unless disabled)
+    if args.no_report:
+        report_path: Optional[Path] = None
+    else:
+        report_path = Path(args.report) if args.report else target_path.with_name(target_path.stem + "_report.txt")
+    return target_path, out_format, report_path
+def main() -> None:
+    args = build_parser().parse_args()
+    logging.basicConfig(
+        level=logging.DEBUG if args.verbose else logging.INFO,
+        format="%(levelname)s: %(message)s",
+        stream=sys.stderr,
+    )
+    try:
+        inputs = _resolve_inputs(args)
+        out_path, out_format, report_path = _resolve_output_and_format(args)
+        total, final_out, final_report = export_files_with_report(
+            inputs=inputs,
+            out_path=out_path,
+            out_format=out_format,
+            fieldnames=DEFAULT_FIELDNAMES,
+            delimiter=args.delimiter,
+            quoting=args.quoting,
+            include_header=not args.no_header,
+            encoding=args.encoding,
+            ref_type=args.ref_type,
+            year=args.year,
+            max_records_per_file=args.max_records,
+            dedupe=args.dedupe,
+            dedupe_keep=args.dedupe_keep,
+            stats=args.stats,
+            stats_json=Path(args.stats_json) if args.stats_json else None,
+            top_authors=args.top_authors,
+            report_path=report_path,  # may be None → core should skip writing report
+        )
+        logging.info("Exported %d record(s) → %s", total, final_out)
+        if report_path is None:
+            logging.info("Report disabled by --no-report.")
+        else:
+            logging.info("Report → %s", final_report)
+    except FileNotFoundError as e:
+        logging.error("File/folder not found: %s", e)
+        sys.exit(1)
+    except Exception as e:
+        logging.error("Unexpected error: %s", e)
+        sys.exit(2)

endnote-utils 0.1.3__tar.gz → 0.2.0__tar.gz

endnote-utils 0.1.3tar.gz → 0.2.0tar.gz