ref-management 1.0.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Akira Imamoto
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,13 @@
1
+ Metadata-Version: 2.4
2
+ Name: ref-management
3
+ Version: 1.0.3
4
+ Summary: Manuscript Reference Toolkit (ARM) — extract, verify, and format references in research manuscripts
5
+ Author-email: Akira Imamoto <aimamoto@uchicago.edu>
6
+ License-File: LICENSE
7
+ Requires-Dist: bibtexparser>=1.4
8
+ Requires-Dist: python-docx>=1.0
9
+ Requires-Dist: biopython>=1.80
10
+ Requires-Dist: rapidfuzz>=3.0
11
+ Requires-Dist: requests>=2.28
12
+ Requires-Dist: citeproc-py>=0.6
13
+ Dynamic: license-file
@@ -0,0 +1,161 @@
1
+ # Manuscript Reference Toolkit ARM (Another Reference Manager v1-Revision 3)
2
+
3
+ ![Python Version](https://img.shields.io/badge/python-3.x-blue) ![License](https://img.shields.io/badge/license-MIT-green)
4
+
5
+ A comprehensive Python toolkit designed to extract, verify, correct, and format references in research manuscripts.
6
+
7
+ This **Revision 3** toolkit bridges the gap between rough drafts (which often contain raw references, incomplete metadata, or AI hallucinations) and a **finalized, submission-ready Word document**. It features a new **Universal CSL Formatting Engine**, an **Author-Year Bridge** (allowing you to draft with `(Author, Year)` and automatically convert to numeric formats if needed), and intelligent text-replacement algorithms that preserve your document's native fonts and formatting.
8
+
9
+ ## Key Features (r3 Updates)
10
+
11
+ * **Universal CSL Engine:** Powered by `citeproc-py`, simply provide any Citation Style Language (`.csl`) file (e.g., from the Zotero repository) to format your manuscript exactly to specific journal requirements (e.g., Nature, Cell, APA).
12
+ * **The Author-Year Bridge:** Draft naturally with `(Smith, 2024)` in text. The pipeline will fuzzy-match the authors to your bibliography and dynamically convert them to whatever your CSL demands (e.g., converting to `1–3` superscripts).
13
+ * **MDPI & Online Journal Preprocessor:** Automatically algebraic-extracts article numbers from DOIs (e.g., isolating `903` from `genes15070903`) to guarantee modern online journals print with correct page numbers.
14
+ * **Smart Bibliography Placement & Pagination:** Automatically detects trailing sections (like "Figure Legends" or "Tables") and perfectly inserts the formatted References in between the main text and trailing sections with clean page breaks.
15
+ * **Advanced Number Collapsing:** Automatically enforces universally required typographic ranges for scientific papers (e.g., converting `1, 2, 3` into `1–3`) natively avoiding CSL engine quirks.
16
+ * **Dual-Database Verification:** Seamlessly falls back to **Crossref** if a DOI is not found in **PubMed** (perfect for statistics or older journals).
17
+ * **Smart Shields:** Protects $CV^2$, $R^2$, `Tyr530`, and $1 \times 10^5$ from being misread as citations.
18
+
19
+ ## Why ARM? (Advantages over Traditional Reference Managers)
20
+
21
+ While conventional reference managers (e.g., Zotero, Mendeley, EndNote) are highly effective for personal library curation, they frequently introduce friction during multi-author manuscript preparation. ARM is specifically designed to resolve these collaborative bottlenecks:
22
+
23
+ * **Decentralized Collaborative Drafting:** Traditional tools require all co-authors to synchronize a centralized library database or install proprietary Word plugins. ARM completely eliminates personal library dependency. Co-authors can draft references organically in plain text (e.g., typing `(Author, Year)` or pasting raw, unformatted references at the bottom of the document), and the pipeline will dynamically resolve and format them.
24
+ * **Post-Hoc Resolution of Messy Drafts:** Instead of forcing authors to use a strict GUI to "insert" citations while writing, ARM acts as a robust post-processing compiler. It takes rough drafts—often containing incomplete metadata, inconsistent formatting, or AI-hallucinated citations—and mathematically standardizes them against the PubMed and Crossref APIs.
25
+ * **Intelligent Text & Math Protection:** Standard Word plugins often override native typography or mangle inline mathematics (mistaking superscript numbers for citations). ARM utilizes NLP-driven "Smart Shields" to actively protect critical scientific nomenclature and statistical notations (e.g., $R^2$, $CV^2$, $1 \times 10^5$).
26
+ * **Native Algorithmic Formatting:** Unlike traditional plugins that rely heavily on hidden Word XML field codes (which can corrupt documents when shared across different operating systems), ARM executes clean text-replacement algorithms that preserve your document's native fonts, margins, and layout.
27
+
28
+ ## Configuration (Important)
29
+
30
+ To query PubMed efficiently without hitting rate limits, you should configure your NCBI credentials.
31
+
32
+ **Option A: Environment Variables (Recommended)**
33
+ * **Mac/Linux:**
34
+ ```bash
35
+ export NCBI_EMAIL="your_email@example.com"
36
+ export NCBI_API_KEY="your_api_key"
37
+ ```
38
+ * **Windows (CMD/PowerShell):**
39
+ ```cmd
40
+ set NCBI_EMAIL=your_email@example.com
41
+ set NCBI_API_KEY=your_api_key
42
+ ```
43
+
44
+ **Option B: Hardcoding**
45
+ You can edit the `Entrez.email` and `Entrez.api_key` lines directly at the top of the `ref_management/verify_bib.py` module.
46
+
47
+ ---
48
+
49
+ ## 📦 Installation
50
+
51
+ Install directly from PyPI with a single command:
52
+
53
+ ```bash
54
+ pip install ref-management
55
+ ```
56
+
57
+ This automatically installs all required dependencies (`bibtexparser`, `python-docx`, `biopython`, `rapidfuzz`, `requests`, `citeproc-py`) and creates the following CLI commands:
58
+
59
+ | Command | Description |
60
+ | :--- | :--- |
61
+ | `arm-format` | End-to-end pipeline wrapper |
62
+ | `arm-scan` | Scan & extract references from a `.docx` |
63
+ | `arm-verify` | Enrich a `.bib` file via PubMed / Crossref |
64
+ | `arm-apply` | Apply CSL formatting to the Word document |
65
+ | `arm-add-dois` | Append missing DOIs to an intermediate draft |
66
+ | `arm-report` | Generate a plain-text reference list from a `.bib` |
67
+
68
+ ---
69
+
70
+ ## 🚀 Workflow 1: Fully Automated Pipeline (Recommended)
71
+
72
+ Use this wrapper command to execute the entire extraction, verification, and formatting process automatically.
73
+
74
+ ```bash
75
+ arm-format "MyDraft.docx" --csl "nature"
76
+ ```
77
+
78
+ > **💡 Pro-Tip (Default Directory):** You can create a folder at `~/citation_styles/` and store all your downloaded `.csl` files from Zotero there. The pipeline will automatically search this folder, meaning you can simply type `--csl cell` instead of providing the full file path.
79
+
80
+ **What it does:**
81
+ 1. Loads your desired journal style via the provided `.csl` file.
82
+ 2. Extracts raw references from your document.
83
+ 3. Downloads the missing metadata (Volume, Issue, Pages) from PubMed/Crossref.
84
+ 4. Rewrites your in-text citations natively.
85
+ 5. Injects a perfectly formatted bibliography, applying proper page breaks to ensure your Tables and Figure Legends are pushed cleanly to the next page.
86
+
87
+ * **Output:** `MyDraft_final_nature.docx`, plus diagnostic CSV/BibTeX files.
88
+
89
+ ---
90
+
91
+ ## Workflow 2: Partial / Step-by-Step Pipeline
92
+
93
+ If you want to manually inspect or edit the references between steps, you can run the modules individually.
94
+
95
+ ### Step 1: Scan & Extract
96
+ Reads the raw reference list at the bottom of your draft and maps them to PMIDs/DOIs.
97
+ ```bash
98
+ arm-scan "MyDraft.docx"
99
+ ```
100
+
101
+ ### Step 2: Verify & Enrich
102
+ Takes the extracted `.bib` file, hits PubMed/Crossref, and fills in all missing Journal names, Volumes, and Authors.
103
+ ```bash
104
+ arm-verify "MyDraft_extracted.bib"
105
+ ```
106
+
107
+ ### Step 3: Apply to Manuscript via CSL Engine
108
+ Takes your verified references and applies them to the document using your target CSL style.
109
+ ```bash
110
+ arm-apply "MyDraft_extracted_verified.bib" "MyDraft.docx" --csl "nature.csl"
111
+ ```
112
+
113
+ ---
114
+
115
+ ## Troubleshooting
116
+
117
+ ### "Dependent" CSL Style Error
118
+ If the script aborts with an error stating that your `.csl` file is a **dependent style**, it means the file you downloaded from the Zotero Style Repository is just a lightweight link to a "parent" publisher style (e.g., *The EMBO Journal* uses *EMBO Press*).
119
+
120
+ **Solution:**
121
+ 1. Read the terminal error message—the script will automatically scan the XML and tell you the exact name and URL of the parent style you need.
122
+ 2. Download that parent `.csl` file and place it in your `~/citation_styles/` folder.
123
+ 3. Rerun the script using the parent style.
124
+
125
+ *Example:* `arm-format "MyDraft.docx" --csl embo-press`
126
+
127
+ ---
128
+
129
+ ## Extra Tools
130
+
131
+ ### 1. Inject DOIs into an Intermediate Draft
132
+ If you want to quickly append clickable DOIs to the raw references of an intermediate draft (for co-authors to easily click/read papers) *without* fully reformatting the document or changing in-text citations:
133
+ ```bash
134
+ arm-add-dois "MyDraft_extracted_verified.bib" "MyDraft.docx"
135
+ ```
136
+ * **Output:** `MyDraft_with_DOIs.docx` (Your original draft, with `https://doi.org/...` seamlessly appended to references that were missing it).
137
+
138
+ ### 2. Generate a Text Report
139
+ If you just want a clean text file of your references (without modifying a Word document), you can use the reporter command on any verified `.bib` file:
140
+ ```bash
141
+ arm-report "MyDraft_extracted_verified.bib"
142
+ ```
143
+ * **Output:** `MyDraft_extracted_verified_list.txt`
144
+
145
+ ---
146
+
147
+ ## Module Overview
148
+
149
+ | Module / Command | Purpose |
150
+ | :--- | :--- |
151
+ | **`arm-format`** | **The Wrapper:** Runs all steps automatically using the CSL Engine. |
152
+ | **`arm-scan`** | **The Auditor:** Scans `.docx` for raw refs, outputs CSV report and a raw `.bib` mapping. |
153
+ | **`arm-verify`** | **The Enrichment Engine:** Queries PubMed/Crossref to enrich missing metadata. |
154
+ | **`arm-apply`** | **The CSL Formatter:** Updates inline citations, protects math/fonts, and smartly paginates the Bibliography. |
155
+ | **`arm-add-dois`** | **The Linker:** Appends DOIs to raw reference lists for intermediate co-author drafts. |
156
+ | **`arm-report`** | **The Reporter:** Converts `.bib` files into clean `.txt` lists. |
157
+
158
+ ---
159
+
160
+ ## Disclaimer
161
+ *While this toolkit uses fuzzy logic, NLP shields, and official APIs to verify and map data, always perform a final visual review of the generated manuscript before submitting to a journal.*
@@ -0,0 +1,35 @@
1
+ [build-system]
2
+ requires = ["setuptools>=61.0"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "ref-management"
7
+ version = "1.0.3"
8
+ description = "Manuscript Reference Toolkit (ARM) — extract, verify, and format references in research manuscripts"
9
+ authors = [
10
+ { name = "Akira Imamoto", email = "aimamoto@uchicago.edu" }
11
+ ]
12
+ dependencies = [
13
+ "bibtexparser>=1.4",
14
+ "python-docx>=1.0",
15
+ "biopython>=1.80",
16
+ "rapidfuzz>=3.0",
17
+ "requests>=2.28",
18
+ "citeproc-py>=0.6",
19
+ ]
20
+
21
+ [project_urls]
22
+ Homepage = "https://github.com/aimamoto/ref_management"
23
+ Repository = "https://github.com/aimamoto/ref_management"
24
+
25
+ [project.scripts]
26
+ arm-format = "ref_management.auto_format:main"
27
+ arm-scan = "ref_management.scan_raw_refs:main"
28
+ arm-verify = "ref_management.verify_bib:main"
29
+ arm-apply = "ref_management.apply_citations:main"
30
+ arm-add-dois = "ref_management.add_dois:main"
31
+ arm-report = "ref_management.generate_report:main"
32
+
33
+ [tool.setuptools.packages.find]
34
+ where = ["."]
35
+ include = ["ref_management*"]
@@ -0,0 +1,3 @@
1
+ """ref_management – Manuscript Reference Toolkit (ARM)."""
2
+
3
+ __version__ = "1.0.0"
@@ -0,0 +1,126 @@
1
+ import sys
2
+ import re
3
+ import argparse
4
+ from pathlib import Path
5
+
6
+ # --- MONKEY PATCH FOR PYPARSING/BIBTEXPARSER COMPATIBILITY ---
7
+ import pyparsing
8
+ if not hasattr(pyparsing, 'DelimitedList'):
9
+ if hasattr(pyparsing, 'delimited_list'): setattr(pyparsing, 'DelimitedList', pyparsing.delimited_list)
10
+ elif hasattr(pyparsing, 'delimitedList'): setattr(pyparsing, 'DelimitedList', pyparsing.delimitedList)
11
+
12
+ import bibtexparser
13
+ from docx import Document
14
+ from rapidfuzz import fuzz
15
+
16
+ REF_HEADER_PATTERN = re.compile(r'^\s*(?:[0-9]+\.?\s*)?(?:REFERENCES|BIBLIOGRAPHY|LITERATURE CITED|WORKS CITED)\s*$', re.IGNORECASE)
17
+ POST_REF_PATTERN = re.compile(r'^\s*(?:Tables?|Figures?|Figure Legends?|Supplementary.*?|Appendices|Data Availability|Acknowledgements?|Author Contributions?|Funding|Conflict(?:s)? of Interest|Competing Interests?|(?:Table|Figure|Fig\.?)\s*\d+.*)$', re.IGNORECASE)
18
+
19
+ def clean_for_match(text: str) -> str:
20
+ """Removes punctuation and normalizes spacing for accurate fuzzy matching."""
21
+ if not text: return ""
22
+ text = text.replace('{', '').replace('}', '')
23
+ return re.sub(r'[^\w\s]', '', text.lower()).strip()
24
+
25
+ def process_document(bib_path: Path, docx_path: Path, output_path: Path):
26
+ print(f"\nReading verified BibTeX: {bib_path.name}...")
27
+ try:
28
+ with open(bib_path, 'r', encoding='utf-8') as f:
29
+ bib_db = bibtexparser.load(f)
30
+ except Exception as e:
31
+ print(f"❌ ERROR reading BibTeX: {e}")
32
+ sys.exit(1)
33
+
34
+ # Build an index of cleaned titles to DOIs
35
+ doi_map = {}
36
+ for entry in bib_db.entries:
37
+ doi = entry.get('doi', '').strip()
38
+ title = entry.get('title', '').strip()
39
+ if doi and title:
40
+ # Clean DOI prefix if present
41
+ clean_doi = doi.replace('https://doi.org/', '').replace('doi:', '').strip()
42
+ doi_map[clean_for_match(title)] = clean_doi
43
+
44
+ print(f"Loaded {len(doi_map)} DOIs from BibTeX.")
45
+ print(f"Scanning document: {docx_path.name}...")
46
+ doc = Document(str(docx_path))
47
+
48
+ # 1. Find the boundaries of the References section
49
+ ref_start_idx = -1
50
+ for i, p in enumerate(doc.paragraphs):
51
+ if REF_HEADER_PATTERN.match(p.text):
52
+ ref_start_idx = i
53
+ break
54
+
55
+ if ref_start_idx == -1:
56
+ print("❌ ERROR: Could not locate 'References' header in the document.")
57
+ sys.exit(1)
58
+
59
+ ref_end_idx = len(doc.paragraphs)
60
+ for i in range(ref_start_idx + 1, len(doc.paragraphs)):
61
+ text = doc.paragraphs[i].text.strip()
62
+ if text and POST_REF_PATTERN.match(text):
63
+ ref_end_idx = i
64
+ break
65
+
66
+ # 2. Iterate through the references and append DOIs
67
+ added_count = 0
68
+ already_had_count = 0
69
+
70
+ for i in range(ref_start_idx + 1, ref_end_idx):
71
+ para = doc.paragraphs[i]
72
+ text = para.text.strip()
73
+
74
+ # Skip empty lines or very short fragments
75
+ if len(text) < 20: continue
76
+
77
+ # Check if a DOI is already present in this paragraph
78
+ if re.search(r'(?i)\bhttps?://doi\.org\b', text) or re.search(r'(?i)\bdoi:', text):
79
+ already_had_count += 1
80
+ continue
81
+
82
+ # Fuzzy match the paragraph text against our BibTeX titles
83
+ best_match_doi = None
84
+ best_score = 85 # Minimum strictness threshold
85
+
86
+ para_clean = clean_for_match(text)
87
+ for bib_title, doi in doi_map.items():
88
+ # partial_ratio is perfect here because the title is just a substring of the full reference paragraph
89
+ score = fuzz.partial_ratio(bib_title, para_clean)
90
+ if score > best_score:
91
+ best_score = score
92
+ best_match_doi = doi
93
+
94
+ if best_match_doi:
95
+ # Append the DOI natively to the paragraph
96
+ if not text.endswith('.'):
97
+ para.add_run('.')
98
+
99
+ # Format the run slightly to match typical hyperlink aesthetics (optional, but clean)
100
+ run = para.add_run(f" https://doi.org/{best_match_doi}")
101
+ added_count += 1
102
+
103
+ # 3. Save the patched draft
104
+ doc.save(str(output_path))
105
+ print(f"\nSuccess! Saved to {output_path.name}")
106
+ print(f" -> Found {already_had_count} references that already had DOIs.")
107
+ print(f" -> Dynamically matched and injected {added_count} missing DOIs.")
108
+
109
+ def main():
110
+ parser = argparse.ArgumentParser(description="Appends DOIs to the References section of an intermediate draft.")
111
+ parser.add_argument("bib", type=Path, help="The verified .bib file containing the DOIs")
112
+ parser.add_argument("doc", type=Path, help="The intermediate .docx file")
113
+ args = parser.parse_args()
114
+
115
+ if not args.bib.exists():
116
+ print(f"❌ ERROR: BibTeX file '{args.bib}' not found.")
117
+ sys.exit(1)
118
+ if not args.doc.exists():
119
+ print(f"❌ ERROR: Document '{args.doc}' not found.")
120
+ sys.exit(1)
121
+
122
+ output = args.doc.with_name(f"{args.doc.stem}_with_DOIs.docx")
123
+ process_document(args.bib, args.doc, output)
124
+
125
+ if __name__ == "__main__":
126
+ main()