litsync 0.0.2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,125 @@
1
+ Metadata-Version: 2.4
2
+ Name: litsync
3
+ Version: 0.0.2
4
+ Summary: Incremental mirror for PubMed, PMC, FDA, and ClinicalTrials.gov
5
+ Author: Literature Downloader Contributors
6
+ Author-email: Rahul Brahma <rahul.brahma@uni-greifswald.de>
7
+ License: MIT
8
+ Project-URL: Homepage, https://github.com/takshan/litsync
9
+ Project-URL: Repository, https://github.com/takshan/litsync
10
+ Keywords: pubmed,pmc,fda,clinicaltrials,biomedical,mirror
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.10
15
+ Classifier: Programming Language :: Python :: 3.11
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: Programming Language :: Python :: 3.13
18
+ Requires-Python: >=3.10
19
+ Description-Content-Type: text/markdown
20
+ Requires-Dist: requests>=2.31
21
+ Requires-Dist: rich>=13.0
22
+
23
+ # litsync — incremental PubMed + PMC + FDA + ClinicalTrials.gov mirror
24
+
25
+ A modern, daily-runnable CLI for mirroring bulk biomedical datasets. It tracks every
26
+ file in a SQLite state DB so re-runs do the minimum work: already-verified immutable
27
+ files are skipped with no network request beyond the directory/manifest listing.
28
+
29
+ ## Install
30
+
31
+ ```bash
32
+ pip install -e .
33
+ ```
34
+
35
+ Or use the Makefile:
36
+
37
+ ```bash
38
+ make install
39
+ make dev
40
+ ```
41
+
42
+ ## Quick start
43
+
44
+ ```bash
45
+ litsync --data-root /data/literature --email you@institute.org
46
+ ```
47
+
48
+ Common options:
49
+
50
+ ```bash
51
+ litsync --data-root /data/literature --email you@institute.org \
52
+ --sources pubmed pmc fda clinicaltrials \
53
+ --fda-endpoints drug/event drug/label
54
+ ```
55
+
56
+ ```bash
57
+ --sources pubmed pmc fda clinicaltrials # which corpora (default: all four)
58
+ --fda-endpoints drug/event drug/label # default: all openFDA endpoints
59
+ --pmc-groups oa_comm oa_noncomm oa_other
60
+ --pmc-formats xml txt # default: xml
61
+ --workers 4 # concurrent downloads (keep modest; be polite)
62
+ --dry-run # plan only, download nothing
63
+ --reverify # re-download local files (integrity audit)
64
+ --prune # delete local files no longer on the server
65
+ --count-articles # count articles in already-downloaded files (no network)
66
+ --no-rich # disable Rich progress bars / tables
67
+ ```
68
+
69
+ ## On-disk layout
70
+
71
+ ```
72
+ /data/literature/
73
+ pubmed/baseline/ pubmed26nXXXX.xml.gz (+ .md5 verified)
74
+ pubmed/updatefiles/ daily citation deltas
75
+ pmc/oa_bulk/<group>/<fmt>/ baseline + dated incremental .tar.gz
76
+ pmc/oa_file_list.csv PMCID <-> PMID id map
77
+ fda/<category>/<endpoint>/ openFDA bulk snapshot zips + extracted JSON
78
+ clinicaltrials/ctg-public-xml.zip ClinicalTrials.gov full XML dump
79
+ clinicaltrials/ctg-public-xml/ extracted study XML files
80
+ _state/state.sqlite file ledger (status, size, mtime, md5, etag, attempts)
81
+ _state/logs/ dated run logs
82
+ _state/litsync.lock run lock (prevents overlapping cron runs)
83
+ ```
84
+
85
+ ## Cron (daily 02:30)
86
+
87
+ ```cron
88
+ 30 2 * * * /path/to/venv/bin/litsync --data-root /data/literature --email you@institute.org >> /data/literature/_state/cron.log 2>&1
89
+ ```
90
+
91
+ ## Extract corpus to sharded JSONL
92
+
93
+ ```bash
94
+ litsync-extract --data-root /data/literature --out /data/corpus \
95
+ --sources pubmed pmc fda clinicaltrials
96
+ ```
97
+
98
+ Or with Make:
99
+
100
+ ```bash
101
+ make extract DATA_ROOT=/data/literature CORPUS_OUT=/data/corpus
102
+ make extract-test DATA_ROOT=/data/literature
103
+ ```
104
+
105
+ ## Integrity model
106
+
107
+ - **PubMed**: every `.xml.gz` is verified against its NCBI `.md5` sidecar.
108
+ - **PMC**: bulk packages have no md5 sidecar, so they are verified by `Content-Length`
109
+ and an `ETag` is recorded for change detection.
110
+ - **openFDA / ClinicalTrials.gov**: these sources publish full snapshots. The downloader
111
+ detects changed snapshots via `ETag` / `Last-Modified` / `Content-Length` and only
112
+ re-downloads when the snapshot changes. When a snapshot changes it is extracted
113
+ again next to the zip file.
114
+ - Downloads are atomic (`.part` -> rename) and resumable via HTTP Range.
115
+ - Exit code is non-zero if any file failed, so cron/monitoring can alert.
116
+
117
+ ## Notes on sources
118
+
119
+ - **openFDA** bulk data is zipped JSON. The manifest is fetched from `https://api.fda.gov/download.json`.
120
+ Each endpoint partition becomes one downloaded/extracted unit.
121
+ - **ClinicalTrials.gov** bulk data is the full public XML dump from
122
+ `https://clinicaltrials.gov/api/legacy/public-xml?format=zip`. One XML file per study.
123
+ - Both sources are snapshots, not daily deltas. Daily runs are still cheap because unchanged
124
+ snapshots are skipped; changed snapshots are replaced in full.
125
+
@@ -0,0 +1,20 @@
1
+ litsync/__init__.py,sha256=6fgq3evCz5TI_zaDMgQlEtNAIIM4LZZI2gQg80nvirk,106
2
+ litsync/__main__.py,sha256=b3d0VEGsuMqtiRbeWxB4gBfpjDz-FGvBB-1JG4BCT0g,55
3
+ litsync/cli.py,sha256=J77IzGia27nJHkqJt32fMQsiEIgx1S0heKQq7oVGHHg,5617
4
+ litsync/config.py,sha256=C0zxfSByjy19pMQHRkLoaEXXIidoCw652EneywQ2qDY,934
5
+ litsync/extract.py,sha256=1Adrno9byh1Nwzbi9GiLEW3eiIC2jz5htQKVEEBNsX4,14034
6
+ litsync/http.py,sha256=ZQtciMiLsm6JORCeXCy4zFW4lT7PytfMD6FPnZIqbFY,4458
7
+ litsync/state.py,sha256=ZbAAyIJ0O3i-AZ75JsdzYs7eZ20fN_wTCNfqUXd82Iw,5536
8
+ litsync/sync.py,sha256=ZWBp6QkyI-z5Yo9kYoWhFkK43A42o2u45EIcALVLZB8,13439
9
+ litsync/ui.py,sha256=Gq9kFgDfbqbvY0sMwK-ailojGzZ3YqIx99VP3yk7kvo,8124
10
+ litsync/utils.py,sha256=c61EOnr2yjZFZQkW56l6mKjPGjILcsQEhf9-hpKzl3o,3788
11
+ litsync/sources/__init__.py,sha256=WGjjAqJTH38kzIKCW_sz8-LjeqN-Rchz_ir0lDDitSQ,312
12
+ litsync/sources/clinicaltrials.py,sha256=4ApojOWxTJqrMmZYrD7xpivLdXD3Ei6yZDmDxTlJgtk,804
13
+ litsync/sources/fda.py,sha256=bVXsnMedMjfpEZBvjnVfWITJZ6-OoSvE1gPsgzG4gAg,1689
14
+ litsync/sources/pmc.py,sha256=SYNrthBskxW-wryVmLAQR7HbBQRhXbsgvLtPN1KIahI,1719
15
+ litsync/sources/pubmed.py,sha256=Dd0up_7G0DReHTSQHa8GD3g7avsdMYPjUcbB13vs2oc,1157
16
+ litsync-0.0.2.dist-info/METADATA,sha256=HwndfiNJ4X2XTn8tCgPJVNbjjUQ3IPkBfUwhLDYZ1g8,4756
17
+ litsync-0.0.2.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
18
+ litsync-0.0.2.dist-info/entry_points.txt,sha256=ymrMbrSu3BMLjArnxnjrxAs8f3-HTPRsGY6Yzo5JO7E,91
19
+ litsync-0.0.2.dist-info/top_level.txt,sha256=YaYYXYOLc9hpIJY40kKrD22kHKBbxKN8T_XE6oJfyLw,8
20
+ litsync-0.0.2.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (82.0.1)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,3 @@
1
+ [console_scripts]
2
+ litsync = litsync.cli:main
3
+ litsync-extract = litsync.cli:extract_command
@@ -0,0 +1 @@
1
+ litsync