openrxiv 0.0.0 → 0.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +224 -0
- package/dist/cli.cjs +8082 -35849
- package/package.json +8 -32
- package/dist/cli/src/api/api-client.d.ts +0 -96
- package/dist/cli/src/api/api-client.d.ts.map +0 -1
- package/dist/cli/src/api/api-client.js +0 -257
- package/dist/cli/src/aws/bucket-explorer.d.ts +0 -26
- package/dist/cli/src/aws/bucket-explorer.d.ts.map +0 -1
- package/dist/cli/src/aws/bucket-explorer.js +0 -220
- package/dist/cli/src/aws/config.d.ts +0 -18
- package/dist/cli/src/aws/config.d.ts.map +0 -1
- package/dist/cli/src/aws/config.js +0 -191
- package/dist/cli/src/aws/downloader.d.ts +0 -13
- package/dist/cli/src/aws/downloader.d.ts.map +0 -1
- package/dist/cli/src/aws/downloader.js +0 -115
- package/dist/cli/src/aws/month-lister.d.ts +0 -18
- package/dist/cli/src/aws/month-lister.d.ts.map +0 -1
- package/dist/cli/src/aws/month-lister.js +0 -90
- package/dist/cli/src/commands/batch-process.d.ts +0 -3
- package/dist/cli/src/commands/batch-process.d.ts.map +0 -1
- package/dist/cli/src/commands/batch-process.js +0 -557
- package/dist/cli/src/commands/config.d.ts +0 -3
- package/dist/cli/src/commands/config.d.ts.map +0 -1
- package/dist/cli/src/commands/config.js +0 -42
- package/dist/cli/src/commands/download.d.ts +0 -3
- package/dist/cli/src/commands/download.d.ts.map +0 -1
- package/dist/cli/src/commands/download.js +0 -76
- package/dist/cli/src/commands/list.d.ts +0 -3
- package/dist/cli/src/commands/list.d.ts.map +0 -1
- package/dist/cli/src/commands/list.js +0 -18
- package/dist/cli/src/commands/month-info.d.ts +0 -3
- package/dist/cli/src/commands/month-info.d.ts.map +0 -1
- package/dist/cli/src/commands/month-info.js +0 -213
- package/dist/cli/src/commands/summary.d.ts +0 -3
- package/dist/cli/src/commands/summary.d.ts.map +0 -1
- package/dist/cli/src/commands/summary.js +0 -249
- package/dist/cli/src/index.d.ts +0 -3
- package/dist/cli/src/index.d.ts.map +0 -1
- package/dist/cli/src/index.js +0 -35
- package/dist/cli/src/utils/batches.d.ts +0 -9
- package/dist/cli/src/utils/batches.d.ts.map +0 -1
- package/dist/cli/src/utils/batches.js +0 -61
- package/dist/cli/src/utils/batches.test.d.ts +0 -2
- package/dist/cli/src/utils/batches.test.d.ts.map +0 -1
- package/dist/cli/src/utils/batches.test.js +0 -119
- package/dist/cli/src/utils/default-server.d.ts +0 -3
- package/dist/cli/src/utils/default-server.d.ts.map +0 -1
- package/dist/cli/src/utils/default-server.js +0 -20
- package/dist/cli/src/utils/index.d.ts +0 -5
- package/dist/cli/src/utils/index.d.ts.map +0 -1
- package/dist/cli/src/utils/index.js +0 -5
- package/dist/cli/src/utils/meca-processor.d.ts +0 -28
- package/dist/cli/src/utils/meca-processor.d.ts.map +0 -1
- package/dist/cli/src/utils/meca-processor.js +0 -503
- package/dist/cli/src/utils/meca-processor.test.d.ts +0 -2
- package/dist/cli/src/utils/meca-processor.test.d.ts.map +0 -1
- package/dist/cli/src/utils/meca-processor.test.js +0 -123
- package/dist/cli/src/utils/months.d.ts +0 -36
- package/dist/cli/src/utils/months.d.ts.map +0 -1
- package/dist/cli/src/utils/months.js +0 -135
- package/dist/cli/src/utils/months.test.d.ts +0 -2
- package/dist/cli/src/utils/months.test.d.ts.map +0 -1
- package/dist/cli/src/utils/months.test.js +0 -209
- package/dist/cli/src/utils/requester-pays-error.d.ts +0 -6
- package/dist/cli/src/utils/requester-pays-error.d.ts.map +0 -1
- package/dist/cli/src/utils/requester-pays-error.js +0 -20
- package/dist/cli/src/version.d.ts +0 -3
- package/dist/cli/src/version.d.ts.map +0 -1
- package/dist/cli/src/version.js +0 -2
- package/dist/utils/src/biorxiv-parser.d.ts +0 -51
- package/dist/utils/src/biorxiv-parser.d.ts.map +0 -1
- package/dist/utils/src/biorxiv-parser.js +0 -126
- package/dist/utils/src/folder-structure.d.ts +0 -44
- package/dist/utils/src/folder-structure.d.ts.map +0 -1
- package/dist/utils/src/folder-structure.js +0 -207
- package/dist/utils/src/index.d.ts +0 -3
- package/dist/utils/src/index.d.ts.map +0 -1
- package/dist/utils/src/index.js +0 -3
package/README.md
ADDED
|
@@ -0,0 +1,224 @@
|
|
|
1
|
+
# openRxiv MECA Downloader CLI
|
|
2
|
+
|
|
3
|
+
A comprehensive command-line interface (CLI) tool to download, process, and manage openRxiv MECA (Manuscript Exchange Common Approach) files from AWS S3 for text and data mining purposes.
|
|
4
|
+
|
|
5
|
+
## Features
|
|
6
|
+
|
|
7
|
+
- **Multi-Server Support**: Works with both bioRxiv and medRxiv servers
|
|
8
|
+
- **AWS S3 Integration**: Connect to S3 buckets with requester-pays support
|
|
9
|
+
- **Content Exploration**: List, search, and browse available content by month or batch
|
|
10
|
+
- **Individual Downloads**: Download MECA files by DOI with API integration
|
|
11
|
+
- **Batch Processing**: Process large amounts of data with configurable concurrency
|
|
12
|
+
- **Content Summaries**: Get detailed information about preprints
|
|
13
|
+
- **Month/Batch Analysis**: Detailed metadata for specific time periods
|
|
14
|
+
- **XML Processing**: Robust handling of openRxiv XML files with entity replacement
|
|
15
|
+
|
|
16
|
+
## Installation
|
|
17
|
+
|
|
18
|
+
### Prerequisites
|
|
19
|
+
|
|
20
|
+
- Node.js 18.0.0 or higher
|
|
21
|
+
- AWS account with access to bioRxiv/medRxiv S3 buckets (`--requester-pays`)
|
|
22
|
+
- API key for bioRxiv API access
|
|
23
|
+
|
|
24
|
+
### Global Installation
|
|
25
|
+
|
|
26
|
+
```bash
|
|
27
|
+
npm install -g openrxiv
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
## Quick Start
|
|
31
|
+
|
|
32
|
+
### Summary
|
|
33
|
+
|
|
34
|
+
Get a summary of a bioRxiv/medRxiv preprint from a URL or DOI.
|
|
35
|
+
|
|
36
|
+
**Arguments:**
|
|
37
|
+
|
|
38
|
+
- `<url-or-doi>`: bioRxiv URL or DOI to summarize
|
|
39
|
+
|
|
40
|
+
**Options:**
|
|
41
|
+
|
|
42
|
+
- `-m, --more`: Show additional details and full abstract
|
|
43
|
+
- `-s, --server <server>`: Specify server (openrxiv or medrxiv)
|
|
44
|
+
|
|
45
|
+
**Examples:**
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
openrxiv summary "10.1101/2024.05.08.593085"
|
|
49
|
+
openrxiv summary -m "10.1101/2024.05.08.593085"
|
|
50
|
+
openrxiv summary -s medrxiv "10.1101/2020.03.19.20039131" --more
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
### Download Files
|
|
54
|
+
|
|
55
|
+
Download MECA files from the bioRxiv/medRxiv S3 buckets by DOI.
|
|
56
|
+
|
|
57
|
+
**Arguments:**
|
|
58
|
+
|
|
59
|
+
- `<doi>`: DOI of the paper (e.g., "10.1101/2024.05.08.593085")
|
|
60
|
+
|
|
61
|
+
**Options:**
|
|
62
|
+
|
|
63
|
+
- `-o, --output <dir>`: Output directory for downloaded files (default: "./downloads")
|
|
64
|
+
- `-a, --api-url <url>`: API base URL
|
|
65
|
+
- `--requester-pays`: Enable requester-pays for S3 bucket access
|
|
66
|
+
|
|
67
|
+
**Examples:**
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
openrxiv --requester-pays download "10.1101/2024.05.08.593085"
|
|
71
|
+
openrxiv --requester-pays download "10.1101/2024.05.08.593085" --output "./papers"
|
|
72
|
+
openrxiv --requester-pays download "10.1101/2024.05.08.593085" --api-url "https://custom-api.com"
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### List Bucket Contents
|
|
76
|
+
|
|
77
|
+
List available content in the bioRxiv or medRxiv S3 bucket.
|
|
78
|
+
|
|
79
|
+
**Options:**
|
|
80
|
+
|
|
81
|
+
- `-m, --month <month>`: Filter by specific month (e.g., "2024-01" or "January_2024")
|
|
82
|
+
- `-b, --batch <batch>`: Filter by specific batch (e.g., "Batch_01")
|
|
83
|
+
- `-l, --limit <number>`: Limit the number of results (default: 50)
|
|
84
|
+
- `-s, --server <server>`: Server to use: "biorxiv" or "medrxiv"
|
|
85
|
+
|
|
86
|
+
**Examples:**
|
|
87
|
+
|
|
88
|
+
```bash
|
|
89
|
+
# Local development
|
|
90
|
+
openrxiv list
|
|
91
|
+
openrxiv list --month "2024-01"
|
|
92
|
+
openrxiv list --batch 1 --limit 100 --server medrxiv
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
## Batch Processing
|
|
96
|
+
|
|
97
|
+
List detailed metadata for all files in a specific month or batch.
|
|
98
|
+
|
|
99
|
+
**Options:**
|
|
100
|
+
|
|
101
|
+
- `-m, --month <month>`: Month to list (e.g., "January_2024" or "2024-01")
|
|
102
|
+
- `-b, --batch <batch>`: Batch to list (e.g., "1", "batch-1", "Batch_01")
|
|
103
|
+
- `-s, --server <server>`: Server to use: "biorxiv" or "medrxiv"
|
|
104
|
+
|
|
105
|
+
**Examples:**
|
|
106
|
+
|
|
107
|
+
```bash
|
|
108
|
+
openrxiv batch-info --month "2024-01"
|
|
109
|
+
openrxiv batch-info --batch "1"
|
|
110
|
+
openrxiv batch-info --server medrxiv --month "2024-01"
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
## Global Options
|
|
114
|
+
|
|
115
|
+
### `--requester-pays`
|
|
116
|
+
|
|
117
|
+
Enable requester pays functionality. The S3 buckets require requester pays for external access.
|
|
118
|
+
|
|
119
|
+
# Batch Processing
|
|
120
|
+
|
|
121
|
+
Batch process MECA files for a given month or batch.
|
|
122
|
+
|
|
123
|
+
**Options:**
|
|
124
|
+
|
|
125
|
+
**Time Selection:**
|
|
126
|
+
|
|
127
|
+
- `-m, --month <month>`: Month(s) to process. Supports: YYYY-MM, comma-separated list, or wildcard patterns
|
|
128
|
+
- `-b, --batch <batch>`: Batch to process. Supports: single batch, range, or comma-separated list
|
|
129
|
+
|
|
130
|
+
**Processing Control:**
|
|
131
|
+
|
|
132
|
+
- `-l, --limit <number>`: Maximum number of files to process
|
|
133
|
+
- `-c, --concurrency <number>`: Number of files to process concurrently (default: 1)
|
|
134
|
+
- `--force`: Force reprocessing of existing files
|
|
135
|
+
- `--dry-run`: List files without processing them
|
|
136
|
+
|
|
137
|
+
**Output Control:**
|
|
138
|
+
|
|
139
|
+
- `-o, --output <dir>`: Output directory for extracted files (default: "./batch-extracted")
|
|
140
|
+
- `--keep`: Keep MECA files after processing
|
|
141
|
+
- `--full-extract`: Extract entire MECA file instead of selective extraction
|
|
142
|
+
- `--max-file-size <size>`: Skip files larger than this size (e.g. 1GB)
|
|
143
|
+
|
|
144
|
+
**API Configuration:**
|
|
145
|
+
|
|
146
|
+
- `-a, --api-url <url>`: API base URL (default: "https://openrxiv.csf.now")
|
|
147
|
+
- `-k, --api-key <key>`: API key for authentication (or use OPENRXIV_BATCH_PROCESSING_API_KEY env var)
|
|
148
|
+
- `-s, --server <server>`: Server type: openrxiv or medrxiv
|
|
149
|
+
|
|
150
|
+
**Examples:**
|
|
151
|
+
|
|
152
|
+
```bash
|
|
153
|
+
# Process specific month
|
|
154
|
+
openrxiv batch-process --month "2025-08" --requester-pays
|
|
155
|
+
|
|
156
|
+
# Process multiple months
|
|
157
|
+
openrxiv batch-process --month "2024-01,2024-02,2024-03" --requester-pays
|
|
158
|
+
|
|
159
|
+
# Dry run to see what would be processed
|
|
160
|
+
openrxiv batch-process --month "2025-08" --dry-run
|
|
161
|
+
|
|
162
|
+
# Process all of 2025
|
|
163
|
+
openrxiv batch-process --month "2025-*" --requester-pays
|
|
164
|
+
|
|
165
|
+
# Process with concurrency
|
|
166
|
+
openrxiv batch-process --month "2025-08" --concurrency 5 --requester-pays
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
## Configuration
|
|
170
|
+
|
|
171
|
+
The tool reads AWS credentials from the home directory under the default profile, if available.
|
|
172
|
+
|
|
173
|
+
### Environment Variables
|
|
174
|
+
|
|
175
|
+
You can also set credentials via environment variables:
|
|
176
|
+
|
|
177
|
+
```bash
|
|
178
|
+
export OPENRXIV_BATCH_PROCESSING_API_KEY="your-api-key"
|
|
179
|
+
export AWS_ACCESS_KEY_ID="your-access-key"
|
|
180
|
+
export AWS_SECRET_ACCESS_KEY="your-secret-key"
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
## Development
|
|
184
|
+
|
|
185
|
+
### Local Development
|
|
186
|
+
|
|
187
|
+
```bash
|
|
188
|
+
git clone https://github.com/continuous-foundation/openrxiv
|
|
189
|
+
cd openrxiv
|
|
190
|
+
npm install
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
### Building
|
|
194
|
+
|
|
195
|
+
```bash
|
|
196
|
+
npm run build
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
### Testing
|
|
200
|
+
|
|
201
|
+
```bash
|
|
202
|
+
npm test
|
|
203
|
+
npm run test:watch
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
### Linting & Formatting
|
|
207
|
+
|
|
208
|
+
```bash
|
|
209
|
+
npm run lint
|
|
210
|
+
npm run lint:format
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
## License
|
|
214
|
+
|
|
215
|
+
MIT License - see LICENSE file for details.
|
|
216
|
+
|
|
217
|
+
## Compliance
|
|
218
|
+
|
|
219
|
+
This tool is designed to comply with bioRxiv's and medRxiv's fair use policies:
|
|
220
|
+
|
|
221
|
+
- No content redistribution
|
|
222
|
+
- Link back to bioRxiv/medRxiv for indexing services
|
|
223
|
+
- Respect author copyright and licensing
|
|
224
|
+
- Intended for legitimate text and data mining purposes
|