openrxiv 0.0.0 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (78) hide show
  1. package/README.md +224 -0
  2. package/dist/cli.cjs +8082 -35849
  3. package/package.json +8 -32
  4. package/dist/cli/src/api/api-client.d.ts +0 -96
  5. package/dist/cli/src/api/api-client.d.ts.map +0 -1
  6. package/dist/cli/src/api/api-client.js +0 -257
  7. package/dist/cli/src/aws/bucket-explorer.d.ts +0 -26
  8. package/dist/cli/src/aws/bucket-explorer.d.ts.map +0 -1
  9. package/dist/cli/src/aws/bucket-explorer.js +0 -220
  10. package/dist/cli/src/aws/config.d.ts +0 -18
  11. package/dist/cli/src/aws/config.d.ts.map +0 -1
  12. package/dist/cli/src/aws/config.js +0 -191
  13. package/dist/cli/src/aws/downloader.d.ts +0 -13
  14. package/dist/cli/src/aws/downloader.d.ts.map +0 -1
  15. package/dist/cli/src/aws/downloader.js +0 -115
  16. package/dist/cli/src/aws/month-lister.d.ts +0 -18
  17. package/dist/cli/src/aws/month-lister.d.ts.map +0 -1
  18. package/dist/cli/src/aws/month-lister.js +0 -90
  19. package/dist/cli/src/commands/batch-process.d.ts +0 -3
  20. package/dist/cli/src/commands/batch-process.d.ts.map +0 -1
  21. package/dist/cli/src/commands/batch-process.js +0 -557
  22. package/dist/cli/src/commands/config.d.ts +0 -3
  23. package/dist/cli/src/commands/config.d.ts.map +0 -1
  24. package/dist/cli/src/commands/config.js +0 -42
  25. package/dist/cli/src/commands/download.d.ts +0 -3
  26. package/dist/cli/src/commands/download.d.ts.map +0 -1
  27. package/dist/cli/src/commands/download.js +0 -76
  28. package/dist/cli/src/commands/list.d.ts +0 -3
  29. package/dist/cli/src/commands/list.d.ts.map +0 -1
  30. package/dist/cli/src/commands/list.js +0 -18
  31. package/dist/cli/src/commands/month-info.d.ts +0 -3
  32. package/dist/cli/src/commands/month-info.d.ts.map +0 -1
  33. package/dist/cli/src/commands/month-info.js +0 -213
  34. package/dist/cli/src/commands/summary.d.ts +0 -3
  35. package/dist/cli/src/commands/summary.d.ts.map +0 -1
  36. package/dist/cli/src/commands/summary.js +0 -249
  37. package/dist/cli/src/index.d.ts +0 -3
  38. package/dist/cli/src/index.d.ts.map +0 -1
  39. package/dist/cli/src/index.js +0 -35
  40. package/dist/cli/src/utils/batches.d.ts +0 -9
  41. package/dist/cli/src/utils/batches.d.ts.map +0 -1
  42. package/dist/cli/src/utils/batches.js +0 -61
  43. package/dist/cli/src/utils/batches.test.d.ts +0 -2
  44. package/dist/cli/src/utils/batches.test.d.ts.map +0 -1
  45. package/dist/cli/src/utils/batches.test.js +0 -119
  46. package/dist/cli/src/utils/default-server.d.ts +0 -3
  47. package/dist/cli/src/utils/default-server.d.ts.map +0 -1
  48. package/dist/cli/src/utils/default-server.js +0 -20
  49. package/dist/cli/src/utils/index.d.ts +0 -5
  50. package/dist/cli/src/utils/index.d.ts.map +0 -1
  51. package/dist/cli/src/utils/index.js +0 -5
  52. package/dist/cli/src/utils/meca-processor.d.ts +0 -28
  53. package/dist/cli/src/utils/meca-processor.d.ts.map +0 -1
  54. package/dist/cli/src/utils/meca-processor.js +0 -503
  55. package/dist/cli/src/utils/meca-processor.test.d.ts +0 -2
  56. package/dist/cli/src/utils/meca-processor.test.d.ts.map +0 -1
  57. package/dist/cli/src/utils/meca-processor.test.js +0 -123
  58. package/dist/cli/src/utils/months.d.ts +0 -36
  59. package/dist/cli/src/utils/months.d.ts.map +0 -1
  60. package/dist/cli/src/utils/months.js +0 -135
  61. package/dist/cli/src/utils/months.test.d.ts +0 -2
  62. package/dist/cli/src/utils/months.test.d.ts.map +0 -1
  63. package/dist/cli/src/utils/months.test.js +0 -209
  64. package/dist/cli/src/utils/requester-pays-error.d.ts +0 -6
  65. package/dist/cli/src/utils/requester-pays-error.d.ts.map +0 -1
  66. package/dist/cli/src/utils/requester-pays-error.js +0 -20
  67. package/dist/cli/src/version.d.ts +0 -3
  68. package/dist/cli/src/version.d.ts.map +0 -1
  69. package/dist/cli/src/version.js +0 -2
  70. package/dist/utils/src/biorxiv-parser.d.ts +0 -51
  71. package/dist/utils/src/biorxiv-parser.d.ts.map +0 -1
  72. package/dist/utils/src/biorxiv-parser.js +0 -126
  73. package/dist/utils/src/folder-structure.d.ts +0 -44
  74. package/dist/utils/src/folder-structure.d.ts.map +0 -1
  75. package/dist/utils/src/folder-structure.js +0 -207
  76. package/dist/utils/src/index.d.ts +0 -3
  77. package/dist/utils/src/index.d.ts.map +0 -1
  78. package/dist/utils/src/index.js +0 -3
package/README.md ADDED
@@ -0,0 +1,224 @@
1
+ # openRxiv MECA Downloader CLI
2
+
3
+ A comprehensive command-line interface (CLI) tool to download, process, and manage openRxiv MECA (Manuscript Exchange Common Approach) files from AWS S3 for text and data mining purposes.
4
+
5
+ ## Features
6
+
7
+ - **Multi-Server Support**: Works with both bioRxiv and medRxiv servers
8
+ - **AWS S3 Integration**: Connect to S3 buckets with requester-pays support
9
+ - **Content Exploration**: List, search, and browse available content by month or batch
10
+ - **Individual Downloads**: Download MECA files by DOI with API integration
11
+ - **Batch Processing**: Process large amounts of data with configurable concurrency
12
+ - **Content Summaries**: Get detailed information about preprints
13
+ - **Month/Batch Analysis**: Detailed metadata for specific time periods
14
+ - **XML Processing**: Robust handling of openRxiv XML files with entity replacement
15
+
16
+ ## Installation
17
+
18
+ ### Prerequisites
19
+
20
+ - Node.js 18.0.0 or higher
21
+ - AWS account with access to bioRxiv/medRxiv S3 buckets (`--requester-pays`)
22
+ - API key for bioRxiv API access
23
+
24
+ ### Global Installation
25
+
26
+ ```bash
27
+ npm install -g openrxiv
28
+ ```
29
+
30
+ ## Quick Start
31
+
32
+ ### Summary
33
+
34
+ Get a summary of a bioRxiv/medRxiv preprint from a URL or DOI.
35
+
36
+ **Arguments:**
37
+
38
+ - `<url-or-doi>`: bioRxiv URL or DOI to summarize
39
+
40
+ **Options:**
41
+
42
+ - `-m, --more`: Show additional details and full abstract
43
+ - `-s, --server <server>`: Specify server (openrxiv or medrxiv)
44
+
45
+ **Examples:**
46
+
47
+ ```bash
48
+ openrxiv summary "10.1101/2024.05.08.593085"
49
+ openrxiv summary -m "10.1101/2024.05.08.593085"
50
+ openrxiv summary -s medrxiv "10.1101/2020.03.19.20039131" --more
51
+ ```
52
+
53
+ ### Download Files
54
+
55
+ Download MECA files from the bioRxiv/medRxiv S3 buckets by DOI.
56
+
57
+ **Arguments:**
58
+
59
+ - `<doi>`: DOI of the paper (e.g., "10.1101/2024.05.08.593085")
60
+
61
+ **Options:**
62
+
63
+ - `-o, --output <dir>`: Output directory for downloaded files (default: "./downloads")
64
+ - `-a, --api-url <url>`: API base URL
65
+ - `--requester-pays`: Enable requester-pays for S3 bucket access
66
+
67
+ **Examples:**
68
+
69
+ ```bash
70
+ openrxiv --requester-pays download "10.1101/2024.05.08.593085"
71
+ openrxiv --requester-pays download "10.1101/2024.05.08.593085" --output "./papers"
72
+ openrxiv --requester-pays download "10.1101/2024.05.08.593085" --api-url "https://custom-api.com"
73
+ ```
74
+
75
+ ### List Bucket Contents
76
+
77
+ List available content in the bioRxiv or medRxiv S3 bucket.
78
+
79
+ **Options:**
80
+
81
+ - `-m, --month <month>`: Filter by specific month (e.g., "2024-01" or "January_2024")
82
+ - `-b, --batch <batch>`: Filter by specific batch (e.g., "Batch_01")
83
+ - `-l, --limit <number>`: Limit the number of results (default: 50)
84
+ - `-s, --server <server>`: Server to use: "biorxiv" or "medrxiv"
85
+
86
+ **Examples:**
87
+
88
+ ```bash
89
+ # Local development
90
+ openrxiv list
91
+ openrxiv list --month "2024-01"
92
+ openrxiv list --batch 1 --limit 100 --server medrxiv
93
+ ```
94
+
95
+ ## Batch Processing
96
+
97
+ List detailed metadata for all files in a specific month or batch.
98
+
99
+ **Options:**
100
+
101
+ - `-m, --month <month>`: Month to list (e.g., "January_2024" or "2024-01")
102
+ - `-b, --batch <batch>`: Batch to list (e.g., "1", "batch-1", "Batch_01")
103
+ - `-s, --server <server>`: Server to use: "biorxiv" or "medrxiv"
104
+
105
+ **Examples:**
106
+
107
+ ```bash
108
+ openrxiv batch-info --month "2024-01"
109
+ openrxiv batch-info --batch "1"
110
+ openrxiv batch-info --server medrxiv --month "2024-01"
111
+ ```
112
+
113
+ ## Global Options
114
+
115
+ ### `--requester-pays`
116
+
117
+ Enable requester pays functionality. The S3 buckets require requester pays for external access.
118
+
119
+ # Batch Processing
120
+
121
+ Batch process MECA files for a given month or batch.
122
+
123
+ **Options:**
124
+
125
+ **Time Selection:**
126
+
127
+ - `-m, --month <month>`: Month(s) to process. Supports: YYYY-MM, comma-separated list, or wildcard patterns
128
+ - `-b, --batch <batch>`: Batch to process. Supports: single batch, range, or comma-separated list
129
+
130
+ **Processing Control:**
131
+
132
+ - `-l, --limit <number>`: Maximum number of files to process
133
+ - `-c, --concurrency <number>`: Number of files to process concurrently (default: 1)
134
+ - `--force`: Force reprocessing of existing files
135
+ - `--dry-run`: List files without processing them
136
+
137
+ **Output Control:**
138
+
139
+ - `-o, --output <dir>`: Output directory for extracted files (default: "./batch-extracted")
140
+ - `--keep`: Keep MECA files after processing
141
+ - `--full-extract`: Extract entire MECA file instead of selective extraction
142
+ - `--max-file-size <size>`: Skip files larger than this size (e.g. 1GB)
143
+
144
+ **API Configuration:**
145
+
146
+ - `-a, --api-url <url>`: API base URL (default: "https://openrxiv.csf.now")
147
+ - `-k, --api-key <key>`: API key for authentication (or use OPENRXIV_BATCH_PROCESSING_API_KEY env var)
148
+ - `-s, --server <server>`: Server type: openrxiv or medrxiv
149
+
150
+ **Examples:**
151
+
152
+ ```bash
153
+ # Process specific month
154
+ openrxiv batch-process --month "2025-08" --requester-pays
155
+
156
+ # Process multiple months
157
+ openrxiv batch-process --month "2024-01,2024-02,2024-03" --requester-pays
158
+
159
+ # Dry run to see what would be processed
160
+ openrxiv batch-process --month "2025-08" --dry-run
161
+
162
+ # Process all of 2025
163
+ openrxiv batch-process --month "2025-*" --requester-pays
164
+
165
+ # Process with concurrency
166
+ openrxiv batch-process --month "2025-08" --concurrency 5 --requester-pays
167
+ ```
168
+
169
+ ## Configuration
170
+
171
+ The tool reads AWS credentials from the home directory under the default profile, if available.
172
+
173
+ ### Environment Variables
174
+
175
+ You can also set credentials via environment variables:
176
+
177
+ ```bash
178
+ export OPENRXIV_BATCH_PROCESSING_API_KEY="your-api-key"
179
+ export AWS_ACCESS_KEY_ID="your-access-key"
180
+ export AWS_SECRET_ACCESS_KEY="your-secret-key"
181
+ ```
182
+
183
+ ## Development
184
+
185
+ ### Local Development
186
+
187
+ ```bash
188
+ git clone https://github.com/continuous-foundation/openrxiv
189
+ cd openrxiv
190
+ npm install
191
+ ```
192
+
193
+ ### Building
194
+
195
+ ```bash
196
+ npm run build
197
+ ```
198
+
199
+ ### Testing
200
+
201
+ ```bash
202
+ npm test
203
+ npm run test:watch
204
+ ```
205
+
206
+ ### Linting & Formatting
207
+
208
+ ```bash
209
+ npm run lint
210
+ npm run lint:format
211
+ ```
212
+
213
+ ## License
214
+
215
+ MIT License - see LICENSE file for details.
216
+
217
+ ## Compliance
218
+
219
+ This tool is designed to comply with bioRxiv's and medRxiv's fair use policies:
220
+
221
+ - No content redistribution
222
+ - Link back to bioRxiv/medRxiv for indexing services
223
+ - Respect author copyright and licensing
224
+ - Intended for legitimate text and data mining purposes