wp2txt 1.1.3 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.dockerignore +12 -0
- data/.github/workflows/ci.yml +13 -13
- data/.gitignore +14 -0
- data/CHANGELOG.md +284 -0
- data/DEVELOPMENT.md +415 -0
- data/DEVELOPMENT_ja.md +415 -0
- data/Dockerfile +19 -10
- data/Gemfile +2 -8
- data/README.md +259 -123
- data/README_ja.md +375 -0
- data/Rakefile +4 -0
- data/bin/wp2txt +863 -161
- data/lib/wp2txt/article.rb +98 -13
- data/lib/wp2txt/bz2_validator.rb +239 -0
- data/lib/wp2txt/category_cache.rb +313 -0
- data/lib/wp2txt/cli.rb +319 -0
- data/lib/wp2txt/cli_ui.rb +428 -0
- data/lib/wp2txt/config.rb +158 -0
- data/lib/wp2txt/constants.rb +134 -0
- data/lib/wp2txt/data/html_entities.json +2135 -0
- data/lib/wp2txt/data/language_metadata.json +4769 -0
- data/lib/wp2txt/data/language_tiers.json +59 -0
- data/lib/wp2txt/data/mediawiki_aliases.json +12366 -0
- data/lib/wp2txt/data/template_aliases.json +193 -0
- data/lib/wp2txt/data/wikipedia_entities.json +12 -0
- data/lib/wp2txt/extractor.rb +545 -0
- data/lib/wp2txt/file_utils.rb +91 -0
- data/lib/wp2txt/formatter.rb +352 -0
- data/lib/wp2txt/global_data_cache.rb +353 -0
- data/lib/wp2txt/index_cache.rb +258 -0
- data/lib/wp2txt/magic_words.rb +353 -0
- data/lib/wp2txt/memory_monitor.rb +236 -0
- data/lib/wp2txt/multistream.rb +1383 -0
- data/lib/wp2txt/output_writer.rb +182 -0
- data/lib/wp2txt/parser_functions.rb +606 -0
- data/lib/wp2txt/ractor_worker.rb +215 -0
- data/lib/wp2txt/regex.rb +396 -12
- data/lib/wp2txt/section_extractor.rb +354 -0
- data/lib/wp2txt/stream_processor.rb +271 -0
- data/lib/wp2txt/template_expander.rb +830 -0
- data/lib/wp2txt/text_processing.rb +337 -0
- data/lib/wp2txt/utils.rb +629 -270
- data/lib/wp2txt/version.rb +1 -1
- data/lib/wp2txt.rb +53 -26
- data/scripts/benchmark_regex.rb +161 -0
- data/scripts/fetch_html_entities.rb +94 -0
- data/scripts/fetch_language_metadata.rb +180 -0
- data/scripts/fetch_mediawiki_data.rb +334 -0
- data/scripts/fetch_template_data.rb +186 -0
- data/scripts/profile_memory.rb +139 -0
- data/spec/article_spec.rb +402 -0
- data/spec/auto_download_spec.rb +314 -0
- data/spec/bz2_validator_spec.rb +193 -0
- data/spec/category_cache_spec.rb +226 -0
- data/spec/category_fetcher_spec.rb +504 -0
- data/spec/cleanup_spec.rb +197 -0
- data/spec/cli_options_spec.rb +678 -0
- data/spec/cli_spec.rb +876 -0
- data/spec/config_spec.rb +194 -0
- data/spec/constants_spec.rb +138 -0
- data/spec/file_utils_spec.rb +170 -0
- data/spec/fixtures/samples.rb +181 -0
- data/spec/formatter_sections_spec.rb +382 -0
- data/spec/global_data_cache_spec.rb +186 -0
- data/spec/index_cache_spec.rb +210 -0
- data/spec/integration_spec.rb +543 -0
- data/spec/magic_words_spec.rb +261 -0
- data/spec/markers_spec.rb +476 -0
- data/spec/memory_monitor_spec.rb +192 -0
- data/spec/multistream_spec.rb +690 -0
- data/spec/output_writer_spec.rb +400 -0
- data/spec/parser_functions_spec.rb +455 -0
- data/spec/ractor_worker_spec.rb +197 -0
- data/spec/regex_spec.rb +281 -0
- data/spec/section_extractor_spec.rb +397 -0
- data/spec/spec_helper.rb +63 -0
- data/spec/stream_processor_spec.rb +579 -0
- data/spec/template_data_spec.rb +246 -0
- data/spec/template_expander_spec.rb +472 -0
- data/spec/template_processing_spec.rb +217 -0
- data/spec/text_processing_spec.rb +312 -0
- data/spec/utils_spec.rb +195 -16
- data/spec/wp2txt_spec.rb +510 -0
- data/wp2txt.gemspec +5 -3
- metadata +146 -18
- data/.rubocop.yml +0 -80
- data/data/output_samples/testdata_en.txt +0 -23002
- data/data/output_samples/testdata_en_category.txt +0 -132
- data/data/output_samples/testdata_en_summary.txt +0 -1376
- data/data/output_samples/testdata_ja.txt +0 -22774
- data/data/output_samples/testdata_ja_category.txt +0 -206
- data/data/output_samples/testdata_ja_summary.txt +0 -1560
- data/data/testdata_en.bz2 +0 -0
- data/data/testdata_ja.bz2 +0 -0
- data/image/screenshot.png +0 -0
data/README.md
CHANGED
|
@@ -2,211 +2,347 @@
|
|
|
2
2
|
|
|
3
3
|
A command-line toolkit to extract text content and category data from Wikipedia dump files
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
English | [日本語](README_ja.md)
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
## Quick Start
|
|
8
8
|
|
|
9
|
-
|
|
9
|
+
```bash
|
|
10
|
+
# Install
|
|
11
|
+
gem install wp2txt
|
|
10
12
|
|
|
11
|
-
|
|
13
|
+
# Extract text from Japanese Wikipedia (auto-download)
|
|
14
|
+
wp2txt --lang=ja -o ./output
|
|
12
15
|
|
|
13
|
-
|
|
16
|
+
# Extract specific articles
|
|
17
|
+
wp2txt --lang=ja --articles="東京,京都" -o ./articles
|
|
14
18
|
|
|
15
|
-
|
|
19
|
+
# Extract articles from a category
|
|
20
|
+
wp2txt --lang=ja --from-category="日本の都市" -o ./cities
|
|
21
|
+
```
|
|
16
22
|
|
|
17
|
-
|
|
23
|
+
## About
|
|
18
24
|
|
|
19
|
-
|
|
25
|
+
WP2TXT extracts plain text and category information from Wikipedia dump files. It processes XML dumps (compressed with bzip2), removes MediaWiki markup, and outputs clean text suitable for corpus linguistics, text mining, and other research purposes.
|
|
20
26
|
|
|
21
|
-
|
|
22
|
-
- Code cleanup introducing Rubocop
|
|
27
|
+
## Key Features
|
|
23
28
|
|
|
24
|
-
**
|
|
29
|
+
- **Auto-download** - Automatically download dumps by language code
|
|
30
|
+
- **Article extraction by title** - Extract specific articles without downloading full dumps
|
|
31
|
+
- **Category-based extraction** - Extract all articles from a specific Wikipedia category
|
|
32
|
+
- **Category metadata extraction** - Preserves article category information in output
|
|
33
|
+
- **Template expansion** - Expands common templates (dates, units, coordinates) to readable text
|
|
34
|
+
- **Multilingual support** - Category and redirect detection for 350+ Wikipedia languages
|
|
35
|
+
- **Streaming processing** - Process large dumps without intermediate files
|
|
36
|
+
- **JSON output** - Machine-readable JSONL format for data pipelines
|
|
25
37
|
|
|
26
|
-
|
|
38
|
+
## Use Cases
|
|
27
39
|
|
|
28
|
-
|
|
40
|
+
wp2txt is particularly suited for:
|
|
29
41
|
|
|
30
|
-
-
|
|
42
|
+
- Building domain-specific corpora using category information
|
|
43
|
+
- Comparative linguistic research across topic areas
|
|
44
|
+
- Extracting Wikipedia text with metadata for NLP tasks
|
|
45
|
+
- Cross-linguistic studies using parallel category structures
|
|
31
46
|
|
|
32
|
-
|
|
47
|
+
## Data Access
|
|
33
48
|
|
|
34
|
-
|
|
35
|
-
- A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
|
|
36
|
-
- Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
|
|
49
|
+
wp2txt uses [official Wikipedia dump files](https://meta.wikimedia.org/wiki/Data_dumps), the recommended method for bulk data access. This approach respects Wikimedia's infrastructure guidelines.
|
|
37
50
|
|
|
38
|
-
##
|
|
51
|
+
## Installation
|
|
39
52
|
|
|
40
|
-
|
|
53
|
+
### Install wp2txt
|
|
41
54
|
|
|
42
|
-
|
|
55
|
+
$ gem install wp2txt
|
|
43
56
|
|
|
44
|
-
|
|
45
|
-
- MacBook Pro (2021 Apple M1 Pro)
|
|
46
|
-
- enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
|
|
57
|
+
### System Requirements
|
|
47
58
|
|
|
48
|
-
|
|
59
|
+
WP2TXT requires one of the following commands to decompress `bz2` files:
|
|
49
60
|
|
|
50
|
-
|
|
61
|
+
- `lbzip2` (recommended - uses multiple CPU cores)
|
|
62
|
+
- `pbzip2`
|
|
63
|
+
- `bzip2` (pre-installed on most systems)
|
|
51
64
|
|
|
52
|
-
|
|
53
|
-
- Creates output files of specified size
|
|
54
|
-
- Allows specifying ext elements (page titles, section headers, paragraphs, list items) to be extracted
|
|
55
|
-
- Allows extracting category information of the article
|
|
56
|
-
- Allows extracting opening paragraphs of the article
|
|
65
|
+
On macOS with Homebrew:
|
|
57
66
|
|
|
58
|
-
|
|
67
|
+
$ brew install lbzip2
|
|
59
68
|
|
|
60
|
-
|
|
69
|
+
On Windows: Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and add to PATH.
|
|
61
70
|
|
|
62
|
-
|
|
63
|
-
2. Execute `docker` command in a terminal:
|
|
71
|
+
### Docker (Alternative)
|
|
64
72
|
|
|
65
73
|
```shell
|
|
66
|
-
docker run -it -v /
|
|
74
|
+
docker run -it -v /path/to/localdata:/data yohasebe/wp2txt
|
|
67
75
|
```
|
|
68
76
|
|
|
69
|
-
|
|
77
|
+
The `wp2txt` command is available inside the container. Use `/data` for input/output files.
|
|
70
78
|
|
|
71
|
-
|
|
72
|
-
4. The `wp2txt` command will be avalable anywhare in the Docker container. Use the `/data` directory as the location of the input dump files and the output text files.
|
|
79
|
+
## Basic Usage
|
|
73
80
|
|
|
74
|
-
|
|
81
|
+
### Auto-download and process (Recommended)
|
|
75
82
|
|
|
76
|
-
|
|
77
|
-
- When running the `wp2txt` command inside a Docker container, be sure to set the output directory to somewhere in the mounted local directory specified by the `docker run` command.
|
|
83
|
+
$ wp2txt --lang=ja -o ./text
|
|
78
84
|
|
|
79
|
-
|
|
85
|
+
This automatically downloads the Japanese Wikipedia dump and extracts plain text. Downloads are cached in `~/.wp2txt/cache/`.
|
|
80
86
|
|
|
81
|
-
|
|
87
|
+
### Extract specific articles by title
|
|
82
88
|
|
|
83
|
-
-
|
|
84
|
-
- `pbzip2`
|
|
85
|
-
- `bzip2`
|
|
89
|
+
$ wp2txt --lang=ja --articles="認知言語学,生成文法" -o ./articles
|
|
86
90
|
|
|
87
|
-
|
|
91
|
+
Only the index file and necessary data streams are downloaded, making it much faster than processing the full dump.
|
|
88
92
|
|
|
89
|
-
|
|
93
|
+
### Extract articles from a category
|
|
90
94
|
|
|
91
|
-
$
|
|
95
|
+
$ wp2txt --lang=ja --from-category="日本の都市" -o ./cities
|
|
92
96
|
|
|
93
|
-
|
|
97
|
+
Include subcategories with `--depth`:
|
|
94
98
|
|
|
95
|
-
|
|
99
|
+
$ wp2txt --lang=ja --from-category="日本の都市" --depth=2 -o ./cities
|
|
96
100
|
|
|
97
|
-
|
|
101
|
+
Preview without downloading (shows article counts):
|
|
98
102
|
|
|
99
|
-
|
|
103
|
+
$ wp2txt --lang=ja --from-category="日本の都市" --dry-run
|
|
100
104
|
|
|
101
|
-
|
|
105
|
+
### Process local dump file
|
|
102
106
|
|
|
103
|
-
|
|
107
|
+
$ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text
|
|
104
108
|
|
|
105
|
-
|
|
109
|
+
### Other extraction modes
|
|
106
110
|
|
|
107
|
-
|
|
111
|
+
# Category info only (title + categories)
|
|
112
|
+
$ wp2txt -g --lang=ja -o ./category
|
|
108
113
|
|
|
109
|
-
|
|
114
|
+
# Summary only (title + categories + opening paragraphs)
|
|
115
|
+
$ wp2txt -s --lang=ja -o ./summary
|
|
110
116
|
|
|
111
|
-
|
|
117
|
+
# Metadata only (title + section headings + categories)
|
|
118
|
+
$ wp2txt -M --lang=ja --format json -o ./metadata
|
|
112
119
|
|
|
113
|
-
|
|
120
|
+
# Extract specific sections (comma-separated, 'summary' for lead text)
|
|
121
|
+
$ wp2txt --lang=en --sections="summary,Plot,Reception" --format json -o ./sections
|
|
114
122
|
|
|
115
|
-
|
|
123
|
+
# Section heading statistics
|
|
124
|
+
$ wp2txt --lang=ja --section-stats -o ./stats
|
|
116
125
|
|
|
117
|
-
|
|
126
|
+
# JSON/JSONL output
|
|
127
|
+
$ wp2txt --format json --lang=ja -o ./json
|
|
118
128
|
|
|
119
|
-
|
|
129
|
+
## Sample Output
|
|
130
|
+
|
|
131
|
+
### Text Output
|
|
120
132
|
|
|
121
133
|
```
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
└── /summary
|
|
134
|
+
[[Article Title]]
|
|
135
|
+
|
|
136
|
+
Article content goes here with sections and paragraphs...
|
|
137
|
+
|
|
138
|
+
CATEGORIES: Category1, Category2, Category3
|
|
128
139
|
```
|
|
129
140
|
|
|
130
|
-
###
|
|
141
|
+
### JSON/JSONL Output
|
|
131
142
|
|
|
132
|
-
|
|
143
|
+
Each line contains one JSON object:
|
|
133
144
|
|
|
134
|
-
|
|
145
|
+
```json
|
|
146
|
+
{"title": "Article Title", "categories": ["Cat1", "Cat2"], "text": "...", "redirect": null}
|
|
147
|
+
```
|
|
135
148
|
|
|
136
|
-
|
|
149
|
+
For redirect articles:
|
|
137
150
|
|
|
138
|
-
|
|
151
|
+
```json
|
|
152
|
+
{"title": "NYC", "categories": [], "text": "", "redirect": "New York City"}
|
|
153
|
+
```
|
|
139
154
|
|
|
140
|
-
|
|
155
|
+
## Cache Management
|
|
141
156
|
|
|
157
|
+
$ wp2txt --cache-status # Show cache status
|
|
158
|
+
$ wp2txt --cache-clear # Clear all cache
|
|
159
|
+
$ wp2txt --cache-clear --lang=ja # Clear cache for Japanese only
|
|
160
|
+
$ wp2txt --update-cache # Force fresh download
|
|
142
161
|
|
|
143
|
-
|
|
162
|
+
When cache exceeds the expiry period (default: 30 days), wp2txt displays a warning but allows using cached data.
|
|
144
163
|
|
|
145
|
-
|
|
164
|
+
## Advanced Options
|
|
146
165
|
|
|
147
|
-
###
|
|
166
|
+
### Content Type Markers
|
|
148
167
|
|
|
149
|
-
|
|
168
|
+
Special content is replaced with marker placeholders by default:
|
|
150
169
|
|
|
151
|
-
|
|
170
|
+
**Inline markers** (appear within sentences):
|
|
152
171
|
|
|
153
|
-
|
|
172
|
+
| Marker | Content Type |
|
|
173
|
+
|--------|--------------|
|
|
174
|
+
| `[MATH]` | Mathematical formulas |
|
|
175
|
+
| `[CODE]` | Inline code |
|
|
176
|
+
| `[CHEM]` | Chemical formulas |
|
|
177
|
+
| `[IPA]` | IPA phonetic notation |
|
|
154
178
|
|
|
155
|
-
|
|
179
|
+
**Block markers** (standalone content):
|
|
156
180
|
|
|
157
|
-
|
|
181
|
+
| Marker | Content Type |
|
|
182
|
+
|--------|--------------|
|
|
183
|
+
| `[CODEBLOCK]` | Source code blocks |
|
|
184
|
+
| `[TABLE]` | Wiki tables |
|
|
185
|
+
| `[INFOBOX]` | Information boxes |
|
|
186
|
+
| `[NAVBOX]` | Navigation boxes |
|
|
187
|
+
| `[GALLERY]` | Image galleries |
|
|
188
|
+
| `[REFERENCES]` | Reference lists |
|
|
189
|
+
| `[SCORE]` | Musical scores |
|
|
190
|
+
| `[TIMELINE]` | Timeline graphics |
|
|
191
|
+
| `[GRAPH]` | Graphs/charts |
|
|
192
|
+
| `[SIDEBAR]` | Sidebar templates |
|
|
193
|
+
| `[MAPFRAME]` | Interactive maps |
|
|
194
|
+
| `[IMAGEMAP]` | Clickable image maps |
|
|
158
195
|
|
|
159
|
-
|
|
196
|
+
Configure with `--markers`:
|
|
160
197
|
|
|
161
|
-
$ wp2txt
|
|
198
|
+
$ wp2txt --lang=en --markers=all -o ./text # All markers (default)
|
|
199
|
+
$ wp2txt --lang=en --markers=math,code -o ./text # Only MATH and CODE
|
|
162
200
|
|
|
163
|
-
|
|
164
|
-
- [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
|
|
201
|
+
**Note**: `--markers=none` is deprecated as removing special content can make surrounding text nonsensical.
|
|
165
202
|
|
|
166
|
-
|
|
203
|
+
### Template Expansion
|
|
167
204
|
|
|
168
|
-
|
|
205
|
+
Common MediaWiki templates are automatically expanded (enabled by default):
|
|
169
206
|
|
|
170
|
-
|
|
171
|
-
|
|
207
|
+
| Template | Output |
|
|
208
|
+
|----------|--------|
|
|
209
|
+
| `{{birth date\|1990\|5\|15}}` | May 15, 1990 |
|
|
210
|
+
| `{{convert\|100\|km\|mi}}` | 100 km (62 mi) |
|
|
211
|
+
| `{{coord\|35\|41\|N\|139\|41\|E}}` | 35°41′N 139°41′E |
|
|
212
|
+
| `{{lang\|ja\|日本語}}` | 日本語 |
|
|
213
|
+
| `{{nihongo\|Tokyo\|東京\|Tōkyō}}` | Tokyo (東京, Tōkyō) |
|
|
214
|
+
| `{{frac\|1\|2}}` | 1/2 |
|
|
215
|
+
| `{{circa\|1900}}` | c. 1900 |
|
|
172
216
|
|
|
173
|
-
|
|
217
|
+
Supported: date/age templates, unit conversion, coordinates, language tags, quotes, fractions, and more. Parser functions (`{{#if:}}`, `{{#switch:}}`) and magic words (`{{PAGENAME}}`, `{{CURRENTYEAR}}`) are also supported.
|
|
174
218
|
|
|
175
|
-
|
|
219
|
+
Disable with `--no-expand-templates`.
|
|
176
220
|
|
|
177
|
-
|
|
178
|
-
- [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
|
|
221
|
+
### Citation Extraction
|
|
179
222
|
|
|
180
|
-
|
|
223
|
+
By default, citation templates are removed. Use `--extract-citations` to extract formatted citations:
|
|
181
224
|
|
|
182
|
-
|
|
225
|
+
$ wp2txt --lang=en --extract-citations -o ./text
|
|
226
|
+
|
|
227
|
+
Supported: `{{cite book}}`, `{{cite web}}`, `{{cite news}}`, `{{cite journal}}`, `{{Citation}}`, etc.
|
|
228
|
+
|
|
229
|
+
## Command Line Options
|
|
183
230
|
|
|
184
231
|
Usage: wp2txt [options]
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
-
|
|
188
|
-
-
|
|
189
|
-
-
|
|
190
|
-
-
|
|
191
|
-
-
|
|
192
|
-
-
|
|
193
|
-
|
|
194
|
-
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
-
|
|
198
|
-
-
|
|
199
|
-
-
|
|
200
|
-
|
|
201
|
-
|
|
202
|
-
-
|
|
203
|
-
-
|
|
232
|
+
|
|
233
|
+
Input source (one of --input or --lang required):
|
|
234
|
+
-i, --input=<s> Path to compressed file (bz2) or XML file
|
|
235
|
+
-L, --lang=<s> Wikipedia language code (e.g., ja, en, de)
|
|
236
|
+
-A, --articles=<s> Specific article titles (comma-separated)
|
|
237
|
+
-G, --from-category=<s> Extract articles from Wikipedia category
|
|
238
|
+
-D, --depth=<i> Subcategory recursion depth (default: 0)
|
|
239
|
+
-y, --yes Skip confirmation prompt
|
|
240
|
+
--dry-run Preview category extraction
|
|
241
|
+
-U, --update-cache Force refresh of cached files
|
|
242
|
+
|
|
243
|
+
Output options:
|
|
244
|
+
-o, --output-dir=<s> Output directory (default: current)
|
|
245
|
+
-j, --format=<s> Output format: text or json (default: text)
|
|
246
|
+
-f, --file-size=<i> Output file size in MB (default: 10, 0=single)
|
|
247
|
+
|
|
248
|
+
Cache management:
|
|
249
|
+
--cache-dir=<s> Cache directory (default: ~/.wp2txt/cache)
|
|
250
|
+
--cache-status Show cache status and exit
|
|
251
|
+
--cache-clear Clear cache and exit
|
|
252
|
+
|
|
253
|
+
Configuration:
|
|
254
|
+
--config-init Create default config (~/.wp2txt/config.yml)
|
|
255
|
+
--config-path=<s> Path to configuration file
|
|
256
|
+
|
|
257
|
+
Extraction modes (mutually exclusive):
|
|
258
|
+
-g, --category-only Extract only title and categories
|
|
259
|
+
-s, --summary-only Extract title, categories, and summary
|
|
260
|
+
-M, --metadata-only Extract only title, headings, and categories
|
|
261
|
+
|
|
262
|
+
Section extraction:
|
|
263
|
+
-S, --sections=<s> Extract specific sections (comma-separated)
|
|
264
|
+
--section-output=<s> Output mode: structured or combined (default: structured)
|
|
265
|
+
--min-section-length=<i> Minimum section length in characters (default: 0)
|
|
266
|
+
--skip-empty Skip articles with no matching sections
|
|
267
|
+
--alias-file=<s> Custom section alias definitions file (YAML)
|
|
268
|
+
--no-section-aliases Disable section alias matching (exact match only)
|
|
269
|
+
--section-stats Collect and output section heading statistics (JSON)
|
|
270
|
+
--show-matched-sections Include matched_sections field in JSON output
|
|
271
|
+
|
|
272
|
+
Content filtering:
|
|
273
|
+
-a, --category, --no-category Show category info (default: true)
|
|
274
|
+
-t, --title, --no-title Keep page titles (default: true)
|
|
275
|
+
-d, --heading, --no-heading Keep section titles (default: true)
|
|
276
|
+
-l, --list Keep list items (default: false)
|
|
277
|
+
--table Keep wiki table content (default: false)
|
|
278
|
+
-p, --pre Keep preformatted text blocks (default: false)
|
|
279
|
+
-r, --ref Keep references as [ref]...[/ref] (default: false)
|
|
280
|
+
--multiline Keep multi-line templates (default: false)
|
|
281
|
+
-e, --redirect Show redirect destination (default: false)
|
|
282
|
+
-m, --marker, --no-marker Show list markers (default: true)
|
|
283
|
+
-k, --markers=<s> Content markers (default: all)
|
|
284
|
+
-C, --extract-citations Extract formatted citations
|
|
285
|
+
-E, --expand-templates Expand templates (default: true)
|
|
286
|
+
--no-expand-templates Disable template expansion
|
|
287
|
+
|
|
288
|
+
Performance:
|
|
289
|
+
-n, --num-procs=<i> Parallel processes (default: auto)
|
|
290
|
+
--no-turbo Disable turbo mode (saves disk space, slower)
|
|
291
|
+
-R, --ractor Use Ractor parallelism (Ruby 4.0+, streaming only)
|
|
292
|
+
-b, --bz2-gem Use bzip2-ruby gem instead of system command
|
|
293
|
+
|
|
294
|
+
Output control:
|
|
295
|
+
-q, --quiet Suppress progress output (errors only)
|
|
296
|
+
--no-color Disable colored output
|
|
297
|
+
|
|
298
|
+
Info:
|
|
299
|
+
-v, --version Print version
|
|
300
|
+
-h, --help Show help
|
|
301
|
+
|
|
302
|
+
## Configuration File
|
|
303
|
+
|
|
304
|
+
Create persistent settings with:
|
|
305
|
+
|
|
306
|
+
$ wp2txt --config-init
|
|
307
|
+
|
|
308
|
+
This creates `~/.wp2txt/config.yml`:
|
|
309
|
+
|
|
310
|
+
```yaml
|
|
311
|
+
cache:
|
|
312
|
+
dump_expiry_days: 30 # Days before dumps are stale (1-365)
|
|
313
|
+
category_expiry_days: 7 # Category cache expiry (1-90)
|
|
314
|
+
directory: ~/.wp2txt/cache
|
|
315
|
+
|
|
316
|
+
defaults:
|
|
317
|
+
format: text # Default output format
|
|
318
|
+
depth: 0 # Default subcategory depth
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
Command-line options override configuration file settings.
|
|
322
|
+
|
|
323
|
+
## Performance
|
|
324
|
+
|
|
325
|
+
Benchmark results on MacBook Air M4 (7 parallel processes, turbo mode, excluding download time):
|
|
326
|
+
|
|
327
|
+
| Wikipedia | Dump Size | Articles | Processing Time | Output |
|
|
328
|
+
|-----------|-----------|----------|-----------------|--------|
|
|
329
|
+
| Japanese | 4.37 GB | 1,485,937 | ~27 min | 463 files (4.5 GB) |
|
|
330
|
+
| English | 24.2 GB | ~6.8M | ~2 hours | 2,000 files (20 GB) |
|
|
331
|
+
|
|
332
|
+
Turbo mode (default) splits bz2 into XML chunks first, then processes in parallel. Use `--no-turbo` to save disk space at the cost of slower processing.
|
|
204
333
|
|
|
205
334
|
## Caveats
|
|
206
335
|
|
|
207
|
-
*
|
|
208
|
-
* Some text
|
|
209
|
-
|
|
336
|
+
* Special content (math, code, etc.) is marked with placeholders by default.
|
|
337
|
+
* Some text may not be extracted correctly due to markup variations or language-specific formatting.
|
|
338
|
+
|
|
339
|
+
## Changelog
|
|
340
|
+
|
|
341
|
+
See [CHANGELOG.md](CHANGELOG.md) for detailed release notes.
|
|
342
|
+
|
|
343
|
+
**v2.1.0 (February 2026)**: SQLite caching, Ractor parallelism (Ruby 4.0+), template expansion, content markers, Docker image update.
|
|
344
|
+
|
|
345
|
+
**v2.0.0 (January 2026)**: Auto-download mode, category-based extraction, article extraction by title, JSON output, streaming processing, Ruby 4.0 support.
|
|
210
346
|
|
|
211
347
|
## Useful Links
|
|
212
348
|
|
|
@@ -223,14 +359,14 @@ The author will appreciate your mentioning one of these in your research.
|
|
|
223
359
|
* Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
|
|
224
360
|
* 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
|
|
225
361
|
|
|
226
|
-
|
|
362
|
+
BibTeX:
|
|
227
363
|
|
|
228
364
|
```
|
|
229
|
-
@misc{
|
|
365
|
+
@misc{wp2txt_2026,
|
|
230
366
|
author = {Yoichiro Hasebe},
|
|
231
367
|
title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
|
|
232
368
|
url = {https://github.com/yohasebe/wp2txt},
|
|
233
|
-
year = {
|
|
369
|
+
year = {2026}
|
|
234
370
|
}
|
|
235
371
|
```
|
|
236
372
|
|