wp2txt 1.1.3 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (96) hide show
  1. checksums.yaml +4 -4
  2. data/.dockerignore +12 -0
  3. data/.github/workflows/ci.yml +13 -13
  4. data/.gitignore +14 -0
  5. data/CHANGELOG.md +284 -0
  6. data/DEVELOPMENT.md +415 -0
  7. data/DEVELOPMENT_ja.md +415 -0
  8. data/Dockerfile +19 -10
  9. data/Gemfile +2 -8
  10. data/README.md +259 -123
  11. data/README_ja.md +375 -0
  12. data/Rakefile +4 -0
  13. data/bin/wp2txt +863 -161
  14. data/lib/wp2txt/article.rb +98 -13
  15. data/lib/wp2txt/bz2_validator.rb +239 -0
  16. data/lib/wp2txt/category_cache.rb +313 -0
  17. data/lib/wp2txt/cli.rb +319 -0
  18. data/lib/wp2txt/cli_ui.rb +428 -0
  19. data/lib/wp2txt/config.rb +158 -0
  20. data/lib/wp2txt/constants.rb +134 -0
  21. data/lib/wp2txt/data/html_entities.json +2135 -0
  22. data/lib/wp2txt/data/language_metadata.json +4769 -0
  23. data/lib/wp2txt/data/language_tiers.json +59 -0
  24. data/lib/wp2txt/data/mediawiki_aliases.json +12366 -0
  25. data/lib/wp2txt/data/template_aliases.json +193 -0
  26. data/lib/wp2txt/data/wikipedia_entities.json +12 -0
  27. data/lib/wp2txt/extractor.rb +545 -0
  28. data/lib/wp2txt/file_utils.rb +91 -0
  29. data/lib/wp2txt/formatter.rb +352 -0
  30. data/lib/wp2txt/global_data_cache.rb +353 -0
  31. data/lib/wp2txt/index_cache.rb +258 -0
  32. data/lib/wp2txt/magic_words.rb +353 -0
  33. data/lib/wp2txt/memory_monitor.rb +236 -0
  34. data/lib/wp2txt/multistream.rb +1383 -0
  35. data/lib/wp2txt/output_writer.rb +182 -0
  36. data/lib/wp2txt/parser_functions.rb +606 -0
  37. data/lib/wp2txt/ractor_worker.rb +215 -0
  38. data/lib/wp2txt/regex.rb +396 -12
  39. data/lib/wp2txt/section_extractor.rb +354 -0
  40. data/lib/wp2txt/stream_processor.rb +271 -0
  41. data/lib/wp2txt/template_expander.rb +830 -0
  42. data/lib/wp2txt/text_processing.rb +337 -0
  43. data/lib/wp2txt/utils.rb +629 -270
  44. data/lib/wp2txt/version.rb +1 -1
  45. data/lib/wp2txt.rb +53 -26
  46. data/scripts/benchmark_regex.rb +161 -0
  47. data/scripts/fetch_html_entities.rb +94 -0
  48. data/scripts/fetch_language_metadata.rb +180 -0
  49. data/scripts/fetch_mediawiki_data.rb +334 -0
  50. data/scripts/fetch_template_data.rb +186 -0
  51. data/scripts/profile_memory.rb +139 -0
  52. data/spec/article_spec.rb +402 -0
  53. data/spec/auto_download_spec.rb +314 -0
  54. data/spec/bz2_validator_spec.rb +193 -0
  55. data/spec/category_cache_spec.rb +226 -0
  56. data/spec/category_fetcher_spec.rb +504 -0
  57. data/spec/cleanup_spec.rb +197 -0
  58. data/spec/cli_options_spec.rb +678 -0
  59. data/spec/cli_spec.rb +876 -0
  60. data/spec/config_spec.rb +194 -0
  61. data/spec/constants_spec.rb +138 -0
  62. data/spec/file_utils_spec.rb +170 -0
  63. data/spec/fixtures/samples.rb +181 -0
  64. data/spec/formatter_sections_spec.rb +382 -0
  65. data/spec/global_data_cache_spec.rb +186 -0
  66. data/spec/index_cache_spec.rb +210 -0
  67. data/spec/integration_spec.rb +543 -0
  68. data/spec/magic_words_spec.rb +261 -0
  69. data/spec/markers_spec.rb +476 -0
  70. data/spec/memory_monitor_spec.rb +192 -0
  71. data/spec/multistream_spec.rb +690 -0
  72. data/spec/output_writer_spec.rb +400 -0
  73. data/spec/parser_functions_spec.rb +455 -0
  74. data/spec/ractor_worker_spec.rb +197 -0
  75. data/spec/regex_spec.rb +281 -0
  76. data/spec/section_extractor_spec.rb +397 -0
  77. data/spec/spec_helper.rb +63 -0
  78. data/spec/stream_processor_spec.rb +579 -0
  79. data/spec/template_data_spec.rb +246 -0
  80. data/spec/template_expander_spec.rb +472 -0
  81. data/spec/template_processing_spec.rb +217 -0
  82. data/spec/text_processing_spec.rb +312 -0
  83. data/spec/utils_spec.rb +195 -16
  84. data/spec/wp2txt_spec.rb +510 -0
  85. data/wp2txt.gemspec +5 -3
  86. metadata +146 -18
  87. data/.rubocop.yml +0 -80
  88. data/data/output_samples/testdata_en.txt +0 -23002
  89. data/data/output_samples/testdata_en_category.txt +0 -132
  90. data/data/output_samples/testdata_en_summary.txt +0 -1376
  91. data/data/output_samples/testdata_ja.txt +0 -22774
  92. data/data/output_samples/testdata_ja_category.txt +0 -206
  93. data/data/output_samples/testdata_ja_summary.txt +0 -1560
  94. data/data/testdata_en.bz2 +0 -0
  95. data/data/testdata_ja.bz2 +0 -0
  96. data/image/screenshot.png +0 -0
data/README.md CHANGED
@@ -2,211 +2,347 @@
2
2
 
3
3
  A command-line toolkit to extract text content and category data from Wikipedia dump files
4
4
 
5
- ## About
5
+ English | [日本語](README_ja.md)
6
6
 
7
- WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
7
+ ## Quick Start
8
8
 
9
- ## Changelog
9
+ ```bash
10
+ # Install
11
+ gem install wp2txt
10
12
 
11
- **May 2023**
13
+ # Extract text from Japanese Wikipedia (auto-download)
14
+ wp2txt --lang=ja -o ./output
12
15
 
13
- - Problems caused by too many parallel processors are addressed by setting the upper limit on the number of processors to 8.
16
+ # Extract specific articles
17
+ wp2txt --lang=ja --articles="東京,京都" -o ./articles
14
18
 
15
- **April 2023**
19
+ # Extract articles from a category
20
+ wp2txt --lang=ja --from-category="日本の都市" -o ./cities
21
+ ```
16
22
 
17
- - File split/delete issues fixed
23
+ ## About
18
24
 
19
- **January 2023**
25
+ WP2TXT extracts plain text and category information from Wikipedia dump files. It processes XML dumps (compressed with bzip2), removes MediaWiki markup, and outputs clean text suitable for corpus linguistics, text mining, and other research purposes.
20
26
 
21
- - Bug related to command line arguments fixed
22
- - Code cleanup introducing Rubocop
27
+ ## Key Features
23
28
 
24
- **December 2022**
29
+ - **Auto-download** - Automatically download dumps by language code
30
+ - **Article extraction by title** - Extract specific articles without downloading full dumps
31
+ - **Category-based extraction** - Extract all articles from a specific Wikipedia category
32
+ - **Category metadata extraction** - Preserves article category information in output
33
+ - **Template expansion** - Expands common templates (dates, units, coordinates) to readable text
34
+ - **Multilingual support** - Category and redirect detection for 350+ Wikipedia languages
35
+ - **Streaming processing** - Process large dumps without intermediate files
36
+ - **JSON output** - Machine-readable JSONL format for data pipelines
25
37
 
26
- - Docker images available via Docker Hub
38
+ ## Use Cases
27
39
 
28
- **November 2022**
40
+ wp2txt is particularly suited for:
29
41
 
30
- - Code added to suppress "Invalid byte sequence error" when an ilegal UTF-8 character is input.
42
+ - Building domain-specific corpora using category information
43
+ - Comparative linguistic research across topic areas
44
+ - Extracting Wikipedia text with metadata for NLP tasks
45
+ - Cross-linguistic studies using parallel category structures
31
46
 
32
- **August 2022**
47
+ ## Data Access
33
48
 
34
- - A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
35
- - A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
36
- - Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
49
+ wp2txt uses [official Wikipedia dump files](https://meta.wikimedia.org/wiki/Data_dumps), the recommended method for bulk data access. This approach respects Wikimedia's infrastructure guidelines.
37
50
 
38
- ## Screenshot
51
+ ## Installation
39
52
 
40
- <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="800" />
53
+ ### Install wp2txt
41
54
 
42
- **Environment**
55
+ $ gem install wp2txt
43
56
 
44
- - WP2TXT 1.0.1
45
- - MacBook Pro (2021 Apple M1 Pro)
46
- - enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
57
+ ### System Requirements
47
58
 
48
- In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.
59
+ WP2TXT requires one of the following commands to decompress `bz2` files:
49
60
 
50
- ## Features
61
+ - `lbzip2` (recommended - uses multiple CPU cores)
62
+ - `pbzip2`
63
+ - `bzip2` (pre-installed on most systems)
51
64
 
52
- - Converts Wikipedia dump files in various languages
53
- - Creates output files of specified size
54
- - Allows specifying ext elements (page titles, section headers, paragraphs, list items) to be extracted
55
- - Allows extracting category information of the article
56
- - Allows extracting opening paragraphs of the article
65
+ On macOS with Homebrew:
57
66
 
58
- ## Setting Up
67
+ $ brew install lbzip2
59
68
 
60
- ### WP2TXT on Docker
69
+ On Windows: Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and add to PATH.
61
70
 
62
- 1. Install [Docker Desktop](https://www.docker.com/products/docker-desktop/) (Mac/Windows/Linux)
63
- 2. Execute `docker` command in a terminal:
71
+ ### Docker (Alternative)
64
72
 
65
73
  ```shell
66
- docker run -it -v /Users/me/localdata:/data yohasebe/wp2txt
74
+ docker run -it -v /path/to/localdata:/data yohasebe/wp2txt
67
75
  ```
68
76
 
69
- - Make sure to Replace `/Users/me/localdata` with the full path to the data directory in your local computer
77
+ The `wp2txt` command is available inside the container. Use `/data` for input/output files.
70
78
 
71
- 3. The Docker image will begin downloading and a bash prompt will appear when finished.
72
- 4. The `wp2txt` command will be avalable anywhare in the Docker container. Use the `/data` directory as the location of the input dump files and the output text files.
79
+ ## Basic Usage
73
80
 
74
- **IMPORTANT:**
81
+ ### Auto-download and process (Recommended)
75
82
 
76
- - Configure Docker Desktop resource settings (number of cores, amount of memory, etc.) to get the best performance possible.
77
- - When running the `wp2txt` command inside a Docker container, be sure to set the output directory to somewhere in the mounted local directory specified by the `docker run` command.
83
+ $ wp2txt --lang=ja -o ./text
78
84
 
79
- ### WP2TXT on MacOS and Linux
85
+ This automatically downloads the Japanese Wikipedia dump and extracts plain text. Downloads are cached in `~/.wp2txt/cache/`.
80
86
 
81
- WP2TXT requires that one of the following commands be installed on the system in order to decompress `bz2` files:
87
+ ### Extract specific articles by title
82
88
 
83
- - `lbzip2` (recommended)
84
- - `pbzip2`
85
- - `bzip2`
89
+ $ wp2txt --lang=ja --articles="認知言語学,生成文法" -o ./articles
86
90
 
87
- In most cases, the `bzip2` command is pre-installed on the system. However, since `lbzip2` can use multiple CPU cores and is faster than `bzip2`, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.
91
+ Only the index file and necessary data streams are downloaded, making it much faster than processing the full dump.
88
92
 
89
- If you are using MacOS with Homebrew installed, you can install `lbzip2` with the following command:
93
+ ### Extract articles from a category
90
94
 
91
- $ brew install lbzip2
95
+ $ wp2txt --lang=ja --from-category="日本の都市" -o ./cities
92
96
 
93
- ### WP2TXT on Windows
97
+ Include subcategories with `--depth`:
94
98
 
95
- Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.
99
+ $ wp2txt --lang=ja --from-category="日本の都市" --depth=2 -o ./cities
96
100
 
97
- ## Installation
101
+ Preview without downloading (shows article counts):
98
102
 
99
- ### WP2TXT command
103
+ $ wp2txt --lang=ja --from-category="日本の都市" --dry-run
100
104
 
101
- $ gem install wp2txt
105
+ ### Process local dump file
102
106
 
103
- ## Wikipedia Dump File
107
+ $ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text
104
108
 
105
- Download the latest Wikipedia dump file for the desired language at a URL such as
109
+ ### Other extraction modes
106
110
 
107
- https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
111
+ # Category info only (title + categories)
112
+ $ wp2txt -g --lang=ja -o ./category
108
113
 
109
- Here, `enwiki` refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to `jawiki` (Japanese). In doing so, note that there are two instances of `enwiki` in the URL above.
114
+ # Summary only (title + categories + opening paragraphs)
115
+ $ wp2txt -s --lang=ja -o ./summary
110
116
 
111
- Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
117
+ # Metadata only (title + section headings + categories)
118
+ $ wp2txt -M --lang=ja --format json -o ./metadata
112
119
 
113
- xxwiki-yyyymmdd-pages-articles.xml.bz2
120
+ # Extract specific sections (comma-separated, 'summary' for lead text)
121
+ $ wp2txt --lang=en --sections="summary,Plot,Reception" --format json -o ./sections
114
122
 
115
- where `xx` is language code such as `en` (English)" or `ja` (japanese), and `yyyymmdd` is the date of creation (e.g. `20220801`).
123
+ # Section heading statistics
124
+ $ wp2txt --lang=ja --section-stats -o ./stats
116
125
 
117
- ## Basic Usage
126
+ # JSON/JSONL output
127
+ $ wp2txt --format json --lang=ja -o ./json
118
128
 
119
- Suppose you have a folder with a wikipedia dump file and empty subfolders organized as follows:
129
+ ## Sample Output
130
+
131
+ ### Text Output
120
132
 
121
133
  ```
122
- .
123
- ├── enwiki-20220801-pages-articles.xml.bz2
124
- ├── /xml
125
- ├── /text
126
- ├── /category
127
- └── /summary
134
+ [[Article Title]]
135
+
136
+ Article content goes here with sections and paragraphs...
137
+
138
+ CATEGORIES: Category1, Category2, Category3
128
139
  ```
129
140
 
130
- ### Decompress and Split
141
+ ### JSON/JSONL Output
131
142
 
132
- The following command will decompress the entire wikipedia data and split it into many small (approximately 10 MB) XML files.
143
+ Each line contains one JSON object:
133
144
 
134
- $ wp2txt --no-convert -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./xml
145
+ ```json
146
+ {"title": "Article Title", "categories": ["Cat1", "Cat2"], "text": "...", "redirect": null}
147
+ ```
135
148
 
136
- **Note**: The resulting files are not well-formed XML. They contain part of the orignal XML extracted from the Wikipedia dump file, taking care to ensure that the content within the <page> tag is not split into multiple files.
149
+ For redirect articles:
137
150
 
138
- ### Extract plain text from MediaWiki XML
151
+ ```json
152
+ {"title": "NYC", "categories": [], "text": "", "redirect": "New York City"}
153
+ ```
139
154
 
140
- $ wp2txt -i ./xml -o ./text
155
+ ## Cache Management
141
156
 
157
+ $ wp2txt --cache-status # Show cache status
158
+ $ wp2txt --cache-clear # Clear all cache
159
+ $ wp2txt --cache-clear --lang=ja # Clear cache for Japanese only
160
+ $ wp2txt --update-cache # Force fresh download
142
161
 
143
- ### Extract only category info from MediaWiki XML
162
+ When cache exceeds the expiry period (default: 30 days), wp2txt displays a warning but allows using cached data.
144
163
 
145
- $ wp2txt -g -i ./xml -o ./category
164
+ ## Advanced Options
146
165
 
147
- ### Extract opening paragraphs from MediaWiki XML
166
+ ### Content Type Markers
148
167
 
149
- $ wp2txt -s -i ./xml -o ./summary
168
+ Special content is replaced with marker placeholders by default:
150
169
 
151
- ### Extract directly from bz2 compressed file
170
+ **Inline markers** (appear within sentences):
152
171
 
153
- It is possible (though not recommended) to 1) decompress the dump files, 2) split the data into files, and 3) extract the text just one line of command. You can automatically remove all the intermediate XML files with `-x` option.
172
+ | Marker | Content Type |
173
+ |--------|--------------|
174
+ | `[MATH]` | Mathematical formulas |
175
+ | `[CODE]` | Inline code |
176
+ | `[CHEM]` | Chemical formulas |
177
+ | `[IPA]` | IPA phonetic notation |
154
178
 
155
- $ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text -x
179
+ **Block markers** (standalone content):
156
180
 
157
- ## Sample Output
181
+ | Marker | Content Type |
182
+ |--------|--------------|
183
+ | `[CODEBLOCK]` | Source code blocks |
184
+ | `[TABLE]` | Wiki tables |
185
+ | `[INFOBOX]` | Information boxes |
186
+ | `[NAVBOX]` | Navigation boxes |
187
+ | `[GALLERY]` | Image galleries |
188
+ | `[REFERENCES]` | Reference lists |
189
+ | `[SCORE]` | Musical scores |
190
+ | `[TIMELINE]` | Timeline graphics |
191
+ | `[GRAPH]` | Graphs/charts |
192
+ | `[SIDEBAR]` | Sidebar templates |
193
+ | `[MAPFRAME]` | Interactive maps |
194
+ | `[IMAGEMAP]` | Clickable image maps |
158
195
 
159
- Output contains title, category info, paragraphs
196
+ Configure with `--markers`:
160
197
 
161
- $ wp2txt -i ./input -o /output
198
+ $ wp2txt --lang=en --markers=all -o ./text # All markers (default)
199
+ $ wp2txt --lang=en --markers=math,code -o ./text # Only MATH and CODE
162
200
 
163
- - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
164
- - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
201
+ **Note**: `--markers=none` is deprecated as removing special content can make surrounding text nonsensical.
165
202
 
166
- Output containing title and category only
203
+ ### Template Expansion
167
204
 
168
- $ wp2txt -g -i ./input -o /output
205
+ Common MediaWiki templates are automatically expanded (enabled by default):
169
206
 
170
- - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_category.txt)
171
- - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_category.txt)
207
+ | Template | Output |
208
+ |----------|--------|
209
+ | `{{birth date\|1990\|5\|15}}` | May 15, 1990 |
210
+ | `{{convert\|100\|km\|mi}}` | 100 km (62 mi) |
211
+ | `{{coord\|35\|41\|N\|139\|41\|E}}` | 35°41′N 139°41′E |
212
+ | `{{lang\|ja\|日本語}}` | 日本語 |
213
+ | `{{nihongo\|Tokyo\|東京\|Tōkyō}}` | Tokyo (東京, Tōkyō) |
214
+ | `{{frac\|1\|2}}` | 1/2 |
215
+ | `{{circa\|1900}}` | c. 1900 |
172
216
 
173
- Output containing title, category, and summary
217
+ Supported: date/age templates, unit conversion, coordinates, language tags, quotes, fractions, and more. Parser functions (`{{#if:}}`, `{{#switch:}}`) and magic words (`{{PAGENAME}}`, `{{CURRENTYEAR}}`) are also supported.
174
218
 
175
- $ wp2txt -s -i ./input -o /output
219
+ Disable with `--no-expand-templates`.
176
220
 
177
- - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
178
- - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
221
+ ### Citation Extraction
179
222
 
180
- ## Command Line Options
223
+ By default, citation templates are removed. Use `--extract-citations` to extract formatted citations:
181
224
 
182
- Command line options are as follows:
225
+ $ wp2txt --lang=en --extract-citations -o ./text
226
+
227
+ Supported: `{{cite book}}`, `{{cite web}}`, `{{cite news}}`, `{{cite journal}}`, `{{Citation}}`, etc.
228
+
229
+ ## Command Line Options
183
230
 
184
231
  Usage: wp2txt [options]
185
- where [options] are:
186
- -i, --input Path to compressed file (bz2) or decompressed file (xml), or path to directory containing files of the latter format
187
- -o, --output-dir=<s> Path to output directory
188
- -c, --convert, --no-convert Output in plain text (converting from XML) (default: true)
189
- -a, --category, --no-category Show article category information (default: true)
190
- -g, --category-only Extract only article title and categories
191
- -s, --summary-only Extract only article title, categories, and summary text before first heading
192
- -f, --file-size=<i> Approximate size (in MB) of each output file (default: 10)
193
- -n, --num-procs Number of proccesses (up to 8) to be run concurrently (default: max num of available CPU cores minus two)
194
- -x, --del-interfile Delete intermediate XML files from output dir
195
- -t, --title, --no-title Keep page titles in output (default: true)
196
- -d, --heading, --no-heading Keep section titles in output (default: true)
197
- -l, --list Keep unprocessed list items in output
198
- -r, --ref Keep reference notations in the format [ref]...[/ref]
199
- -e, --redirect Show redirect destination
200
- -m, --marker, --no-marker Show symbols prefixed to list items, definitions, etc. (Default: true)
201
- -b, --bz2-gem Use Ruby's bzip2-ruby gem instead of a system command
202
- -v, --version Print version and exit
203
- -h, --help Show this message
232
+
233
+ Input source (one of --input or --lang required):
234
+ -i, --input=<s> Path to compressed file (bz2) or XML file
235
+ -L, --lang=<s> Wikipedia language code (e.g., ja, en, de)
236
+ -A, --articles=<s> Specific article titles (comma-separated)
237
+ -G, --from-category=<s> Extract articles from Wikipedia category
238
+ -D, --depth=<i> Subcategory recursion depth (default: 0)
239
+ -y, --yes Skip confirmation prompt
240
+ --dry-run Preview category extraction
241
+ -U, --update-cache Force refresh of cached files
242
+
243
+ Output options:
244
+ -o, --output-dir=<s> Output directory (default: current)
245
+ -j, --format=<s> Output format: text or json (default: text)
246
+ -f, --file-size=<i> Output file size in MB (default: 10, 0=single)
247
+
248
+ Cache management:
249
+ --cache-dir=<s> Cache directory (default: ~/.wp2txt/cache)
250
+ --cache-status Show cache status and exit
251
+ --cache-clear Clear cache and exit
252
+
253
+ Configuration:
254
+ --config-init Create default config (~/.wp2txt/config.yml)
255
+ --config-path=<s> Path to configuration file
256
+
257
+ Extraction modes (mutually exclusive):
258
+ -g, --category-only Extract only title and categories
259
+ -s, --summary-only Extract title, categories, and summary
260
+ -M, --metadata-only Extract only title, headings, and categories
261
+
262
+ Section extraction:
263
+ -S, --sections=<s> Extract specific sections (comma-separated)
264
+ --section-output=<s> Output mode: structured or combined (default: structured)
265
+ --min-section-length=<i> Minimum section length in characters (default: 0)
266
+ --skip-empty Skip articles with no matching sections
267
+ --alias-file=<s> Custom section alias definitions file (YAML)
268
+ --no-section-aliases Disable section alias matching (exact match only)
269
+ --section-stats Collect and output section heading statistics (JSON)
270
+ --show-matched-sections Include matched_sections field in JSON output
271
+
272
+ Content filtering:
273
+ -a, --category, --no-category Show category info (default: true)
274
+ -t, --title, --no-title Keep page titles (default: true)
275
+ -d, --heading, --no-heading Keep section titles (default: true)
276
+ -l, --list Keep list items (default: false)
277
+ --table Keep wiki table content (default: false)
278
+ -p, --pre Keep preformatted text blocks (default: false)
279
+ -r, --ref Keep references as [ref]...[/ref] (default: false)
280
+ --multiline Keep multi-line templates (default: false)
281
+ -e, --redirect Show redirect destination (default: false)
282
+ -m, --marker, --no-marker Show list markers (default: true)
283
+ -k, --markers=<s> Content markers (default: all)
284
+ -C, --extract-citations Extract formatted citations
285
+ -E, --expand-templates Expand templates (default: true)
286
+ --no-expand-templates Disable template expansion
287
+
288
+ Performance:
289
+ -n, --num-procs=<i> Parallel processes (default: auto)
290
+ --no-turbo Disable turbo mode (saves disk space, slower)
291
+ -R, --ractor Use Ractor parallelism (Ruby 4.0+, streaming only)
292
+ -b, --bz2-gem Use bzip2-ruby gem instead of system command
293
+
294
+ Output control:
295
+ -q, --quiet Suppress progress output (errors only)
296
+ --no-color Disable colored output
297
+
298
+ Info:
299
+ -v, --version Print version
300
+ -h, --help Show help
301
+
302
+ ## Configuration File
303
+
304
+ Create persistent settings with:
305
+
306
+ $ wp2txt --config-init
307
+
308
+ This creates `~/.wp2txt/config.yml`:
309
+
310
+ ```yaml
311
+ cache:
312
+ dump_expiry_days: 30 # Days before dumps are stale (1-365)
313
+ category_expiry_days: 7 # Category cache expiry (1-90)
314
+ directory: ~/.wp2txt/cache
315
+
316
+ defaults:
317
+ format: text # Default output format
318
+ depth: 0 # Default subcategory depth
319
+ ```
320
+
321
+ Command-line options override configuration file settings.
322
+
323
+ ## Performance
324
+
325
+ Benchmark results on MacBook Air M4 (7 parallel processes, turbo mode, excluding download time):
326
+
327
+ | Wikipedia | Dump Size | Articles | Processing Time | Output |
328
+ |-----------|-----------|----------|-----------------|--------|
329
+ | Japanese | 4.37 GB | 1,485,937 | ~27 min | 463 files (4.5 GB) |
330
+ | English | 24.2 GB | ~6.8M | ~2 hours | 2,000 files (20 GB) |
331
+
332
+ Turbo mode (default) splits bz2 into XML chunks first, then processes in parallel. Use `--no-turbo` to save disk space at the cost of slower processing.
204
333
 
205
334
  ## Caveats
206
335
 
207
- * Some data, such as mathematical formulas and computer source code, will not be converted correctly.
208
- * Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
209
- * The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
336
+ * Special content (math, code, etc.) is marked with placeholders by default.
337
+ * Some text may not be extracted correctly due to markup variations or language-specific formatting.
338
+
339
+ ## Changelog
340
+
341
+ See [CHANGELOG.md](CHANGELOG.md) for detailed release notes.
342
+
343
+ **v2.1.0 (February 2026)**: SQLite caching, Ractor parallelism (Ruby 4.0+), template expansion, content markers, Docker image update.
344
+
345
+ **v2.0.0 (January 2026)**: Auto-download mode, category-based extraction, article extraction by title, JSON output, streaming processing, Ruby 4.0 support.
210
346
 
211
347
  ## Useful Links
212
348
 
@@ -223,14 +359,14 @@ The author will appreciate your mentioning one of these in your research.
223
359
  * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
224
360
  * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
225
361
 
226
- Or use this BibTeX entry:
362
+ BibTeX:
227
363
 
228
364
  ```
229
- @misc{wp2txt_2023,
365
+ @misc{wp2txt_2026,
230
366
  author = {Yoichiro Hasebe},
231
367
  title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
232
368
  url = {https://github.com/yohasebe/wp2txt},
233
- year = {2023}
369
+ year = {2026}
234
370
  }
235
371
  ```
236
372