wp2txt 1.1.2 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (96) hide show
  1. checksums.yaml +4 -4
  2. data/.dockerignore +12 -0
  3. data/.github/workflows/ci.yml +13 -13
  4. data/.gitignore +14 -0
  5. data/CHANGELOG.md +284 -0
  6. data/DEVELOPMENT.md +415 -0
  7. data/DEVELOPMENT_ja.md +415 -0
  8. data/Dockerfile +19 -10
  9. data/Gemfile +2 -8
  10. data/README.md +261 -121
  11. data/README_ja.md +375 -0
  12. data/Rakefile +4 -0
  13. data/bin/wp2txt +863 -159
  14. data/lib/wp2txt/article.rb +98 -13
  15. data/lib/wp2txt/bz2_validator.rb +239 -0
  16. data/lib/wp2txt/category_cache.rb +313 -0
  17. data/lib/wp2txt/cli.rb +319 -0
  18. data/lib/wp2txt/cli_ui.rb +428 -0
  19. data/lib/wp2txt/config.rb +158 -0
  20. data/lib/wp2txt/constants.rb +134 -0
  21. data/lib/wp2txt/data/html_entities.json +2135 -0
  22. data/lib/wp2txt/data/language_metadata.json +4769 -0
  23. data/lib/wp2txt/data/language_tiers.json +59 -0
  24. data/lib/wp2txt/data/mediawiki_aliases.json +12366 -0
  25. data/lib/wp2txt/data/template_aliases.json +193 -0
  26. data/lib/wp2txt/data/wikipedia_entities.json +12 -0
  27. data/lib/wp2txt/extractor.rb +545 -0
  28. data/lib/wp2txt/file_utils.rb +91 -0
  29. data/lib/wp2txt/formatter.rb +352 -0
  30. data/lib/wp2txt/global_data_cache.rb +353 -0
  31. data/lib/wp2txt/index_cache.rb +258 -0
  32. data/lib/wp2txt/magic_words.rb +353 -0
  33. data/lib/wp2txt/memory_monitor.rb +236 -0
  34. data/lib/wp2txt/multistream.rb +1383 -0
  35. data/lib/wp2txt/output_writer.rb +182 -0
  36. data/lib/wp2txt/parser_functions.rb +606 -0
  37. data/lib/wp2txt/ractor_worker.rb +215 -0
  38. data/lib/wp2txt/regex.rb +396 -12
  39. data/lib/wp2txt/section_extractor.rb +354 -0
  40. data/lib/wp2txt/stream_processor.rb +271 -0
  41. data/lib/wp2txt/template_expander.rb +830 -0
  42. data/lib/wp2txt/text_processing.rb +337 -0
  43. data/lib/wp2txt/utils.rb +629 -270
  44. data/lib/wp2txt/version.rb +1 -1
  45. data/lib/wp2txt.rb +53 -26
  46. data/scripts/benchmark_regex.rb +161 -0
  47. data/scripts/fetch_html_entities.rb +94 -0
  48. data/scripts/fetch_language_metadata.rb +180 -0
  49. data/scripts/fetch_mediawiki_data.rb +334 -0
  50. data/scripts/fetch_template_data.rb +186 -0
  51. data/scripts/profile_memory.rb +139 -0
  52. data/spec/article_spec.rb +402 -0
  53. data/spec/auto_download_spec.rb +314 -0
  54. data/spec/bz2_validator_spec.rb +193 -0
  55. data/spec/category_cache_spec.rb +226 -0
  56. data/spec/category_fetcher_spec.rb +504 -0
  57. data/spec/cleanup_spec.rb +197 -0
  58. data/spec/cli_options_spec.rb +678 -0
  59. data/spec/cli_spec.rb +876 -0
  60. data/spec/config_spec.rb +194 -0
  61. data/spec/constants_spec.rb +138 -0
  62. data/spec/file_utils_spec.rb +170 -0
  63. data/spec/fixtures/samples.rb +181 -0
  64. data/spec/formatter_sections_spec.rb +382 -0
  65. data/spec/global_data_cache_spec.rb +186 -0
  66. data/spec/index_cache_spec.rb +210 -0
  67. data/spec/integration_spec.rb +543 -0
  68. data/spec/magic_words_spec.rb +261 -0
  69. data/spec/markers_spec.rb +476 -0
  70. data/spec/memory_monitor_spec.rb +192 -0
  71. data/spec/multistream_spec.rb +690 -0
  72. data/spec/output_writer_spec.rb +400 -0
  73. data/spec/parser_functions_spec.rb +455 -0
  74. data/spec/ractor_worker_spec.rb +197 -0
  75. data/spec/regex_spec.rb +281 -0
  76. data/spec/section_extractor_spec.rb +397 -0
  77. data/spec/spec_helper.rb +63 -0
  78. data/spec/stream_processor_spec.rb +579 -0
  79. data/spec/template_data_spec.rb +246 -0
  80. data/spec/template_expander_spec.rb +472 -0
  81. data/spec/template_processing_spec.rb +217 -0
  82. data/spec/text_processing_spec.rb +312 -0
  83. data/spec/utils_spec.rb +195 -16
  84. data/spec/wp2txt_spec.rb +510 -0
  85. data/wp2txt.gemspec +5 -3
  86. metadata +146 -18
  87. data/.rubocop.yml +0 -80
  88. data/data/output_samples/testdata_en.txt +0 -23002
  89. data/data/output_samples/testdata_en_category.txt +0 -132
  90. data/data/output_samples/testdata_en_summary.txt +0 -1376
  91. data/data/output_samples/testdata_ja.txt +0 -22774
  92. data/data/output_samples/testdata_ja_category.txt +0 -206
  93. data/data/output_samples/testdata_ja_summary.txt +0 -1560
  94. data/data/testdata_en.bz2 +0 -0
  95. data/data/testdata_ja.bz2 +0 -0
  96. data/image/screenshot.png +0 -0
data/README.md CHANGED
@@ -2,207 +2,347 @@
2
2
 
3
3
  A command-line toolkit to extract text content and category data from Wikipedia dump files
4
4
 
5
- ## About
5
+ English | [日本語](README_ja.md)
6
6
 
7
- WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
7
+ ## Quick Start
8
8
 
9
- ## Changelog
9
+ ```bash
10
+ # Install
11
+ gem install wp2txt
12
+
13
+ # Extract text from Japanese Wikipedia (auto-download)
14
+ wp2txt --lang=ja -o ./output
10
15
 
11
- **April 2023**
16
+ # Extract specific articles
17
+ wp2txt --lang=ja --articles="東京,京都" -o ./articles
18
+
19
+ # Extract articles from a category
20
+ wp2txt --lang=ja --from-category="日本の都市" -o ./cities
21
+ ```
12
22
 
13
- - File split/delete issues fixed
23
+ ## About
14
24
 
15
- **January 2023**
25
+ WP2TXT extracts plain text and category information from Wikipedia dump files. It processes XML dumps (compressed with bzip2), removes MediaWiki markup, and outputs clean text suitable for corpus linguistics, text mining, and other research purposes.
16
26
 
17
- - Bug related to command line arguments fixed
18
- - Code cleanup introducing Rubocop
27
+ ## Key Features
19
28
 
20
- **December 2022**
29
+ - **Auto-download** - Automatically download dumps by language code
30
+ - **Article extraction by title** - Extract specific articles without downloading full dumps
31
+ - **Category-based extraction** - Extract all articles from a specific Wikipedia category
32
+ - **Category metadata extraction** - Preserves article category information in output
33
+ - **Template expansion** - Expands common templates (dates, units, coordinates) to readable text
34
+ - **Multilingual support** - Category and redirect detection for 350+ Wikipedia languages
35
+ - **Streaming processing** - Process large dumps without intermediate files
36
+ - **JSON output** - Machine-readable JSONL format for data pipelines
21
37
 
22
- - Docker images available via Docker Hub
38
+ ## Use Cases
23
39
 
24
- **November 2022**
40
+ wp2txt is particularly suited for:
25
41
 
26
- - Code added to suppress "Invalid byte sequence error" when an ilegal UTF-8 character is input.
42
+ - Building domain-specific corpora using category information
43
+ - Comparative linguistic research across topic areas
44
+ - Extracting Wikipedia text with metadata for NLP tasks
45
+ - Cross-linguistic studies using parallel category structures
27
46
 
28
- **August 2022**
47
+ ## Data Access
29
48
 
30
- - A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
31
- - A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
32
- - Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
49
+ wp2txt uses [official Wikipedia dump files](https://meta.wikimedia.org/wiki/Data_dumps), the recommended method for bulk data access. This approach respects Wikimedia's infrastructure guidelines.
33
50
 
34
- ## Screenshot
51
+ ## Installation
35
52
 
36
- <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="800" />
53
+ ### Install wp2txt
37
54
 
38
- **Environment**
55
+ $ gem install wp2txt
39
56
 
40
- - WP2TXT 1.0.1
41
- - MacBook Pro (2021 Apple M1 Pro)
42
- - enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
57
+ ### System Requirements
43
58
 
44
- In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.
59
+ WP2TXT requires one of the following commands to decompress `bz2` files:
45
60
 
46
- ## Features
61
+ - `lbzip2` (recommended - uses multiple CPU cores)
62
+ - `pbzip2`
63
+ - `bzip2` (pre-installed on most systems)
47
64
 
48
- - Converts Wikipedia dump files in various languages
49
- - Creates output files of specified size
50
- - Allows specifying ext elements (page titles, section headers, paragraphs, list items) to be extracted
51
- - Allows extracting category information of the article
52
- - Allows extracting opening paragraphs of the article
65
+ On macOS with Homebrew:
53
66
 
54
- ## Setting Up
67
+ $ brew install lbzip2
55
68
 
56
- ### WP2TXT on Docker
69
+ On Windows: Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and add to PATH.
57
70
 
58
- 1. Install [Docker Desktop](https://www.docker.com/products/docker-desktop/) (Mac/Windows/Linux)
59
- 2. Execute `docker` command in a terminal:
71
+ ### Docker (Alternative)
60
72
 
61
73
  ```shell
62
- docker run -it -v /Users/me/localdata:/data yohasebe/wp2txt
74
+ docker run -it -v /path/to/localdata:/data yohasebe/wp2txt
63
75
  ```
64
76
 
65
- - Make sure to Replace `/Users/me/localdata` with the full path to the data directory in your local computer
77
+ The `wp2txt` command is available inside the container. Use `/data` for input/output files.
66
78
 
67
- 3. The Docker image will begin downloading and a bash prompt will appear when finished.
68
- 4. The `wp2txt` command will be avalable anywhare in the Docker container. Use the `/data` directory as the location of the input dump files and the output text files.
79
+ ## Basic Usage
69
80
 
70
- **IMPORTANT:**
81
+ ### Auto-download and process (Recommended)
71
82
 
72
- - Configure Docker Desktop resource settings (number of cores, amount of memory, etc.) to get the best performance possible.
73
- - When running the `wp2txt` command inside a Docker container, be sure to set the output directory to somewhere in the mounted local directory specified by the `docker run` command.
83
+ $ wp2txt --lang=ja -o ./text
74
84
 
75
- ### WP2TXT on MacOS and Linux
85
+ This automatically downloads the Japanese Wikipedia dump and extracts plain text. Downloads are cached in `~/.wp2txt/cache/`.
76
86
 
77
- WP2TXT requires that one of the following commands be installed on the system in order to decompress `bz2` files:
87
+ ### Extract specific articles by title
78
88
 
79
- - `lbzip2` (recommended)
80
- - `pbzip2`
81
- - `bzip2`
89
+ $ wp2txt --lang=ja --articles="認知言語学,生成文法" -o ./articles
82
90
 
83
- In most cases, the `bzip2` command is pre-installed on the system. However, since `lbzip2` can use multiple CPU cores and is faster than `bzip2`, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.
91
+ Only the index file and necessary data streams are downloaded, making it much faster than processing the full dump.
84
92
 
85
- If you are using MacOS with Homebrew installed, you can install `lbzip2` with the following command:
93
+ ### Extract articles from a category
86
94
 
87
- $ brew install lbzip2
95
+ $ wp2txt --lang=ja --from-category="日本の都市" -o ./cities
88
96
 
89
- ### WP2TXT on Windows
97
+ Include subcategories with `--depth`:
90
98
 
91
- Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.
99
+ $ wp2txt --lang=ja --from-category="日本の都市" --depth=2 -o ./cities
92
100
 
93
- ## Installation
101
+ Preview without downloading (shows article counts):
94
102
 
95
- ### WP2TXT command
103
+ $ wp2txt --lang=ja --from-category="日本の都市" --dry-run
96
104
 
97
- $ gem install wp2txt
105
+ ### Process local dump file
98
106
 
99
- ## Wikipedia Dump File
107
+ $ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text
100
108
 
101
- Download the latest Wikipedia dump file for the desired language at a URL such as
109
+ ### Other extraction modes
102
110
 
103
- https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
111
+ # Category info only (title + categories)
112
+ $ wp2txt -g --lang=ja -o ./category
104
113
 
105
- Here, `enwiki` refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to `jawiki` (Japanese). In doing so, note that there are two instances of `enwiki` in the URL above.
114
+ # Summary only (title + categories + opening paragraphs)
115
+ $ wp2txt -s --lang=ja -o ./summary
106
116
 
107
- Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
117
+ # Metadata only (title + section headings + categories)
118
+ $ wp2txt -M --lang=ja --format json -o ./metadata
108
119
 
109
- xxwiki-yyyymmdd-pages-articles.xml.bz2
120
+ # Extract specific sections (comma-separated, 'summary' for lead text)
121
+ $ wp2txt --lang=en --sections="summary,Plot,Reception" --format json -o ./sections
110
122
 
111
- where `xx` is language code such as `en` (English)" or `ja` (japanese), and `yyyymmdd` is the date of creation (e.g. `20220801`).
123
+ # Section heading statistics
124
+ $ wp2txt --lang=ja --section-stats -o ./stats
112
125
 
113
- ## Basic Usage
126
+ # JSON/JSONL output
127
+ $ wp2txt --format json --lang=ja -o ./json
128
+
129
+ ## Sample Output
114
130
 
115
- Suppose you have a folder with a wikipedia dump file and empty subfolders organized as follows:
131
+ ### Text Output
116
132
 
117
133
  ```
118
- .
119
- ├── enwiki-20220801-pages-articles.xml.bz2
120
- ├── /xml
121
- ├── /text
122
- ├── /category
123
- └── /summary
134
+ [[Article Title]]
135
+
136
+ Article content goes here with sections and paragraphs...
137
+
138
+ CATEGORIES: Category1, Category2, Category3
124
139
  ```
125
140
 
126
- ### Decompress and Split
141
+ ### JSON/JSONL Output
127
142
 
128
- The following command will decompress the entire wikipedia data and split it into many small (approximately 10 MB) XML files.
143
+ Each line contains one JSON object:
129
144
 
130
- $ wp2txt --no-convert -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./xml
145
+ ```json
146
+ {"title": "Article Title", "categories": ["Cat1", "Cat2"], "text": "...", "redirect": null}
147
+ ```
131
148
 
132
- **Note**: The resulting files are not well-formed XML. They contain part of the orignal XML extracted from the Wikipedia dump file, taking care to ensure that the content within the <page> tag is not split into multiple files.
149
+ For redirect articles:
133
150
 
134
- ### Extract plain text from MediaWiki XML
151
+ ```json
152
+ {"title": "NYC", "categories": [], "text": "", "redirect": "New York City"}
153
+ ```
135
154
 
136
- $ wp2txt -i ./xml -o ./text
155
+ ## Cache Management
137
156
 
157
+ $ wp2txt --cache-status # Show cache status
158
+ $ wp2txt --cache-clear # Clear all cache
159
+ $ wp2txt --cache-clear --lang=ja # Clear cache for Japanese only
160
+ $ wp2txt --update-cache # Force fresh download
138
161
 
139
- ### Extract only category info from MediaWiki XML
162
+ When cache exceeds the expiry period (default: 30 days), wp2txt displays a warning but allows using cached data.
140
163
 
141
- $ wp2txt -g -i ./xml -o ./category
164
+ ## Advanced Options
142
165
 
143
- ### Extract opening paragraphs from MediaWiki XML
166
+ ### Content Type Markers
144
167
 
145
- $ wp2txt -s -i ./xml -o ./summary
168
+ Special content is replaced with marker placeholders by default:
146
169
 
147
- ### Extract directly from bz2 compressed file
170
+ **Inline markers** (appear within sentences):
148
171
 
149
- It is possible (though not recommended) to 1) decompress the dump files, 2) split the data into files, and 3) extract the text just one line of command. You can automatically remove all the intermediate XML files with `-x` option.
172
+ | Marker | Content Type |
173
+ |--------|--------------|
174
+ | `[MATH]` | Mathematical formulas |
175
+ | `[CODE]` | Inline code |
176
+ | `[CHEM]` | Chemical formulas |
177
+ | `[IPA]` | IPA phonetic notation |
150
178
 
151
- $ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text -x
179
+ **Block markers** (standalone content):
152
180
 
153
- ## Sample Output
181
+ | Marker | Content Type |
182
+ |--------|--------------|
183
+ | `[CODEBLOCK]` | Source code blocks |
184
+ | `[TABLE]` | Wiki tables |
185
+ | `[INFOBOX]` | Information boxes |
186
+ | `[NAVBOX]` | Navigation boxes |
187
+ | `[GALLERY]` | Image galleries |
188
+ | `[REFERENCES]` | Reference lists |
189
+ | `[SCORE]` | Musical scores |
190
+ | `[TIMELINE]` | Timeline graphics |
191
+ | `[GRAPH]` | Graphs/charts |
192
+ | `[SIDEBAR]` | Sidebar templates |
193
+ | `[MAPFRAME]` | Interactive maps |
194
+ | `[IMAGEMAP]` | Clickable image maps |
154
195
 
155
- Output contains title, category info, paragraphs
196
+ Configure with `--markers`:
156
197
 
157
- $ wp2txt -i ./input -o /output
198
+ $ wp2txt --lang=en --markers=all -o ./text # All markers (default)
199
+ $ wp2txt --lang=en --markers=math,code -o ./text # Only MATH and CODE
158
200
 
159
- - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
160
- - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
201
+ **Note**: `--markers=none` is deprecated as removing special content can make surrounding text nonsensical.
161
202
 
162
- Output containing title and category only
203
+ ### Template Expansion
163
204
 
164
- $ wp2txt -g -i ./input -o /output
205
+ Common MediaWiki templates are automatically expanded (enabled by default):
165
206
 
166
- - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_category.txt)
167
- - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_category.txt)
207
+ | Template | Output |
208
+ |----------|--------|
209
+ | `{{birth date\|1990\|5\|15}}` | May 15, 1990 |
210
+ | `{{convert\|100\|km\|mi}}` | 100 km (62 mi) |
211
+ | `{{coord\|35\|41\|N\|139\|41\|E}}` | 35°41′N 139°41′E |
212
+ | `{{lang\|ja\|日本語}}` | 日本語 |
213
+ | `{{nihongo\|Tokyo\|東京\|Tōkyō}}` | Tokyo (東京, Tōkyō) |
214
+ | `{{frac\|1\|2}}` | 1/2 |
215
+ | `{{circa\|1900}}` | c. 1900 |
168
216
 
169
- Output containing title, category, and summary
217
+ Supported: date/age templates, unit conversion, coordinates, language tags, quotes, fractions, and more. Parser functions (`{{#if:}}`, `{{#switch:}}`) and magic words (`{{PAGENAME}}`, `{{CURRENTYEAR}}`) are also supported.
170
218
 
171
- $ wp2txt -s -i ./input -o /output
219
+ Disable with `--no-expand-templates`.
172
220
 
173
- - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
174
- - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
221
+ ### Citation Extraction
175
222
 
176
- ## Command Line Options
223
+ By default, citation templates are removed. Use `--extract-citations` to extract formatted citations:
224
+
225
+ $ wp2txt --lang=en --extract-citations -o ./text
177
226
 
178
- Command line options are as follows:
227
+ Supported: `{{cite book}}`, `{{cite web}}`, `{{cite news}}`, `{{cite journal}}`, `{{Citation}}`, etc.
228
+
229
+ ## Command Line Options
179
230
 
180
231
  Usage: wp2txt [options]
181
- where [options] are:
182
- -i, --input Path to compressed file (bz2) or decompressed file (xml), or path to directory containing files of the latter format
183
- -o, --output-dir=<s> Path to output directory
184
- -c, --convert, --no-convert Output in plain text (converting from XML) (default: true)
185
- -a, --category, --no-category Show article category information (default: true)
186
- -g, --category-only Extract only article title and categories
187
- -s, --summary-only Extract only article title, categories, and summary text before first heading
188
- -f, --file-size=<i> Approximate size (in MB) of each output file (default: 10)
189
- -n, --num-procs Number of proccesses to be run concurrently (default: max num of available CPU cores minus two)
190
- -x, --del-interfile Delete intermediate XML files from output dir
191
- -t, --title, --no-title Keep page titles in output (default: true)
192
- -d, --heading, --no-heading Keep section titles in output (default: true)
193
- -l, --list Keep unprocessed list items in output
194
- -r, --ref Keep reference notations in the format [ref]...[/ref]
195
- -e, --redirect Show redirect destination
196
- -m, --marker, --no-marker Show symbols prefixed to list items, definitions, etc. (Default: true)
197
- -b, --bz2-gem Use Ruby's bzip2-ruby gem instead of a system command
198
- -v, --version Print version and exit
199
- -h, --help Show this message
232
+
233
+ Input source (one of --input or --lang required):
234
+ -i, --input=<s> Path to compressed file (bz2) or XML file
235
+ -L, --lang=<s> Wikipedia language code (e.g., ja, en, de)
236
+ -A, --articles=<s> Specific article titles (comma-separated)
237
+ -G, --from-category=<s> Extract articles from Wikipedia category
238
+ -D, --depth=<i> Subcategory recursion depth (default: 0)
239
+ -y, --yes Skip confirmation prompt
240
+ --dry-run Preview category extraction
241
+ -U, --update-cache Force refresh of cached files
242
+
243
+ Output options:
244
+ -o, --output-dir=<s> Output directory (default: current)
245
+ -j, --format=<s> Output format: text or json (default: text)
246
+ -f, --file-size=<i> Output file size in MB (default: 10, 0=single)
247
+
248
+ Cache management:
249
+ --cache-dir=<s> Cache directory (default: ~/.wp2txt/cache)
250
+ --cache-status Show cache status and exit
251
+ --cache-clear Clear cache and exit
252
+
253
+ Configuration:
254
+ --config-init Create default config (~/.wp2txt/config.yml)
255
+ --config-path=<s> Path to configuration file
256
+
257
+ Extraction modes (mutually exclusive):
258
+ -g, --category-only Extract only title and categories
259
+ -s, --summary-only Extract title, categories, and summary
260
+ -M, --metadata-only Extract only title, headings, and categories
261
+
262
+ Section extraction:
263
+ -S, --sections=<s> Extract specific sections (comma-separated)
264
+ --section-output=<s> Output mode: structured or combined (default: structured)
265
+ --min-section-length=<i> Minimum section length in characters (default: 0)
266
+ --skip-empty Skip articles with no matching sections
267
+ --alias-file=<s> Custom section alias definitions file (YAML)
268
+ --no-section-aliases Disable section alias matching (exact match only)
269
+ --section-stats Collect and output section heading statistics (JSON)
270
+ --show-matched-sections Include matched_sections field in JSON output
271
+
272
+ Content filtering:
273
+ -a, --category, --no-category Show category info (default: true)
274
+ -t, --title, --no-title Keep page titles (default: true)
275
+ -d, --heading, --no-heading Keep section titles (default: true)
276
+ -l, --list Keep list items (default: false)
277
+ --table Keep wiki table content (default: false)
278
+ -p, --pre Keep preformatted text blocks (default: false)
279
+ -r, --ref Keep references as [ref]...[/ref] (default: false)
280
+ --multiline Keep multi-line templates (default: false)
281
+ -e, --redirect Show redirect destination (default: false)
282
+ -m, --marker, --no-marker Show list markers (default: true)
283
+ -k, --markers=<s> Content markers (default: all)
284
+ -C, --extract-citations Extract formatted citations
285
+ -E, --expand-templates Expand templates (default: true)
286
+ --no-expand-templates Disable template expansion
287
+
288
+ Performance:
289
+ -n, --num-procs=<i> Parallel processes (default: auto)
290
+ --no-turbo Disable turbo mode (saves disk space, slower)
291
+ -R, --ractor Use Ractor parallelism (Ruby 4.0+, streaming only)
292
+ -b, --bz2-gem Use bzip2-ruby gem instead of system command
293
+
294
+ Output control:
295
+ -q, --quiet Suppress progress output (errors only)
296
+ --no-color Disable colored output
297
+
298
+ Info:
299
+ -v, --version Print version
300
+ -h, --help Show help
301
+
302
+ ## Configuration File
303
+
304
+ Create persistent settings with:
305
+
306
+ $ wp2txt --config-init
307
+
308
+ This creates `~/.wp2txt/config.yml`:
309
+
310
+ ```yaml
311
+ cache:
312
+ dump_expiry_days: 30 # Days before dumps are stale (1-365)
313
+ category_expiry_days: 7 # Category cache expiry (1-90)
314
+ directory: ~/.wp2txt/cache
315
+
316
+ defaults:
317
+ format: text # Default output format
318
+ depth: 0 # Default subcategory depth
319
+ ```
320
+
321
+ Command-line options override configuration file settings.
322
+
323
+ ## Performance
324
+
325
+ Benchmark results on MacBook Air M4 (7 parallel processes, turbo mode, excluding download time):
326
+
327
+ | Wikipedia | Dump Size | Articles | Processing Time | Output |
328
+ |-----------|-----------|----------|-----------------|--------|
329
+ | Japanese | 4.37 GB | 1,485,937 | ~27 min | 463 files (4.5 GB) |
330
+ | English | 24.2 GB | ~6.8M | ~2 hours | 2,000 files (20 GB) |
331
+
332
+ Turbo mode (default) splits bz2 into XML chunks first, then processes in parallel. Use `--no-turbo` to save disk space at the cost of slower processing.
200
333
 
201
334
  ## Caveats
202
335
 
203
- * Some data, such as mathematical formulas and computer source code, will not be converted correctly.
204
- * Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
205
- * The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
336
+ * Special content (math, code, etc.) is marked with placeholders by default.
337
+ * Some text may not be extracted correctly due to markup variations or language-specific formatting.
338
+
339
+ ## Changelog
340
+
341
+ See [CHANGELOG.md](CHANGELOG.md) for detailed release notes.
342
+
343
+ **v2.1.0 (February 2026)**: SQLite caching, Ractor parallelism (Ruby 4.0+), template expansion, content markers, Docker image update.
344
+
345
+ **v2.0.0 (January 2026)**: Auto-download mode, category-based extraction, article extraction by title, JSON output, streaming processing, Ruby 4.0 support.
206
346
 
207
347
  ## Useful Links
208
348
 
@@ -219,14 +359,14 @@ The author will appreciate your mentioning one of these in your research.
219
359
  * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
220
360
  * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
221
361
 
222
- Or use this BibTeX entry:
362
+ BibTeX:
223
363
 
224
364
  ```
225
- @misc{wp2txt_2023,
365
+ @misc{wp2txt_2026,
226
366
  author = {Yoichiro Hasebe},
227
367
  title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
228
368
  url = {https://github.com/yohasebe/wp2txt},
229
- year = {2023}
369
+ year = {2026}
230
370
  }
231
371
  ```
232
372