RubyGems - wp2txt - Versions diffs - 1.1.3 → 2.1.0 - Mend

wp2txt 1.1.3 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (96) hide show

checksums.yaml +4 -4
data/.dockerignore +12 -0
data/.github/workflows/ci.yml +13 -13
data/.gitignore +14 -0
data/CHANGELOG.md +284 -0
data/DEVELOPMENT.md +415 -0
data/DEVELOPMENT_ja.md +415 -0
data/Dockerfile +19 -10
data/Gemfile +2 -8
data/README.md +259 -123
data/README_ja.md +375 -0
data/Rakefile +4 -0
data/bin/wp2txt +863 -161
data/lib/wp2txt/article.rb +98 -13
data/lib/wp2txt/bz2_validator.rb +239 -0
data/lib/wp2txt/category_cache.rb +313 -0
data/lib/wp2txt/cli.rb +319 -0
data/lib/wp2txt/cli_ui.rb +428 -0
data/lib/wp2txt/config.rb +158 -0
data/lib/wp2txt/constants.rb +134 -0
data/lib/wp2txt/data/html_entities.json +2135 -0
data/lib/wp2txt/data/language_metadata.json +4769 -0
data/lib/wp2txt/data/language_tiers.json +59 -0
data/lib/wp2txt/data/mediawiki_aliases.json +12366 -0
data/lib/wp2txt/data/template_aliases.json +193 -0
data/lib/wp2txt/data/wikipedia_entities.json +12 -0
data/lib/wp2txt/extractor.rb +545 -0
data/lib/wp2txt/file_utils.rb +91 -0
data/lib/wp2txt/formatter.rb +352 -0
data/lib/wp2txt/global_data_cache.rb +353 -0
data/lib/wp2txt/index_cache.rb +258 -0
data/lib/wp2txt/magic_words.rb +353 -0
data/lib/wp2txt/memory_monitor.rb +236 -0
data/lib/wp2txt/multistream.rb +1383 -0
data/lib/wp2txt/output_writer.rb +182 -0
data/lib/wp2txt/parser_functions.rb +606 -0
data/lib/wp2txt/ractor_worker.rb +215 -0
data/lib/wp2txt/regex.rb +396 -12
data/lib/wp2txt/section_extractor.rb +354 -0
data/lib/wp2txt/stream_processor.rb +271 -0
data/lib/wp2txt/template_expander.rb +830 -0
data/lib/wp2txt/text_processing.rb +337 -0
data/lib/wp2txt/utils.rb +629 -270
data/lib/wp2txt/version.rb +1 -1
data/lib/wp2txt.rb +53 -26
data/scripts/benchmark_regex.rb +161 -0
data/scripts/fetch_html_entities.rb +94 -0
data/scripts/fetch_language_metadata.rb +180 -0
data/scripts/fetch_mediawiki_data.rb +334 -0
data/scripts/fetch_template_data.rb +186 -0
data/scripts/profile_memory.rb +139 -0
data/spec/article_spec.rb +402 -0
data/spec/auto_download_spec.rb +314 -0
data/spec/bz2_validator_spec.rb +193 -0
data/spec/category_cache_spec.rb +226 -0
data/spec/category_fetcher_spec.rb +504 -0
data/spec/cleanup_spec.rb +197 -0
data/spec/cli_options_spec.rb +678 -0
data/spec/cli_spec.rb +876 -0
data/spec/config_spec.rb +194 -0
data/spec/constants_spec.rb +138 -0
data/spec/file_utils_spec.rb +170 -0
data/spec/fixtures/samples.rb +181 -0
data/spec/formatter_sections_spec.rb +382 -0
data/spec/global_data_cache_spec.rb +186 -0
data/spec/index_cache_spec.rb +210 -0
data/spec/integration_spec.rb +543 -0
data/spec/magic_words_spec.rb +261 -0
data/spec/markers_spec.rb +476 -0
data/spec/memory_monitor_spec.rb +192 -0
data/spec/multistream_spec.rb +690 -0
data/spec/output_writer_spec.rb +400 -0
data/spec/parser_functions_spec.rb +455 -0
data/spec/ractor_worker_spec.rb +197 -0
data/spec/regex_spec.rb +281 -0
data/spec/section_extractor_spec.rb +397 -0
data/spec/spec_helper.rb +63 -0
data/spec/stream_processor_spec.rb +579 -0
data/spec/template_data_spec.rb +246 -0
data/spec/template_expander_spec.rb +472 -0
data/spec/template_processing_spec.rb +217 -0
data/spec/text_processing_spec.rb +312 -0
data/spec/utils_spec.rb +195 -16
data/spec/wp2txt_spec.rb +510 -0
data/wp2txt.gemspec +5 -3
metadata +146 -18
data/.rubocop.yml +0 -80
data/data/output_samples/testdata_en.txt +0 -23002
data/data/output_samples/testdata_en_category.txt +0 -132
data/data/output_samples/testdata_en_summary.txt +0 -1376
data/data/output_samples/testdata_ja.txt +0 -22774
data/data/output_samples/testdata_ja_category.txt +0 -206
data/data/output_samples/testdata_ja_summary.txt +0 -1560
data/data/testdata_en.bz2 +0 -0
data/data/testdata_ja.bz2 +0 -0
data/image/screenshot.png +0 -0

data/README.md CHANGED Viewed

@@ -2,211 +2,347 @@
 A command-line toolkit to extract text content and category data from Wikipedia dump files
-## About
+English | [日本語](README_ja.md)
-WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
+## Quick Start
-## Changelog
+```bash
+# Install
+gem install wp2txt
-**May 2023**
+# Extract text from Japanese Wikipedia (auto-download)
+wp2txt --lang=ja -o ./output
-- Problems caused by too many parallel processors are addressed by setting the upper limit on the number of processors to 8.
+# Extract specific articles
+wp2txt --lang=ja --articles="東京,京都" -o ./articles
-**April 2023**
+# Extract articles from a category
+wp2txt --lang=ja --from-category="日本の都市" -o ./cities
+```
-- File split/delete issues fixed
+## About
-**January 2023**
+WP2TXT extracts plain text and category information from Wikipedia dump files. It processes XML dumps (compressed with bzip2), removes MediaWiki markup, and outputs clean text suitable for corpus linguistics, text mining, and other research purposes.
-- Bug related to command line arguments fixed
-- Code cleanup introducing Rubocop
+## Key Features
-**December 2022**
+- **Auto-download** - Automatically download dumps by language code
+- **Article extraction by title** - Extract specific articles without downloading full dumps
+- **Category-based extraction** - Extract all articles from a specific Wikipedia category
+- **Category metadata extraction** - Preserves article category information in output
+- **Template expansion** - Expands common templates (dates, units, coordinates) to readable text
+- **Multilingual support** - Category and redirect detection for 350+ Wikipedia languages
+- **Streaming processing** - Process large dumps without intermediate files
+- **JSON output** - Machine-readable JSONL format for data pipelines
-- Docker images available via Docker Hub
+## Use Cases
-**November 2022**
+wp2txt is particularly suited for:
-- Code added to suppress "Invalid byte sequence error" when an ilegal UTF-8 character is input.
+- Building domain-specific corpora using category information
+- Comparative linguistic research across topic areas
+- Extracting Wikipedia text with metadata for NLP tasks
+- Cross-linguistic studies using parallel category structures
-**August 2022**
+## Data Access
-- A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
-- A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
-- Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
+wp2txt uses [official Wikipedia dump files](https://meta.wikimedia.org/wiki/Data_dumps), the recommended method for bulk data access. This approach respects Wikimedia's infrastructure guidelines.
-## Screenshot
+## Installation
-<img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="800" />
+### Install wp2txt
-**Environment**
+    $ gem install wp2txt
-- WP2TXT 1.0.1
-- MacBook Pro (2021 Apple M1 Pro)
-- enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
+### System Requirements
-In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.
+WP2TXT requires one of the following commands to decompress `bz2` files:
-## Features
+- `lbzip2` (recommended - uses multiple CPU cores)
+- `pbzip2`
+- `bzip2` (pre-installed on most systems)
-- Converts Wikipedia dump files in various languages
-- Creates output files of specified size
-- Allows specifying ext elements (page titles, section headers, paragraphs, list items) to be extracted
-- Allows extracting category information of the article
-- Allows extracting opening paragraphs of the article
+On macOS with Homebrew:
-## Setting Up
+    $ brew install lbzip2
-### WP2TXT on Docker
+On Windows: Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and add to PATH.
-1. Install [Docker Desktop](https://www.docker.com/products/docker-desktop/) (Mac/Windows/Linux)
-2. Execute `docker` command in a terminal:
+### Docker (Alternative)
 ```shell
-docker run -it -v /Users/me/localdata:/data yohasebe/wp2txt
+docker run -it -v /path/to/localdata:/data yohasebe/wp2txt
 ```
-- Make sure to Replace `/Users/me/localdata` with the full path to the data directory in your local computer
+The `wp2txt` command is available inside the container. Use `/data` for input/output files.
-3. The Docker image will begin downloading and a bash prompt will appear when finished.
-4. The `wp2txt` command will be avalable anywhare in the Docker container. Use the `/data` directory as the location of the input dump files and the output text files.
+## Basic Usage
-**IMPORTANT:**
+### Auto-download and process (Recommended)
-- Configure Docker Desktop resource settings (number of cores, amount of memory, etc.) to get the best performance possible.
-- When running the `wp2txt` command inside a Docker container, be sure to set the output directory to somewhere in the mounted local directory specified by the `docker run` command.
+    $ wp2txt --lang=ja -o ./text
-### WP2TXT on MacOS and Linux
+This automatically downloads the Japanese Wikipedia dump and extracts plain text. Downloads are cached in `~/.wp2txt/cache/`.
-WP2TXT requires that one of the following commands be installed on the system in order to decompress `bz2` files:
+### Extract specific articles by title
-- `lbzip2` (recommended)
-- `pbzip2`
-- `bzip2`
+    $ wp2txt --lang=ja --articles="認知言語学,生成文法" -o ./articles
-In most cases, the `bzip2` command is pre-installed on the system. However, since `lbzip2` can use multiple CPU cores and is faster than `bzip2`, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.
+Only the index file and necessary data streams are downloaded, making it much faster than processing the full dump.
-If you are using MacOS with Homebrew installed, you can install `lbzip2` with the following command:
+### Extract articles from a category
-    $ brew install lbzip2
+    $ wp2txt --lang=ja --from-category="日本の都市" -o ./cities
-### WP2TXT on Windows
+Include subcategories with `--depth`:
-Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.
+    $ wp2txt --lang=ja --from-category="日本の都市" --depth=2 -o ./cities
-## Installation
+Preview without downloading (shows article counts):
-### WP2TXT command
+    $ wp2txt --lang=ja --from-category="日本の都市" --dry-run
-    $ gem install wp2txt
+### Process local dump file
-## Wikipedia Dump File
+    $ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text
-Download the latest Wikipedia dump file for the desired language at a URL such as
+### Other extraction modes
-    https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
+    # Category info only (title + categories)
+    $ wp2txt -g --lang=ja -o ./category
-Here, `enwiki` refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to `jawiki` (Japanese). In doing so, note that there are two instances of `enwiki` in the URL above.
+    # Summary only (title + categories + opening paragraphs)
+    $ wp2txt -s --lang=ja -o ./summary
-Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
+    # Metadata only (title + section headings + categories)
+    $ wp2txt -M --lang=ja --format json -o ./metadata
-    xxwiki-yyyymmdd-pages-articles.xml.bz2
+    # Extract specific sections (comma-separated, 'summary' for lead text)
+    $ wp2txt --lang=en --sections="summary,Plot,Reception" --format json -o ./sections
-where `xx` is language code such as `en` (English)" or `ja` (japanese), and  `yyyymmdd` is the date of creation (e.g. `20220801`).
+    # Section heading statistics
+    $ wp2txt --lang=ja --section-stats -o ./stats
-## Basic Usage
+    # JSON/JSONL output
+    $ wp2txt --format json --lang=ja -o ./json
-Suppose you have a folder with a wikipedia dump file and empty subfolders organized as follows:
+## Sample Output
+### Text Output
 ```
-.
-├── enwiki-20220801-pages-articles.xml.bz2
-├── /xml
-├── /text
-├── /category
-└── /summary
+[[Article Title]]
+Article content goes here with sections and paragraphs...
+CATEGORIES: Category1, Category2, Category3
 ```
-### Decompress and Split
+### JSON/JSONL Output
-The following command will decompress the entire wikipedia data and split it into many small (approximately 10 MB) XML files.
+Each line contains one JSON object:
-    $ wp2txt --no-convert -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./xml
+```json
+{"title": "Article Title", "categories": ["Cat1", "Cat2"], "text": "...", "redirect": null}
+```
-**Note**: The resulting files are not well-formed XML. They contain part of the orignal XML extracted from the Wikipedia dump file, taking care to ensure that the content within the <page> tag is not split into multiple files.
+For redirect articles:
-### Extract plain text from MediaWiki XML
+```json
+{"title": "NYC", "categories": [], "text": "", "redirect": "New York City"}
+```
-    $ wp2txt -i ./xml -o ./text
+## Cache Management
+    $ wp2txt --cache-status           # Show cache status
+    $ wp2txt --cache-clear            # Clear all cache
+    $ wp2txt --cache-clear --lang=ja  # Clear cache for Japanese only
+    $ wp2txt --update-cache           # Force fresh download
-### Extract only category info from MediaWiki XML
+When cache exceeds the expiry period (default: 30 days), wp2txt displays a warning but allows using cached data.
-    $ wp2txt -g -i ./xml -o ./category
+## Advanced Options
-### Extract opening paragraphs from MediaWiki XML
+### Content Type Markers
-    $ wp2txt -s -i ./xml -o ./summary
+Special content is replaced with marker placeholders by default:
-### Extract directly from bz2 compressed file
+**Inline markers** (appear within sentences):
-It is possible (though not recommended) to 1) decompress the dump files, 2) split the data into files, and 3) extract the text just one line of command. You can automatically remove all the intermediate XML files with `-x` option.
+| Marker | Content Type |
+|--------|--------------|
+| `[MATH]` | Mathematical formulas |
+| `[CODE]` | Inline code |
+| `[CHEM]` | Chemical formulas |
+| `[IPA]` | IPA phonetic notation |
-    $ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text -x
+**Block markers** (standalone content):
-## Sample Output
+| Marker | Content Type |
+|--------|--------------|
+| `[CODEBLOCK]` | Source code blocks |
+| `[TABLE]` | Wiki tables |
+| `[INFOBOX]` | Information boxes |
+| `[NAVBOX]` | Navigation boxes |
+| `[GALLERY]` | Image galleries |
+| `[REFERENCES]` | Reference lists |
+| `[SCORE]` | Musical scores |
+| `[TIMELINE]` | Timeline graphics |
+| `[GRAPH]` | Graphs/charts |
+| `[SIDEBAR]` | Sidebar templates |
+| `[MAPFRAME]` | Interactive maps |
+| `[IMAGEMAP]` | Clickable image maps |
-Output contains title, category info, paragraphs
+Configure with `--markers`:
-    $ wp2txt -i ./input -o /output
+    $ wp2txt --lang=en --markers=all -o ./text        # All markers (default)
+    $ wp2txt --lang=en --markers=math,code -o ./text  # Only MATH and CODE
-- [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
-- [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
+**Note**: `--markers=none` is deprecated as removing special content can make surrounding text nonsensical.
-Output containing title and category only
+### Template Expansion
-    $ wp2txt -g -i ./input -o /output
+Common MediaWiki templates are automatically expanded (enabled by default):
-- [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_category.txt)
-- [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_category.txt)
+| Template | Output |
+|----------|--------|
+| `{{birth date\|1990\|5\|15}}` | May 15, 1990 |
+| `{{convert\|100\|km\|mi}}` | 100 km (62 mi) |
+| `{{coord\|35\|41\|N\|139\|41\|E}}` | 35°41′N 139°41′E |
+| `{{lang\|ja\|日本語}}` | 日本語 |
+| `{{nihongo\|Tokyo\|東京\|Tōkyō}}` | Tokyo (東京, Tōkyō) |
+| `{{frac\|1\|2}}` | 1/2 |
+| `{{circa\|1900}}` | c. 1900 |
-Output containing title, category, and summary
+Supported: date/age templates, unit conversion, coordinates, language tags, quotes, fractions, and more. Parser functions (`{{#if:}}`, `{{#switch:}}`) and magic words (`{{PAGENAME}}`, `{{CURRENTYEAR}}`) are also supported.
-    $ wp2txt -s -i ./input -o /output
+Disable with `--no-expand-templates`.
-- [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
-- [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
+### Citation Extraction
-## Command Line Options
+By default, citation templates are removed. Use `--extract-citations` to extract formatted citations:
-Command line options are as follows:
+    $ wp2txt --lang=en --extract-citations -o ./text
+Supported: `{{cite book}}`, `{{cite web}}`, `{{cite news}}`, `{{cite journal}}`, `{{Citation}}`, etc.
+## Command Line Options
     Usage: wp2txt [options]
-    where [options] are:
-      -i, --input                      Path to compressed file (bz2) or decompressed file (xml), or path to directory containing files of the latter format
-      -o, --output-dir=<s>             Path to output directory
-      -c, --convert, --no-convert      Output in plain text (converting from XML) (default: true)
-      -a, --category, --no-category    Show article category information (default: true)
-      -g, --category-only              Extract only article title and categories
-      -s, --summary-only               Extract only article title, categories, and summary text before first heading
-      -f, --file-size=<i>              Approximate size (in MB) of each output file (default: 10)
-      -n, --num-procs                  Number of proccesses (up to 8) to be run concurrently (default: max num of available CPU cores minus two)
-      -x, --del-interfile              Delete intermediate XML files from output dir
-      -t, --title, --no-title          Keep page titles in output (default: true)
-      -d, --heading, --no-heading      Keep section titles in output (default: true)
-      -l, --list                       Keep unprocessed list items in output
-      -r, --ref                        Keep reference notations in the format [ref]...[/ref]
-      -e, --redirect                   Show redirect destination
-      -m, --marker, --no-marker        Show symbols prefixed to list items, definitions, etc. (Default: true)
-      -b, --bz2-gem                    Use Ruby's bzip2-ruby gem instead of a system command
-      -v, --version                    Print version and exit
-      -h, --help                       Show this message
+    Input source (one of --input or --lang required):
+      -i, --input=<s>                  Path to compressed file (bz2) or XML file
+      -L, --lang=<s>                   Wikipedia language code (e.g., ja, en, de)
+      -A, --articles=<s>               Specific article titles (comma-separated)
+      -G, --from-category=<s>          Extract articles from Wikipedia category
+      -D, --depth=<i>                  Subcategory recursion depth (default: 0)
+      -y, --yes                        Skip confirmation prompt
+      --dry-run                        Preview category extraction
+      -U, --update-cache               Force refresh of cached files
+    Output options:
+      -o, --output-dir=<s>             Output directory (default: current)
+      -j, --format=<s>                 Output format: text or json (default: text)
+      -f, --file-size=<i>              Output file size in MB (default: 10, 0=single)
+    Cache management:
+      --cache-dir=<s>                  Cache directory (default: ~/.wp2txt/cache)
+      --cache-status                   Show cache status and exit
+      --cache-clear                    Clear cache and exit
+    Configuration:
+      --config-init                    Create default config (~/.wp2txt/config.yml)
+      --config-path=<s>                Path to configuration file
+    Extraction modes (mutually exclusive):
+      -g, --category-only              Extract only title and categories
+      -s, --summary-only               Extract title, categories, and summary
+      -M, --metadata-only              Extract only title, headings, and categories
+    Section extraction:
+      -S, --sections=<s>               Extract specific sections (comma-separated)
+      --section-output=<s>             Output mode: structured or combined (default: structured)
+      --min-section-length=<i>         Minimum section length in characters (default: 0)
+      --skip-empty                     Skip articles with no matching sections
+      --alias-file=<s>                 Custom section alias definitions file (YAML)
+      --no-section-aliases             Disable section alias matching (exact match only)
+      --section-stats                  Collect and output section heading statistics (JSON)
+      --show-matched-sections          Include matched_sections field in JSON output
+    Content filtering:
+      -a, --category, --no-category    Show category info (default: true)
+      -t, --title, --no-title          Keep page titles (default: true)
+      -d, --heading, --no-heading      Keep section titles (default: true)
+      -l, --list                       Keep list items (default: false)
+      --table                          Keep wiki table content (default: false)
+      -p, --pre                        Keep preformatted text blocks (default: false)
+      -r, --ref                        Keep references as [ref]...[/ref] (default: false)
+      --multiline                      Keep multi-line templates (default: false)
+      -e, --redirect                   Show redirect destination (default: false)
+      -m, --marker, --no-marker        Show list markers (default: true)
+      -k, --markers=<s>                Content markers (default: all)
+      -C, --extract-citations          Extract formatted citations
+      -E, --expand-templates           Expand templates (default: true)
+          --no-expand-templates        Disable template expansion
+    Performance:
+      -n, --num-procs=<i>              Parallel processes (default: auto)
+      --no-turbo                       Disable turbo mode (saves disk space, slower)
+      -R, --ractor                     Use Ractor parallelism (Ruby 4.0+, streaming only)
+      -b, --bz2-gem                    Use bzip2-ruby gem instead of system command
+    Output control:
+      -q, --quiet                      Suppress progress output (errors only)
+      --no-color                       Disable colored output
+    Info:
+      -v, --version                    Print version
+      -h, --help                       Show help
+## Configuration File
+Create persistent settings with:
+    $ wp2txt --config-init
+This creates `~/.wp2txt/config.yml`:
+```yaml
+cache:
+  dump_expiry_days: 30      # Days before dumps are stale (1-365)
+  category_expiry_days: 7   # Category cache expiry (1-90)
+  directory: ~/.wp2txt/cache
+defaults:
+  format: text              # Default output format
+  depth: 0                  # Default subcategory depth
+```
+Command-line options override configuration file settings.
+## Performance
+Benchmark results on MacBook Air M4 (7 parallel processes, turbo mode, excluding download time):
+| Wikipedia | Dump Size | Articles | Processing Time | Output |
+|-----------|-----------|----------|-----------------|--------|
+| Japanese  | 4.37 GB   | 1,485,937 | ~27 min        | 463 files (4.5 GB) |
+| English   | 24.2 GB   | ~6.8M    | ~2 hours        | 2,000 files (20 GB) |
+Turbo mode (default) splits bz2 into XML chunks first, then processes in parallel. Use `--no-turbo` to save disk space at the cost of slower processing.
 ## Caveats
-* Some data, such as mathematical formulas and computer source code, will not be converted correctly.
-* Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
-* The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
+* Special content (math, code, etc.) is marked with placeholders by default.
+* Some text may not be extracted correctly due to markup variations or language-specific formatting.
+## Changelog
+See [CHANGELOG.md](CHANGELOG.md) for detailed release notes.
+**v2.1.0 (February 2026)**: SQLite caching, Ractor parallelism (Ruby 4.0+), template expansion, content markers, Docker image update.
+**v2.0.0 (January 2026)**: Auto-download mode, category-based extraction, article extraction by title, JSON output, streaming processing, Ruby 4.0 support.
 ## Useful Links
@@ -223,14 +359,14 @@ The author will appreciate your mentioning one of these in your research.
 * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
 * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
-Or use this BibTeX entry:
+BibTeX:
 ```
-@misc{wp2txt_2023,
+@misc{wp2txt_2026,
   author = {Yoichiro Hasebe},
   title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
   url = {https://github.com/yohasebe/wp2txt},
-  year = {2023}
+  year = {2026}
 }
 ```