RubyGems - wp2txt - Versions diffs - 1.1.3 → 2.1.0 - Mend

wp2txt 1.1.3 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (96) hide show

checksums.yaml +4 -4
data/.dockerignore +12 -0
data/.github/workflows/ci.yml +13 -13
data/.gitignore +14 -0
data/CHANGELOG.md +284 -0
data/DEVELOPMENT.md +415 -0
data/DEVELOPMENT_ja.md +415 -0
data/Dockerfile +19 -10
data/Gemfile +2 -8
data/README.md +259 -123
data/README_ja.md +375 -0
data/Rakefile +4 -0
data/bin/wp2txt +863 -161
data/lib/wp2txt/article.rb +98 -13
data/lib/wp2txt/bz2_validator.rb +239 -0
data/lib/wp2txt/category_cache.rb +313 -0
data/lib/wp2txt/cli.rb +319 -0
data/lib/wp2txt/cli_ui.rb +428 -0
data/lib/wp2txt/config.rb +158 -0
data/lib/wp2txt/constants.rb +134 -0
data/lib/wp2txt/data/html_entities.json +2135 -0
data/lib/wp2txt/data/language_metadata.json +4769 -0
data/lib/wp2txt/data/language_tiers.json +59 -0
data/lib/wp2txt/data/mediawiki_aliases.json +12366 -0
data/lib/wp2txt/data/template_aliases.json +193 -0
data/lib/wp2txt/data/wikipedia_entities.json +12 -0
data/lib/wp2txt/extractor.rb +545 -0
data/lib/wp2txt/file_utils.rb +91 -0
data/lib/wp2txt/formatter.rb +352 -0
data/lib/wp2txt/global_data_cache.rb +353 -0
data/lib/wp2txt/index_cache.rb +258 -0
data/lib/wp2txt/magic_words.rb +353 -0
data/lib/wp2txt/memory_monitor.rb +236 -0
data/lib/wp2txt/multistream.rb +1383 -0
data/lib/wp2txt/output_writer.rb +182 -0
data/lib/wp2txt/parser_functions.rb +606 -0
data/lib/wp2txt/ractor_worker.rb +215 -0
data/lib/wp2txt/regex.rb +396 -12
data/lib/wp2txt/section_extractor.rb +354 -0
data/lib/wp2txt/stream_processor.rb +271 -0
data/lib/wp2txt/template_expander.rb +830 -0
data/lib/wp2txt/text_processing.rb +337 -0
data/lib/wp2txt/utils.rb +629 -270
data/lib/wp2txt/version.rb +1 -1
data/lib/wp2txt.rb +53 -26
data/scripts/benchmark_regex.rb +161 -0
data/scripts/fetch_html_entities.rb +94 -0
data/scripts/fetch_language_metadata.rb +180 -0
data/scripts/fetch_mediawiki_data.rb +334 -0
data/scripts/fetch_template_data.rb +186 -0
data/scripts/profile_memory.rb +139 -0
data/spec/article_spec.rb +402 -0
data/spec/auto_download_spec.rb +314 -0
data/spec/bz2_validator_spec.rb +193 -0
data/spec/category_cache_spec.rb +226 -0
data/spec/category_fetcher_spec.rb +504 -0
data/spec/cleanup_spec.rb +197 -0
data/spec/cli_options_spec.rb +678 -0
data/spec/cli_spec.rb +876 -0
data/spec/config_spec.rb +194 -0
data/spec/constants_spec.rb +138 -0
data/spec/file_utils_spec.rb +170 -0
data/spec/fixtures/samples.rb +181 -0
data/spec/formatter_sections_spec.rb +382 -0
data/spec/global_data_cache_spec.rb +186 -0
data/spec/index_cache_spec.rb +210 -0
data/spec/integration_spec.rb +543 -0
data/spec/magic_words_spec.rb +261 -0
data/spec/markers_spec.rb +476 -0
data/spec/memory_monitor_spec.rb +192 -0
data/spec/multistream_spec.rb +690 -0
data/spec/output_writer_spec.rb +400 -0
data/spec/parser_functions_spec.rb +455 -0
data/spec/ractor_worker_spec.rb +197 -0
data/spec/regex_spec.rb +281 -0
data/spec/section_extractor_spec.rb +397 -0
data/spec/spec_helper.rb +63 -0
data/spec/stream_processor_spec.rb +579 -0
data/spec/template_data_spec.rb +246 -0
data/spec/template_expander_spec.rb +472 -0
data/spec/template_processing_spec.rb +217 -0
data/spec/text_processing_spec.rb +312 -0
data/spec/utils_spec.rb +195 -16
data/spec/wp2txt_spec.rb +510 -0
data/wp2txt.gemspec +5 -3
metadata +146 -18
data/.rubocop.yml +0 -80
data/data/output_samples/testdata_en.txt +0 -23002
data/data/output_samples/testdata_en_category.txt +0 -132
data/data/output_samples/testdata_en_summary.txt +0 -1376
data/data/output_samples/testdata_ja.txt +0 -22774
data/data/output_samples/testdata_ja_category.txt +0 -206
data/data/output_samples/testdata_ja_summary.txt +0 -1560
data/data/testdata_en.bz2 +0 -0
data/data/testdata_ja.bz2 +0 -0
data/image/screenshot.png +0 -0

data/DEVELOPMENT.md ADDED Viewed

@@ -0,0 +1,415 @@
+# WP2TXT Development Guide
+This document provides guidance for developers working on WP2TXT. For user documentation, see [README.md](README.md).
+English | [日本語](DEVELOPMENT_ja.md)
+## Quick Start
+```bash
+# Install dependencies
+bundle install
+# Run tests
+bundle exec rspec
+# Run tests with coverage
+bundle exec rspec  # Coverage report at coverage/index.html
+```
+## Architecture Overview
+### Processing Pipeline
+WP2TXT uses a streaming architecture to process Wikipedia dumps:
+```
+Input (bz2/xml) → StreamProcessor → Article Parser → OutputWriter → Output files
+```
+1. **StreamProcessor** (`lib/wp2txt.rb`): Decompresses bz2 and streams XML pages
+2. **Article** (`lib/wp2txt/article.rb`): Parses MediaWiki text into typed elements
+3. **Utils** (`lib/wp2txt/utils.rb`): Provides text formatting and cleanup functions
+4. **OutputWriter** (`lib/wp2txt.rb`): Writes output in text or JSON format
+### Core Classes
+| Class | File | Purpose |
+|-------|------|---------|
+| `StreamProcessor` | `lib/wp2txt/stream_processor.rb` | Streams pages from compressed dumps with adaptive buffering |
+| `Article` | `lib/wp2txt/article.rb` | Parses MediaWiki markup |
+| `OutputWriter` | `lib/wp2txt.rb` | Manages output file rotation |
+| `DumpManager` | `lib/wp2txt/multistream.rb` | Downloads and caches dumps |
+| `MultistreamIndex` | `lib/wp2txt/multistream.rb` | Indexes articles for random access |
+| `MultistreamReader` | `lib/wp2txt/multistream.rb` | Extracts articles (supports parallel extraction) |
+| `CategoryFetcher` | `lib/wp2txt/multistream.rb` | Fetches category members from Wikipedia API |
+| `MemoryMonitor` | `lib/wp2txt/memory_monitor.rb` | Cross-platform memory monitoring |
+| `Bz2Validator` | `lib/wp2txt/bz2_validator.rb` | Validates bz2 file integrity |
+| `CLI` | `lib/wp2txt/cli.rb` | Command-line option parsing |
+### Cache Classes
+| Class | File | Purpose |
+|-------|------|---------|
+| `GlobalDataCache` | `lib/wp2txt/global_data_cache.rb` | SQLite cache for parsed JSON data files |
+| `CategoryCache` | `lib/wp2txt/category_cache.rb` | SQLite cache for Wikipedia category hierarchy |
+| `IndexCache` | `lib/wp2txt/index_cache.rb` | SQLite cache for multistream index entries |
+### Element Types
+The `Article` class parses MediaWiki text into typed elements:
+| Type | Description |
+|------|-------------|
+| `:mw_heading` | Section headings (`== Title ==`) |
+| `:mw_paragraph` | Regular text paragraphs |
+| `:mw_table` | Wiki tables (`{| ... |}`) |
+| `:mw_quote` | Block quotes |
+| `:mw_pre` | Preformatted text |
+| `:mw_unordered` | Unordered list items |
+| `:mw_ordered` | Ordered list items |
+| `:mw_definition` | Definition list items |
+| `:mw_link` | Single-line links |
+| `:mw_ml_link` | Multi-line links |
+| `:mw_redirect` | Redirect pages |
+| `:mw_template` | Templates |
+| `:mw_isolated_tag` | HTML tags |
+### Marker System
+Content type markers replace special content (math, code, etc.) with placeholders:
+```ruby
+# In utils.rb
+MARKER_TYPES = %i[math code chem table score timeline graph ipa].freeze
+# Processing flow:
+# 1. Content detected → Replace with placeholder («« MATH »»)
+# 2. Text processing continues (placeholders protected from cleanup)
+# 3. finalize_markers() converts placeholders to [MARKER] format
+```
+### Template Expansion
+The `TemplateExpander` class (`lib/wp2txt/template_expander.rb`) expands common Wikipedia templates to readable text:
+| Template Type | Example | Output |
+|---------------|---------|--------|
+| Birth/death dates | `{{birth date|1990|5|15}}` | "May 15, 1990" |
+| Unit conversion | `{{convert|100|km|mi}}` | "100 km (62 mi)" |
+| Coordinates | `{{coord|35|41|N|139|41|E}}` | "35°41′N 139°41′E" |
+| Language tags | `{{lang|ja|日本語}}` | "日本語" |
+| Nihongo | `{{nihongo|Tokyo|東京|Tōkyō}}` | "Tokyo (東京, Tōkyō)" |
+Template expansion is enabled by default. Disable with `--no-expand-templates` or `expand_templates: false`.
+### Magic Word Expansion
+The `MagicWordExpander` class (`lib/wp2txt/magic_words.rb`) expands MediaWiki magic words to their actual values:
+| Category | Magic Words | Example |
+|----------|-------------|---------|
+| Page context | `PAGENAME`, `FULLPAGENAME`, `BASEPAGENAME`, `ROOTPAGENAME`, `SUBPAGENAME`, `NAMESPACE`, `TALKPAGENAME` | `{{PAGENAME}}` → "Article Title" |
+| Date/time | `CURRENTYEAR`, `CURRENTMONTH`, `CURRENTDAY`, `CURRENTDAYNAME`, `CURRENTTIME`, `CURRENTTIMESTAMP` | `{{CURRENTYEAR}}` → "2024" |
+| String functions | `lc`, `uc`, `lcfirst`, `ucfirst`, `urlencode`, `anchorencode`, `padleft`, `padright` | `{{uc:hello}}` → "HELLO" |
+| Parser functions | `#titleparts` | `{{#titleparts:A/B/C\|2}}` → "A/B" |
+Magic words are expanded early in the `format_wiki()` pipeline when a title is provided in the config:
+```ruby
+result = format_wiki(text, title: "Article Name", dump_date: Time.now)
+```
+## Caching Infrastructure
+WP2TXT uses SQLite-based caching to improve performance for repeated operations. All caches are stored in `~/.wp2txt/cache/`.
+### GlobalDataCache
+Caches parsed JSON data files (templates, MediaWiki aliases, HTML entities) to eliminate parsing overhead:
+```ruby
+# Automatic - data loading methods use cache transparently
+data = Wp2txt.load_mediawiki_data  # Uses cache if valid
+# Manual cache operations
+Wp2txt::GlobalDataCache.clear!     # Clear all cached data
+Wp2txt::GlobalDataCache.stats      # Get cache statistics
+```
+Cache validation: Checks source file modification time and size. Cache is automatically invalidated when source files change.
+### CategoryCache
+Caches Wikipedia category hierarchy from API for faster category-based article extraction:
+```ruby
+cache = Wp2txt::CategoryCache.new("en", cache_dir: "/path/to/cache")
+# Save category data
+cache.save("Category Name", ["Article1", "Article2"], ["Subcategory1"])
+# Retrieve category data
+data = cache.get("Category Name")  # { pages: [...], subcats: [...] }
+# Get all pages in category tree
+pages = cache.get_all_pages("Root Category", max_depth: 2)
+# Statistics and maintenance
+cache.stats              # Cache statistics
+cache.cleanup_expired!   # Remove stale entries
+cache.clear!             # Clear all data
+```
+### IndexCache
+Caches parsed multistream index entries for fast article lookup:
+```ruby
+cache = Wp2txt::IndexCache.new("/path/to/index.txt", cache_dir: "/path/to/cache")
+# Check cache validity
+cache.valid?  # true if cache exists and matches source file
+# Save/load operations (used internally by MultistreamIndex)
+cache.save(entries_by_title, stream_offsets)
+data = cache.load  # { entries_by_title: {}, entries_by_id: {}, stream_offsets: [] }
+# Batch lookup
+results = cache.find_by_titles(["Article1", "Article2"])
+```
+### Cache Location
+All caches are stored in `~/.wp2txt/cache/`:
+```
+~/.wp2txt/cache/
+├── global_data.sqlite3           # GlobalDataCache
+├── categories_en.sqlite3         # CategoryCache (English)
+├── categories_ja.sqlite3         # CategoryCache (Japanese)
+└── enwiki_*_index.sqlite3        # IndexCache (per dump file)
+```
+## Test System
+### Test Structure
+```
+spec/
+├── spec_helper.rb          # RSpec configuration
+├── article_spec.rb         # Article parsing tests
+├── utils_spec.rb           # Text processing tests
+├── markers_spec.rb         # Marker functionality tests
+├── auto_download_spec.rb   # CLI and download tests
+├── multilingual_spec.rb    # Language-specific tests
+├── streaming_spec.rb       # Streaming architecture tests
+└── testdata/               # Static test data
+```
+### Running Tests
+```bash
+# Run all tests
+bundle exec rspec
+# Run specific test file
+bundle exec rspec spec/utils_spec.rb
+# Run with documentation format
+bundle exec rspec --format documentation
+# Run specific test by line number
+bundle exec rspec spec/utils_spec.rb:42
+```
+## Multistream Support
+WP2TXT supports Wikipedia's multistream format for efficient article extraction.
+### How Multistream Works
+1. **Index file** (`-multistream-index.txt.bz2`): Maps article titles to byte offsets
+2. **Multistream file** (`-multistream.xml.bz2`): Concatenated bz2 streams
+### Parallel Extraction
+`MultistreamReader` supports parallel article extraction for improved performance:
+```ruby
+reader = MultistreamReader.new(multistream_path, index_path)
+# Extract multiple articles in parallel (4 processes by default)
+results = reader.extract_articles_parallel(["Tokyo", "Kyoto", "Osaka"], num_processes: 4)
+# Iterate with parallel processing
+reader.each_article_parallel(entries, num_processes: 4) do |page|
+  process(page)
+end
+```
+Articles are grouped by stream offset to minimize bz2 decompression overhead.
+### Partial Downloads
+For specific article extraction, WP2TXT downloads only necessary data:
+```ruby
+# Only download first N streams
+manager.download_multistream(max_streams: 10)
+# Download only needed byte range
+download_file_range(url, path, start_byte, end_byte)
+```
+### Incremental Downloads
+When a partial dump exists, `download_multistream_full` can resume the download:
+```ruby
+manager = DumpManager.new("ja")
+# Check for existing partial dump
+partial = manager.find_any_partial_cache
+# => { path: "...", dump_date: "20260101", stream_count: 100, size: 1000000, mtime: ... }
+# Check if incremental download is possible
+resume_info = manager.can_resume_from_partial?(partial)
+# => { possible: true, current_streams: 100, total_streams: 5000, current_size: 1000000 }
+# => { possible: false, reason: :date_mismatch, partial_date: "20250101", latest_date: "20260101" }
+# Download full dump with incremental support (interactive prompts)
+path = manager.download_multistream_full(interactive: true)
+# Non-interactive mode (skips user prompts, always downloads fresh if needed)
+path = manager.download_multistream_full(interactive: false)
+```
+User prompts for incremental downloads:
+1. **Same date partial exists:**
+   - `[Y]` Resume download (download only remaining data)
+   - `[n]` Use existing partial as-is
+   - `[f]` Download fresh full dump
+2. **Outdated partial exists:**
+   - `[D]` Delete old partial and download latest (recommended)
+   - `[k]` Keep old partial, download latest separately
+   - `[u]` Use old partial as-is (may have outdated content)
+### Article Extraction Flow
+```
+1. Download index file (~500MB for en)
+2. Load index into hash (O(1) lookup)
+3. Find article offsets
+4. Group by stream offset
+5. Download only needed streams
+6. Extract specific articles
+```
+## Memory Management
+WP2TXT includes adaptive memory management for processing large dumps:
+### MemoryMonitor
+Cross-platform memory monitoring in `lib/wp2txt/memory_monitor.rb`:
+```ruby
+# Check current memory usage
+stats = Wp2txt::MemoryMonitor.memory_stats
+# => { current: 256000000, available: 8000000000, ... }
+# Get optimal buffer size based on available memory
+buffer_size = Wp2txt::MemoryMonitor.optimal_buffer_size
+# => 10485760 (10 MB)
+# Check if memory is low and trigger GC if needed
+Wp2txt::MemoryMonitor.gc_if_needed
+```
+### StreamProcessor Adaptive Buffering
+`StreamProcessor` adjusts buffer size dynamically:
+```ruby
+processor = Wp2txt::StreamProcessor.new(input_path, adaptive_buffer: true)
+processor.each_page { |title, text| ... }
+# Monitor processing stats
+processor.stats
+# => { pages_processed: 1000, bytes_read: 50000000, buffer_size: 10485760, ... }
+```
+## bz2 Validation
+The `Bz2Validator` module validates bz2 files before processing:
+```ruby
+# Full validation (header + decompression test)
+result = Wp2txt::Bz2Validator.validate("/path/to/file.bz2")
+result.valid?      # => true/false
+result.error_type  # => :invalid_magic, :too_small, etc.
+result.message     # => "Invalid bz2 header..."
+# Quick validation (header only)
+result = Wp2txt::Bz2Validator.validate_quick("/path/to/file.bz2")
+# Get file info
+info = Wp2txt::Bz2Validator.file_info("/path/to/file.bz2")
+# => { path: "...", size: 1000000, valid_header: true, version: "h", block_size: 9, ... }
+```
+## Adding New Features
+### Adding a New Marker Type
+1. Add to `MARKER_TYPES` in `lib/wp2txt/utils.rb`
+2. Add detection pattern in `apply_markers()`
+3. Add tests in `spec/markers_spec.rb`
+### Adding a New CLI Option
+1. Add option definition in `lib/wp2txt/cli.rb`
+2. Add validation in `validate_options!()`
+3. Handle option in `bin/wp2txt`
+4. Add tests in `spec/auto_download_spec.rb`
+5. Update README.md
+### Adding Language Support
+1. Category keywords: `data/language_categories.json`
+2. Redirect keywords: `data/language_redirects.json`
+3. Scripts: `scripts/generate_language_data.rb`
+## Code Style
+- Ruby 2.6+ compatibility
+- Frozen string literals (`# frozen_string_literal: true`)
+- RuboCop configuration in `.rubocop.yml`
+- UTF-8 encoding throughout
+## Docker
+Build and push Docker images:
+```bash
+rake push  # Builds multi-arch and pushes to Docker Hub
+```
+## Release Process
+1. Update version in `lib/wp2txt/version.rb`
+2. Update CHANGELOG.md
+3. Run full test suite: `bundle exec rspec`
+4. Build gem: `gem build wp2txt.gemspec`
+5. Push to RubyGems: `gem push wp2txt-*.gem`
+6. Push Docker image: `rake push`
+7. Create GitHub release
+## Useful Links
+- [MediaWiki Markup Reference](https://www.mediawiki.org/wiki/Help:Formatting)
+- [Wikipedia Dump Downloads](https://dumps.wikimedia.org/)
+- [Multistream Format](https://meta.wikimedia.org/wiki/Data_dumps/FAQ#Why_are_there_multiple_files_for_a_single_dump?)