wp2txt 1.1.3 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (96) hide show
  1. checksums.yaml +4 -4
  2. data/.dockerignore +12 -0
  3. data/.github/workflows/ci.yml +13 -13
  4. data/.gitignore +14 -0
  5. data/CHANGELOG.md +284 -0
  6. data/DEVELOPMENT.md +415 -0
  7. data/DEVELOPMENT_ja.md +415 -0
  8. data/Dockerfile +19 -10
  9. data/Gemfile +2 -8
  10. data/README.md +259 -123
  11. data/README_ja.md +375 -0
  12. data/Rakefile +4 -0
  13. data/bin/wp2txt +863 -161
  14. data/lib/wp2txt/article.rb +98 -13
  15. data/lib/wp2txt/bz2_validator.rb +239 -0
  16. data/lib/wp2txt/category_cache.rb +313 -0
  17. data/lib/wp2txt/cli.rb +319 -0
  18. data/lib/wp2txt/cli_ui.rb +428 -0
  19. data/lib/wp2txt/config.rb +158 -0
  20. data/lib/wp2txt/constants.rb +134 -0
  21. data/lib/wp2txt/data/html_entities.json +2135 -0
  22. data/lib/wp2txt/data/language_metadata.json +4769 -0
  23. data/lib/wp2txt/data/language_tiers.json +59 -0
  24. data/lib/wp2txt/data/mediawiki_aliases.json +12366 -0
  25. data/lib/wp2txt/data/template_aliases.json +193 -0
  26. data/lib/wp2txt/data/wikipedia_entities.json +12 -0
  27. data/lib/wp2txt/extractor.rb +545 -0
  28. data/lib/wp2txt/file_utils.rb +91 -0
  29. data/lib/wp2txt/formatter.rb +352 -0
  30. data/lib/wp2txt/global_data_cache.rb +353 -0
  31. data/lib/wp2txt/index_cache.rb +258 -0
  32. data/lib/wp2txt/magic_words.rb +353 -0
  33. data/lib/wp2txt/memory_monitor.rb +236 -0
  34. data/lib/wp2txt/multistream.rb +1383 -0
  35. data/lib/wp2txt/output_writer.rb +182 -0
  36. data/lib/wp2txt/parser_functions.rb +606 -0
  37. data/lib/wp2txt/ractor_worker.rb +215 -0
  38. data/lib/wp2txt/regex.rb +396 -12
  39. data/lib/wp2txt/section_extractor.rb +354 -0
  40. data/lib/wp2txt/stream_processor.rb +271 -0
  41. data/lib/wp2txt/template_expander.rb +830 -0
  42. data/lib/wp2txt/text_processing.rb +337 -0
  43. data/lib/wp2txt/utils.rb +629 -270
  44. data/lib/wp2txt/version.rb +1 -1
  45. data/lib/wp2txt.rb +53 -26
  46. data/scripts/benchmark_regex.rb +161 -0
  47. data/scripts/fetch_html_entities.rb +94 -0
  48. data/scripts/fetch_language_metadata.rb +180 -0
  49. data/scripts/fetch_mediawiki_data.rb +334 -0
  50. data/scripts/fetch_template_data.rb +186 -0
  51. data/scripts/profile_memory.rb +139 -0
  52. data/spec/article_spec.rb +402 -0
  53. data/spec/auto_download_spec.rb +314 -0
  54. data/spec/bz2_validator_spec.rb +193 -0
  55. data/spec/category_cache_spec.rb +226 -0
  56. data/spec/category_fetcher_spec.rb +504 -0
  57. data/spec/cleanup_spec.rb +197 -0
  58. data/spec/cli_options_spec.rb +678 -0
  59. data/spec/cli_spec.rb +876 -0
  60. data/spec/config_spec.rb +194 -0
  61. data/spec/constants_spec.rb +138 -0
  62. data/spec/file_utils_spec.rb +170 -0
  63. data/spec/fixtures/samples.rb +181 -0
  64. data/spec/formatter_sections_spec.rb +382 -0
  65. data/spec/global_data_cache_spec.rb +186 -0
  66. data/spec/index_cache_spec.rb +210 -0
  67. data/spec/integration_spec.rb +543 -0
  68. data/spec/magic_words_spec.rb +261 -0
  69. data/spec/markers_spec.rb +476 -0
  70. data/spec/memory_monitor_spec.rb +192 -0
  71. data/spec/multistream_spec.rb +690 -0
  72. data/spec/output_writer_spec.rb +400 -0
  73. data/spec/parser_functions_spec.rb +455 -0
  74. data/spec/ractor_worker_spec.rb +197 -0
  75. data/spec/regex_spec.rb +281 -0
  76. data/spec/section_extractor_spec.rb +397 -0
  77. data/spec/spec_helper.rb +63 -0
  78. data/spec/stream_processor_spec.rb +579 -0
  79. data/spec/template_data_spec.rb +246 -0
  80. data/spec/template_expander_spec.rb +472 -0
  81. data/spec/template_processing_spec.rb +217 -0
  82. data/spec/text_processing_spec.rb +312 -0
  83. data/spec/utils_spec.rb +195 -16
  84. data/spec/wp2txt_spec.rb +510 -0
  85. data/wp2txt.gemspec +5 -3
  86. metadata +146 -18
  87. data/.rubocop.yml +0 -80
  88. data/data/output_samples/testdata_en.txt +0 -23002
  89. data/data/output_samples/testdata_en_category.txt +0 -132
  90. data/data/output_samples/testdata_en_summary.txt +0 -1376
  91. data/data/output_samples/testdata_ja.txt +0 -22774
  92. data/data/output_samples/testdata_ja_category.txt +0 -206
  93. data/data/output_samples/testdata_ja_summary.txt +0 -1560
  94. data/data/testdata_en.bz2 +0 -0
  95. data/data/testdata_ja.bz2 +0 -0
  96. data/image/screenshot.png +0 -0
data/DEVELOPMENT.md ADDED
@@ -0,0 +1,415 @@
1
+ # WP2TXT Development Guide
2
+
3
+ This document provides guidance for developers working on WP2TXT. For user documentation, see [README.md](README.md).
4
+
5
+ English | [日本語](DEVELOPMENT_ja.md)
6
+
7
+ ## Quick Start
8
+
9
+ ```bash
10
+ # Install dependencies
11
+ bundle install
12
+
13
+ # Run tests
14
+ bundle exec rspec
15
+
16
+ # Run tests with coverage
17
+ bundle exec rspec # Coverage report at coverage/index.html
18
+ ```
19
+
20
+ ## Architecture Overview
21
+
22
+ ### Processing Pipeline
23
+
24
+ WP2TXT uses a streaming architecture to process Wikipedia dumps:
25
+
26
+ ```
27
+ Input (bz2/xml) → StreamProcessor → Article Parser → OutputWriter → Output files
28
+ ```
29
+
30
+ 1. **StreamProcessor** (`lib/wp2txt.rb`): Decompresses bz2 and streams XML pages
31
+ 2. **Article** (`lib/wp2txt/article.rb`): Parses MediaWiki text into typed elements
32
+ 3. **Utils** (`lib/wp2txt/utils.rb`): Provides text formatting and cleanup functions
33
+ 4. **OutputWriter** (`lib/wp2txt.rb`): Writes output in text or JSON format
34
+
35
+ ### Core Classes
36
+
37
+ | Class | File | Purpose |
38
+ |-------|------|---------|
39
+ | `StreamProcessor` | `lib/wp2txt/stream_processor.rb` | Streams pages from compressed dumps with adaptive buffering |
40
+ | `Article` | `lib/wp2txt/article.rb` | Parses MediaWiki markup |
41
+ | `OutputWriter` | `lib/wp2txt.rb` | Manages output file rotation |
42
+ | `DumpManager` | `lib/wp2txt/multistream.rb` | Downloads and caches dumps |
43
+ | `MultistreamIndex` | `lib/wp2txt/multistream.rb` | Indexes articles for random access |
44
+ | `MultistreamReader` | `lib/wp2txt/multistream.rb` | Extracts articles (supports parallel extraction) |
45
+ | `CategoryFetcher` | `lib/wp2txt/multistream.rb` | Fetches category members from Wikipedia API |
46
+ | `MemoryMonitor` | `lib/wp2txt/memory_monitor.rb` | Cross-platform memory monitoring |
47
+ | `Bz2Validator` | `lib/wp2txt/bz2_validator.rb` | Validates bz2 file integrity |
48
+ | `CLI` | `lib/wp2txt/cli.rb` | Command-line option parsing |
49
+
50
+ ### Cache Classes
51
+
52
+ | Class | File | Purpose |
53
+ |-------|------|---------|
54
+ | `GlobalDataCache` | `lib/wp2txt/global_data_cache.rb` | SQLite cache for parsed JSON data files |
55
+ | `CategoryCache` | `lib/wp2txt/category_cache.rb` | SQLite cache for Wikipedia category hierarchy |
56
+ | `IndexCache` | `lib/wp2txt/index_cache.rb` | SQLite cache for multistream index entries |
57
+
58
+ ### Element Types
59
+
60
+ The `Article` class parses MediaWiki text into typed elements:
61
+
62
+ | Type | Description |
63
+ |------|-------------|
64
+ | `:mw_heading` | Section headings (`== Title ==`) |
65
+ | `:mw_paragraph` | Regular text paragraphs |
66
+ | `:mw_table` | Wiki tables (`{| ... |}`) |
67
+ | `:mw_quote` | Block quotes |
68
+ | `:mw_pre` | Preformatted text |
69
+ | `:mw_unordered` | Unordered list items |
70
+ | `:mw_ordered` | Ordered list items |
71
+ | `:mw_definition` | Definition list items |
72
+ | `:mw_link` | Single-line links |
73
+ | `:mw_ml_link` | Multi-line links |
74
+ | `:mw_redirect` | Redirect pages |
75
+ | `:mw_template` | Templates |
76
+ | `:mw_isolated_tag` | HTML tags |
77
+
78
+ ### Marker System
79
+
80
+ Content type markers replace special content (math, code, etc.) with placeholders:
81
+
82
+ ```ruby
83
+ # In utils.rb
84
+ MARKER_TYPES = %i[math code chem table score timeline graph ipa].freeze
85
+
86
+ # Processing flow:
87
+ # 1. Content detected → Replace with placeholder («« MATH »»)
88
+ # 2. Text processing continues (placeholders protected from cleanup)
89
+ # 3. finalize_markers() converts placeholders to [MARKER] format
90
+ ```
91
+
92
+ ### Template Expansion
93
+
94
+ The `TemplateExpander` class (`lib/wp2txt/template_expander.rb`) expands common Wikipedia templates to readable text:
95
+
96
+ | Template Type | Example | Output |
97
+ |---------------|---------|--------|
98
+ | Birth/death dates | `{{birth date|1990|5|15}}` | "May 15, 1990" |
99
+ | Unit conversion | `{{convert|100|km|mi}}` | "100 km (62 mi)" |
100
+ | Coordinates | `{{coord|35|41|N|139|41|E}}` | "35°41′N 139°41′E" |
101
+ | Language tags | `{{lang|ja|日本語}}` | "日本語" |
102
+ | Nihongo | `{{nihongo|Tokyo|東京|Tōkyō}}` | "Tokyo (東京, Tōkyō)" |
103
+
104
+ Template expansion is enabled by default. Disable with `--no-expand-templates` or `expand_templates: false`.
105
+
106
+ ### Magic Word Expansion
107
+
108
+ The `MagicWordExpander` class (`lib/wp2txt/magic_words.rb`) expands MediaWiki magic words to their actual values:
109
+
110
+ | Category | Magic Words | Example |
111
+ |----------|-------------|---------|
112
+ | Page context | `PAGENAME`, `FULLPAGENAME`, `BASEPAGENAME`, `ROOTPAGENAME`, `SUBPAGENAME`, `NAMESPACE`, `TALKPAGENAME` | `{{PAGENAME}}` → "Article Title" |
113
+ | Date/time | `CURRENTYEAR`, `CURRENTMONTH`, `CURRENTDAY`, `CURRENTDAYNAME`, `CURRENTTIME`, `CURRENTTIMESTAMP` | `{{CURRENTYEAR}}` → "2024" |
114
+ | String functions | `lc`, `uc`, `lcfirst`, `ucfirst`, `urlencode`, `anchorencode`, `padleft`, `padright` | `{{uc:hello}}` → "HELLO" |
115
+ | Parser functions | `#titleparts` | `{{#titleparts:A/B/C\|2}}` → "A/B" |
116
+
117
+ Magic words are expanded early in the `format_wiki()` pipeline when a title is provided in the config:
118
+
119
+ ```ruby
120
+ result = format_wiki(text, title: "Article Name", dump_date: Time.now)
121
+ ```
122
+
123
+ ## Caching Infrastructure
124
+
125
+ WP2TXT uses SQLite-based caching to improve performance for repeated operations. All caches are stored in `~/.wp2txt/cache/`.
126
+
127
+ ### GlobalDataCache
128
+
129
+ Caches parsed JSON data files (templates, MediaWiki aliases, HTML entities) to eliminate parsing overhead:
130
+
131
+ ```ruby
132
+ # Automatic - data loading methods use cache transparently
133
+ data = Wp2txt.load_mediawiki_data # Uses cache if valid
134
+
135
+ # Manual cache operations
136
+ Wp2txt::GlobalDataCache.clear! # Clear all cached data
137
+ Wp2txt::GlobalDataCache.stats # Get cache statistics
138
+ ```
139
+
140
+ Cache validation: Checks source file modification time and size. Cache is automatically invalidated when source files change.
141
+
142
+ ### CategoryCache
143
+
144
+ Caches Wikipedia category hierarchy from API for faster category-based article extraction:
145
+
146
+ ```ruby
147
+ cache = Wp2txt::CategoryCache.new("en", cache_dir: "/path/to/cache")
148
+
149
+ # Save category data
150
+ cache.save("Category Name", ["Article1", "Article2"], ["Subcategory1"])
151
+
152
+ # Retrieve category data
153
+ data = cache.get("Category Name") # { pages: [...], subcats: [...] }
154
+
155
+ # Get all pages in category tree
156
+ pages = cache.get_all_pages("Root Category", max_depth: 2)
157
+
158
+ # Statistics and maintenance
159
+ cache.stats # Cache statistics
160
+ cache.cleanup_expired! # Remove stale entries
161
+ cache.clear! # Clear all data
162
+ ```
163
+
164
+ ### IndexCache
165
+
166
+ Caches parsed multistream index entries for fast article lookup:
167
+
168
+ ```ruby
169
+ cache = Wp2txt::IndexCache.new("/path/to/index.txt", cache_dir: "/path/to/cache")
170
+
171
+ # Check cache validity
172
+ cache.valid? # true if cache exists and matches source file
173
+
174
+ # Save/load operations (used internally by MultistreamIndex)
175
+ cache.save(entries_by_title, stream_offsets)
176
+ data = cache.load # { entries_by_title: {}, entries_by_id: {}, stream_offsets: [] }
177
+
178
+ # Batch lookup
179
+ results = cache.find_by_titles(["Article1", "Article2"])
180
+ ```
181
+
182
+ ### Cache Location
183
+
184
+ All caches are stored in `~/.wp2txt/cache/`:
185
+
186
+ ```
187
+ ~/.wp2txt/cache/
188
+ ├── global_data.sqlite3 # GlobalDataCache
189
+ ├── categories_en.sqlite3 # CategoryCache (English)
190
+ ├── categories_ja.sqlite3 # CategoryCache (Japanese)
191
+ └── enwiki_*_index.sqlite3 # IndexCache (per dump file)
192
+ ```
193
+
194
+ ## Test System
195
+
196
+ ### Test Structure
197
+
198
+ ```
199
+ spec/
200
+ ├── spec_helper.rb # RSpec configuration
201
+ ├── article_spec.rb # Article parsing tests
202
+ ├── utils_spec.rb # Text processing tests
203
+ ├── markers_spec.rb # Marker functionality tests
204
+ ├── auto_download_spec.rb # CLI and download tests
205
+ ├── multilingual_spec.rb # Language-specific tests
206
+ ├── streaming_spec.rb # Streaming architecture tests
207
+ └── testdata/ # Static test data
208
+ ```
209
+
210
+ ### Running Tests
211
+
212
+ ```bash
213
+ # Run all tests
214
+ bundle exec rspec
215
+
216
+ # Run specific test file
217
+ bundle exec rspec spec/utils_spec.rb
218
+
219
+ # Run with documentation format
220
+ bundle exec rspec --format documentation
221
+
222
+ # Run specific test by line number
223
+ bundle exec rspec spec/utils_spec.rb:42
224
+ ```
225
+
226
+ ## Multistream Support
227
+
228
+ WP2TXT supports Wikipedia's multistream format for efficient article extraction.
229
+
230
+ ### How Multistream Works
231
+
232
+ 1. **Index file** (`-multistream-index.txt.bz2`): Maps article titles to byte offsets
233
+ 2. **Multistream file** (`-multistream.xml.bz2`): Concatenated bz2 streams
234
+
235
+ ### Parallel Extraction
236
+
237
+ `MultistreamReader` supports parallel article extraction for improved performance:
238
+
239
+ ```ruby
240
+ reader = MultistreamReader.new(multistream_path, index_path)
241
+
242
+ # Extract multiple articles in parallel (4 processes by default)
243
+ results = reader.extract_articles_parallel(["Tokyo", "Kyoto", "Osaka"], num_processes: 4)
244
+
245
+ # Iterate with parallel processing
246
+ reader.each_article_parallel(entries, num_processes: 4) do |page|
247
+ process(page)
248
+ end
249
+ ```
250
+
251
+ Articles are grouped by stream offset to minimize bz2 decompression overhead.
252
+
253
+ ### Partial Downloads
254
+
255
+ For specific article extraction, WP2TXT downloads only necessary data:
256
+
257
+ ```ruby
258
+ # Only download first N streams
259
+ manager.download_multistream(max_streams: 10)
260
+
261
+ # Download only needed byte range
262
+ download_file_range(url, path, start_byte, end_byte)
263
+ ```
264
+
265
+ ### Incremental Downloads
266
+
267
+ When a partial dump exists, `download_multistream_full` can resume the download:
268
+
269
+ ```ruby
270
+ manager = DumpManager.new("ja")
271
+
272
+ # Check for existing partial dump
273
+ partial = manager.find_any_partial_cache
274
+ # => { path: "...", dump_date: "20260101", stream_count: 100, size: 1000000, mtime: ... }
275
+
276
+ # Check if incremental download is possible
277
+ resume_info = manager.can_resume_from_partial?(partial)
278
+ # => { possible: true, current_streams: 100, total_streams: 5000, current_size: 1000000 }
279
+ # => { possible: false, reason: :date_mismatch, partial_date: "20250101", latest_date: "20260101" }
280
+
281
+ # Download full dump with incremental support (interactive prompts)
282
+ path = manager.download_multistream_full(interactive: true)
283
+
284
+ # Non-interactive mode (skips user prompts, always downloads fresh if needed)
285
+ path = manager.download_multistream_full(interactive: false)
286
+ ```
287
+
288
+ User prompts for incremental downloads:
289
+
290
+ 1. **Same date partial exists:**
291
+ - `[Y]` Resume download (download only remaining data)
292
+ - `[n]` Use existing partial as-is
293
+ - `[f]` Download fresh full dump
294
+
295
+ 2. **Outdated partial exists:**
296
+ - `[D]` Delete old partial and download latest (recommended)
297
+ - `[k]` Keep old partial, download latest separately
298
+ - `[u]` Use old partial as-is (may have outdated content)
299
+
300
+ ### Article Extraction Flow
301
+
302
+ ```
303
+ 1. Download index file (~500MB for en)
304
+ 2. Load index into hash (O(1) lookup)
305
+ 3. Find article offsets
306
+ 4. Group by stream offset
307
+ 5. Download only needed streams
308
+ 6. Extract specific articles
309
+ ```
310
+
311
+ ## Memory Management
312
+
313
+ WP2TXT includes adaptive memory management for processing large dumps:
314
+
315
+ ### MemoryMonitor
316
+
317
+ Cross-platform memory monitoring in `lib/wp2txt/memory_monitor.rb`:
318
+
319
+ ```ruby
320
+ # Check current memory usage
321
+ stats = Wp2txt::MemoryMonitor.memory_stats
322
+ # => { current: 256000000, available: 8000000000, ... }
323
+
324
+ # Get optimal buffer size based on available memory
325
+ buffer_size = Wp2txt::MemoryMonitor.optimal_buffer_size
326
+ # => 10485760 (10 MB)
327
+
328
+ # Check if memory is low and trigger GC if needed
329
+ Wp2txt::MemoryMonitor.gc_if_needed
330
+ ```
331
+
332
+ ### StreamProcessor Adaptive Buffering
333
+
334
+ `StreamProcessor` adjusts buffer size dynamically:
335
+
336
+ ```ruby
337
+ processor = Wp2txt::StreamProcessor.new(input_path, adaptive_buffer: true)
338
+ processor.each_page { |title, text| ... }
339
+
340
+ # Monitor processing stats
341
+ processor.stats
342
+ # => { pages_processed: 1000, bytes_read: 50000000, buffer_size: 10485760, ... }
343
+ ```
344
+
345
+ ## bz2 Validation
346
+
347
+ The `Bz2Validator` module validates bz2 files before processing:
348
+
349
+ ```ruby
350
+ # Full validation (header + decompression test)
351
+ result = Wp2txt::Bz2Validator.validate("/path/to/file.bz2")
352
+ result.valid? # => true/false
353
+ result.error_type # => :invalid_magic, :too_small, etc.
354
+ result.message # => "Invalid bz2 header..."
355
+
356
+ # Quick validation (header only)
357
+ result = Wp2txt::Bz2Validator.validate_quick("/path/to/file.bz2")
358
+
359
+ # Get file info
360
+ info = Wp2txt::Bz2Validator.file_info("/path/to/file.bz2")
361
+ # => { path: "...", size: 1000000, valid_header: true, version: "h", block_size: 9, ... }
362
+ ```
363
+
364
+ ## Adding New Features
365
+
366
+ ### Adding a New Marker Type
367
+
368
+ 1. Add to `MARKER_TYPES` in `lib/wp2txt/utils.rb`
369
+ 2. Add detection pattern in `apply_markers()`
370
+ 3. Add tests in `spec/markers_spec.rb`
371
+
372
+ ### Adding a New CLI Option
373
+
374
+ 1. Add option definition in `lib/wp2txt/cli.rb`
375
+ 2. Add validation in `validate_options!()`
376
+ 3. Handle option in `bin/wp2txt`
377
+ 4. Add tests in `spec/auto_download_spec.rb`
378
+ 5. Update README.md
379
+
380
+ ### Adding Language Support
381
+
382
+ 1. Category keywords: `data/language_categories.json`
383
+ 2. Redirect keywords: `data/language_redirects.json`
384
+ 3. Scripts: `scripts/generate_language_data.rb`
385
+
386
+ ## Code Style
387
+
388
+ - Ruby 2.6+ compatibility
389
+ - Frozen string literals (`# frozen_string_literal: true`)
390
+ - RuboCop configuration in `.rubocop.yml`
391
+ - UTF-8 encoding throughout
392
+
393
+ ## Docker
394
+
395
+ Build and push Docker images:
396
+
397
+ ```bash
398
+ rake push # Builds multi-arch and pushes to Docker Hub
399
+ ```
400
+
401
+ ## Release Process
402
+
403
+ 1. Update version in `lib/wp2txt/version.rb`
404
+ 2. Update CHANGELOG.md
405
+ 3. Run full test suite: `bundle exec rspec`
406
+ 4. Build gem: `gem build wp2txt.gemspec`
407
+ 5. Push to RubyGems: `gem push wp2txt-*.gem`
408
+ 6. Push Docker image: `rake push`
409
+ 7. Create GitHub release
410
+
411
+ ## Useful Links
412
+
413
+ - [MediaWiki Markup Reference](https://www.mediawiki.org/wiki/Help:Formatting)
414
+ - [Wikipedia Dump Downloads](https://dumps.wikimedia.org/)
415
+ - [Multistream Format](https://meta.wikimedia.org/wiki/Data_dumps/FAQ#Why_are_there_multiple_files_for_a_single_dump?)