RubyGems - wp2txt - Versions diffs - 2.1.0 → 2.1.1 - Mend

wp2txt 2.1.0 → 2.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +6 -0
data/README.md +27 -20
data/README_ja.md +11 -4
data/lib/wp2txt/extractor.rb +2 -2
data/lib/wp2txt/section_extractor.rb +36 -15
data/lib/wp2txt/version.rb +1 -1
data/spec/multistream_spec.rb +14 -14
data/spec/section_extractor_spec.rb +70 -0
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 6542679abdbb9ac3e8c00581ce7c82b583c742ef0425f9f1ccd3eab619598c1b
-  data.tar.gz: d822011ec24cd6d512cb9725880b4780daec5b9ce401caafbae9e8df5e8593a5
+  metadata.gz: 464bf436280592e916e565d24cacdbde13c925ceaa9b390a2b36d4835053a323
+  data.tar.gz: 205a30ed5e9193d974f3a93d04f42ca55f9a241c0215a2755567c949776a9c3b
 SHA512:
-  metadata.gz: 8ffad99cceab4797a03e857203ebe0cd4f5df8e592d30920c975cb89e1d079709356810466bec32027b8be8de026138c1c579ffe97f0da47d6a90d799ba60222
-  data.tar.gz: 312f68371040f86384cb2bd01a68e178f2adbd97c3bdb71ec75ccaf7cf8a47c3fec733ca3874abcb16d30776222c313b1b04f393a24b62f119dad0217c35ed3c
+  metadata.gz: d6cf5dbef0e429802a5f66450e43598b94615c92fb144bc6630d44469ba73030002ec539049babd56271ef9c50d84ac22d7b90172e47d112b78676b68aca3324
+  data.tar.gz: 04c368cb623116823bd1a036fe760f38bb1bce78fa3ba6d15ea1db13f00ad7224a1e67b3904a5e1a9b1f329bcc7a8805255ad811ddd2925299c2d58a5968a8ac

data/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,12 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [2.1.1] - 2026-02-21
+- **Bidirectional alias matching**: Section extraction now supports reverse alias lookup - specifying an alias name (e.g., "Synopsis") as target matches the canonical heading ("Plot") and vice versa
+- **Expanded default section aliases**: Increased from 2 to 12 alias groups covering common English Wikipedia sections (Plot, Reception, References, Bibliography, Awards, Legacy, Early life, Career, etc.)
+- **Config forwarding fix**: `--pre`, `--ref`, `--expand-templates`, and `--metadata-only` options now correctly forwarded in `--articles` and `--from-category` modes
 ## [2.1.0] - 2026-02-19
 - **SQLite-based caching infrastructure**: New high-performance caching using SQLite for faster startup and repeated operations:

data/README.md CHANGED Viewed

@@ -10,14 +10,14 @@ English | [日本語](README_ja.md)
 # Install
 gem install wp2txt
-# Extract text from Japanese Wikipedia (auto-download)
-wp2txt --lang=ja -o ./output
+# Extract text from English Wikipedia (auto-download)
+wp2txt --lang=en -o ./output
 # Extract specific articles
-wp2txt --lang=ja --articles="東京,京都" -o ./articles
+wp2txt --lang=en --articles="Tokyo,Kyoto" -o ./articles
 # Extract articles from a category
-wp2txt --lang=ja --from-category="日本の都市" -o ./cities
+wp2txt --lang=en --from-category="Cities in Japan" -o ./cities
 ```
 ## About
@@ -80,27 +80,27 @@ The `wp2txt` command is available inside the container. Use `/data` for input/ou
 ### Auto-download and process (Recommended)
-    $ wp2txt --lang=ja -o ./text
+    $ wp2txt --lang=en -o ./text
-This automatically downloads the Japanese Wikipedia dump and extracts plain text. Downloads are cached in `~/.wp2txt/cache/`.
+This automatically downloads the English Wikipedia dump and extracts plain text. Downloads are cached in `~/.wp2txt/cache/`.
 ### Extract specific articles by title
-    $ wp2txt --lang=ja --articles="認知言語学,生成文法" -o ./articles
+    $ wp2txt --lang=en --articles="Cognitive linguistics,Generative grammar" -o ./articles
 Only the index file and necessary data streams are downloaded, making it much faster than processing the full dump.
 ### Extract articles from a category
-    $ wp2txt --lang=ja --from-category="日本の都市" -o ./cities
+    $ wp2txt --lang=en --from-category="Cities in Japan" -o ./cities
 Include subcategories with `--depth`:
-    $ wp2txt --lang=ja --from-category="日本の都市" --depth=2 -o ./cities
+    $ wp2txt --lang=en --from-category="Cities in Japan" --depth=2 -o ./cities
 Preview without downloading (shows article counts):
-    $ wp2txt --lang=ja --from-category="日本の都市" --dry-run
+    $ wp2txt --lang=en --from-category="Cities in Japan" --dry-run
 ### Process local dump file
@@ -109,22 +109,29 @@ Preview without downloading (shows article counts):
 ### Other extraction modes
     # Category info only (title + categories)
-    $ wp2txt -g --lang=ja -o ./category
+    $ wp2txt -g --lang=en -o ./category
     # Summary only (title + categories + opening paragraphs)
-    $ wp2txt -s --lang=ja -o ./summary
+    $ wp2txt -s --lang=en -o ./summary
     # Metadata only (title + section headings + categories)
-    $ wp2txt -M --lang=ja --format json -o ./metadata
+    $ wp2txt -M --lang=en --format json -o ./metadata
-    # Extract specific sections (comma-separated, 'summary' for lead text)
-    $ wp2txt --lang=en --sections="summary,Plot,Reception" --format json -o ./sections
+    # Extract specific sections from particular articles (fast)
+    # Section names are case-insensitive; alias matching is enabled by default
+    $ wp2txt --lang=en --articles="Tokyo" --sections="summary,history,geography" --format json -o ./sections
-    # Section heading statistics
-    $ wp2txt --lang=ja --section-stats -o ./stats
+    # Extract specific sections from a category (moderate)
+    $ wp2txt --lang=en --from-category="Cities in Japan" --sections="summary,history" --format json -o ./sections
+    # Extract specific sections from full dump (slow - processes all articles)
+    $ wp2txt --lang=en --sections="summary,plot,reception" --format json -o ./sections
+    # Section heading statistics (useful for discovering section names before extraction)
+    $ wp2txt --lang=en --section-stats -o ./stats
     # JSON/JSONL output
-    $ wp2txt --format json --lang=ja -o ./json
+    $ wp2txt --format json --lang=en -o ./json
 ## Sample Output
@@ -156,7 +163,7 @@ For redirect articles:
     $ wp2txt --cache-status           # Show cache status
     $ wp2txt --cache-clear            # Clear all cache
-    $ wp2txt --cache-clear --lang=ja  # Clear cache for Japanese only
+    $ wp2txt --cache-clear --lang=en  # Clear cache for English only
     $ wp2txt --update-cache           # Force fresh download
 When cache exceeds the expiry period (default: 30 days), wp2txt displays a warning but allows using cached data.
@@ -260,7 +267,7 @@ Supported: `{{cite book}}`, `{{cite web}}`, `{{cite news}}`, `{{cite journal}}`,
       -M, --metadata-only              Extract only title, headings, and categories
     Section extraction:
-      -S, --sections=<s>               Extract specific sections (comma-separated)
+      -S, --sections=<s>               Extract specific sections (comma-separated, case-insensitive)
       --section-output=<s>             Output mode: structured or combined (default: structured)
       --min-section-length=<i>         Minimum section length in characters (default: 0)
       --skip-empty                     Skip articles with no matching sections

data/README_ja.md CHANGED Viewed

@@ -117,10 +117,17 @@ docker run -it -v /path/to/localdata:/data yohasebe/wp2txt
     # メタデータのみ（タイトル + セクション見出し + カテゴリ）
     $ wp2txt -M --lang=ja --format json -o ./metadata
-    # 特定セクションを抽出（カンマ区切り、'summary'で冒頭テキスト）
-    $ wp2txt --lang=ja --sections="概要,歴史,関連項目" --format json -o ./sections
+    # 特定記事から特定セクションを抽出（高速）
+    # セクション名は大文字小文字を区別しません。エイリアスマッチングもデフォルトで有効です
+    $ wp2txt --lang=ja --articles="東京" --sections="summary,概要,歴史" --format json -o ./sections
-    # セクション見出しの統計
+    # カテゴリ内の記事から特定セクションを抽出（中速）
+    $ wp2txt --lang=ja --from-category="日本の都市" --sections="summary,概要,歴史" --format json -o ./sections
+    # フルダンプから特定セクションを抽出（低速 - 全記事を処理）
+    $ wp2txt --lang=ja --sections="summary,概要,歴史,関連項目" --format json -o ./sections
+    # セクション見出しの統計（抽出前のセクション名の調査に便利）
     $ wp2txt --lang=ja --section-stats -o ./stats
     # JSON/JSONL出力
@@ -260,7 +267,7 @@ CATEGORIES: カテゴリ1, カテゴリ2, カテゴリ3
       -M, --metadata-only              タイトル、見出し、カテゴリのみ抽出
     セクション抽出:
-      -S, --sections=<s>               特定セクションを抽出（カンマ区切り）
+      -S, --sections=<s>               特定セクションを抽出（カンマ区切り、大文字小文字区別なし）
       --section-output=<s>             出力モード: structured または combined（デフォルト: structured）
       --min-section-length=<i>         最小セクション長（文字数）（デフォルト: 0）
       --skip-empty                     該当セクションのない記事をスキップ

data/lib/wp2txt/extractor.rb CHANGED Viewed

@@ -253,8 +253,8 @@ module Wp2txt
         bz2_gem: opts[:bz2_gem]
       }
-      %i[title list heading table redirect multiline category category_only
-         summary_only marker extract_citations].each do |opt|
+      %i[title list heading table pre ref redirect multiline category category_only
+         summary_only metadata_only marker extract_citations expand_templates].each do |opt|
         config[opt] = opts[opt]
       end

data/lib/wp2txt/section_extractor.rb CHANGED Viewed

@@ -11,9 +11,21 @@ module Wp2txt
     SUMMARY_KEY = "summary"
     # Default section aliases (canonical name => array of aliases)
+    # These cover common variations found across English Wikipedia articles.
+    # Users can add custom aliases via --alias-file for other languages or domains.
     DEFAULT_ALIASES = {
-      "Plot" => ["Synopsis"],
-      "Reception" => ["Critical reception"]
+      "Plot" => ["Synopsis", "Plot summary", "Story"],
+      "Reception" => ["Critical reception", "Reviews", "Critical response"],
+      "References" => ["Notes", "Footnotes", "Citations", "Notes and references"],
+      "External links" => ["External sources"],
+      "See also" => ["Related articles", "Related pages"],
+      "Bibliography" => ["Works", "Publications", "Selected works", "Selected bibliography"],
+      "Awards" => ["Awards and nominations", "Honors", "Accolades"],
+      "Legacy" => ["Impact", "Influence", "Cultural impact", "Cultural legacy"],
+      "Early life" => ["Early life and education", "Childhood", "Early years"],
+      "Career" => ["Professional career"],
+      "Filmography" => ["Films"],
+      "Discography" => ["Discography and videography"]
     }.freeze
     # Track which actual headings matched which requested sections
@@ -230,19 +242,22 @@ module Wp2txt
     end
     # Find canonical name for a heading (handles aliases)
+    # Supports bidirectional alias matching:
+    #   - Target is canonical name, heading matches an alias (e.g., target="plot" matches "Synopsis")
+    #   - Target is an alias, heading matches canonical or another alias in the same group
+    #     (e.g., target="synopsis" matches "Plot" or "Plot summary")
     # @param heading [String] The actual heading text from the article
     # @param record_match [Boolean] Whether to record the match for tracking
-    # @return [String, nil] The canonical (requested) section name, or nil
+    # @return [String, nil] The target section name as specified by the user, or nil
     def find_canonical_name(heading, record_match: true)
       return nil if heading.nil? || heading.empty?
       return nil if @targets.nil?
       heading_lower = heading.downcase.strip
-      # Direct match
+      # Direct match (target name == heading name)
       @targets.each do |target|
         if target.downcase == heading_lower
-          # Record direct match (only if heading differs in case)
           if @track_matches && record_match && target != heading
             @matched_sections[target] = heading
           end
@@ -250,20 +265,26 @@ module Wp2txt
         end
       end
-      # Alias match
       return nil unless @use_aliases
-      @aliases.each do |canonical, alias_list|
-        next unless @targets.any? { |t| t.downcase == canonical.downcase }
+      @targets.each do |target|
+        target_lower = target.downcase
-        if alias_list.any? { |a| a.downcase == heading_lower }
-          # Return the target that matches canonical
-          target = @targets.find { |t| t.downcase == canonical.downcase }
-          # Record alias match
-          if @track_matches && record_match && target
-            @matched_sections[target] = heading
+        @aliases.each do |canonical, alias_list|
+          canonical_lower = canonical.downcase
+          aliases_lower = alias_list.map(&:downcase)
+          all_names = [canonical_lower] + aliases_lower
+          # Check if this target belongs to this alias group
+          next unless all_names.include?(target_lower)
+          # Check if the heading matches any name in the same group
+          if all_names.include?(heading_lower) && target_lower != heading_lower
+            if @track_matches && record_match
+              @matched_sections[target] = heading
+            end
+            return target
           end
-          return target
         end
       end

data/lib/wp2txt/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Wp2txt
-  VERSION = "2.1.0"
+  VERSION = "2.1.1"
 end

data/spec/multistream_spec.rb CHANGED Viewed

@@ -239,13 +239,13 @@ RSpec.describe "Wp2txt Multistream" do
     describe "#initialize" do
       it "loads the index file" do
-        index = described_class.new(index_path)
+        index = described_class.new(index_path, cache_dir: temp_dir)
         expect(index.size).to eq(4)
       end
     end
     describe "#find_by_title" do
-      let(:index) { described_class.new(index_path) }
+      let(:index) { described_class.new(index_path, cache_dir: temp_dir) }
       it "finds article by exact title" do
         result = index.find_by_title("Article One")
@@ -268,7 +268,7 @@ RSpec.describe "Wp2txt Multistream" do
     end
     describe "#find_by_id" do
-      let(:index) { described_class.new(index_path) }
+      let(:index) { described_class.new(index_path, cache_dir: temp_dir) }
       it "finds article by page ID" do
         result = index.find_by_id(2)
@@ -283,7 +283,7 @@ RSpec.describe "Wp2txt Multistream" do
     end
     describe "#articles_in_stream" do
-      let(:index) { described_class.new(index_path) }
+      let(:index) { described_class.new(index_path, cache_dir: temp_dir) }
       it "returns articles at given byte offset" do
         articles = index.articles_in_stream(100)
@@ -298,7 +298,7 @@ RSpec.describe "Wp2txt Multistream" do
     end
     describe "#stream_offset_for" do
-      let(:index) { described_class.new(index_path) }
+      let(:index) { described_class.new(index_path, cache_dir: temp_dir) }
       it "returns byte offset for article" do
         offset = index.stream_offset_for("Article Three")
@@ -312,7 +312,7 @@ RSpec.describe "Wp2txt Multistream" do
     end
     describe "#random_articles" do
-      let(:index) { described_class.new(index_path) }
+      let(:index) { described_class.new(index_path, cache_dir: temp_dir) }
       it "returns requested number of random articles" do
         articles = index.random_articles(2)
@@ -326,7 +326,7 @@ RSpec.describe "Wp2txt Multistream" do
     end
     describe "#first_articles" do
-      let(:index) { described_class.new(index_path) }
+      let(:index) { described_class.new(index_path, cache_dir: temp_dir) }
       it "returns first N articles" do
         articles = index.first_articles(2)
@@ -335,7 +335,7 @@ RSpec.describe "Wp2txt Multistream" do
     end
     describe "#stream_offsets" do
-      let(:index) { described_class.new(index_path) }
+      let(:index) { described_class.new(index_path, cache_dir: temp_dir) }
       it "returns unique sorted offsets" do
         offsets = index.stream_offsets
@@ -447,7 +447,7 @@ RSpec.describe "Wp2txt Multistream" do
     describe "#initialize" do
       it "creates reader with paths" do
-        reader = described_class.new(multistream_path, index_path)
+        reader = described_class.new(multistream_path, index_path, cache_dir: temp_dir)
         expect(reader.multistream_path).to eq(multistream_path)
         expect(reader.index).to be_a(Wp2txt::MultistreamIndex)
       end
@@ -456,7 +456,7 @@ RSpec.describe "Wp2txt Multistream" do
     describe "#extract_article" do
       it "returns nil for non-existent article" do
         # Without actual bz2 file, can't extract, but should handle gracefully
-        reader = described_class.new(multistream_path, index_path)
+        reader = described_class.new(multistream_path, index_path, cache_dir: temp_dir)
         # Will return nil because file doesn't exist
         expect { reader.extract_article("Non Existent") }.not_to raise_error
       end
@@ -464,13 +464,13 @@ RSpec.describe "Wp2txt Multistream" do
     describe "#extract_articles_parallel" do
       it "handles empty titles array" do
-        reader = described_class.new(multistream_path, index_path)
+        reader = described_class.new(multistream_path, index_path, cache_dir: temp_dir)
         result = reader.extract_articles_parallel([], num_processes: 2)
         expect(result).to eq({})
       end
       it "handles titles not in index" do
-        reader = described_class.new(multistream_path, index_path)
+        reader = described_class.new(multistream_path, index_path, cache_dir: temp_dir)
         result = reader.extract_articles_parallel(["Non Existent"], num_processes: 2)
         expect(result).to eq({})
       end
@@ -478,13 +478,13 @@ RSpec.describe "Wp2txt Multistream" do
     describe "#each_article_parallel" do
       it "returns an enumerator when no block given" do
-        reader = described_class.new(multistream_path, index_path)
+        reader = described_class.new(multistream_path, index_path, cache_dir: temp_dir)
         result = reader.each_article_parallel([], num_processes: 2)
         expect(result).to be_an(Enumerator)
       end
       it "handles empty entries array" do
-        reader = described_class.new(multistream_path, index_path)
+        reader = described_class.new(multistream_path, index_path, cache_dir: temp_dir)
         pages = []
         reader.each_article_parallel([], num_processes: 2) { |page| pages << page }
         expect(pages).to eq([])

data/spec/section_extractor_spec.rb CHANGED Viewed

@@ -161,6 +161,76 @@ RSpec.describe Wp2txt::SectionExtractor do
     end
   end
+  describe "bidirectional alias matching" do
+    let(:wiki_with_plot) do
+      <<~WIKI
+        Summary.
+        == Plot ==
+        The story begins...
+      WIKI
+    end
+    let(:plot_article) { Wp2txt::Article.new(wiki_with_plot, "Film") }
+    let(:wiki_with_synopsis) do
+      <<~WIKI
+        Summary.
+        == Synopsis ==
+        The story follows...
+      WIKI
+    end
+    let(:synopsis_article) { Wp2txt::Article.new(wiki_with_synopsis, "Movie") }
+    let(:wiki_with_reviews) do
+      <<~WIKI
+        Summary.
+        == Reviews ==
+        Critics praised...
+      WIKI
+    end
+    let(:reviews_article) { Wp2txt::Article.new(wiki_with_reviews, "Album") }
+    context "when target is canonical name" do
+      let(:extractor) { described_class.new(["Plot"]) }
+      it "matches alias heading (Synopsis)" do
+        sections = extractor.extract_sections(synopsis_article)
+        expect(sections["Plot"]).to include("story follows")
+      end
+    end
+    context "when target is an alias name" do
+      let(:extractor) { described_class.new(["Synopsis"]) }
+      it "matches canonical heading (Plot)" do
+        sections = extractor.extract_sections(plot_article)
+        expect(sections["Synopsis"]).to include("story begins")
+      end
+    end
+    context "when target is one alias and heading is another alias in the same group" do
+      let(:extractor) { described_class.new(["Reviews"]) }
+      it "matches Critical reception heading via shared alias group" do
+        wiki = "== Critical reception ==\nWell received."
+        art = Wp2txt::Article.new(wiki, "Work")
+        sections = extractor.extract_sections(art)
+        expect(sections["Reviews"]).to include("Well received")
+      end
+    end
+    context "when aliases are disabled" do
+      let(:extractor) { described_class.new(["Synopsis"], use_aliases: false) }
+      it "does not match canonical name (Plot)" do
+        sections = extractor.extract_sections(plot_article)
+        expect(sections["Synopsis"]).to be_nil
+      end
+    end
+  end
   describe "case-insensitive matching" do
     let(:extractor) { described_class.new(["early life", "CAREER"]) }

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: wp2txt
 version: !ruby/object:Gem::Version
-  version: 2.1.0
+  version: 2.1.1
 platform: ruby
 authors:
 - Yoichiro Hasebe