RubyGems - wp2txt - Versions diffs - 1.0.0 → 1.0.2 - Mend

wp2txt 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: a15462742cc2912a4dca9e0e4e42e90af4b8f9e09ea29584da94946d0a563872
-  data.tar.gz: 0c63c91b90883b4ed69199ef569c7bd467aece538bb1de1f8e7d632e710d6964
+  metadata.gz: bb540f4f17f7825786d110245c235ac556e3e64cedb17efae3e0591887425801
+  data.tar.gz: 479c357f7ba117ae10d9a5a04d24ce3aca2e54d942a156b02eb932c1aab55c8b
 SHA512:
-  metadata.gz: 22f5c61c0ff6d11cd2c0155ad77940e9b618aea1354826a7b8fc5155289b42daff159be6c48f3f038c8df08753731cad623561cbd8055a10a12ce7feae0566ca
-  data.tar.gz: 9b286a09211576f5a397e3e2e46fefbedbf9e95d200f3393b030ede106c9b543fb800c73d3d958ddc5dccad1ba2a30f0b99700af05eef88b142e90c8603e9699
+  metadata.gz: 940d47d2c8bce06029fe76e3b3744563d089e26e297e5224b36e65d815295da57117eae84cbb43abeddf2f2c052e2a987d668cba52c7af6148e935b571b6d403
+  data.tar.gz: 8ce76523a3bf181ac7a5da11f088dd14cfb1e1d7ac0d5239832db52968d183db16a3ece6074513b634eebe0e5ca28ceea945eaef6542ecb1933266caf4e89a3c

data/README.md CHANGED Viewed

@@ -1,26 +1,34 @@
 <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/wp2txt-logo.svg' width="400" />
-Text conversion tool to extract content and category data from Wikipedia dump files
+A command-line toolkit to extract text content and category data from Wikipedia dump files
 ## About
-WP2TXT extracts plain text data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata.
+WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
-**UPDATE (August 2022)**
+## Changelog
-1. A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
-2. A new option `--summary-only` has been added. If this option is enabled, only the title and text data from the opening paragraphs of the article (= summary) will be extracted.
-3. The current WP2TXT is *several times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
+**November 2022**
+- Code added to suppress "Invalid byte sequence error" when an ilegal UTF-8 character is input.
+**August 2022**
+- A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
+- A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
+- Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
 ## Screenshot
-<img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="700" />
+<img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="800" />
-- WP2TXT 1.0.0
-- MacBook Pro (2019) 2.3GHz 8Core Intel Core i9
-- enwiki-20220802-pages-articles.xml.bz2 (approx. 20GB)
+**Environment**
-In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes a little over two hours.
+- WP2TXT 1.0.1
+- MacBook Pro (2021 Apple M1 Pro)
+- enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
+In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.
 ## Features
@@ -30,23 +38,45 @@ In the above environment, the process (decompression, splitting, extraction, and
 - Allows extracting category information of the article
 - Allows extracting opening paragraphs of the article
+## Preparation
+### For MacOS and Linux
+WP2TXT requires that one of the following commands be installed on the system in order to decompress `bz2` files:
+- `lbzip2` (recommended)
+- `pbzip2`
+- `bzip2`
+In most cases, the `bzip2` command is pre-installed on the system. However, since `lbzip2` can use multiple CPU cores and is faster than `bzip2`, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.
+If you are using MacOS with Homebrew installed, you can install `lbzip2` with the following command:
+    $ brew install lbzip2
+### For Windows
+Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.
 ## Installation
+### WP2TXT command
     $ gem install wp2txt
-## Preparation
+## Wikipedia Dump File
-First, download the latest Wikipedia dump file for the language of your choice.
+Download the latest Wikipedia dump file for the desired language at a URL such as
-    https://dumps.wikimedia.org/xxwiki/latest/xxwiki-latest-pages-articles.xml.bz2
+    https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
-where `xx` is language code such as `en` (English) or `zh` (Chinese). Change it to `ja`, for instance, if you want the latest Japanese Wikipedia dump file.
+Here, `enwiki` refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to jawiki (Japanese). In doing so, note that there are two instances of `enwiki` in the URL above.
 Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
     xxwiki-yyyymmdd-pages-articles.xml.bz2
-where `xx` is language code such as `en` (English)" or `ko` (Korean), and  `yyyymmdd` is the date of creation (e.g. `20220801`).
+where `xx` is language code such as `en` (English)" or `ja` (japanese), and  `yyyymmdd` is the date of creation (e.g. `20220801`).
 ## Basic Usage
@@ -124,7 +154,7 @@ Command line options are as follows:
       -g, --category-only              Extract only article title and categories
       -s, --summary-only               Extract only article title, categories, and summary text before first heading
       -f, --file-size=<i>              Approximate size (in MB) of each output file (default: 10)
-      -n, --num-procs                  Number of proccesses to be run concurrently (default: max num of CPU cores minus two)
+      -n, --num-procs                  Number of proccesses to be run concurrently (default: max num of available CPU cores minus two)
       -x, --del-interfile              Delete intermediate XML files from output dir
       -t, --title, --no-title          Keep page titles in output (default: true)
       -d, --heading, --no-heading      Keep section titles in output (default: true)
@@ -132,6 +162,7 @@ Command line options are as follows:
       -r, --ref                        Keep reference notations in the format [ref]...[/ref]
       -e, --redirect                   Show redirect destination
       -m, --marker, --no-marker        Show symbols prefixed to list items, definitions, etc. (Default: true)
+      -b, --bz2-gem                    Use Ruby's bzip2-ruby gem instead of a system command
       -v, --version                    Print version and exit
       -h, --help                       Show this message
@@ -156,6 +187,17 @@ The author will appreciate your mentioning one of these in your research.
 * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
 * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
+Or use this BibTeX entry:
+```
+@misc{wp2txt_2022,
+  author = {Yoichiro Hasebe},
+  title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
+  url = {https://github.com/yohasebe/wp2txt},
+  year = {2022}
+}
+```
 ## License
 This software is distributed under the MIT License. Please see the LICENSE file.

data/bin/wp2txt CHANGED Viewed

@@ -43,6 +43,7 @@ EOS
   opt :ref, "Keep reference notations in the format [ref]...[/ref]", :default => false, :short => "-r"
   opt :redirect, "Show redirect destination", :default => false, :short => "-e"
   opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true, :short => "-m"
+  opt :bz2_gem, "Use Ruby's bzip2-ruby gem instead of a system command", :default => false, :short => "-b"
 end
 Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
@@ -72,7 +73,8 @@ opt_array = [:title,
              :category,
              :category_only,
              :summary_only,
-             :del_interfile]
+             :del_interfile,
+             :bz2_gem ]
 $leave_inline_template = true if opts[:inline]
 $leave_ref = true if opts[:ref]
@@ -90,11 +92,15 @@ else
   puts "Decompressing and splitting the original dump file."
   puts pastel.underline("This may take a while. Please be patient!")
+  time_start = Time.now.to_i
+  wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
   spinner = TTY::Spinner.new(":spinner", format: :arrow_pulse, hide_cursor: true, interval: 5)
   spinner.auto_spin
-  wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
   wpsplitter.split_file
-  spinner.stop(pastel.blue.bold("Done!")) # Stop animation
+  time_finish = Time.now.to_i
+  spinner.stop("Time: #{sec_to_str(time_finish - time_start)}")# Stop animation
+  puts pastel.blue.bold("Complete!")
   exit if !convert
   input_files = Dir.glob("#{output_dir}/*.xml")
 end

data/image/screenshot.png CHANGED Viewed

Binary file

data/lib/wp2txt/utils.rb CHANGED Viewed

@@ -41,7 +41,7 @@ $in_table_regex2 = Regexp.new('^\|\}.*?$')
 $in_unordered_regex  = Regexp.new('^\*')
 $in_ordered_regex    = Regexp.new('^\#')
 $in_pre_regex = Regexp.new('^ ')
-$in_definition_regex  = Regexp.new('^[\;\:]')
+$in_definition_regex  = Regexp.new('^[\;\:]')
 $blank_line_regex = Regexp.new('^\s*$')
 $redirect_regex = Regexp.new('#(?:REDIRECT|転送)\s+\[\[(.+)\]\]', Regexp::IGNORECASE)
 $remove_tag_regex = Regexp.new("\<[^\<\>]*\>")
@@ -98,11 +98,12 @@ $cleanup_regex_08 = Regexp.new('\n\n\n+', Regexp::MULTILINE)
 module Wp2txt
   def convert_characters!(text, has_retried = false)
-    begin
-      text << ""
+    begin
+      text << ""
       chrref_to_utf!(text)
       special_chr!(text)
+      text.encode!("UTF-8", "UTF-8", invalid: :replace, replace: "")
     rescue # detect invalid byte sequence in UTF-8
       if has_retried
         puts "invalid byte sequence detected"
@@ -112,20 +113,20 @@ module Wp2txt
         end
         exit
       else
-        text.encode!("UTF-16")
-        text.encode!("UTF-8")
+        text.encode!("UTF-16", "UTF-16", invalid: :replace, replace: "")
+        text.encode!("UTF-16", "UTF-16", invalid: :replace, replace: "")
         convert_characters!(text, true)
       end
     end
   end
   def format_wiki!(text, has_retried = false)
     remove_complex!(text)
     escape_nowiki!(text)
     process_interwiki_links!(text)
     process_external_links!(text)
-    unescape_nowiki!(text)
+    unescape_nowiki!(text)
     remove_directive!(text)
     remove_emphasis!(text)
     mndash!(text)
@@ -135,7 +136,7 @@ module Wp2txt
     remove_templates!(text) unless $leave_inline_template
     remove_table!(text) unless $leave_table
   end
   def cleanup!(text)
     text.gsub!($cleanup_regex_01){""}
     text.gsub!($cleanup_regex_02){""}
@@ -150,7 +151,7 @@ module Wp2txt
   end
   #################### parser for nested structure ####################
   def process_nested_structure(scanner, left, right, &block)
     test = false
     buffer = ""
@@ -195,7 +196,7 @@ module Wp2txt
     rescue => e
       return scanner.string
     end
-  end
+  end
   #################### methods used from format_wiki ####################
   def escape_nowiki!(str)
@@ -218,11 +219,11 @@ module Wp2txt
       @nowikis[obj_id]
     end
   end
   def process_interwiki_links!(str)
     scanner = StringScanner.new(str)
     result = process_nested_structure(scanner, "[[", "]]") do |contents|
-      parts = contents.split("|")
+      parts = contents.split("|")
       case parts.size
       when 1
         parts.first || ""
@@ -265,7 +266,7 @@ module Wp2txt
     end
     str.replace(result)
   end
   def remove_table!(str)
     scanner = StringScanner.new(str)
     result = process_nested_structure(scanner, "{|", "|}") do |contents|
@@ -273,7 +274,7 @@ module Wp2txt
     end
     str.replace(result)
   end
   def special_chr!(str)
     str.replace $html_decoder.decode(str)
   end
@@ -316,7 +317,7 @@ module Wp2txt
     end
     return true
   end
   def mndash!(str)
     str.gsub!($mndash_regex, "–")
   end
@@ -347,7 +348,7 @@ module Wp2txt
     str.gsub!($complex_regex_04){""}
     str.gsub!($complex_regex_05){""}
   end
   def make_reference!(str)
     str.gsub!($make_reference_regex_a){"\n"}
     str.gsub!($make_reference_regex_b){""}
@@ -413,7 +414,7 @@ module Wp2txt
     File.rename(file_path, file_path + ".bak")
     File.rename("temp", file_path)
     File.unlink(file_path + ".bak") unless backup
-  end
+  end
   # modify files under a directry (recursive)
   def batch_file_mod(dir_path, &block)
@@ -421,7 +422,7 @@ module Wp2txt
       collect_files(dir_path).each do |file|
         yield file if FileTest.file?(file)
       end
-    else
+    else
       yield dir_path if FileTest.file?(dir_path)
     end
   end
@@ -445,9 +446,9 @@ module Wp2txt
     end
   end
-  def rename(files, ext = "txt")
+  def rename(files, ext = "txt")
     # num of digits necessary to name the last file generated
-    maxwidth = 0
+    maxwidth = 0
     files.each do |f|
       width = f.slice(/\-(\d+)\z/, 1).to_s.length.to_i
@@ -476,8 +477,4 @@ module Wp2txt
     return str
   end
-  def decimal_format(i)
-    str = i.to_s.reverse
-    return str.scan(/.?.?./).join(',').reverse
-  end
 end

data/lib/wp2txt/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Wp2txt
-  VERSION = "1.0.0"
+  VERSION = "1.0.2"
 end

data/lib/wp2txt.rb CHANGED Viewed

@@ -7,26 +7,22 @@ require "nokogiri"
 require "wp2txt/article"
 require "wp2txt/utils"
-begin
-  require "bzip2-ruby"
-  NO_BZ2 = false
-rescue LoadError
-  # in case bzip2-ruby gem is not available
-  NO_BZ2 = true
-end
 module Wp2txt
   class Splitter
     include Wp2txt
-    def initialize(input_file, output_dir = ".", tfile_size = 10)
+    def initialize(input_file, output_dir = ".", tfile_size = 10, bz2_gem = false)
       @fp = nil
       @input_file = input_file
       @output_dir = output_dir
       @tfile_size = tfile_size
-      prepare
+      if bz2_gem
+        require "bzip2-ruby"
+      end
+      @bz2_gem = bz2_gem
+      prepare
     end
-    def file_size(file)
+    def file_size(file)
       origin = Time.now
       size = 0;  unit = 10485760; star = 0; before = Time.now.to_f
       error_count = 10
@@ -36,7 +32,7 @@ module Wp2txt
         rescue => e
           a = nil
         end
-        break unless a
+        break unless a
         present = Time.now.to_f
         size += a.size
@@ -44,12 +40,29 @@ module Wp2txt
           star = 0 if star > 10
           star += 1
           before = present
-        end
+        end
       end
       time_elapsed = Time.now - origin
       size
     end
+    # check if a given command exists: return the path if it does, return false if not
+    def command_exist?(command)
+      basename = File.basename(command)
+      path = ""
+      print "Checking #{basename}: "
+      if open("| which #{command} 2>/dev/null"){ |f| path = f.gets.strip }
+        puts "detected [#{path}]"
+        return path.strip
+      elsif open("| which #{basename} 2>/dev/null"){ |f| path = f.gets.strip }
+        puts "detected [#{path}]"
+        return path.strip
+      else
+        puts "not found"
+        return false
+      end
+    end
     # check the size of input file (bz2 or plain xml) when decompressed
     def prepare
       # if output_dir is not specified, output in the same directory
@@ -58,31 +71,31 @@ module Wp2txt
         @output_dir = File.dirname(@input_file)
       end
-      # if input file is bz2 compressed, use bz2-ruby if available,
-      # use command line bzip2 program otherwise.
       if /.bz2$/ =~ @input_file
-        unless NO_BZ2
+        if @bz2_gem
           file = Bzip2::Reader.new File.open(@input_file, "r:UTF-8")
+        elsif RUBY_PLATFORM.index("win32")
+          file = IO.popen("bunzip2.exe -c #{@input_file}")
         else
-          if RUBY_PLATFORM.index("win32")
-            file = IO.popen("bunzip2.exe -c #{@input_file}")
-          else
-            file = IO.popen("bzip2 -c -d #{@input_file}")
+          if bzpath = command_exist?("lbzip2") ||
+                      command_exist?("pbzip2") ||
+                      command_exist?("bzip2")
+            file = IO.popen("#{bzpath} -c -d #{@input_file}")
           end
-        end
+        end
       else # meaning that it is a text file
         @infile_size = File.stat(@input_file).size
         file = open(@input_file)
       end
       #create basename of output file
-      @outfile_base = File.basename(@input_file, ".*") + "-"
+      @outfile_base = File.basename(@input_file, ".*") + "-"
       @total_size = 0
       @file_index = 1
       outfilename = File.join(@output_dir, @outfile_base + @file_index.to_s)
       @outfiles = []
       @outfiles << outfilename
-      @fp = File.open(outfilename, "w")
+      @fp = File.open(outfilename, "w")
       @file_pointer = file
       return true
     end
@@ -100,7 +113,7 @@ module Wp2txt
         # temp_buf is filled with text split by "\n"
         temp_buf = []
         ss = StringScanner.new(new_lines)
-        while ss.scan(/.*?\n/m)
+        while ss.scan(/.*?\n/m)
           temp_buf << ss[0]
         end
         temp_buf << ss.rest unless ss.eos?
@@ -122,16 +135,16 @@ module Wp2txt
     end
     def get_newline
-      @buffer ||= [""]
+      @buffer ||= [""]
       if @buffer.size == 1
         return nil unless fill_buffer
       end
       if @buffer.empty?
         return nil
-      else
+      else
         new_line = @buffer.shift
         return new_line
-      end
+      end
     end
     def split_file
@@ -145,7 +158,7 @@ module Wp2txt
         output_text << text
         end_flag = true if @total_size > (@tfile_size * 1024 * 1024)
         # never close the file until the end of the page even if end_flag is on
-        if end_flag && /<\/page/ =~ text
+        if end_flag && /<\/page/ =~ text
           @fp.puts(output_text)
           output_text = ""
           @total_size = 0
@@ -159,15 +172,15 @@ module Wp2txt
         end
       end
       @fp.puts(output_text) if output_text != ""
-      @fp.close
+      @fp.close
       if File.size(outfilename) == 0
-        File.delete(outfilename)
+        File.delete(outfilename)
         @outfiles.delete(outfilename)
       end
-      rename(@outfiles, "xml")
-    end
+      rename(@outfiles, "xml")
+    end
   end
   class Runner
@@ -181,7 +194,7 @@ module Wp2txt
       @del_interfile = del_interfile
       prepare
     end
     def prepare
       @infile_size = File.stat(@input_file).size
       file = open(@input_file)
@@ -203,7 +216,7 @@ module Wp2txt
         # temp_buf is filled with text split by "\n"
         temp_buf = []
         ss = StringScanner.new(new_lines)
-        while ss.scan(/.*?\n/m)
+        while ss.scan(/.*?\n/m)
           temp_buf << ss[0]
         end
         temp_buf << ss.rest unless ss.eos?
@@ -225,16 +238,16 @@ module Wp2txt
     end
     def get_newline
-      @buffer ||= [""]
+      @buffer ||= [""]
       if @buffer.size == 1
         return nil unless fill_buffer
       end
       if @buffer.empty?
         return nil
-      else
+      else
         new_line = @buffer.shift
         return new_line
-      end
+      end
     end
     def get_page
@@ -270,7 +283,7 @@ module Wp2txt
       pages = []
       data_empty = false
-      while !data_empty
+      while !data_empty
         page = get_page
         if page
           pages << page

data/tags ADDED Viewed

@@ -0,0 +1,58 @@
+!_TAG_FILE_FORMAT	2	/extended format; --format=1 will not append ;" to lines/
+!_TAG_FILE_SORTED	1	/0=unsorted, 1=sorted, 2=foldcase/
+!_TAG_PROGRAM_AUTHOR	Darren Hiebert	/dhiebert@users.sourceforge.net/
+!_TAG_PROGRAM_NAME	Exuberant Ctags	//
+!_TAG_PROGRAM_URL	http://ctags.sourceforge.net	/official site/
+!_TAG_PROGRAM_VERSION	5.8	//
+Article	lib/wp2txt/article.rb	/^  class Article$/;"	c	class:Wp2txt
+Runner	lib/wp2txt.rb	/^  class Runner$/;"	c	class:Wp2txt.Splitter.file_size
+Splitter	lib/wp2txt.rb	/^  class Splitter$/;"	c	class:Wp2txt
+Wp2txt	lib/wp2txt.rb	/^module Wp2txt$/;"	m
+Wp2txt	lib/wp2txt/article.rb	/^module Wp2txt$/;"	m
+Wp2txt	lib/wp2txt/utils.rb	/^module Wp2txt$/;"	m
+Wp2txt	lib/wp2txt/version.rb	/^module Wp2txt$/;"	m
+batch_file_mod	lib/wp2txt/utils.rb	/^  def batch_file_mod(dir_path, &block)$/;"	f
+chrref_to_utf!	lib/wp2txt/utils.rb	/^  def chrref_to_utf!(num_str)$/;"	f
+cleanup!	lib/wp2txt/utils.rb	/^  def cleanup!(text)$/;"	f
+collect_files	lib/wp2txt/utils.rb	/^  def collect_files(str, regex = nil)$/;"	f
+command_exist?	lib/wp2txt.rb	/^    def command_exist?(command)$/;"	f	class:Wp2txt.Splitter.file_size
+convert_characters!	lib/wp2txt/utils.rb	/^  def convert_characters!(text, has_retried = false)$/;"	f	class:Wp2txt
+correct_inline_template!	lib/wp2txt/utils.rb	/^  def correct_inline_template!(str)$/;"	f
+correct_separator	lib/wp2txt/utils.rb	/^  def correct_separator(input)$/;"	f
+create_element	lib/wp2txt/article.rb	/^    def create_element(tp, text)$/;"	f	class:Wp2txt.Article
+escape_nowiki!	lib/wp2txt/utils.rb	/^  def escape_nowiki!(str)$/;"	f
+extract_text	lib/wp2txt.rb	/^    def extract_text(&block)$/;"	f	class:Wp2txt.Splitter.file_size.Runner.fill_buffer
+file_mod	lib/wp2txt/utils.rb	/^  def file_mod(file_path, backup = false, &block)$/;"	f
+file_size	lib/wp2txt.rb	/^    def file_size(file)$/;"	f	class:Wp2txt.Splitter
+fill_buffer	lib/wp2txt.rb	/^    def fill_buffer$/;"	f	class:Wp2txt.Splitter.file_size
+fill_buffer	lib/wp2txt.rb	/^    def fill_buffer$/;"	f	class:Wp2txt.Splitter.file_size.Runner
+format_wiki!	lib/wp2txt/utils.rb	/^  def format_wiki!(text, has_retried = false)$/;"	f
+get_newline	lib/wp2txt.rb	/^    def get_newline$/;"	f	class:Wp2txt.Splitter.file_size.Runner.fill_buffer
+get_newline	lib/wp2txt.rb	/^    def get_newline$/;"	f	class:Wp2txt.Splitter.file_size.fill_buffer
+get_page	lib/wp2txt.rb	/^    def get_page$/;"	f	class:Wp2txt.Splitter.file_size.Runner.fill_buffer
+initialize	lib/wp2txt.rb	/^    def initialize(input_file, output_dir = ".", strip_tmarker = false, del_interfile = true)$/;"	f	class:Wp2txt.Splitter.file_size.Runner
+initialize	lib/wp2txt.rb	/^    def initialize(input_file, output_dir = ".", tfile_size = 10, bz2_gem = false)$/;"	f	class:Wp2txt.Splitter
+initialize	lib/wp2txt/article.rb	/^    def initialize(text, title = "", strip_tmarker = false)$/;"	f	class:Wp2txt.Article
+make_reference!	lib/wp2txt/utils.rb	/^  def make_reference!(str)$/;"	f
+mndash!	lib/wp2txt/utils.rb	/^  def mndash!(str)$/;"	f
+parse	lib/wp2txt/article.rb	/^    def parse(source)$/;"	f	class:Wp2txt.Article
+prepare	lib/wp2txt.rb	/^    def prepare$/;"	f	class:Wp2txt.Splitter.file_size
+prepare	lib/wp2txt.rb	/^    def prepare$/;"	f	class:Wp2txt.Splitter.file_size.Runner
+process_external_links!	lib/wp2txt/utils.rb	/^  def process_external_links!(str)$/;"	f
+process_interwiki_links!	lib/wp2txt/utils.rb	/^  def process_interwiki_links!(str)$/;"	f
+process_nested_structure	lib/wp2txt/utils.rb	/^  def process_nested_structure(scanner, left, right, &block)$/;"	f
+remove_complex!	lib/wp2txt/utils.rb	/^  def remove_complex!(str)$/;"	f
+remove_directive!	lib/wp2txt/utils.rb	/^  def remove_directive!(str)$/;"	f
+remove_emphasis!	lib/wp2txt/utils.rb	/^  def remove_emphasis!(str)$/;"	f
+remove_hr!	lib/wp2txt/utils.rb	/^  def remove_hr!(str)$/;"	f
+remove_html!	lib/wp2txt/utils.rb	/^  def remove_html!(str)$/;"	f
+remove_inbetween!	lib/wp2txt/utils.rb	/^  def remove_inbetween!(str, tagset = ['<', '>'])$/;"	f
+remove_ref!	lib/wp2txt/utils.rb	/^  def remove_ref!(str)$/;"	f
+remove_table!	lib/wp2txt/utils.rb	/^  def remove_table!(str)$/;"	f
+remove_tag!	lib/wp2txt/utils.rb	/^  def remove_tag!(str)$/;"	f
+remove_templates!	lib/wp2txt/utils.rb	/^  def remove_templates!(str)$/;"	f
+rename	lib/wp2txt/utils.rb	/^  def rename(files, ext = "txt")$/;"	f
+sec_to_str	lib/wp2txt/utils.rb	/^  def sec_to_str(int)$/;"	f
+special_chr!	lib/wp2txt/utils.rb	/^  def special_chr!(str)$/;"	f
+split_file	lib/wp2txt.rb	/^    def split_file$/;"	f	class:Wp2txt.Splitter.file_size.fill_buffer
+unescape_nowiki!	lib/wp2txt/utils.rb	/^  def unescape_nowiki!(str)$/;"	f

data/wp2txt.gemspec CHANGED Viewed

@@ -7,9 +7,9 @@ Gem::Specification.new do |s|
   s.version     = Wp2txt::VERSION
   s.authors     = ["Yoichiro Hasebe"]
   s.email       = ["yohasebe@gmail.com"]
-  s.homepage    = "http://github.com/yohasebe/wp2txt"
-  s.summary     = %q{Wikipedia dump to text converter}
-  s.description = %q{WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.}
+  s.homepage    = "https://github.com/yohasebe/wp2txt"
+  s.summary     = %q{A command-line toolkit to extract text content and category data from Wikipedia dump files}
+  s.description = %q{WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.}
   s.rubyforge_project = "wp2txt"

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: wp2txt
 version: !ruby/object:Gem::Version
-  version: 1.0.0
+  version: 1.0.2
 platform: ruby
 authors:
 - Yoichiro Hasebe
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2022-08-09 00:00:00.000000000 Z
+date: 2022-11-25 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -108,8 +108,8 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
-description: WP2TXT extracts plain text data from Wikipedia dump file (encoded in
-  XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
+description: WP2TXT extracts text and category data from Wikipedia dump files (encoded
+  in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
 email:
 - yohasebe@gmail.com
 executables:
@@ -140,8 +140,9 @@ files:
 - lib/wp2txt/version.rb
 - spec/spec_helper.rb
 - spec/utils_spec.rb
+- tags
 - wp2txt.gemspec
-homepage: http://github.com/yohasebe/wp2txt
+homepage: https://github.com/yohasebe/wp2txt
 licenses: []
 metadata: {}
 post_install_message:
@@ -162,7 +163,8 @@ requirements: []
 rubygems_version: 3.3.3
 signing_key:
 specification_version: 4
-summary: Wikipedia dump to text converter
+summary: A command-line toolkit to extract text content and category data from Wikipedia
+  dump files
 test_files:
 - spec/spec_helper.rb
 - spec/utils_spec.rb