RubyGems - wp2txt - Versions diffs - 1.0.0 → 1.0.1 - Mend

wp2txt 1.0.0 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: a15462742cc2912a4dca9e0e4e42e90af4b8f9e09ea29584da94946d0a563872
-  data.tar.gz: 0c63c91b90883b4ed69199ef569c7bd467aece538bb1de1f8e7d632e710d6964
+  metadata.gz: d33a41cf46688679a14eb8c3eb16f6ed33ce9175c7f5b566c9f87998ba2c8401
+  data.tar.gz: 7371e0f7b06b2f0846f01d66f461c7e106778adc6e686919302f0f29b1f80a9e
 SHA512:
-  metadata.gz: 22f5c61c0ff6d11cd2c0155ad77940e9b618aea1354826a7b8fc5155289b42daff159be6c48f3f038c8df08753731cad623561cbd8055a10a12ce7feae0566ca
-  data.tar.gz: 9b286a09211576f5a397e3e2e46fefbedbf9e95d200f3393b030ede106c9b543fb800c73d3d958ddc5dccad1ba2a30f0b99700af05eef88b142e90c8603e9699
+  metadata.gz: cab8d9c27989387acc6dbbe052029d2205508ce10e38b8eedc111c822328d8eba551d603020684cbb3844a87b747f261a5959f711267acd96a3b97ccef4f6834
+  data.tar.gz: 4de59be37d57ef3d14ae2304660e8dde069bdf645a7cff862026562b26327984f1be13840e9d6ec1f25110222367f71c84a0286b649d71fec0c13805c6b0a647

data/README.md CHANGED Viewed

@@ -1,26 +1,28 @@
 <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/wp2txt-logo.svg' width="400" />
-Text conversion tool to extract content and category data from Wikipedia dump files
+A command-line toolkit to extract text content and category data from Wikipedia dump files
 ## About
-WP2TXT extracts plain text data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata.
+WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
 **UPDATE (August 2022)**
 1. A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
-2. A new option `--summary-only` has been added. If this option is enabled, only the title and text data from the opening paragraphs of the article (= summary) will be extracted.
-3. The current WP2TXT is *several times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
+2. A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
+3. Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
 ## Screenshot
 <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="700" />
-- WP2TXT 1.0.0
-- MacBook Pro (2019) 2.3GHz 8Core Intel Core i9
-- enwiki-20220802-pages-articles.xml.bz2 (approx. 20GB)
+**Environment**
-In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes a little over two hours.
+- WP2TXT 1.0.1
+- MacBook Pro (2021 Apple M1 Pro)
+- enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
+In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.
 ## Features
@@ -30,23 +32,45 @@ In the above environment, the process (decompression, splitting, extraction, and
 - Allows extracting category information of the article
 - Allows extracting opening paragraphs of the article
+## Preparation
+### For MacOS / Linux/ WSL2
+WP2TXT requires that one of the following commands be installed on the system in order to decompress `bz2` files:
+- `lbzip2` (recommended)
+- `pbzip2`
+- `bzip2`
+In most cases, the `bzip2` command is pre-installed on the system. However, since `lbzip2` can use multiple CPU cores and is faster than `bzip2`, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.
+If you are using MacOS with Homebrew installed, you can install `lbzip2` with the following command:
+    $ brew install lbzip2
+### For Windows
+Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.
 ## Installation
+### WP2TXT command
     $ gem install wp2txt
-## Preparation
+## Wikipedia Dump File
-First, download the latest Wikipedia dump file for the language of your choice.
+Download the latest Wikipedia dump file for the desired language at a URL such as
-    https://dumps.wikimedia.org/xxwiki/latest/xxwiki-latest-pages-articles.xml.bz2
+    https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
-where `xx` is language code such as `en` (English) or `zh` (Chinese). Change it to `ja`, for instance, if you want the latest Japanese Wikipedia dump file.
+Here, `enwiki` refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to jawiki (Japanese). In doing so, note that there are two instances of `enwiki` in the URL above.
 Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
     xxwiki-yyyymmdd-pages-articles.xml.bz2
-where `xx` is language code such as `en` (English)" or `ko` (Korean), and  `yyyymmdd` is the date of creation (e.g. `20220801`).
+where `xx` is language code such as `en` (English)" or `ja` (japanese), and  `yyyymmdd` is the date of creation (e.g. `20220801`).
 ## Basic Usage
@@ -124,7 +148,7 @@ Command line options are as follows:
       -g, --category-only              Extract only article title and categories
       -s, --summary-only               Extract only article title, categories, and summary text before first heading
       -f, --file-size=<i>              Approximate size (in MB) of each output file (default: 10)
-      -n, --num-procs                  Number of proccesses to be run concurrently (default: max num of CPU cores minus two)
+      -n, --num-procs                  Number of proccesses to be run concurrently (default: max num of available CPU cores minus two)
       -x, --del-interfile              Delete intermediate XML files from output dir
       -t, --title, --no-title          Keep page titles in output (default: true)
       -d, --heading, --no-heading      Keep section titles in output (default: true)
@@ -132,6 +156,7 @@ Command line options are as follows:
       -r, --ref                        Keep reference notations in the format [ref]...[/ref]
       -e, --redirect                   Show redirect destination
       -m, --marker, --no-marker        Show symbols prefixed to list items, definitions, etc. (Default: true)
+      -b, --bz2-gem                    Use Ruby's bzip2-ruby gem instead of a system command
       -v, --version                    Print version and exit
       -h, --help                       Show this message
@@ -156,6 +181,17 @@ The author will appreciate your mentioning one of these in your research.
 * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
 * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
+Or use this BibTeX entry:
+```
+@misc{WP2TXT_2022,
+  author = {Yoichiro Hasebe},
+  title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
+  url = {https://github.com/yohasebe/wp2txt}
+  year = {2022},
+}
+```
 ## License
 This software is distributed under the MIT License. Please see the LICENSE file.

data/bin/wp2txt CHANGED Viewed

@@ -43,6 +43,7 @@ EOS
   opt :ref, "Keep reference notations in the format [ref]...[/ref]", :default => false, :short => "-r"
   opt :redirect, "Show redirect destination", :default => false, :short => "-e"
   opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true, :short => "-m"
+  opt :bz2_gem, "Use Ruby's bzip2-ruby gem instead of a system command", :default => false, :short => "-b"
 end
 Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
@@ -72,7 +73,8 @@ opt_array = [:title,
              :category,
              :category_only,
              :summary_only,
-             :del_interfile]
+             :del_interfile,
+             :bz2_gem ]
 $leave_inline_template = true if opts[:inline]
 $leave_ref = true if opts[:ref]
@@ -90,11 +92,15 @@ else
   puts "Decompressing and splitting the original dump file."
   puts pastel.underline("This may take a while. Please be patient!")
+  time_start = Time.now.to_i
+  wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
   spinner = TTY::Spinner.new(":spinner", format: :arrow_pulse, hide_cursor: true, interval: 5)
   spinner.auto_spin
-  wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
   wpsplitter.split_file
-  spinner.stop(pastel.blue.bold("Done!")) # Stop animation
+  time_finish = Time.now.to_i
+  spinner.stop("Time: #{sec_to_str(time_finish - time_start)}")# Stop animation
+  puts pastel.blue.bold("Complete!")
   exit if !convert
   input_files = Dir.glob("#{output_dir}/*.xml")
 end

data/image/screenshot.png CHANGED Viewed

Binary file

data/lib/wp2txt/utils.rb CHANGED Viewed

@@ -476,8 +476,4 @@ module Wp2txt
     return str
   end
-  def decimal_format(i)
-    str = i.to_s.reverse
-    return str.scan(/.?.?./).join(',').reverse
-  end
 end

data/lib/wp2txt/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Wp2txt
-  VERSION = "1.0.0"
+  VERSION = "1.0.1"
 end

data/lib/wp2txt.rb CHANGED Viewed

@@ -7,26 +7,22 @@ require "nokogiri"
 require "wp2txt/article"
 require "wp2txt/utils"
-begin
-  require "bzip2-ruby"
-  NO_BZ2 = false
-rescue LoadError
-  # in case bzip2-ruby gem is not available
-  NO_BZ2 = true
-end
 module Wp2txt
   class Splitter
     include Wp2txt
-    def initialize(input_file, output_dir = ".", tfile_size = 10)
+    def initialize(input_file, output_dir = ".", tfile_size = 10, bz2_gem = false)
       @fp = nil
       @input_file = input_file
       @output_dir = output_dir
       @tfile_size = tfile_size
-      prepare
+      if bz2_gem
+        require "bzip2-ruby"
+      end
+      @bz2_gem = bz2_gem
+      prepare
     end
-    def file_size(file)
+    def file_size(file)
       origin = Time.now
       size = 0;  unit = 10485760; star = 0; before = Time.now.to_f
       error_count = 10
@@ -36,7 +32,7 @@ module Wp2txt
         rescue => e
           a = nil
         end
-        break unless a
+        break unless a
         present = Time.now.to_f
         size += a.size
@@ -44,12 +40,29 @@ module Wp2txt
           star = 0 if star > 10
           star += 1
           before = present
-        end
+        end
       end
       time_elapsed = Time.now - origin
       size
     end
+    # check if a given command exists: return the path if it does, return false if not
+    def command_exist?(command)
+      basename = File.basename(command)
+      path = ""
+      print "Checking #{basename}: "
+      if open("| which #{command} 2>/dev/null"){ |f| path = f.gets.strip }
+        puts "detected [#{path}]"
+        return path.strip
+      elsif open("| which #{basename} 2>/dev/null"){ |f| path = f.gets.strip }
+        puts "detected [#{path}]"
+        return path.strip
+      else
+        puts "not found"
+        return false
+      end
+    end
     # check the size of input file (bz2 or plain xml) when decompressed
     def prepare
       # if output_dir is not specified, output in the same directory
@@ -58,31 +71,31 @@ module Wp2txt
         @output_dir = File.dirname(@input_file)
       end
-      # if input file is bz2 compressed, use bz2-ruby if available,
-      # use command line bzip2 program otherwise.
       if /.bz2$/ =~ @input_file
-        unless NO_BZ2
+        if @bz2_gem
           file = Bzip2::Reader.new File.open(@input_file, "r:UTF-8")
+        elsif RUBY_PLATFORM.index("win32")
+          file = IO.popen("bunzip2.exe -c #{@input_file}")
         else
-          if RUBY_PLATFORM.index("win32")
-            file = IO.popen("bunzip2.exe -c #{@input_file}")
-          else
-            file = IO.popen("bzip2 -c -d #{@input_file}")
+          if bzpath = command_exist?("lbzip2") ||
+                      command_exist?("pbzip2") ||
+                      command_exist?("bzip2")
+            file = IO.popen("#{bzpath} -c -d #{@input_file}")
           end
-        end
+        end
       else # meaning that it is a text file
         @infile_size = File.stat(@input_file).size
         file = open(@input_file)
       end
       #create basename of output file
-      @outfile_base = File.basename(@input_file, ".*") + "-"
+      @outfile_base = File.basename(@input_file, ".*") + "-"
       @total_size = 0
       @file_index = 1
       outfilename = File.join(@output_dir, @outfile_base + @file_index.to_s)
       @outfiles = []
       @outfiles << outfilename
-      @fp = File.open(outfilename, "w")
+      @fp = File.open(outfilename, "w")
       @file_pointer = file
       return true
     end
@@ -100,7 +113,7 @@ module Wp2txt
         # temp_buf is filled with text split by "\n"
         temp_buf = []
         ss = StringScanner.new(new_lines)
-        while ss.scan(/.*?\n/m)
+        while ss.scan(/.*?\n/m)
           temp_buf << ss[0]
         end
         temp_buf << ss.rest unless ss.eos?
@@ -122,16 +135,16 @@ module Wp2txt
     end
     def get_newline
-      @buffer ||= [""]
+      @buffer ||= [""]
       if @buffer.size == 1
         return nil unless fill_buffer
       end
       if @buffer.empty?
         return nil
-      else
+      else
         new_line = @buffer.shift
         return new_line
-      end
+      end
     end
     def split_file
@@ -145,7 +158,7 @@ module Wp2txt
         output_text << text
         end_flag = true if @total_size > (@tfile_size * 1024 * 1024)
         # never close the file until the end of the page even if end_flag is on
-        if end_flag && /<\/page/ =~ text
+        if end_flag && /<\/page/ =~ text
           @fp.puts(output_text)
           output_text = ""
           @total_size = 0
@@ -159,15 +172,15 @@ module Wp2txt
         end
       end
       @fp.puts(output_text) if output_text != ""
-      @fp.close
+      @fp.close
       if File.size(outfilename) == 0
-        File.delete(outfilename)
+        File.delete(outfilename)
         @outfiles.delete(outfilename)
       end
-      rename(@outfiles, "xml")
-    end
+      rename(@outfiles, "xml")
+    end
   end
   class Runner
@@ -181,7 +194,7 @@ module Wp2txt
       @del_interfile = del_interfile
       prepare
     end
     def prepare
       @infile_size = File.stat(@input_file).size
       file = open(@input_file)
@@ -203,7 +216,7 @@ module Wp2txt
         # temp_buf is filled with text split by "\n"
         temp_buf = []
         ss = StringScanner.new(new_lines)
-        while ss.scan(/.*?\n/m)
+        while ss.scan(/.*?\n/m)
           temp_buf << ss[0]
         end
         temp_buf << ss.rest unless ss.eos?
@@ -225,16 +238,16 @@ module Wp2txt
     end
     def get_newline
-      @buffer ||= [""]
+      @buffer ||= [""]
       if @buffer.size == 1
         return nil unless fill_buffer
       end
       if @buffer.empty?
         return nil
-      else
+      else
         new_line = @buffer.shift
         return new_line
-      end
+      end
     end
     def get_page
@@ -270,7 +283,7 @@ module Wp2txt
       pages = []
       data_empty = false
-      while !data_empty
+      while !data_empty
         page = get_page
         if page
           pages << page

data/wp2txt.gemspec CHANGED Viewed

@@ -7,9 +7,9 @@ Gem::Specification.new do |s|
   s.version     = Wp2txt::VERSION
   s.authors     = ["Yoichiro Hasebe"]
   s.email       = ["yohasebe@gmail.com"]
-  s.homepage    = "http://github.com/yohasebe/wp2txt"
-  s.summary     = %q{Wikipedia dump to text converter}
-  s.description = %q{WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.}
+  s.homepage    = "https://github.com/yohasebe/wp2txt"
+  s.summary     = %q{A command-line toolkit to extract text content and category data from Wikipedia dump files}
+  s.description = %q{WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.}
   s.rubyforge_project = "wp2txt"

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: wp2txt
 version: !ruby/object:Gem::Version
-  version: 1.0.0
+  version: 1.0.1
 platform: ruby
 authors:
 - Yoichiro Hasebe
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2022-08-09 00:00:00.000000000 Z
+date: 2022-08-11 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -108,8 +108,8 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
-description: WP2TXT extracts plain text data from Wikipedia dump file (encoded in
-  XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
+description: WP2TXT extracts text and category data from Wikipedia dump files (encoded
+  in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
 email:
 - yohasebe@gmail.com
 executables:
@@ -141,7 +141,7 @@ files:
 - spec/spec_helper.rb
 - spec/utils_spec.rb
 - wp2txt.gemspec
-homepage: http://github.com/yohasebe/wp2txt
+homepage: https://github.com/yohasebe/wp2txt
 licenses: []
 metadata: {}
 post_install_message:
@@ -159,10 +159,11 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.3.3
+rubygems_version: 3.3.7
 signing_key:
 specification_version: 4
-summary: Wikipedia dump to text converter
+summary: A command-line toolkit to extract text content and category data from Wikipedia
+  dump files
 test_files:
 - spec/spec_helper.rb
 - spec/utils_spec.rb