RubyGems - wp2txt - Versions diffs - 0.8.0 → 0.9.3 - Mend

wp2txt 0.8.0 → 0.9.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml +5 -5
data/README.md +48 -21
data/bin/benchmark.rb +5 -4
data/bin/wp2txt +67 -67
data/data/output_samples/testdata_en.txt +49076 -0
data/data/output_samples/testdata_en_categories.txt +824 -0
data/data/output_samples/testdata_ja.txt +9382 -0
data/data/output_samples/testdata_ja_categories.txt +188 -0
data/data/testdata_en.bz2 +0 -0
data/data/{testdata.bz2 → testdata_ja.bz2} +0 -0
data/lib/wp2txt/article.rb +33 -3
data/lib/wp2txt/utils.rb +44 -49
data/lib/wp2txt/version.rb +1 -1
data/lib/wp2txt.rb +67 -42
data/spec/utils_spec.rb +28 -16
data/wp2txt.gemspec +2 -1
metadata +27 -9

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: d0610b7e28e04c4cd9c3a1401c88e15f6ddb16ec
-  data.tar.gz: b866915631fdc956395c005735b089ddff7956e5
+SHA256:
+  metadata.gz: 32966949db257b30be7a5c044965ce08426bdede1f1fa0dbb0a276361d1c69c2
+  data.tar.gz: ee0b08031ae75b9d08fd1f07e5d08e8e10135dba8b15bd44692f7c434b220262
 SHA512:
-  metadata.gz: e9fbef3de5ed866de0b3c7fadd96bdf0ff501b71c2d9f6f282eed538194bdfff8d9659cf53aedf95062c9aadf2ec90393075158ef9c8ae78f3e53ce84119f764
-  data.tar.gz: 36ad316986d94a6be89ccb591dec510dc4695bede188448fde0745702faad8039d0df4a84d2f0730dd11749a63d58e6b63b51b2bce585d8f8ccb3ff02553c3c8
+  metadata.gz: 9dee99ed39d2da01c9aeda462291645533773ed537de06a1f9127d626c91bc421e92ef805c52764fd0a668cf104a3733ab5981e59bb35d4abebe2f2909c63e3f
+  data.tar.gz: 8c6d7a5841a47fa4643a229050be652e38879d03477568175e8b29fb6f77971633731ec2d5dbd429dc2499ecfc2bfbb58917389d89b9cd57743b86b97ceaa4af

data/README.md CHANGED Viewed

@@ -2,33 +2,54 @@
 Wikipedia dump file to text converter
-**Important: This is a project *work in progress* and it could be slow, unstable, and even destructive! Please use it with caution!**
+**IMPORTANT:** This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.
-### About ###
+## About
 WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
-### Features ###
+**UPDATE:** Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
-* Convert dump files of Wikipedia of various languages (I hope).
+## Features
+* Convert dump files of Wikipedia of various languages
 * Create output files of specified size.
-* Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables).
+* Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables)
+* Extract category information of each article
-### Installation
+## Installation
     $ gem install wp2txt
-### Usage
+## Usage
 Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
     xxwiki-yyyymmdd-pages-articles.xml.bz2
-where `xx` is language code such as "en (English)" or "ja (Japanese)", and  `yyyymmdd` is the date of creation (e.g. 20120601).
+where `xx` is language code such as "en (English)" or "", and  `yyyymmdd` is the date of creation (e.g. 20120601).
-Command line options are as follows:
+### Example 1
+The following extracts text data, including list items and excluding tables.
+    $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
+- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
+### Example 2
-**Important** Command line options in the current version have been drastically changed from previous versions.
+The following will extract only article titles and the categories to which each article belongs:
+    $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
+- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
+## Options
+Command line options are as follows:
     Usage: wp2txt [options]
     where [options] are:
@@ -40,39 +61,45 @@ Command line options are as follows:
           --list, --no-list, -l:   Show list items in output (default: true)
     --heading, --no-heading, -d:   Show section titles in output (default: true)
         --title, --no-title, -t:   Show page titles in output (default: true)
-                    --table, -a:   Show table source code in output
-                 --template, -e:   leave inline template notations unmodified
-                      --ref, -r:   leave reference notations in the format
+                    --table, -a:   Show table source code in output (default: false)
+                   --inline, -n:   leave inline template notations unmodified (default: false)
+                --multiline, -m:   leave multiline template notations unmodified (default: false)
+                      --ref, -r:   leave reference notations in the format (default: false)
                                    [ref]...[/ref]
-                     --redirect:   Show redirect destination
-      --marker, --no-marker, -m:   Show symbols prefixed to list items,
+                 --redirect, -e:   Show redirect destination (default: false)
+      --marker, --no-marker, -k:   Show symbols prefixed to list items,
                                    definitions, etc. (Default: true)
-                 --category, -g:   Show article category information
+                 --category, -g:   Show article category information (default: true)
+            --category-only, -y:   Extract only article title and categories (default: false)
             --file-size, -f <i>:   Approximate size (in MB) of each output file
                                    (default: 10)
+          -u, --num-threads=<i>:   Number of threads to be spawned (capped to the number of CPU cores;
+                                   set 99 to spawn max num of threads) (default: 4)
                   --version, -v:   Print version and exit
                      --help, -h:   Show this message
-### Caveats ###
+## Caveats
 * Certain types of data such as mathematical equations and computer source code are not be properly converted.  Please remember this software is originally intended for correcting “sentences” for linguistic studies.
 * Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
 * Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
 * Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
-### Useful Link ###
+### Useful Links
 * [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
-### Author ###
+### Author
 * Yoichiro Hasebe (<yohasebe@gmail.com>)
-### References ###
+### References
+The author will appreciate your mentioning one of these in your research.
 * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
 * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
-### License ###
+### License
 This software is distributed under the MIT License. Please see the LICENSE file.

data/bin/benchmark.rb CHANGED Viewed

@@ -12,15 +12,16 @@ require 'benchmark'
 data_dir = File.join(File.dirname(__FILE__), '..', "data")
 parent = Wp2txt::CmdProgbar.new
-input_file = File.join(data_dir, "testdata.bz2")
+input_file = File.join(data_dir, "testdata_ja.bz2")
 output_dir = data_dir
 tfile_size = 10
+num_threads = 1
 convert = true
 strip_tmarker = true
 Benchmark.bm do |x|
   x.report do
-    wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert, strip_tmarker)
+    wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_threads, convert, strip_tmarker)
     wpconv.extract_text do |article|
       format_wiki!(article.title)
       title = "[[#{article.title}]]\n"
@@ -58,11 +59,11 @@ Benchmark.bm do |x|
         end
         contents << line
       end
-      format_article!(contents)
+      format_wiki!(contents)
       convert_characters!(contents)
       ##### cleanup #####
-      if /\A\s*\z/m =~ contents
+      if /\A[\s　]*\z/m =~ contents
         result = ""
       else
         result = title + "\n" + contents

data/bin/wp2txt CHANGED Viewed

@@ -4,18 +4,18 @@
 $: << File.join(File.dirname(__FILE__))
 $: << File.join(File.dirname(__FILE__), '..', 'lib')
-DEBUG_MODE = true
+$DEBUG_MODE = false
 SHAREDIR = File.join(File.dirname(__FILE__), '..', 'share')
 DOCDIR = File.join(File.dirname(__FILE__), '..', 'doc')
 require 'wp2txt'
 require 'wp2txt/utils'
 require 'wp2txt/version'
-require 'trollop'
+require 'optimist'
 include Wp2txt
-opts = Trollop::options do
+opts = Optimist::options do
 	version Wp2txt::VERSION
 	banner <<-EOS
 WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
@@ -31,37 +31,40 @@ EOS
   opt :heading, "Show section titles in output", :default => true, :short => "-d"
   opt :title,   "Show page titles in output", :default => true
   opt :table,   "Show table source code in output", :default => false
-  opt :template, "leave inline template notations unmodified", :default => false
+  opt :inline, "leave inline template notations as they are", :default => false
+  opt :multiline, "leave multiline template notations as they are", :default => false
   opt :ref, "leave reference notations in the format [ref]...[/ref]", :default => false
   opt :redirect, "Show redirect destination", :default => false
   opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
-  opt :category, "Show article category information", :default => false
+  opt :category, "Show article category information", :default => true
+  opt :category_only, "Extract only article title and categories", :default => false
   opt :file_size,   "Approximate size (in MB) of each output file", :default => 10
+  opt :num_threads,   "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
 end
-Trollop::die :size, "must be larger than 0" unless opts[:file_size] >= 0
-Trollop::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
+Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
+Optimist::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
 input_file = ARGV[0]
 output_dir = opts[:output_dir]
 tfile_size = opts[:file_size]
+num_threads = opts[:num_threads]
 convert = opts[:convert]
 strip_tmarker = opts[:marker] ? false : true
-opt_array = [:title, :list, :heading, :table, :redirect]
-$leave_template = true if opts[:template]
-$leave_table = true if opts[:table]
+opt_array = [:title, :list, :heading, :table, :redirect, :multiline]
+$leave_inline_template = true if opts[:inline]
 $leave_ref = true if opts[:ref]
+# $leave_table = true if opts[:table]
 config = {}
 opt_array.each do |opt|
   config[opt] = opts[opt]
 end
 parent = Wp2txt::CmdProgbar.new
-wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert, strip_tmarker)
+wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_threads, convert, strip_tmarker)
 wpconv.extract_text do |article|
   format_wiki!(article.title)
   title = "[[#{article.title}]]\n"
-  convert_characters!(title)
   if opts[:category] && !article.categories.empty?
     contents = "\nCATEGORIES: "
@@ -71,67 +74,64 @@ wpconv.extract_text do |article|
     contents = ""
   end
-  article.elements.each do |e|
-    case e.first
-    when :mw_heading
-      next if !config[:heading]
-      format_wiki!(e.last)
-      format_article!(e.last)
-      line = e.last
-      line << "+HEADING+" if $DEBUG_MODE
-    when :mw_paragraph
-      # next if !config[:paragraph]
-      format_wiki!(e.last)
-      format_article!(e.last)
-      line = e.last + "\n"
-      line << "+PARAGRAPH+" if $DEBUG_MODE
-    when :mw_table, :mw_htable
-      next if !config[:table]
-      # format_wiki!(e.last)
-      line = e.last
-      line << "+TABLE+" if $DEBUG_MODE
-    when :mw_pre
-      next if !config[:pre]
-      line = e.last
-      line << "+PRE+" if $DEBUG_MODE
-    when :mw_quote
-      # next if !config[:quote]
-      # format_wiki!(e.last)
-      line = e.last
-      line << "+QUOTE+" if $DEBUG_MODE
-    when :mw_unordered, :mw_ordered, :mw_definition
-      next if !config[:list]
-      # format_wiki!(e.last)
-      line = e.last
-      line << "+LIST+" if $DEBUG_MODE
-    when :mw_redirect
-      next if !config[:redirect]
-      # format_wiki!(e.last)
-      line = e.last
-      line << "+REDIRECT+" if $DEBUG_MODE
-      line << "\n\n"
-    else
-      if $DEBUG_MODE
-        # format_wiki!(e.last)
+  unless opts[:category_only]
+    article.elements.each do |e|
+      case e.first
+      when :mw_heading
+        next if !config[:heading]
+        format_wiki!(e.last)
         line = e.last
-        line << "+OTHER+"
-      else
+        line << "+HEADING+" if $DEBUG_MODE
+      when :mw_paragraph
+        format_wiki!(e.last)
+        line = e.last + "\n"
+        line << "+PARAGRAPH+" if $DEBUG_MODE
+      when :mw_table, :mw_htable
+        next if !config[:table]
+        line = e.last
+        line << "+TABLE+" if $DEBUG_MODE
+      when :mw_pre
+        next if !config[:pre]
+        line = e.last
+        line << "+PRE+" if $DEBUG_MODE
+      when :mw_quote
+        line = e.last
+        line << "+QUOTE+" if $DEBUG_MODE
+      when :mw_unordered, :mw_ordered, :mw_definition
+        next if !config[:list]
+        line = e.last
+        line << "+LIST+" if $DEBUG_MODE
+      when :mw_ml_template
+        next if !config[:multiline]
+        line = e.last
+        line << "+MLTEMPLATE+" if $DEBUG_MODE
+      when :mw_redirect
+        next if !config[:redirect]
+        line = e.last
+        line << "+REDIRECT+" if $DEBUG_MODE
+        line << "\n\n"
+      when :mw_isolated_template
+        next if !config[:multiline]
+        line = e.last
+        line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
+      when :mw_isolated_tag
         next
+      else
+        if $DEBUG_MODE
+          # format_wiki!(e.last)
+          line = e.last
+          line << "+OTHER+"
+        else
+          next
+        end
       end
+      contents << line << "\n"
     end
-    contents << line
   end
-  convert_characters!(contents)
-  remove_table!(contents) unless $leave_table
-  remove_ref!(contents) unless $leave_ref
-  ##### cleanup #####
-  if /\A\s*\z/m =~ contents
+  if /\A[\s　]*\z/m =~ contents
     result = ""
   else
-    result = config[:title] ? title + "\n" << contents : contents
+    result = config[:title] ? "\n#{title}\n" << contents : contents
   end
-  result.gsub!(/\[ref\]\s*\[\/ref\]/m){""}
-  result.gsub!(/\n\n\n+/m){"\n\n"}
-  result << "\n"
-end
+end