RubyGems - wp2txt - Versions diffs - 0.7.8 → 0.9.2 - Mend

wp2txt 0.7.8 → 0.9.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml +5 -5
data/README.md +11 -6
data/bin/benchmark.rb +5 -4
data/bin/wp2txt +29 -30
data/data/output_samples/testdata_en.txt +49076 -0
data/data/output_samples/testdata_ja.txt +9382 -0
data/data/testdata_en.bz2 +0 -0
data/data/{testdata.bz2 → testdata_ja.bz2} +0 -0
data/lib/wp2txt/article.rb +34 -4
data/lib/wp2txt/utils.rb +50 -53
data/lib/wp2txt/version.rb +1 -1
data/lib/wp2txt.rb +69 -75
data/spec/utils_spec.rb +28 -16
data/wp2txt.gemspec +2 -1
metadata +25 -10
data/error_log.txt +0 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: ee8448d2dc341c9f26a613522c0b9a225b62a7df
-  data.tar.gz: 036aa5184a19b4351c65af605f2ebc23b9e73398
+SHA256:
+  metadata.gz: 3ed3d7e29a8f1c6b5f97ca0da646ddfb53ae88add38f647eae0bdc03e626269e
+  data.tar.gz: '009188addebcd908f449f2ce4cf39036406f3816cafeeb61beba097fe036e890'
 SHA512:
-  metadata.gz: 05dd0bd2462bc72f030c0bd03233e359d1febdb4b30ad1309f4baf35ab6241684d164269ae1bae527163da787188d915ccb7ab460d83cd83732fbf9627d7ada1
-  data.tar.gz: 2bc83d1854656a4b3a83e6a2e1b9cfe86c86163d27a64582f994fc997b8104e4ab28d8d28881c054e323fd69934c53b63909cd7458a8d2ed0243c95702f8a14e
+  metadata.gz: d91531685df204222ab7bae9b3153653d61ccd36270e36f14575cabc3c2b1d6009bfa15f9033cb8eeb837f7c1a97fdb6303611166ec62ca96b9e4c8fc1e1ec15
+  data.tar.gz: 19183feee7eb8f7c03d3f7bf60eebb7e75ffeb6c6eec6967a8c3e480f82f2b48b6e171d2aa22c7aa44a9336b981ad51dfd37ab423c3db2fe1a0d854860c37231

data/README.md CHANGED Viewed

@@ -2,12 +2,14 @@
 Wikipedia dump file to text converter
-**Important** This is a project *work in progress* and it could be slow, unstable, and even destructive! Please use it with caution
+**IMPORTANT:** This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.
 ### About ###
 WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
+**UPDATE:** Version 0.9.1 has added a new option `num-threads`, which improves the performance significantly . Note also that `--category` option is enabled by default, resulting with output format somewhat different from previous versions. Check out the new format using test data in `data/output_samples` folder before going on to convert a huge wikipedia dump.
 ### Features ###
 * Convert dump files of Wikipedia of various languages (I hope).
@@ -28,8 +30,6 @@ where `xx` is language code such as "en (English)" or "ja (Japanese)", and  `yyy
 Command line options are as follows:
-**Important** Command line options in the current version have been drastically changed from previous versions.
     Usage: wp2txt [options]
     where [options] are:
                --input-file, -i:   Wikipedia dump file with .bz2 (compressed) or
@@ -41,15 +41,18 @@ Command line options are as follows:
     --heading, --no-heading, -d:   Show section titles in output (default: true)
         --title, --no-title, -t:   Show page titles in output (default: true)
                     --table, -a:   Show table source code in output
-                 --template, -e:   leave inline template notations unmodified
+                   --inline, -n:   leave inline template notations unmodified
+                --multiline, -m:   leave multiline template notations unmodified
                       --ref, -r:   leave reference notations in the format
                                    [ref]...[/ref]
-                     --redirect:   Show redirect destination
-      --marker, --no-marker, -m:   Show symbols prefixed to list items,
+                 --redirect, -e:   Show redirect destination
+      --marker, --no-marker, -k:   Show symbols prefixed to list items,
                                    definitions, etc. (Default: true)
                  --category, -g:   Show article category information
             --file-size, -f <i>:   Approximate size (in MB) of each output file
                                    (default: 10)
+          -u, --num-threads=<i>:   Number of threads to be spawned (capped to the number of CPU cores;
+                                   set 99 to spawn max num of threads) (default: 4)
                   --version, -v:   Print version and exit
                      --help, -h:   Show this message
@@ -70,6 +73,8 @@ Command line options are as follows:
 ### References ###
+The author will appreciate your mentioning one of these in your research.
 * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
 * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.

data/bin/benchmark.rb CHANGED Viewed

@@ -12,15 +12,16 @@ require 'benchmark'
 data_dir = File.join(File.dirname(__FILE__), '..', "data")
 parent = Wp2txt::CmdProgbar.new
-input_file = File.join(data_dir, "testdata.bz2")
+input_file = File.join(data_dir, "testdata_ja.bz2")
 output_dir = data_dir
 tfile_size = 10
+num_threads = 1
 convert = true
 strip_tmarker = true
 Benchmark.bm do |x|
   x.report do
-    wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert, strip_tmarker)
+    wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_threads, convert, strip_tmarker)
     wpconv.extract_text do |article|
       format_wiki!(article.title)
       title = "[[#{article.title}]]\n"
@@ -58,11 +59,11 @@ Benchmark.bm do |x|
         end
         contents << line
       end
-      format_article!(contents)
+      format_wiki!(contents)
       convert_characters!(contents)
       ##### cleanup #####
-      if /\A\s*\z/m =~ contents
+      if /\A[\s　]*\z/m =~ contents
         result = ""
       else
         result = title + "\n" + contents

data/bin/wp2txt CHANGED Viewed

@@ -11,11 +11,11 @@ DOCDIR = File.join(File.dirname(__FILE__), '..', 'doc')
 require 'wp2txt'
 require 'wp2txt/utils'
 require 'wp2txt/version'
-require 'trollop'
+require 'optimist'
 include Wp2txt
-opts = Trollop::options do
+opts = Optimist::options do
 	version Wp2txt::VERSION
 	banner <<-EOS
 WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
@@ -31,37 +31,39 @@ EOS
   opt :heading, "Show section titles in output", :default => true, :short => "-d"
   opt :title,   "Show page titles in output", :default => true
   opt :table,   "Show table source code in output", :default => false
-  opt :template, "leave inline template notations unmodified", :default => false
+  opt :inline, "leave inline template notations as they are", :default => false
+  opt :multiline, "leave multiline template notations as they are", :default => false
   opt :ref, "leave reference notations in the format [ref]...[/ref]", :default => false
   opt :redirect, "Show redirect destination", :default => false
   opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
-  opt :category, "Show article category information", :default => false
+  opt :category, "Show article category information", :default => true
   opt :file_size,   "Approximate size (in MB) of each output file", :default => 10
+  opt :num_threads,   "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
 end
-Trollop::die :size, "must be larger than 0" unless opts[:file_size] >= 0
-Trollop::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
+Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
+Optimist::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
 input_file = ARGV[0]
 output_dir = opts[:output_dir]
 tfile_size = opts[:file_size]
+num_threads = opts[:num_threads]
 convert = opts[:convert]
 strip_tmarker = opts[:marker] ? false : true
-opt_array = [:title, :list, :heading, :table, :redirect]
-$leave_template = true if opts[:template]
-$leave_table = true if opts[:table]
+opt_array = [:title, :list, :heading, :table, :redirect, :multiline]
+$leave_inline_template = true if opts[:inline]
 $leave_ref = true if opts[:ref]
+# $leave_table = true if opts[:table]
 config = {}
 opt_array.each do |opt|
   config[opt] = opts[opt]
 end
 parent = Wp2txt::CmdProgbar.new
-wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert, strip_tmarker)
+wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_threads, convert, strip_tmarker)
 wpconv.extract_text do |article|
   format_wiki!(article.title)
   title = "[[#{article.title}]]\n"
-  convert_characters!(title)
   if opts[:category] && !article.categories.empty?
     contents = "\nCATEGORIES: "
@@ -79,13 +81,11 @@ wpconv.extract_text do |article|
       line = e.last
       line << "+HEADING+" if $DEBUG_MODE
     when :mw_paragraph
-      # next if !config[:paragraph]
       format_wiki!(e.last)
-      line = e.last
+      line = e.last + "\n"
       line << "+PARAGRAPH+" if $DEBUG_MODE
     when :mw_table, :mw_htable
       next if !config[:table]
-      format_wiki!(e.last)
       line = e.last
       line << "+TABLE+" if $DEBUG_MODE
     when :mw_pre
@@ -93,43 +93,42 @@ wpconv.extract_text do |article|
       line = e.last
       line << "+PRE+" if $DEBUG_MODE
     when :mw_quote
-      # next if !config[:quote]
-      format_wiki!(e.last)
       line = e.last
       line << "+QUOTE+" if $DEBUG_MODE
     when :mw_unordered, :mw_ordered, :mw_definition
       next if !config[:list]
-      format_wiki!(e.last)
       line = e.last
       line << "+LIST+" if $DEBUG_MODE
+    when :mw_ml_template
+      next if !config[:multiline]
+      line = e.last
+      line << "+MLTEMPLATE+" if $DEBUG_MODE
     when :mw_redirect
       next if !config[:redirect]
-      format_wiki!(e.last)
       line = e.last
       line << "+REDIRECT+" if $DEBUG_MODE
       line << "\n\n"
+    when :mw_isolated_template
+      next if !config[:multiline]
+      line = e.last
+      line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
+    when :mw_isolated_tag
+      next
     else
       if $DEBUG_MODE
-        format_wiki!(e.last)
+        # format_wiki!(e.last)
         line = e.last
         line << "+OTHER+"
       else
         next
       end
     end
-    contents << line
+    contents << line << "\n"
   end
-  format_article!(contents)
-  convert_characters!(contents)
-  ##### cleanup #####
-  if /\A\s*\z/m =~ contents
+  if /\A[\s　]*\z/m =~ contents
     result = ""
   else
-    result = config[:title] ? title + "\n" << contents : contents
+    result = config[:title] ? "\n#{title}\n" << contents : contents
   end
-  result.gsub!(/\[ref\]\s*\[\/ref\]/m){""}
-  result.gsub!(/^[\s\W]+$/)
-  result.gsub!(/\n\n\n+/m){"\n\n"}
-  result << "\n"
-end
+end