RubyGems - wp2txt - Versions diffs - 0.6.1 → 0.7.0 - Mend

wp2txt 0.6.1 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: bcfd6986e262e455c100d664583b099d57ff4428
-  data.tar.gz: 68c922b43951b7f326b0136681981a166208d0a9
+  metadata.gz: 80f68e6c1ac855160575f85f4d78ca378f0a1c2b
+  data.tar.gz: 16bbac80e7139ea63dd46baf54fb5deaf0840e59
 SHA512:
-  metadata.gz: 294e0f8e1d2b37534ad885c617cfbd72ad72144dca6fb01231f6e2cf691a86bf58690f5dd1b2b410f8ee23eb3c74fa3f40e4ca8bbf3f3921ea78295783da5f2e
-  data.tar.gz: 71a1b8feca5c3067ff534f0239c4c937485ff3ed8c0a6de793be0b71befd440cb5b1c2c465dc6684c558a53209ab6481dd2e60edfe7fb7cbdd9c6f07416efd24
+  metadata.gz: 004d26fa39aae4eb194858cf85ae8aad33f65dc556a08bbfc499ead05d49e70af4f5ba5e708354aa816cd6b38d8e9860866cefa7d6c0730058e9a186ff9eec31
+  data.tar.gz: c2523b8afeab165c37de028eedff36e719a2472f9440469e4041c342b08463d439351a89523d959ff28d53364c76a2af44502113bb2084eacbbc8ac14306f8a4

data/README.md CHANGED

@@ -8,7 +8,7 @@ WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compres
 ### Features ###
-* Convert dump files of Wikipedia of multiple languages (I hope).
+* Convert dump files of Wikipedia of various languages (I hope).
 * Create output files of specified size.
 * Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables).
@@ -16,12 +16,6 @@ WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compres
     $ gem install wp2txt
-It is highly recommended you also install bz2-ruby gem. See the following for the details about bz2-ruby gem:
-[https://github.com/brianmario/bzip2-ruby](https://github.com/brianmario/bzip2-ruby)
-When the above gem is not found, wp2txt will try to use bzip2 program in your command line environment.  Supposedly he former option is more reliable as well as fast.
 ### Usage
 Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
@@ -32,10 +26,10 @@ where `xx` is language code such as "en (English)" or "ja (Japanese)", and  `yyy
 Command line options are as follows:
-*CAUTION:* command line options in the current version have been drastically changed from those in versions 0.5!
+*CAUTION:* Command line options in the current version have been drastically changed from previous versions.
-      Usage: wp2txt [options]
-      where [options] are:
+    Usage: wp2txt [options]
+    where [options] are:
                --input-file, -i:   Wikipedia dump file with .bz2 (compressed) or
                                    .txt (uncompressed) format
            --output-dir, -o <s>:   Output directory (default:
@@ -46,13 +40,15 @@ Command line options are as follows:
     --heading, --no-heading, -d:   Show section titles in output (default: true)
         --title, --no-title, -t:   Show page titles in output (default: true)
                     --table, -a:   Show table source code in output
-                 --template, -e:   Show template specifications in output
+                 --template, -e:   leave inline template notations unmodified
                  --redirect, -r:   Show redirect destination
       --marker, --no-marker, -m:   Show symbols prefixed to list items,
                                    definitions, etc. (Default: true)
                  --category, -g:   Show article category information
             --file-size, -f <i>:   Approximate size (in MB) of each output file
                                    (default: 10)
+          --limit-recur, -u <i>:   Max number of recursive call (0 to 10)
+                                   (default: 10)
                   --version, -v:   Print version and exit
                      --help, -h:   Show this message
@@ -71,6 +67,11 @@ Command line options are as follows:
 * Yoichiro Hasebe (<yohasebe@gmail.com>)
+### References ###
+* Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
+* 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
 ### License ###
 This software is distributed under the MIT License. Please see the LICENSE file.

data/bin/benchmark.rb CHANGED

@@ -18,13 +18,11 @@ tfile_size = 10
 convert = true
 strip_tmarker = true
 Benchmark.bm do |x|
   x.report do
     wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert, strip_tmarker)
     wpconv.extract_text do |article|
-      title = format_wiki article.title
+      title = format_wiki! article.title
       title = "[[#{title}]]\n"
         contents = "\nCATEGORIES: "
@@ -34,25 +32,31 @@ Benchmark.bm do |x|
       article.elements.each do |e|
         case e.first
         when :mw_heading
-          line = format_wiki(e.last)
+          format_wiki!(e.last)
+          line = e.last
         when :mw_paragraph
-          line = format_wiki(e.last)
+          format_wiki!(e.last)
+          line = e.last
         when :mw_table, :mw_htable
-          line = format_wiki(e.last)
+          format_wiki!(e.last)
+          line = e.last
         when :mw_pre
           line = e.last
         when :mw_quote
-          line = format_wiki(e.last)
+          format_wiki!(e.last)
+          line = e.last
         when :mw_unordered, :mw_ordered, :mw_definition
-          line = format_wiki(e.last)
+          format_wiki!(e.last)
+          line = e.last
         when :mw_redirect
-          line = format_wiki(e.last)
+          format_wiki!(e.last)
+          line = e.last
           line += "\n\n"
         else
           next
         end
         contents += line
-        contents = remove_templates(contents)
+        remove_templates!(contents)
       end
       ##### cleanup #####

data/bin/wp2txt CHANGED

@@ -31,39 +31,41 @@ EOS
   opt :heading, "Show section titles in output", :default => true, :short => "-d"
   opt :title,   "Show page titles in output", :default => true
   opt :table,   "Show table source code in output", :default => false
-  opt :template, "Show template specifications in output", :default => false
+  opt :template, "leave inline template notations unmodified", :default => false
   opt :redirect, "Show redirect destination", :default => false
   opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
   opt :category, "Show article category information", :default => false
-  opt :file_size,   "Approximate size (in MB) of each output file", :default => 10
+  opt :file_size,   "Approximate size (in MB) of each output file", :default => 10
+  opt :limit_recur, "Max number of recursive call (0 to 10)", :default => 10
 end
 Trollop::die :size, "must be larger than 0" unless opts[:file_size] >= 0
 Trollop::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
+Trollop::die :limit_recur, "must be 10 or smaller" if opts[:limit_recur] > 10
 input_file = ARGV[0]
 output_dir = opts[:output_dir]
 tfile_size = opts[:file_size]
+limit_recur = opts[:limit_recur]
 convert = opts[:convert]
 strip_tmarker = opts[:marker] ? false : true
-opt_array = [:title, :list, :heading, :table, :template, :redirect]
+opt_array = [:title, :list, :heading, :table, :redirect]
+$leave_template = true if opts[:template]
 config = {}
 opt_array.each do |opt|
   config[opt] = opts[opt]
 end
-# a "parent" is either commandline progress bar or
-# a gui window (not available for now)
 parent = Wp2txt::CmdProgbar.new
-wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert, strip_tmarker)
+wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert, strip_tmarker, limit_recur)
 wpconv.extract_text do |article|
-  title = format_wiki article.title
-  title = "[[#{title}]]\n"
+  format_wiki!(article.title)
+  title = "[[#{article.title}]]\n"
   if opts[:category] && !article.categories.empty?
     contents = "\nCATEGORIES: "
-    contents += article.categories.join(", ")
-    contents += "\n\n"
+    contents << article.categories.join(", ")
+    contents << "\n\n"
   else
     contents = ""
   end
@@ -72,44 +74,62 @@ wpconv.extract_text do |article|
     case e.first
     when :mw_heading
       next if !config[:heading]
-      line = format_wiki(e.last)
-      line += "+HEADING+" if $DEBUG_MODE
+      format_wiki!(e.last)
+      line = e.last
+      line << "+HEADING+" if $DEBUG_MODE
     when :mw_paragraph
       # next if !config[:paragraph]
-      line = format_wiki(e.last)
-      line += "+PARAGRAPH+" if $DEBUG_MODE
+      format_wiki!(e.last)
+      line = e.last
+      line << "+PARAGRAPH+" if $DEBUG_MODE
     when :mw_table, :mw_htable
       next if !config[:table]
-      line = format_wiki(e.last)
-      line += "+TABLE+" if $DEBUG_MODE
+      format_wiki!(e.last)
+      line = e.last
+      line << "+TABLE+" if $DEBUG_MODE
     when :mw_pre
       next if !config[:pre]
       line = e.last
-      line += "+PRE+" if $DEBUG_MODE
+      line << "+PRE+" if $DEBUG_MODE
     when :mw_quote
       # next if !config[:quote]
-      line = format_wiki(e.last)
-      line += "+QUOTE+" if $DEBUG_MODE
+      format_wiki!(e.last)
+      line = e.last
+      line << "+QUOTE+" if $DEBUG_MODE
     when :mw_unordered, :mw_ordered, :mw_definition
       next if !config[:list]
-      line = format_wiki(e.last)
-      line += "+LIST+" if $DEBUG_MODE
+      format_wiki!(e.last)
+      line = e.last
+      line << "+LIST+" if $DEBUG_MODE
     when :mw_redirect
       next if !config[:redirect]
-      line = format_wiki(e.last)
-      line += "+REDIRECT+" if $DEBUG_MODE
-      line += "\n\n"
+      format_wiki!(e.last)
+      line = e.last
+      line << "+REDIRECT+" if $DEBUG_MODE
+      line << "\n\n"
     else
       if $DEBUG_MODE
-        line = format_wiki(e.last)
-        line += "+OTHER+"
+        format_wiki!(e.last)
+        line = e.last
+        line << "+OTHER+"
       else
         next
       end
     end
-    contents += line
-    contents = remove_templates(contents) unless config[:template]
+    contents << line
   end
+  remove_directive!(contents)
+  remove_emphasis!(contents)
+  mndash!(contents)
+  make_reference!(contents)
+  format_ref!(contents)
+  remove_hr!(contents)
+  remove_tag!(contents)
+  special_chr!(contents)
+  correct_inline_template!(contents) unless $leave_template
+  remove_templates!(contents) unless $leave_template
   ##### cleanup #####
   if /\A\s*\z/m =~ contents

data/lib/wp2txt.rb CHANGED

@@ -3,14 +3,18 @@
 $: << File.join(File.dirname(__FILE__))
-require "rubygems"
-require "bundler/setup"
-require "nokogiri"
+# require "rubygems"
+# require "bundler/setup"
+require "Nokogiri"
+# require "oga"
+# require "ox"
+require 'pp'
 require "wp2txt/article"
 require "wp2txt/utils"
-require "wp2txt/mw_api"
 require "wp2txt/progressbar"
+# require "wp2txt/mw_api"
 begin
   require "bzip2-ruby"
@@ -25,9 +29,7 @@ module Wp2txt
     include Wp2txt
-    # attr_accessor :pause_flag, :stop_flag, :outfiles, :convert
-    def initialize(parent, input_file, output_dir = ".", tfile_size = 10, convert = true, strip_tmarker = false)
+    def initialize(parent, input_file, output_dir = ".", tfile_size = 10, convert = true, strip_tmarker = false, limit_recur = 10)
       @parent = parent
       @fp = nil
@@ -36,6 +38,9 @@ module Wp2txt
       @tfile_size = tfile_size
       @convert = convert
       @strip_tmarker = strip_tmarker
+      #max number of recursive calls (global variable)
+      $limit_recur = limit_recur
     end
     def file_size(file)
@@ -111,7 +116,9 @@ module Wp2txt
           else
             file = IO.popen("bzip2 -c -d #{@input_file}")
           end
+          @parent.msg("Preparing ... This may take several minutes or more ", 0)
           @infile_size = file_size(file)
+          @parent.msg("... Done.", 1)
           file.close  # try to reopen since rewind method is unavailable
           if RUBY_PLATFORM.index("win32")
             file = IO.popen("bunzip2.exe -c #{@input_file}")
@@ -237,13 +244,41 @@ module Wp2txt
       while page = get_page
         xmlns = '<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.5/ http://www.mediawiki.org/xml/export-0.5.xsd" version="0.5" xml:lang="en">' + "\n"
         xml = xmlns + page + "</mediawiki>"
         input = Nokogiri::XML(xml, nil, 'UTF-8')
-        page = input.xpath("//xmlns:text").first
+        page = input.xpath("//xmlns:text").first
         pp_title = page.parent.parent.at_css "title"
         title = pp_title.content
-        next if /\:/ =~ title
+        next if /\:/ =~ title
         text = page.content
+        # input = Oga.parse_xml(xml)
+        # page = input.xpath("//xmlns:text").first
+        # title = page.parent.parent.xpath("//xmlns:title").first.text
+        # next if /\:/ =~ title
+        # text = page.text
+        # input = Ox.load(xml, :encoding => "UTF-8")
+        # title = ""
+        # text  = ""
+        # input.nodes.first.nodes.each do |n|
+        #   if n.name == "title"
+        #     title = n.nodes.first
+        #     if /\:/ =~ title
+        #       title = ""
+        #       break
+        #     end
+        #   elsif n.name == "revision"
+        #     n.nodes.each do |o|
+        #       if o.name == "text"
+        #         text = o.nodes.first
+        #         break
+        #       end
+        #     end
+        #   end
+        # end
+        # next if title == "" || text == ""
         # remove all comment texts
         # and insert as many number of new line chars included in
         # each comment instead
@@ -256,7 +291,7 @@ module Wp2txt
           end
         end
-        @count ||= 0;@count += 1;
+        @count ||= 0;@count += 1;
         article = Article.new(text, title, @strip_tmarker)
         output_text += block.call(article)

data/lib/wp2txt/article.rb CHANGED

@@ -3,77 +3,37 @@
 $: << File.join(File.dirname(__FILE__))
 require 'strscan'
 require 'utils'
 module Wp2txt
   # possible element type, which could be later chosen to print or not to print
-  # :mw_heading
-  # :mw_htable
-  # :mw_quote
-  # :mw_unordered
-  # :mw_ordered
-  # :mw_definition
-  # :mw_pre
-  # :mw_paragraph
-  # :mw_comment
-  # :mw_math
-  # :mw_source
-  # :mw_inputbox
-  # :mw_template
-  # :mw_link
-  # :mw_summary
-  # :mw_blank
-  # :mw_redirect
+    # :mw_heading
+    # :mw_htable
+    # :mw_quote
+    # :mw_unordered
+    # :mw_ordered
+    # :mw_definition
+    # :mw_pre
+    # :mw_paragraph
+    # :mw_comment
+    # :mw_math
+    # :mw_source
+    # :mw_inputbox
+    # :mw_template
+    # :mw_link
+    # :mw_summary
+    # :mw_blank
+    # :mw_redirect
   # an article contains elements, each of which is [TYPE, string]
   class Article
     include Wp2txt
     attr_accessor :elements, :title, :categories
-    # class varialbes to save resource for generating regexps
-    # those with a trailing number 1 represent opening tag/markup
-    # those with a trailing number 2 represent closing tag/markup
-    # those without a trailing number contain both opening/closing tags/markups
-    @@in_template_regex = Regexp.new('^\s*\{\{[^\}]+\}\}\s*$')
-    @@in_link_regex = Regexp.new('^\s*\[.*\]\s*$')
-    @@in_inputbox_regex  = Regexp.new('<inputbox>.*?<\/inputbox>')
-    @@in_inputbox_regex1  = Regexp.new('<inputbox>')
-    @@in_inputbox_regex2  = Regexp.new('<\/inputbox>')
-    @@in_source_regex  = Regexp.new('<source.*?>.*?<\/source>')
-    @@in_source_regex1  = Regexp.new('<source.*?>')
-    @@in_source_regex2  = Regexp.new('<\/source>')
-    @@in_math_regex  = Regexp.new('<math.*?>.*?<\/math>')
-    @@in_math_regex1  = Regexp.new('<math.*?>')
-    @@in_math_regex2  = Regexp.new('<\/math>')
-    @@in_heading_regex  = Regexp.new('^=+.*?=+$')
-    @@in_html_table_regex = Regexp.new('<table.*?><\/table>')
-    @@in_html_table_regex1 = Regexp.new('<table\b')
-    @@in_html_table_regex2 = Regexp.new('<\/\s*table>')
-    @@in_table_regex1 = Regexp.new('^\s*\{\|')
-    @@in_table_regex2 = Regexp.new('^\|\}.*?$')
-    @@in_unordered_regex  = Regexp.new('^\*')
-    @@in_ordered_regex    = Regexp.new('^\#')
-    @@in_pre_regex = Regexp.new('^ ')
-    @@in_definition_regex  = Regexp.new('^[\;\:]')
-    @@blank_line_regex = Regexp.new('^\s*$')
-    @@redirect_regex = Regexp.new('#(?:REDIRECT|転送)\s+\[\[(.+)\]\]', Regexp::IGNORECASE)
-    category_patterns = ["Category", "Categoria"].join("|")
-    @@category_regex = Regexp.new('[\{\[\|\b](?:' + category_patterns + ')\:(.*?)[\}\]\|\b]', Regexp::IGNORECASE)
     def initialize(text, title = "", strip_tmarker = false)
       @title = title.strip
       @strip_tmarker = strip_tmarker
@@ -91,39 +51,39 @@ module Wp2txt
       open_stack  = []
       close_stack = []
       source.each_line do |line|
-        matched = line.scan(@@category_regex)
+        matched = line.scan($category_regex)
         if matched && !matched.empty?
           @categories += matched
-          @categories = @categories.uniq
+          @categories.uniq!
         end
         case mode
         when :mw_table
-          if @@in_table_regex2 =~ line
+          if $in_table_regex2 =~ line
             mode = nil
           end
           @elements.last.last << line
           next
         when :mw_inputbox
-          if @@in_inputbox_regex2 =~ line
+          if $in_inputbox_regex2 =~ line
             mode = nil
           end
           @elements.last.last << line
           next
         when :mw_source
-          if @@in_source_regex2 =~ line
+          if $in_source_regex2 =~ line
             mode = nil
           end
           @elements.last.last << line
           next
         when :mw_math
-          if @@in_math_regex2 =~ line
+          if $in_math_regex2 =~ line
             mode = nil
           end
           @elements.last.last << line
           next
         when :mw_htable
-          if @@in_html_table_regex2 =~ line
+          if $in_html_table_regex2 =~ line
             mode = nil
           end
           @elements.last.last << line
@@ -131,51 +91,51 @@ module Wp2txt
         end
         case line
-        when @@blank_line_regex
+        when $blank_line_regex
           @elements << create_element(:mw_blank, "\n")
-        when @@redirect_regex
+        when $redirect_regex
           @elements << create_element(:mw_redirect, line)
-        when @@in_template_regex
+        when $in_template_regex
           @elements << create_element(:mw_template, line)
-        when @@in_heading_regex
-          line = line.sub(/^(\=+)\s+/){$1}.sub(/\s+(\=+)$/){$1}
+        when $in_heading_regex
+          line = line.sub($heading_onset_regex){$1}.sub($heading_coda_regex){$1}
           @elements << create_element(:mw_heading, "\n" + line + "\n")
-        when @@in_inputbox_regex
+        when $in_inputbox_regex
           @elements << create_element(:mw_inputbox, line)
-        when @@in_inputbox_regex1
+        when $in_inputbox_regex1
           mode = :mw_inputbox
           @elements << create_element(:mw_inputbox, line)
-        when @@in_source_regex
+        when $in_source_regex
         @elements << create_element(:mw_source, line)
-        when @@in_source_regex1
+        when $in_source_regex1
           mode = :mw_source
           @elements << create_element(:mw_source, line)
-        when @@in_math_regex
+        when $in_math_regex
           @elements << create_element(:mw_math, line)
-        when @@in_math_regex1
+        when $in_math_regex1
           mode = :mw_math
           @elements << create_element(:mw_math, line)
-        when @@in_html_table_regex
+        when $in_html_table_regex
           @elements << create_element(:mw_htable, line)
-        when @@in_html_table_regex1
+        when $in_html_table_regex1
           mode = :mw_htable
           @elements << create_element(:mw_htable, line)
-        when @@in_table_regex1
+        when $in_table_regex1
           mode = :mw_table
           @elements << create_element(:mw_table, line)
-        when @@in_unordered_regex
-          line = line.sub(/\A[\*\#\;\:\ ]+/, "") if @strip_tmarker
+        when $in_unordered_regex
+          line = line.sub($list_marks_regex, "") if @strip_tmarker
           @elements << create_element(:mw_unordered, line)
-        when @@in_ordered_regex
-          line = line.sub(/\A[\*\#\;\:\ ]+/, "") if @strip_tmarker
+        when $in_ordered_regex
+          line = line.sub($list_marks_regex, "") if @strip_tmarker
           @elements << create_element(:mw_ordered, line)
-        when @@in_pre_regex
-          line = line.sub(/\A\^\ /, "") if @strip_tmarker
+        when $in_pre_regex
+          line = line.sub($pre_marks_regex, "") if @strip_tmarker
           @elements << create_element(:mw_pre, line)
-        when @@in_definition_regex
-          line = line.sub(/\A[\;\:\ ]+/, "") if @strip_tmarker
+        when $in_definition_regex
+          line = line.sub($def_marks_regex, "") if @strip_tmarker
           @elements << create_element(:mw_definition, line)
-        when @@in_link_regex
+        when $in_link_regex
           @elements << create_element(:mw_link, line)
         else
           @elements << create_element(:mw_paragraph, line)