RubyGems - wp2txt - Versions diffs - 0.4.1 - Mend

wp2txt 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

data/.gitignore ADDED

@@ -0,0 +1,20 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+.DS_Store
+*.bak
+*.~

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source "http://rubygems.org"
+# Specify your gem's dependencies in wp2txt.gemspec
+gemspec

data/LICENSE ADDED

@@ -0,0 +1,22 @@
+Copyright (c) 2012 Yoichiro Hasebe
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,65 @@
+# WP2TXT
+Wikipedia dump file to text converter
+### About ###
+WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
+### Features ###
+* Convert dump files of Wikipedia of multiple languages (I hope).
+* Create output files of specified size.
+* Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables).
+WP2TXT before version 0.4.0 came with Mac/Windows GUI. Now it's become a pure command-line application--Sorry GUI folks, but there seems more demand for an easy-to-hack CUI package than a not-very-flexible GUI app.
+### Installation
+`gem install` method will become available soon.  In the meantime, use the source code on Github.
+    $ gem install wp2txt
+### Usage
+Obtain a Wikipedia dump file (see the link below) with a file name such as:
+    xxwiki-yyyymmdd-pages-articles.xml.bz2
+where `xx` is language code such as "en (English)" or "ja (Japanese)", and  `yyyymmdd` is the date of creation (e.g. 20120601).
+Command line options are as follows:
+    Usage: wp2txt [options]
+    where [options] are:
+          --input-file, -i:   Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format
+      --output-dir, -o <s>:   Output directory (default: current directory)
+         --convert-off, -c:   Output XML (without converting to plain text)
+            --list-off, -l:   Exclude list items from output
+         --heading-off, -d:   Exclude section titles from output
+           --title-off, -t:   Exclude page titles from output
+           --table-off, -a:   Exclude page titles from output (default: true)
+        --template-off, -e:   Remove multi-line template notations from output
+        --strip-marker, -s:   Remove symbols prefixed to list items, definitions, etc.
+       --file-size, -f <i>:   Approximate size (in MB) of each output file (default: 10)
+             --version, -v:   Print version and exit
+                --help, -h:   Show this message
+### Limitations ###
+* Certain types of data such as mathematical equations and computer source code are not be properly converted.  Please remember this software is originally intended for correcting “sentences” for linguistic studies.
+* Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
+* Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
+* Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
+### Useful Link ###
+* [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
+### Author ###
+* Yoichiro Hasebe (<yohasebe@gmail.com>)
+### License ###
+This software is distributed under the MIT License. Please see the LICENSE file.

data/Rakefile ADDED

@@ -0,0 +1,9 @@
+require "bundler/gem_tasks"
+require 'rspec/core'
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new(:spec) do |spec|
+  spec.pattern = FileList['spec/**/*_spec.rb']
+end
+task :default => :spec

data/bin/wp2txt ADDED

@@ -0,0 +1,112 @@
+#!/usr/bin/env ruby
+# -*- coding: utf-8 -*-
+$: << File.join(File.dirname(__FILE__))
+$: << File.join(File.dirname(__FILE__), '..', 'lib')
+$DEBUG_MODE = false
+SHAREDIR = File.join(File.dirname(__FILE__), '..', 'share')
+DOCDIR = File.join(File.dirname(__FILE__), '..', 'doc')
+require 'wp2txt'
+require 'wp2txt/utils'
+require 'wp2txt/version'
+require 'trollop'
+include Wp2txt
+opts = Trollop::options do
+	version Wp2txt::VERSION
+	banner <<-EOS
+WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
+Usage: wp2txt [options]
+where [options] are:
+EOS
+  opt :input_file,  "Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format", :required => true
+  opt :output_dir,  "Output directory", :default => Dir::pwd, :type => String
+  opt :convert_off, "Output XML (without converting to plain text)", :default => false
+  opt :list_off,    "Exclude list items from output", :default => false
+  opt :heading_off, "Exclude section titles from output", :default => false, :short => "-d"
+  opt :title_off,   "Exclude page titles from output", :default => false
+  opt :table_off,   "Exclude page titles from output", :default => true
+  opt :template_off, "Remove template notations from output", :default => true
+  opt :redirect_off, "Not show redirect destination", :default => false
+  opt :strip_marker, "Remove symbols prefixed to list items, definitions, etc.", :default => false
+  opt :file_size,   "Approximate size (in MB) of each output file", :default => 10
+end
+Trollop::die :size, "must be larger than 0" unless opts[:file_size] >= 0
+Trollop::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
+input_file = ARGV[0]
+output_dir = opts[:output_dir]
+tfile_size = opts[:file_size]
+convert_off = opts[:convert_off]
+strip_tmarker = opts[:strip_marker]
+opt_array = [:title_off, :list_off, :heading_off, :table_off, :template_off, :redirect_off]
+config = {}
+opt_array.each do |opt|
+  config[opt] = opts[opt]
+end
+# a "parent" is either commandline progress bar or
+# a gui window (not available for now)
+parent = Wp2txt::CmdProgbar.new
+wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert_off, strip_tmarker)
+wpconv.extract_text do |article|
+  title = format_wiki article.title
+  title = "[[#{title}]]\n"
+  contents = ""
+  article.elements.each do |e|
+    case e.first
+    when :mw_heading
+      next if config[:heading_off]
+      line = format_wiki(e.last)
+      line += "+HEADING+" if $DEBUG_MODE
+    when :mw_paragraph
+      next if config[:paragraph_off]
+      line = format_wiki(e.last)
+      line += "+PARAGRAPH+" if $DEBUG_MODE
+    when :mw_table, :mw_htable
+      next if config[:table_off]
+      line = format_wiki(e.last)
+      line += "+TABLE+" if $DEBUG_MODE
+    when :mw_pre
+      next if config[:pre_off]
+      line = e.last
+      line += "+PRE+" if $DEBUG_MODE
+    when :mw_quote
+      next if config[:quote_off]
+      line = format_wiki(e.last)
+      line += "+QUOTE+" if $DEBUG_MODE
+    when :mw_unordered, :mw_ordered, :mw_definition
+      next if config[:list_off]
+      line = format_wiki(e.last)
+      line += "+LIST+" if $DEBUG_MODE
+    when :mw_redirect
+      next if config[:redirect_off]
+      line = format_wiki(e.last)
+      line += "+REDIRECT+" if $DEBUG_MODE
+      line += "\n\n"
+    else
+      if $DEBUG_MODE
+        line = format_wiki(e.last)
+        line += "+OTHER+"
+      else
+        next
+      end
+    end
+    contents += line
+    contents = remove_templates(contents) if config[:template_off]
+  end
+  if /\A\s*\z/m =~ contents
+    result = ""
+  else
+    result = config[:title_off] ? contents : title + "\n" + contents
+  end
+  result = result.gsub(/\[ref\]\s*\[\/ref\]/m){""}
+  result = result.gsub(/\n\n\n+/m){"\n\n"} + "\n"
+end

data/data/testdata.bz2 ADDED

Binary file

data/lib/wp2txt.rb ADDED

@@ -0,0 +1,323 @@
+#!/usr/bin/env ruby
+# -*- coding: utf-8 -*-
+$: << File.join(File.dirname(__FILE__))
+require "rubygems"
+require "bundler/setup"
+require "nokogiri"
+require "wp2txt/article"
+require "wp2txt/utils"
+require "wp2txt/mw_api"
+require "wp2txt/progressbar"
+begin
+  require "bzip2-ruby"
+  NO_BZ2 = false
+rescue LoadError
+  # in case bzip2-ruby gem is not available
+  NO_BZ2 = true
+end
+module Wp2txt
+  class Runner
+    include Wp2txt
+    # attr_accessor :pause_flag, :stop_flag, :outfiles, :convert_off
+    def initialize(parent, input_file, output_dir = ".", tfile_size = 10, convert_off = false, strip_tmarker = false)
+      @parent = parent
+      @fp = nil
+      @input_file = input_file
+      @output_dir = output_dir
+      @tfile_size = tfile_size
+      @convert_off = convert_off
+      @strip_tmarker = strip_tmarker
+    end
+    def file_size(file)
+      origin = Time.now
+      size = 0;  unit = 10485760; star = 0; before = Time.now.to_f
+      error_count = 10
+      while true do
+        begin
+          a = file.read(unit)
+        rescue => e
+          a = nil
+        end
+        break unless a
+        present = Time.now.to_f
+        size += a.size
+        if present - before > 0.3
+          star = 0 if star > 10
+          star += 1
+          before = present
+        end
+      end
+      time_elapsed = Time.now - origin
+      size
+    end
+    # control the display of command line progressbar (or gui which is not available for now)
+    def notify_parent(last = false)
+      @last_time ||= Time.now.to_f
+      @elapsed_sum ||= 0
+      time_now = Time.now.to_f
+      elapsed_from_last = (time_now - @last_time).to_i
+      if elapsed_from_last > 0.3 || last
+        @last_time = time_now
+        @elapsed_sum += elapsed_from_last
+        gvalue = (@size_read.to_f / @infile_size.to_f * 100 * 100).to_i
+        elt_str = sec_to_str(@elapsed_sum)
+        if last
+          eta_str = "00:00:00"
+        else
+          lines_persec = @size_read / @elapsed_sum if @elapsed_sum > 0
+          eta_sec = (@infile_size - @size_read) / lines_persec
+          eta_str = sec_to_str(eta_sec)
+        end
+        @parent.prg_update(gvalue, elt_str, eta_str)
+      end
+    end
+    # check the size of input file (bz2 or plain xml) when uncompressed
+    def prepare
+      # if output_dir is not specified, output in the same directory
+      # as the imput file
+      if !@output_dir && @input_file
+        @output_dir = File.dirname(@input_file)
+      end
+      # if input file is bz2 compressed, use bz2-ruby if available,
+      # use command line bzip2 program otherwise.
+      if /.bz2$/ =~ @input_file
+        unless NO_BZ2
+          file = Bzip2::Reader.new File.open(@input_file, "r:UTF-8")
+          @parent.msg("Preparing ... This may take several minutes or more ", 0)
+          @infile_size = file_size(file)
+          @parent.msg("... Done.", 1)
+          file.close
+          file = Bzip2::Reader.new File.open(@input_file, "r:UTF-8")
+        else
+          if RUBY_PLATFORM.index("win32")
+            file = IO.popen("bunzip2.exe -c #{@input_file}")
+          else
+            file = IO.popen("bzip2 -c -d #{@input_file}")
+          end
+          @infile_size = file_size(file)
+          file.close  # try to reopen since rewind method is unavailable
+          if RUBY_PLATFORM.index("win32")
+            file = IO.popen("bunzip2.exe -c #{@input_file}")
+          else
+            file = IO.popen("bzip2 -c -d #{@input_file}")
+          end
+        end
+      else # meaning that it is a text file
+        @infile_size = File.stat(@input_file).size
+        file = open(@input_file)
+      end
+      #create basename of output file
+      @outfile_base = File.basename(@input_file, ".*") + "-"
+      @total_size = 0
+      @file_index = 1
+      outfilename = File.join(@output_dir, @outfile_base + @file_index.to_s)
+      @outfiles = []
+      @outfiles << outfilename
+      @fp = File.open(outfilename, "w")
+      @parent.before
+      @parent.data_set(@input_file, 100 * 100)
+      @file_pointer = file
+      return true
+    end
+    # read text data from bz2 compressed file by 1 megabyte
+    def fill_buffer
+      while true do
+        begin
+          new_lines = @file_pointer.read(10485760)
+        rescue => e
+          return nil
+        end
+        return nil unless new_lines
+        # temp_buf is filled with text split by "\n"
+        temp_buf = []
+        ss = StringScanner.new(new_lines)
+        while ss.scan(/.*?\n/m)
+          temp_buf << ss[0]
+        end
+        temp_buf << ss.rest unless ss.eos?
+        new_first_line = temp_buf.shift
+        if new_first_line[-1, 1] == "\n" # new_first_line.index("\n")
+          @buffer.last <<  new_first_line
+          @buffer << ""
+        else
+          @buffer.last << new_first_line
+        end
+        @buffer += temp_buf unless temp_buf.empty?
+        if @buffer.last[-1, 1] == "\n" # @buffer.last.index("\n")
+          @buffer << ""
+        end
+        break if @buffer.size > 1
+      end
+      return true
+    end
+    def get_newline
+      @buffer ||= [""]
+      if @buffer.size == 1
+        return nil unless fill_buffer
+      end
+      if @buffer.empty?
+        return nil
+      else
+        new_line = @buffer.shift
+        return new_line
+      end
+    end
+    def get_page
+      inside_page = false
+      page = ""
+      while line = get_newline
+        notify_parent
+        @size_read ||=0; @size_read += line.bytesize
+        if /<page>/ =~ line #
+          page << line
+          inside_page = true
+          next
+        elsif  /<\/page>/ =~ line #
+          page << line
+          inside_page = false
+          break
+        end
+        page << line if inside_page
+      end
+      if page.empty?
+        return false
+      else
+        return page.force_encoding("utf-8")
+      end
+    end
+    # call this method to do the job
+    def extract_text(&block)
+      prepare
+      # output the original xml only split to files of the specified size
+      if @convert_off
+        extract
+        # convert xml to plain text
+      else
+        if block
+          extract_and_convert(&block)
+        else
+          extract_and_convert
+        end
+      end
+    end
+    def extract_and_convert(&block)
+      in_text = false
+      in_message = false
+      result_text = ""
+      title = nil
+      end_flag = false
+      terminal_round = false
+      output_text = ""
+      while page = get_page
+        xmlns = '<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.5/ http://www.mediawiki.org/xml/export-0.5.xsd" version="0.5" xml:lang="en">' + "\n"
+        xml = xmlns + page + "</mediawiki>"
+        input = Nokogiri::XML(xml, nil, 'UTF-8')
+        page = input.xpath("//xmlns:text").first
+        pp_title = page.parent.parent.at_css "title"
+        title = pp_title.content
+        next if /\:/ =~ title
+        text = page.content
+        # remove all comment texts
+        # and insert as many number of new line chars included in
+        # each comment instead
+        text.gsub!(/\<\!\-\-(.*?)\-\-\>/m) do |content|
+          num_of_newlines = content.count("\n")
+          if num_of_newlines == 0
+            ""
+          else
+            "\n" * num_of_newlines
+          end
+        end
+        @count ||= 0;@count += 1;
+        article = Article.new(text, title, @strip_tmarker)
+        output_text += block.call(article)
+        @total_size = output_text.bytesize
+        # flagged when data exceeds the size of output file
+        end_flag = true if @total_size > (@tfile_size * 1024 * 1024)
+        #close the present file, then open a new one
+        if end_flag
+          @fp.puts(output_text)
+          output_text = ""
+          @total_size = 0
+          end_flag = false
+          @fp.close
+          @file_index += 1
+          outfilename = File.join(@output_dir, @outfile_base + @file_index.to_s)
+          @outfiles << outfilename
+          @fp = File.open(outfilename, "w")
+          next
+        end
+      end
+      @fp.puts(output_text) if output_text != ""
+      notify_parent(true)
+      @parent.after
+      @fp.close
+      rename(@outfiles)
+      @parent.msg("Processing finished", 1)
+    end
+    def extract
+      output_text = ""
+      end_flag = false
+      while text = get_newline
+        @count ||= 0;@count += 1;
+        @size_read ||=0;@size_read += text.bytesize
+        @total_size += text.bytesize
+        output_text << text
+        end_flag = true if @total_size > (@tfile_size * 1024 * 1024)
+        notify_parent
+        # never close the file until the end of the page even if end_flag is on
+        if end_flag && /<\/page/ =~ text
+          @fp.puts(output_text)
+          output_text = ""
+          @total_size = 0
+          end_flag = false
+          @fp.close
+          @file_index += 1
+          outfilename = File.join(@output_dir, @outfile_base + @file_index.to_s)
+          @outfiles << outfilename
+          @fp = File.open(outfilename, "w")
+          next
+        end
+      end
+      @fp.puts(output_text) if output_text != ""
+      notify_parent(true)
+      @parent.after
+      @fp.close
+      rename(@outfiles)
+      @parent.msg("Processing finished", 1)
+    end
+  end
+end