RubyGems - wp2txt - Versions diffs - 0.9.5 → 1.0.1 - Mend

wp2txt 0.9.5 → 1.0.1

Files changed (22) hide show

checksums.yaml +4 -4
data/README.md +134 -57
data/bin/wp2txt +149 -95
data/data/output_samples/testdata_en.txt +171 -1247
data/data/output_samples/{testdata_en_categories.txt → testdata_en_category.txt} +1 -1
data/data/output_samples/testdata_en_summary.txt +28 -20
data/data/output_samples/testdata_ja.txt +10359 -17093
data/data/output_samples/{testdata_ja_categories.txt → testdata_ja_category.txt} +30 -30
data/data/output_samples/testdata_ja_summary.txt +36 -160
data/image/screenshot.png +0 -0
data/image/wp2txt-logo.svg +16 -0
data/image/wp2txt.svg +31 -0
data/lib/wp2txt/article.rb +1 -3
data/lib/wp2txt/utils.rb +92 -68
data/lib/wp2txt/version.rb +1 -1
data/lib/wp2txt.rb +154 -171
data/spec/utils_spec.rb +3 -21
data/wp2txt.gemspec +7 -3
metadata +54 -12
data/bin/benchmark.rb +0 -76
data/lib/wp2txt/mw_api.rb +0 -65
data/lib/wp2txt/progressbar.rb +0 -305

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: bf8270b3488c0045a067f71c155db8d9ac6366a94d825eed9bc6d05c95598345
-  data.tar.gz: 8802949a232c60d8b5ae6f93726154f7a6b40436b478919657f58f4bdc54add3
+  metadata.gz: d33a41cf46688679a14eb8c3eb16f6ed33ce9175c7f5b566c9f87998ba2c8401
+  data.tar.gz: 7371e0f7b06b2f0846f01d66f461c7e106778adc6e686919302f0f29b1f80a9e
 SHA512:
-  metadata.gz: 0a10804d78c33e035aaf429dd4613f84f3db0c6f22a6c36617a1fda25f03c0fd8fac224ec9e6009ab6ddddb475d73e6eda4c21606f89ef94950bc3749ce4f452
-  data.tar.gz: 4a8ea2f0900c6f97d3dcaf6c6387b3543d23962ea064cf1b18a8c293b4664c16fcabca9230aa83b0f5685eacffed74968fe82529566080b7821fd944c7bf275d
+  metadata.gz: cab8d9c27989387acc6dbbe052029d2205508ce10e38b8eedc111c822328d8eba551d603020684cbb3844a87b747f261a5959f711267acd96a3b97ccef4f6834
+  data.tar.gz: 4de59be37d57ef3d14ae2304660e8dde069bdf645a7cff862026562b26327984f1be13840e9d6ec1f25110222367f71c84a0286b649d71fec0c13805c6b0a647

data/README.md CHANGED Viewed

@@ -1,104 +1,170 @@
-# WP2TXT
+<img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/wp2txt-logo.svg' width="400" />
-Wikipedia dump file to text converter that extracts both content and category data
+A command-line toolkit to extract text content and category data from Wikipedia dump files
 ## About
-WP2TXT extracts plain text data from a Wikipedia dump file (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata. It was developed for researchers who want easy access to open-source multilingual corpora, but may be used for other purposes as well.
+WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
-**UPDATE (July 2022)**: Version 0.9.3 adds a new option `category_only`. When this option is enabled, wp2txt will extract only the title and category information of the article. See output examples below.
+**UPDATE (August 2022)**
+1. A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
+2. A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
+3. Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
+## Screenshot
+<img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="700" />
+**Environment**
+- WP2TXT 1.0.1
+- MacBook Pro (2021 Apple M1 Pro)
+- enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
+In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.
 ## Features
-* Converts Wikipedia dump files in various languages
-* Creates output files of specified size
-* Can specify text elements to be extracted and converted (page titles, section titles, lists, tables)
-* Can extract category information for each article
+- Converts Wikipedia dump files in various languages
+- Creates output files of specified size
+- Allows specifying ext elements (page titles, section headers, paragraphs, list items) to be extracted
+- Allows extracting category information of the article
+- Allows extracting opening paragraphs of the article
+## Preparation
+### For MacOS / Linux/ WSL2
+WP2TXT requires that one of the following commands be installed on the system in order to decompress `bz2` files:
+- `lbzip2` (recommended)
+- `pbzip2`
+- `bzip2`
+In most cases, the `bzip2` command is pre-installed on the system. However, since `lbzip2` can use multiple CPU cores and is faster than `bzip2`, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.
+If you are using MacOS with Homebrew installed, you can install `lbzip2` with the following command:
+    $ brew install lbzip2
+### For Windows
+Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.
 ## Installation
+### WP2TXT command
     $ gem install wp2txt
-## Usage
+## Wikipedia Dump File
+Download the latest Wikipedia dump file for the desired language at a URL such as
+    https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
-Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
+Here, `enwiki` refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to jawiki (Japanese). In doing so, note that there are two instances of `enwiki` in the URL above.
-> `xxwiki-yyyymmdd-pages-articles.xml.bz2`
+Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
-where `xx` is language code such as "en (English)" or "ja (Japanese)", and  `yyyymmdd` is the date of creation (e.g. 20220720).
+    xxwiki-yyyymmdd-pages-articles.xml.bz2
-### Example 1: Basic
+where `xx` is language code such as `en` (English)" or `ja` (japanese), and  `yyyymmdd` is the date of creation (e.g. `20220801`).
-The following extracts text data, including list items and excluding tables.
+## Basic Usage
-    $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+Suppose you have a folder with a wikipedia dump file and empty subfolders organized as follows:
-- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
-- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
+```
+.
+├── enwiki-20220801-pages-articles.xml.bz2
+├── /xml
+├── /text
+├── /category
+└── /summary
+```
-### Example 2: Title and category information only
+### Decompress and Split
-The following will extract only article titles and the categories to which each article belongs:
+The following command will decompress the entire wikipedia data and split it into many small (approximately 10 MB) XML files.
-    $ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+    $ wp2txt --no-convert -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./xml
-Each line of the output data contains the title and the categories of an article:
+**Note**: The resulting files are not well-formed XML. They contain part of the orignal XML extracted from the Wikipedia dump file, taking care to ensure that the content within the <page> tag is not split into multiple files.
-> title `TAB` category1`,` category2`,` category3`,` ...
+### Extract plain text from MediaWiki XML
-- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
-- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
+    $ wp2txt -i ./xml -o ./text
-### Example 3: Title, category, and summary text only
-The following will extract only article titles, the categories to which each article belongs, and text blocks before the first heading of the article:
+### Extract only category info from MediaWiki XML
-    $ wp2txt --summary-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+    $ wp2txt -g -i ./xml -o ./category
-- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
-- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
+### Extract opening paragraphs from MediaWiki XML
+    $ wp2txt -s -i ./xml -o ./summary
-## Options
+### Extract directly from bz2 compressed file
+It is possible (though not recommended) to 1) decompress the dump files, 2) split the data into files, and 3) extract the text just one line of command. You can automatically remove all the intermediate XML files with `-x` option.
+    $ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text -x
+## Sample Output
+Output contains title, category info, paragraphs
+    $ wp2txt -i ./input -o /output
+- [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
+- [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
+Output containing title and category only
+    $ wp2txt -g -i ./input -o /output
+- [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_category.txt)
+- [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_category.txt)
+Output containing title, category, and summary
+    $ wp2txt -s -i ./input -o /output
+- [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
+- [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
+## Command Line Options
 Command line options are as follows:
     Usage: wp2txt [options]
     where [options] are:
-               --input-file, -i:   Wikipedia dump file with .bz2 (compressed) or
-                                   .txt (uncompressed) format
-           --output-dir, -o <s>:   Output directory (default: current directory)
-    --convert, --no-convert, -c:   Output in plain text (converting from XML)
-                                   (default: true)
-          --list, --no-list, -l:   Show list items in output (default: true)
-    --heading, --no-heading, -d:   Show section titles in output (default: true)
-        --title, --no-title, -t:   Show page titles in output (default: true)
-                    --table, -a:   Show table source code in output (default: false)
-                   --inline, -n:   leave inline template notations unmodified (default: false)
-                --multiline, -m:   leave multiline template notations unmodified (default: false)
-                      --ref, -r:   leave reference notations in the format (default: false)
-                                   [ref]...[/ref]
-                 --redirect, -e:   Show redirect destination (default: false)
-      --marker, --no-marker, -k:   Show symbols prefixed to list items,
-                                   definitions, etc. (Default: true)
-                 --category, -g:   Show article category information (default: true)
-            --category-only, -y:   Extract only article title and categories (default: false)
-             -s, --summary-only:   Extract only article title, categories, and summary text before first heading
-            --file-size, -f <i>:   Approximate size (in MB) of each output file
-                                   (default: 10)
-          -u, --num-threads=<i>:   Number of threads to be spawned (capped to the number of CPU cores;
-                                   set 99 to spawn max num of threads) (default: 4)
-                  --version, -v:   Print version and exit
-                     --help, -h:   Show this message
+      -i, --input                      Path to compressed file (bz2) or decompressed file (xml), or path to directory containing files of the latter format
+      -o, --output-dir=<s>             Path to output directory
+      -c, --convert, --no-convert      Output in plain text (converting from XML) (default: true)
+      -a, --category, --no-category    Show article category information (default: true)
+      -g, --category-only              Extract only article title and categories
+      -s, --summary-only               Extract only article title, categories, and summary text before first heading
+      -f, --file-size=<i>              Approximate size (in MB) of each output file (default: 10)
+      -n, --num-procs                  Number of proccesses to be run concurrently (default: max num of available CPU cores minus two)
+      -x, --del-interfile              Delete intermediate XML files from output dir
+      -t, --title, --no-title          Keep page titles in output (default: true)
+      -d, --heading, --no-heading      Keep section titles in output (default: true)
+      -l, --list                       Keep unprocessed list items in output
+      -r, --ref                        Keep reference notations in the format [ref]...[/ref]
+      -e, --redirect                   Show redirect destination
+      -m, --marker, --no-marker        Show symbols prefixed to list items, definitions, etc. (Default: true)
+      -b, --bz2-gem                    Use Ruby's bzip2-ruby gem instead of a system command
+      -v, --version                    Print version and exit
+      -h, --help                       Show this message
 ## Caveats
-* Some data, such as mathematical formulas and computer source code, will not be converted correctly.
+* Some data, such as mathematical formulas and computer source code, will not be converted correctly.
 * Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
 * The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
-* WP2TXT, by the nature of its task, requires a lot of machine power and consumes a large amount of memory/storage resources. Therefore, there is a possibility that the process may stop unexpectedly. In the worst case, the process may even freeze without terminating successfully. Please understand this and use at your own risk.
 ## Useful Links
@@ -115,6 +181,17 @@ The author will appreciate your mentioning one of these in your research.
 * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
 * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
+Or use this BibTeX entry:
+```
+@misc{WP2TXT_2022,
+  author = {Yoichiro Hasebe},
+  title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
+  url = {https://github.com/yohasebe/wp2txt}
+  year = {2022},
+}
+```
 ## License
 This software is distributed under the MIT License. Please see the LICENSE file.

data/bin/wp2txt CHANGED Viewed

@@ -11,133 +11,187 @@ DOCDIR = File.join(File.dirname(__FILE__), '..', 'doc')
 require 'wp2txt'
 require 'wp2txt/utils'
 require 'wp2txt/version'
+require 'etc'
 require 'optimist'
+require 'parallel'
+require 'pastel'
+require 'tty-spinner'
 include Wp2txt
 opts = Optimist::options do
-	version Wp2txt::VERSION
-	banner <<-EOS
+  version Wp2txt::VERSION
+  banner <<-EOS
 WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
 Usage: wp2txt [options]
 where [options] are:
 EOS
-  opt :input_file,  "Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format", :required => true
-  opt :output_dir,  "Output directory", :default => Dir::pwd, :type => String
-  opt :convert,     "Output in plain text (converting from XML)", :default => true
-  opt :list,    "Show list items in output", :default => false
-  opt :heading, "Show section titles in output", :default => true, :short => "-d"
-  opt :title,   "Show page titles in output", :default => true
-  opt :table,   "Show table source code in output", :default => false
-  opt :inline, "leave inline template notations as they are", :default => false
-  opt :multiline, "leave multiline template notations as they are", :default => false
-  opt :ref, "leave reference notations in the format [ref]...[/ref]", :default => false
-  opt :redirect, "Show redirect destination", :default => false
-  opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
-  opt :category, "Show article category information", :default => true
-  opt :category_only, "Extract only article title and categories", :default => false
-  opt :summary_only, "Extract only article title, categories, and summary text before first heading", :default => false
-  opt :file_size,   "Approximate size (in MB) of each output file", :default => 10
-  opt :num_threads,   "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
+  opt :input,  "Path to compressed file (bz2) or decompressed file (xml), or path to directory containing files of the latter format", :required => true, :short => "-i"
+  opt :output_dir,  "Path to output directory", :default => Dir::pwd, :type => String, :short => "-o"
+  opt :convert,     "Output in plain text (converting from XML)", :default => true, :short => "-c"
+  opt :category, "Show article category information", :default => true, :short => "-a"
+  opt :category_only, "Extract only article title and categories", :default => false, :short => "-g"
+  opt :summary_only, "Extract only article title, categories, and summary text before first heading", :default => false, :short => "-s"
+  opt :file_size,   "Approximate size (in MB) of each output file", :default => 10, :short => "-f"
+  opt :num_procs,   "Number of proccesses to be run concurrently (default: max num of CPU cores minus two)", :short => "-n"
+  opt :del_interfile,   "Delete intermediate XML files from output dir", :short => "-x", :default => false
+  opt :title,   "Keep page titles in output", :default => true, :short => "-t"
+  opt :heading, "Keep section titles in output", :default => true, :short => "-d"
+  opt :list,    "Keep unprocessed list items in output", :default => false, :short => "-l"
+  opt :ref, "Keep reference notations in the format [ref]...[/ref]", :default => false, :short => "-r"
+  opt :redirect, "Show redirect destination", :default => false, :short => "-e"
+  opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true, :short => "-m"
+  opt :bz2_gem, "Use Ruby's bzip2-ruby gem instead of a system command", :default => false, :short => "-b"
 end
 Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
 Optimist::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
+pastel = Pastel.new
 input_file = ARGV[0]
 output_dir = opts[:output_dir]
 tfile_size = opts[:file_size]
-num_threads = opts[:num_threads]
+num_processors = Etc.nprocessors
+if opts[:num_procs] && opts[:num_procs].to_i <= num_processors
+  num_processes = opts[:num_procs]
+else
+  num_processes = num_processors - 2
+end
+num_processes = 1 if num_processes < 1
 convert = opts[:convert]
 strip_tmarker = opts[:marker] ? false : true
-opt_array = [:title, :list, :heading, :table, :redirect, :multiline, :category, :category_only, :summary_only]
+opt_array = [:title,
+             :list,
+             :heading,
+             :table,
+             :redirect,
+             :multiline,
+             :category,
+             :category_only,
+             :summary_only,
+             :del_interfile,
+             :bz2_gem ]
 $leave_inline_template = true if opts[:inline]
 $leave_ref = true if opts[:ref]
 config = {}
 opt_array.each do |opt|
   config[opt] = opts[opt]
 end
-parent = Wp2txt::CmdProgbar.new
-wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_threads, convert, strip_tmarker)
-wpconv.extract_text do |article|
-  format_wiki!(article.title)
-  if config[:category_only]
-    title = "#{article.title}\t"
-    contents = article.categories.join(", ")
-    contents << "\n"
-  elsif config[:category] && !article.categories.empty?
-    title = "\n[[#{article.title}]]\n\n"
-    contents = "\nCATEGORIES: "
-    contents << article.categories.join(", ")
-    contents << "\n\n"
-  else
-    title = "\n[[#{article.title}]]\n\n"
-    contents = ""
-  end
+if File::ftype(input_file) == "directory"
+  input_files = Dir.glob("#{input_file}/*.xml")
+else
+  puts ""
+  puts pastel.green.bold("Preprocessing")
+  puts "Decompressing and splitting the original dump file."
+  puts pastel.underline("This may take a while. Please be patient!")
+  time_start = Time.now.to_i
+  wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
+  spinner = TTY::Spinner.new(":spinner", format: :arrow_pulse, hide_cursor: true, interval: 5)
+  spinner.auto_spin
+  wpsplitter.split_file
+  time_finish = Time.now.to_i
+  spinner.stop("Time: #{sec_to_str(time_finish - time_start)}")# Stop animation
+  puts pastel.blue.bold("Complete!")
+  exit if !convert
+  input_files = Dir.glob("#{output_dir}/*.xml")
+end
+puts ""
+puts pastel.red.bold("Converting")
+puts "Number of files being processed: " + pastel.bold("#{input_files.size}")
+puts "Number of CPU cores being used:  " + pastel.bold("#{num_processes}")
+Parallel.map(input_files, progress: pastel.magenta.bold("WP2TXT"), in_processes: num_processes) do |input_file|
+  wpconv = Wp2txt::Runner.new(input_file, output_dir, strip_tmarker, config[:del_interfile])
+  wpconv.extract_text do |article|
+    format_wiki!(article.title)
+    if config[:category_only]
+      title = "#{article.title}\t"
+      contents = article.categories.join(", ")
+      contents << "\n"
+    elsif config[:category] && !article.categories.empty?
+      title = "\n[[#{article.title}]]\n\n"
+      contents = "\nCATEGORIES: "
+      contents << article.categories.join(", ")
+      contents << "\n\n"
+    else
+      title = "\n[[#{article.title}]]\n\n"
+      contents = ""
+    end
-  unless config[:category_only]
-    article.elements.each do |e|
-      case e.first
-      when :mw_heading
-        break if config[:summary_only]
-        next if !config[:heading]
-        format_wiki!(e.last)
-        line = e.last
-        line << "+HEADING+" if $DEBUG_MODE
-      when :mw_paragraph
-        format_wiki!(e.last)
-        line = e.last + "\n"
-        line << "+PARAGRAPH+" if $DEBUG_MODE
-      when :mw_table, :mw_htable
-        next if !config[:table]
-        line = e.last
-        line << "+TABLE+" if $DEBUG_MODE
-      when :mw_pre
-        next if !config[:pre]
-        line = e.last
-        line << "+PRE+" if $DEBUG_MODE
-      when :mw_quote
-        line = e.last
-        line << "+QUOTE+" if $DEBUG_MODE
-      when :mw_unordered, :mw_ordered, :mw_definition
-        next if !config[:list]
-        line = e.last
-        line << "+LIST+" if $DEBUG_MODE
-      when :mw_ml_template
-        next if !config[:multiline]
-        line = e.last
-        line << "+MLTEMPLATE+" if $DEBUG_MODE
-      when :mw_redirect
-        next if !config[:redirect]
-        line = e.last
-        line << "+REDIRECT+" if $DEBUG_MODE
-        line << "\n\n"
-      when :mw_isolated_template
-        next if !config[:multiline]
-        line = e.last
-        line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
-      when :mw_isolated_tag
-        next
-      else
-        if $DEBUG_MODE
-          # format_wiki!(e.last)
+    unless config[:category_only]
+      article.elements.each do |e|
+        case e.first
+        when :mw_heading
+          break if config[:summary_only]
+          next if !config[:heading]
+          format_wiki!(e.last)
           line = e.last
-          line << "+OTHER+"
-        else
+          line << "+HEADING+" if $DEBUG_MODE
+        when :mw_paragraph
+          format_wiki!(e.last)
+          line = e.last + "\n"
+          line << "+PARAGRAPH+" if $DEBUG_MODE
+        when :mw_table, :mw_htable
+          next if !config[:table]
+          line = e.last
+          line << "+TABLE+" if $DEBUG_MODE
+        when :mw_pre
+          next if !config[:pre]
+          line = e.last
+          line << "+PRE+" if $DEBUG_MODE
+        when :mw_quote
+          line = e.last
+          line << "+QUOTE+" if $DEBUG_MODE
+        when :mw_unordered, :mw_ordered, :mw_definition
+          next if !config[:list]
+          line = e.last
+          line << "+LIST+" if $DEBUG_MODE
+        when :mw_ml_template
+          next if !config[:multiline]
+          line = e.last
+          line << "+MLTEMPLATE+" if $DEBUG_MODE
+        when :mw_redirect
+          next if !config[:redirect]
+          line = e.last
+          line << "+REDIRECT+" if $DEBUG_MODE
+          line << "\n\n"
+        when :mw_isolated_template
+          next if !config[:multiline]
+          line = e.last
+          line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
+        when :mw_isolated_tag
           next
+        else
+          if $DEBUG_MODE
+            # format_wiki!(e.last)
+            line = e.last
+            line << "+OTHER+"
+          else
+            next
+          end
         end
+        contents << line << "\n"
       end
-      contents << line << "\n"
     end
-  end
-  if /\A[\s　]*\z/m =~ contents
-    result = ""
-  else
-    result = config[:title] ? title << contents : contents
+    if /\A[\s　]*\z/m =~ contents
+      result = ""
+    else
+      result = config[:title] ? title << contents : contents
+    end
   end
 end
+puts pastel.blue.bold("Complete!")