wp2txt 0.9.4 → 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 50f291332872b0e3cd0b651662d7494ec9edd823fdb6ba6a928f501a37ea06c3
4
- data.tar.gz: ec4891f6a30c7bc2f8f0a6fd3ec56618c9f706ea277207e7f955347417959f7e
3
+ metadata.gz: a15462742cc2912a4dca9e0e4e42e90af4b8f9e09ea29584da94946d0a563872
4
+ data.tar.gz: 0c63c91b90883b4ed69199ef569c7bd467aece538bb1de1f8e7d632e710d6964
5
5
  SHA512:
6
- metadata.gz: afa3770c47bc25252993bfddf6da6e99a7bca87d4d899b3f8ce44d8a6298d29a19ce06fe9b64166316a31672a76d7d4530887e77d98212bc8f17a350c0e1598a
7
- data.tar.gz: ef6f5b11b8a7d2ae5eeb640b0f2319bea9ee1209b0ab1dd78833f3cde41149fb8468871d90ba75b00c62876fccfa5c5f7cca6fc2420d4769c3c35a7bd9aa8786
6
+ metadata.gz: 22f5c61c0ff6d11cd2c0155ad77940e9b618aea1354826a7b8fc5155289b42daff159be6c48f3f038c8df08753731cad623561cbd8055a10a12ce7feae0566ca
7
+ data.tar.gz: 9b286a09211576f5a397e3e2e46fefbedbf9e95d200f3393b030ede106c9b543fb800c73d3d958ddc5dccad1ba2a30f0b99700af05eef88b142e90c8603e9699
data/.gitignore CHANGED
@@ -18,3 +18,4 @@ tmp
18
18
  .DS_Store
19
19
  *.bak
20
20
  *.~
21
+
data/README.md CHANGED
@@ -1,103 +1,161 @@
1
- # WP2TXT
1
+ <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/wp2txt-logo.svg' width="400" />
2
2
 
3
- Wikipedia dump file to text converter that extracts both content and category data
3
+ Text conversion tool to extract content and category data from Wikipedia dump files
4
4
 
5
5
  ## About
6
6
 
7
- WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
7
+ WP2TXT extracts plain text data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata.
8
8
 
9
- **UPDATE (July 2022)**: Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
9
+ **UPDATE (August 2022)**
10
+
11
+ 1. A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
12
+ 2. A new option `--summary-only` has been added. If this option is enabled, only the title and text data from the opening paragraphs of the article (= summary) will be extracted.
13
+ 3. The current WP2TXT is *several times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
14
+
15
+ ## Screenshot
16
+
17
+ <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="700" />
18
+
19
+ - WP2TXT 1.0.0
20
+ - MacBook Pro (2019) 2.3GHz 8Core Intel Core i9
21
+ - enwiki-20220802-pages-articles.xml.bz2 (approx. 20GB)
22
+
23
+ In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes a little over two hours.
10
24
 
11
25
  ## Features
12
26
 
13
- * Convert dump files of Wikipedia of various languages
14
- * Create output files of specified size.
15
- * Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables)
16
- * Extract category information of each article
27
+ - Converts Wikipedia dump files in various languages
28
+ - Creates output files of specified size
29
+ - Allows specifying ext elements (page titles, section headers, paragraphs, list items) to be extracted
30
+ - Allows extracting category information of the article
31
+ - Allows extracting opening paragraphs of the article
17
32
 
18
33
  ## Installation
19
34
 
20
35
  $ gem install wp2txt
21
36
 
22
- ## Usage
37
+ ## Preparation
38
+
39
+ First, download the latest Wikipedia dump file for the language of your choice.
40
+
41
+ https://dumps.wikimedia.org/xxwiki/latest/xxwiki-latest-pages-articles.xml.bz2
23
42
 
24
- Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
43
+ where `xx` is language code such as `en` (English) or `zh` (Chinese). Change it to `ja`, for instance, if you want the latest Japanese Wikipedia dump file.
44
+
45
+ Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
25
46
 
26
47
  xxwiki-yyyymmdd-pages-articles.xml.bz2
27
48
 
28
- where `xx` is language code such as "en (English)" or "ja (Japanese)", and `yyyymmdd` is the date of creation (e.g. 20220720).
49
+ where `xx` is language code such as `en` (English)" or `ko` (Korean), and `yyyymmdd` is the date of creation (e.g. `20220801`).
50
+
51
+ ## Basic Usage
52
+
53
+ Suppose you have a folder with a wikipedia dump file and empty subfolders organized as follows:
54
+
55
+ ```
56
+ .
57
+ ├── enwiki-20220801-pages-articles.xml.bz2
58
+ ├── /xml
59
+ ├── /text
60
+ ├── /category
61
+ └── /summary
62
+ ```
63
+
64
+ ### Decompress and Split
65
+
66
+ The following command will decompress the entire wikipedia data and split it into many small (approximately 10 MB) XML files.
67
+
68
+ $ wp2txt --no-convert -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./xml
69
+
70
+ **Note**: The resulting files are not well-formed XML. They contain part of the orignal XML extracted from the Wikipedia dump file, taking care to ensure that the content within the <page> tag is not split into multiple files.
71
+
72
+ ### Extract plain text from MediaWiki XML
73
+
74
+ $ wp2txt -i ./xml -o ./text
75
+
76
+
77
+ ### Extract only category info from MediaWiki XML
78
+
79
+ $ wp2txt -g -i ./xml -o ./category
80
+
81
+ ### Extract opening paragraphs from MediaWiki XML
82
+
83
+ $ wp2txt -s -i ./xml -o ./summary
84
+
85
+ ### Extract directly from bz2 compressed file
86
+
87
+ It is possible (though not recommended) to 1) decompress the dump files, 2) split the data into files, and 3) extract the text just one line of command. You can automatically remove all the intermediate XML files with `-x` option.
88
+
89
+ $ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text -x
90
+
91
+ ## Sample Output
92
+
93
+ Output contains title, category info, paragraphs
29
94
 
30
- ### Example 1
95
+ $ wp2txt -i ./input -o /output
31
96
 
32
- The following extracts text data, including list items and excluding tables.
97
+ - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
98
+ - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
33
99
 
34
- $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
100
+ Output containing title and category only
35
101
 
36
- - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
37
- - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
102
+ $ wp2txt -g -i ./input -o /output
38
103
 
39
- ### Example 2
104
+ - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_category.txt)
105
+ - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_category.txt)
40
106
 
41
- The following will extract only article titles and the categories to which each article belongs:
107
+ Output containing title, category, and summary
42
108
 
43
- $ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
109
+ $ wp2txt -s -i ./input -o /output
44
110
 
45
- - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
46
- - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
111
+ - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
112
+ - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
47
113
 
48
- ## Options
114
+ ## Command Line Options
49
115
 
50
116
  Command line options are as follows:
51
117
 
52
118
  Usage: wp2txt [options]
53
119
  where [options] are:
54
- --input-file, -i: Wikipedia dump file with .bz2 (compressed) or
55
- .txt (uncompressed) format
56
- --output-dir, -o <s>: Output directory (default: current directory)
57
- --convert, --no-convert, -c: Output in plain text (converting from XML)
58
- (default: true)
59
- --list, --no-list, -l: Show list items in output (default: true)
60
- --heading, --no-heading, -d: Show section titles in output (default: true)
61
- --title, --no-title, -t: Show page titles in output (default: true)
62
- --table, -a: Show table source code in output (default: false)
63
- --inline, -n: leave inline template notations unmodified (default: false)
64
- --multiline, -m: leave multiline template notations unmodified (default: false)
65
- --ref, -r: leave reference notations in the format (default: false)
66
- [ref]...[/ref]
67
- --redirect, -e: Show redirect destination (default: false)
68
- --marker, --no-marker, -k: Show symbols prefixed to list items,
69
- definitions, etc. (Default: true)
70
- --category, -g: Show article category information (default: true)
71
- --category-only, -y: Extract only article title and categories (default: false)
72
- --file-size, -f <i>: Approximate size (in MB) of each output file
73
- (default: 10)
74
- -u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
75
- set 99 to spawn max num of threads) (default: 4)
76
- --version, -v: Print version and exit
77
- --help, -h: Show this message
120
+ -i, --input Path to compressed file (bz2) or decompressed file (xml), or path to directory containing files of the latter format
121
+ -o, --output-dir=<s> Path to output directory
122
+ -c, --convert, --no-convert Output in plain text (converting from XML) (default: true)
123
+ -a, --category, --no-category Show article category information (default: true)
124
+ -g, --category-only Extract only article title and categories
125
+ -s, --summary-only Extract only article title, categories, and summary text before first heading
126
+ -f, --file-size=<i> Approximate size (in MB) of each output file (default: 10)
127
+ -n, --num-procs Number of proccesses to be run concurrently (default: max num of CPU cores minus two)
128
+ -x, --del-interfile Delete intermediate XML files from output dir
129
+ -t, --title, --no-title Keep page titles in output (default: true)
130
+ -d, --heading, --no-heading Keep section titles in output (default: true)
131
+ -l, --list Keep unprocessed list items in output
132
+ -r, --ref Keep reference notations in the format [ref]...[/ref]
133
+ -e, --redirect Show redirect destination
134
+ -m, --marker, --no-marker Show symbols prefixed to list items, definitions, etc. (Default: true)
135
+ -v, --version Print version and exit
136
+ -h, --help Show this message
78
137
 
79
138
  ## Caveats
80
139
 
81
- * Certain types of data such as mathematical equations and computer source code are not be properly converted. Please remember this software is originally intended for correcting “sentences” for linguistic studies.
82
- * Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
83
- * Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
84
- * Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
140
+ * Some data, such as mathematical formulas and computer source code, will not be converted correctly.
141
+ * Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
142
+ * The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
85
143
 
86
- ### Useful Links
144
+ ## Useful Links
87
145
 
88
146
  * [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
89
147
 
90
- ### Author
148
+ ## Author
91
149
 
92
150
  * Yoichiro Hasebe (<yohasebe@gmail.com>)
93
151
 
94
- ### References
152
+ ## References
95
153
 
96
154
  The author will appreciate your mentioning one of these in your research.
97
155
 
98
156
  * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
99
157
  * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
100
158
 
101
- ### License
159
+ ## License
102
160
 
103
161
  This software is distributed under the MIT License. Please see the LICENSE file.
data/bin/wp2txt CHANGED
@@ -11,132 +11,181 @@ DOCDIR = File.join(File.dirname(__FILE__), '..', 'doc')
11
11
  require 'wp2txt'
12
12
  require 'wp2txt/utils'
13
13
  require 'wp2txt/version'
14
+ require 'etc'
14
15
  require 'optimist'
16
+ require 'parallel'
17
+ require 'pastel'
18
+ require 'tty-spinner'
15
19
 
16
20
  include Wp2txt
17
21
 
18
22
  opts = Optimist::options do
19
- version Wp2txt::VERSION
20
- banner <<-EOS
23
+ version Wp2txt::VERSION
24
+ banner <<-EOS
21
25
  WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
22
26
 
23
27
  Usage: wp2txt [options]
24
28
  where [options] are:
25
29
  EOS
26
30
 
27
- opt :input_file, "Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format", :required => true
28
- opt :output_dir, "Output directory", :default => Dir::pwd, :type => String
29
- opt :convert, "Output in plain text (converting from XML)", :default => true
30
- opt :list, "Show list items in output", :default => true
31
- opt :heading, "Show section titles in output", :default => true, :short => "-d"
32
- opt :title, "Show page titles in output", :default => true
33
- opt :table, "Show table source code in output", :default => false
34
- opt :inline, "leave inline template notations as they are", :default => false
35
- opt :multiline, "leave multiline template notations as they are", :default => false
36
- opt :ref, "leave reference notations in the format [ref]...[/ref]", :default => false
37
- opt :redirect, "Show redirect destination", :default => false
38
- opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
39
- opt :category, "Show article category information", :default => true
40
- opt :category_only, "Extract only article title and categories", :default => false
41
- opt :file_size, "Approximate size (in MB) of each output file", :default => 10
42
- opt :num_threads, "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
31
+ opt :input, "Path to compressed file (bz2) or decompressed file (xml), or path to directory containing files of the latter format", :required => true, :short => "-i"
32
+ opt :output_dir, "Path to output directory", :default => Dir::pwd, :type => String, :short => "-o"
33
+ opt :convert, "Output in plain text (converting from XML)", :default => true, :short => "-c"
34
+ opt :category, "Show article category information", :default => true, :short => "-a"
35
+ opt :category_only, "Extract only article title and categories", :default => false, :short => "-g"
36
+ opt :summary_only, "Extract only article title, categories, and summary text before first heading", :default => false, :short => "-s"
37
+ opt :file_size, "Approximate size (in MB) of each output file", :default => 10, :short => "-f"
38
+ opt :num_procs, "Number of proccesses to be run concurrently (default: max num of CPU cores minus two)", :short => "-n"
39
+ opt :del_interfile, "Delete intermediate XML files from output dir", :short => "-x", :default => false
40
+ opt :title, "Keep page titles in output", :default => true, :short => "-t"
41
+ opt :heading, "Keep section titles in output", :default => true, :short => "-d"
42
+ opt :list, "Keep unprocessed list items in output", :default => false, :short => "-l"
43
+ opt :ref, "Keep reference notations in the format [ref]...[/ref]", :default => false, :short => "-r"
44
+ opt :redirect, "Show redirect destination", :default => false, :short => "-e"
45
+ opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true, :short => "-m"
43
46
  end
47
+
44
48
  Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
45
49
  Optimist::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
46
50
 
51
+ pastel = Pastel.new
52
+
47
53
  input_file = ARGV[0]
48
54
  output_dir = opts[:output_dir]
49
55
  tfile_size = opts[:file_size]
50
- num_threads = opts[:num_threads]
56
+ num_processors = Etc.nprocessors
57
+ if opts[:num_procs] && opts[:num_procs].to_i <= num_processors
58
+ num_processes = opts[:num_procs]
59
+ else
60
+ num_processes = num_processors - 2
61
+ end
62
+ num_processes = 1 if num_processes < 1
63
+
51
64
  convert = opts[:convert]
52
65
  strip_tmarker = opts[:marker] ? false : true
53
- opt_array = [:title, :list, :heading, :table, :redirect, :multiline]
66
+ opt_array = [:title,
67
+ :list,
68
+ :heading,
69
+ :table,
70
+ :redirect,
71
+ :multiline,
72
+ :category,
73
+ :category_only,
74
+ :summary_only,
75
+ :del_interfile]
76
+
54
77
  $leave_inline_template = true if opts[:inline]
55
78
  $leave_ref = true if opts[:ref]
56
- # $leave_table = true if opts[:table]
79
+
57
80
  config = {}
58
81
  opt_array.each do |opt|
59
82
  config[opt] = opts[opt]
60
83
  end
61
84
 
62
- parent = Wp2txt::CmdProgbar.new
63
- wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_threads, convert, strip_tmarker)
64
-
65
- wpconv.extract_text do |article|
66
- format_wiki!(article.title)
67
-
68
- if opts[:category_only]
69
- title = "#{article.title}\t"
70
- contents = article.categories.join(", ")
71
- contents << "\n"
72
- elsif opts[:category] && !article.categories.empty?
73
- title = "\n[[#{article.title}]]\n\n"
74
- contents = "\nCATEGORIES: "
75
- contents << article.categories.join(", ")
76
- contents << "\n\n"
77
- else
78
- title = "\n[[#{article.title}]]\n\n"
79
- contents = ""
80
- end
85
+ if File::ftype(input_file) == "directory"
86
+ input_files = Dir.glob("#{input_file}/*.xml")
87
+ else
88
+ puts ""
89
+ puts pastel.green.bold("Preprocessing")
90
+ puts "Decompressing and splitting the original dump file."
91
+ puts pastel.underline("This may take a while. Please be patient!")
81
92
 
82
- unless opts[:category_only]
83
- article.elements.each do |e|
84
- case e.first
85
- when :mw_heading
86
- next if !config[:heading]
87
- format_wiki!(e.last)
88
- line = e.last
89
- line << "+HEADING+" if $DEBUG_MODE
90
- when :mw_paragraph
91
- format_wiki!(e.last)
92
- line = e.last + "\n"
93
- line << "+PARAGRAPH+" if $DEBUG_MODE
94
- when :mw_table, :mw_htable
95
- next if !config[:table]
96
- line = e.last
97
- line << "+TABLE+" if $DEBUG_MODE
98
- when :mw_pre
99
- next if !config[:pre]
100
- line = e.last
101
- line << "+PRE+" if $DEBUG_MODE
102
- when :mw_quote
103
- line = e.last
104
- line << "+QUOTE+" if $DEBUG_MODE
105
- when :mw_unordered, :mw_ordered, :mw_definition
106
- next if !config[:list]
107
- line = e.last
108
- line << "+LIST+" if $DEBUG_MODE
109
- when :mw_ml_template
110
- next if !config[:multiline]
111
- line = e.last
112
- line << "+MLTEMPLATE+" if $DEBUG_MODE
113
- when :mw_redirect
114
- next if !config[:redirect]
115
- line = e.last
116
- line << "+REDIRECT+" if $DEBUG_MODE
117
- line << "\n\n"
118
- when :mw_isolated_template
119
- next if !config[:multiline]
120
- line = e.last
121
- line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
122
- when :mw_isolated_tag
123
- next
124
- else
125
- if $DEBUG_MODE
126
- # format_wiki!(e.last)
93
+ spinner = TTY::Spinner.new(":spinner", format: :arrow_pulse, hide_cursor: true, interval: 5)
94
+ spinner.auto_spin
95
+ wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
96
+ wpsplitter.split_file
97
+ spinner.stop(pastel.blue.bold("Done!")) # Stop animation
98
+ exit if !convert
99
+ input_files = Dir.glob("#{output_dir}/*.xml")
100
+ end
101
+
102
+ puts ""
103
+ puts pastel.red.bold("Converting")
104
+ puts "Number of files being processed: " + pastel.bold("#{input_files.size}")
105
+ puts "Number of CPU cores being used: " + pastel.bold("#{num_processes}")
106
+
107
+ Parallel.map(input_files, progress: pastel.magenta.bold("WP2TXT"), in_processes: num_processes) do |input_file|
108
+ wpconv = Wp2txt::Runner.new(input_file, output_dir, strip_tmarker, config[:del_interfile])
109
+ wpconv.extract_text do |article|
110
+ format_wiki!(article.title)
111
+
112
+ if config[:category_only]
113
+ title = "#{article.title}\t"
114
+ contents = article.categories.join(", ")
115
+ contents << "\n"
116
+ elsif config[:category] && !article.categories.empty?
117
+ title = "\n[[#{article.title}]]\n\n"
118
+ contents = "\nCATEGORIES: "
119
+ contents << article.categories.join(", ")
120
+ contents << "\n\n"
121
+ else
122
+ title = "\n[[#{article.title}]]\n\n"
123
+ contents = ""
124
+ end
125
+
126
+ unless config[:category_only]
127
+ article.elements.each do |e|
128
+ case e.first
129
+ when :mw_heading
130
+ break if config[:summary_only]
131
+ next if !config[:heading]
132
+ format_wiki!(e.last)
127
133
  line = e.last
128
- line << "+OTHER+"
129
- else
134
+ line << "+HEADING+" if $DEBUG_MODE
135
+ when :mw_paragraph
136
+ format_wiki!(e.last)
137
+ line = e.last + "\n"
138
+ line << "+PARAGRAPH+" if $DEBUG_MODE
139
+ when :mw_table, :mw_htable
140
+ next if !config[:table]
141
+ line = e.last
142
+ line << "+TABLE+" if $DEBUG_MODE
143
+ when :mw_pre
144
+ next if !config[:pre]
145
+ line = e.last
146
+ line << "+PRE+" if $DEBUG_MODE
147
+ when :mw_quote
148
+ line = e.last
149
+ line << "+QUOTE+" if $DEBUG_MODE
150
+ when :mw_unordered, :mw_ordered, :mw_definition
151
+ next if !config[:list]
152
+ line = e.last
153
+ line << "+LIST+" if $DEBUG_MODE
154
+ when :mw_ml_template
155
+ next if !config[:multiline]
156
+ line = e.last
157
+ line << "+MLTEMPLATE+" if $DEBUG_MODE
158
+ when :mw_redirect
159
+ next if !config[:redirect]
160
+ line = e.last
161
+ line << "+REDIRECT+" if $DEBUG_MODE
162
+ line << "\n\n"
163
+ when :mw_isolated_template
164
+ next if !config[:multiline]
165
+ line = e.last
166
+ line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
167
+ when :mw_isolated_tag
130
168
  next
169
+ else
170
+ if $DEBUG_MODE
171
+ # format_wiki!(e.last)
172
+ line = e.last
173
+ line << "+OTHER+"
174
+ else
175
+ next
176
+ end
131
177
  end
178
+ contents << line << "\n"
132
179
  end
133
- contents << line << "\n"
134
180
  end
135
- end
136
-
137
- if /\A[\s ]*\z/m =~ contents
138
- result = ""
139
- else
140
- result = config[:title] ? title << contents : contents
181
+
182
+ if /\A[\s ]*\z/m =~ contents
183
+ result = ""
184
+ else
185
+ result = config[:title] ? title << contents : contents
186
+ end
141
187
  end
142
188
  end
189
+
190
+ puts pastel.blue.bold("Complete!")
191
+