wp2txt 0.9.5 → 1.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: bf8270b3488c0045a067f71c155db8d9ac6366a94d825eed9bc6d05c95598345
4
- data.tar.gz: 8802949a232c60d8b5ae6f93726154f7a6b40436b478919657f58f4bdc54add3
3
+ metadata.gz: d33a41cf46688679a14eb8c3eb16f6ed33ce9175c7f5b566c9f87998ba2c8401
4
+ data.tar.gz: 7371e0f7b06b2f0846f01d66f461c7e106778adc6e686919302f0f29b1f80a9e
5
5
  SHA512:
6
- metadata.gz: 0a10804d78c33e035aaf429dd4613f84f3db0c6f22a6c36617a1fda25f03c0fd8fac224ec9e6009ab6ddddb475d73e6eda4c21606f89ef94950bc3749ce4f452
7
- data.tar.gz: 4a8ea2f0900c6f97d3dcaf6c6387b3543d23962ea064cf1b18a8c293b4664c16fcabca9230aa83b0f5685eacffed74968fe82529566080b7821fd944c7bf275d
6
+ metadata.gz: cab8d9c27989387acc6dbbe052029d2205508ce10e38b8eedc111c822328d8eba551d603020684cbb3844a87b747f261a5959f711267acd96a3b97ccef4f6834
7
+ data.tar.gz: 4de59be37d57ef3d14ae2304660e8dde069bdf645a7cff862026562b26327984f1be13840e9d6ec1f25110222367f71c84a0286b649d71fec0c13805c6b0a647
data/README.md CHANGED
@@ -1,104 +1,170 @@
1
- # WP2TXT
1
+ <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/wp2txt-logo.svg' width="400" />
2
2
 
3
- Wikipedia dump file to text converter that extracts both content and category data
3
+ A command-line toolkit to extract text content and category data from Wikipedia dump files
4
4
 
5
5
  ## About
6
6
 
7
- WP2TXT extracts plain text data from a Wikipedia dump file (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata. It was developed for researchers who want easy access to open-source multilingual corpora, but may be used for other purposes as well.
7
+ WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
8
8
 
9
- **UPDATE (July 2022)**: Version 0.9.3 adds a new option `category_only`. When this option is enabled, wp2txt will extract only the title and category information of the article. See output examples below.
9
+ **UPDATE (August 2022)**
10
10
 
11
+ 1. A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
12
+ 2. A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
13
+ 3. Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
14
+
15
+ ## Screenshot
16
+
17
+ <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="700" />
18
+
19
+ **Environment**
20
+
21
+ - WP2TXT 1.0.1
22
+ - MacBook Pro (2021 Apple M1 Pro)
23
+ - enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
24
+
25
+ In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.
11
26
 
12
27
  ## Features
13
28
 
14
- * Converts Wikipedia dump files in various languages
15
- * Creates output files of specified size
16
- * Can specify text elements to be extracted and converted (page titles, section titles, lists, tables)
17
- * Can extract category information for each article
29
+ - Converts Wikipedia dump files in various languages
30
+ - Creates output files of specified size
31
+ - Allows specifying ext elements (page titles, section headers, paragraphs, list items) to be extracted
32
+ - Allows extracting category information of the article
33
+ - Allows extracting opening paragraphs of the article
34
+
35
+ ## Preparation
36
+
37
+ ### For MacOS / Linux/ WSL2
38
+
39
+ WP2TXT requires that one of the following commands be installed on the system in order to decompress `bz2` files:
18
40
 
41
+ - `lbzip2` (recommended)
42
+ - `pbzip2`
43
+ - `bzip2`
44
+
45
+ In most cases, the `bzip2` command is pre-installed on the system. However, since `lbzip2` can use multiple CPU cores and is faster than `bzip2`, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.
46
+
47
+ If you are using MacOS with Homebrew installed, you can install `lbzip2` with the following command:
48
+
49
+ $ brew install lbzip2
50
+
51
+ ### For Windows
52
+
53
+ Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.
19
54
 
20
55
  ## Installation
21
56
 
57
+ ### WP2TXT command
58
+
22
59
  $ gem install wp2txt
23
60
 
24
- ## Usage
61
+ ## Wikipedia Dump File
62
+
63
+ Download the latest Wikipedia dump file for the desired language at a URL such as
64
+
65
+ https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
25
66
 
26
- Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
67
+ Here, `enwiki` refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to jawiki (Japanese). In doing so, note that there are two instances of `enwiki` in the URL above.
27
68
 
28
- > `xxwiki-yyyymmdd-pages-articles.xml.bz2`
69
+ Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
29
70
 
30
- where `xx` is language code such as "en (English)" or "ja (Japanese)", and `yyyymmdd` is the date of creation (e.g. 20220720).
71
+ xxwiki-yyyymmdd-pages-articles.xml.bz2
31
72
 
32
- ### Example 1: Basic
73
+ where `xx` is language code such as `en` (English)" or `ja` (japanese), and `yyyymmdd` is the date of creation (e.g. `20220801`).
33
74
 
34
- The following extracts text data, including list items and excluding tables.
75
+ ## Basic Usage
35
76
 
36
- $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
77
+ Suppose you have a folder with a wikipedia dump file and empty subfolders organized as follows:
37
78
 
38
- - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
39
- - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
79
+ ```
80
+ .
81
+ ├── enwiki-20220801-pages-articles.xml.bz2
82
+ ├── /xml
83
+ ├── /text
84
+ ├── /category
85
+ └── /summary
86
+ ```
40
87
 
41
- ### Example 2: Title and category information only
88
+ ### Decompress and Split
42
89
 
43
- The following will extract only article titles and the categories to which each article belongs:
90
+ The following command will decompress the entire wikipedia data and split it into many small (approximately 10 MB) XML files.
44
91
 
45
- $ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
92
+ $ wp2txt --no-convert -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./xml
46
93
 
47
- Each line of the output data contains the title and the categories of an article:
94
+ **Note**: The resulting files are not well-formed XML. They contain part of the orignal XML extracted from the Wikipedia dump file, taking care to ensure that the content within the <page> tag is not split into multiple files.
48
95
 
49
- > title `TAB` category1`,` category2`,` category3`,` ...
96
+ ### Extract plain text from MediaWiki XML
50
97
 
51
- - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
52
- - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
98
+ $ wp2txt -i ./xml -o ./text
53
99
 
54
- ### Example 3: Title, category, and summary text only
55
100
 
56
- The following will extract only article titles, the categories to which each article belongs, and text blocks before the first heading of the article:
101
+ ### Extract only category info from MediaWiki XML
57
102
 
58
- $ wp2txt --summary-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
103
+ $ wp2txt -g -i ./xml -o ./category
59
104
 
60
- - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
61
- - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
105
+ ### Extract opening paragraphs from MediaWiki XML
62
106
 
107
+ $ wp2txt -s -i ./xml -o ./summary
63
108
 
64
- ## Options
109
+ ### Extract directly from bz2 compressed file
110
+
111
+ It is possible (though not recommended) to 1) decompress the dump files, 2) split the data into files, and 3) extract the text just one line of command. You can automatically remove all the intermediate XML files with `-x` option.
112
+
113
+ $ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text -x
114
+
115
+ ## Sample Output
116
+
117
+ Output contains title, category info, paragraphs
118
+
119
+ $ wp2txt -i ./input -o /output
120
+
121
+ - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
122
+ - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
123
+
124
+ Output containing title and category only
125
+
126
+ $ wp2txt -g -i ./input -o /output
127
+
128
+ - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_category.txt)
129
+ - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_category.txt)
130
+
131
+ Output containing title, category, and summary
132
+
133
+ $ wp2txt -s -i ./input -o /output
134
+
135
+ - [English Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
136
+ - [Japanese Wikipedia](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
137
+
138
+ ## Command Line Options
65
139
 
66
140
  Command line options are as follows:
67
141
 
68
142
  Usage: wp2txt [options]
69
143
  where [options] are:
70
- --input-file, -i: Wikipedia dump file with .bz2 (compressed) or
71
- .txt (uncompressed) format
72
- --output-dir, -o <s>: Output directory (default: current directory)
73
- --convert, --no-convert, -c: Output in plain text (converting from XML)
74
- (default: true)
75
- --list, --no-list, -l: Show list items in output (default: true)
76
- --heading, --no-heading, -d: Show section titles in output (default: true)
77
- --title, --no-title, -t: Show page titles in output (default: true)
78
- --table, -a: Show table source code in output (default: false)
79
- --inline, -n: leave inline template notations unmodified (default: false)
80
- --multiline, -m: leave multiline template notations unmodified (default: false)
81
- --ref, -r: leave reference notations in the format (default: false)
82
- [ref]...[/ref]
83
- --redirect, -e: Show redirect destination (default: false)
84
- --marker, --no-marker, -k: Show symbols prefixed to list items,
85
- definitions, etc. (Default: true)
86
- --category, -g: Show article category information (default: true)
87
- --category-only, -y: Extract only article title and categories (default: false)
88
- -s, --summary-only: Extract only article title, categories, and summary text before first heading
89
- --file-size, -f <i>: Approximate size (in MB) of each output file
90
- (default: 10)
91
- -u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
92
- set 99 to spawn max num of threads) (default: 4)
93
- --version, -v: Print version and exit
94
- --help, -h: Show this message
144
+ -i, --input Path to compressed file (bz2) or decompressed file (xml), or path to directory containing files of the latter format
145
+ -o, --output-dir=<s> Path to output directory
146
+ -c, --convert, --no-convert Output in plain text (converting from XML) (default: true)
147
+ -a, --category, --no-category Show article category information (default: true)
148
+ -g, --category-only Extract only article title and categories
149
+ -s, --summary-only Extract only article title, categories, and summary text before first heading
150
+ -f, --file-size=<i> Approximate size (in MB) of each output file (default: 10)
151
+ -n, --num-procs Number of proccesses to be run concurrently (default: max num of available CPU cores minus two)
152
+ -x, --del-interfile Delete intermediate XML files from output dir
153
+ -t, --title, --no-title Keep page titles in output (default: true)
154
+ -d, --heading, --no-heading Keep section titles in output (default: true)
155
+ -l, --list Keep unprocessed list items in output
156
+ -r, --ref Keep reference notations in the format [ref]...[/ref]
157
+ -e, --redirect Show redirect destination
158
+ -m, --marker, --no-marker Show symbols prefixed to list items, definitions, etc. (Default: true)
159
+ -b, --bz2-gem Use Ruby's bzip2-ruby gem instead of a system command
160
+ -v, --version Print version and exit
161
+ -h, --help Show this message
95
162
 
96
163
  ## Caveats
97
164
 
98
- * Some data, such as mathematical formulas and computer source code, will not be converted correctly.
165
+ * Some data, such as mathematical formulas and computer source code, will not be converted correctly.
99
166
  * Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
100
167
  * The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
101
- * WP2TXT, by the nature of its task, requires a lot of machine power and consumes a large amount of memory/storage resources. Therefore, there is a possibility that the process may stop unexpectedly. In the worst case, the process may even freeze without terminating successfully. Please understand this and use at your own risk.
102
168
 
103
169
  ## Useful Links
104
170
 
@@ -115,6 +181,17 @@ The author will appreciate your mentioning one of these in your research.
115
181
  * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
116
182
  * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
117
183
 
184
+ Or use this BibTeX entry:
185
+
186
+ ```
187
+ @misc{WP2TXT_2022,
188
+ author = {Yoichiro Hasebe},
189
+ title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
190
+ url = {https://github.com/yohasebe/wp2txt}
191
+ year = {2022},
192
+ }
193
+ ```
194
+
118
195
  ## License
119
196
 
120
197
  This software is distributed under the MIT License. Please see the LICENSE file.
data/bin/wp2txt CHANGED
@@ -11,133 +11,187 @@ DOCDIR = File.join(File.dirname(__FILE__), '..', 'doc')
11
11
  require 'wp2txt'
12
12
  require 'wp2txt/utils'
13
13
  require 'wp2txt/version'
14
+ require 'etc'
14
15
  require 'optimist'
16
+ require 'parallel'
17
+ require 'pastel'
18
+ require 'tty-spinner'
15
19
 
16
20
  include Wp2txt
17
21
 
18
22
  opts = Optimist::options do
19
- version Wp2txt::VERSION
20
- banner <<-EOS
23
+ version Wp2txt::VERSION
24
+ banner <<-EOS
21
25
  WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
22
26
 
23
27
  Usage: wp2txt [options]
24
28
  where [options] are:
25
29
  EOS
26
30
 
27
- opt :input_file, "Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format", :required => true
28
- opt :output_dir, "Output directory", :default => Dir::pwd, :type => String
29
- opt :convert, "Output in plain text (converting from XML)", :default => true
30
- opt :list, "Show list items in output", :default => false
31
- opt :heading, "Show section titles in output", :default => true, :short => "-d"
32
- opt :title, "Show page titles in output", :default => true
33
- opt :table, "Show table source code in output", :default => false
34
- opt :inline, "leave inline template notations as they are", :default => false
35
- opt :multiline, "leave multiline template notations as they are", :default => false
36
- opt :ref, "leave reference notations in the format [ref]...[/ref]", :default => false
37
- opt :redirect, "Show redirect destination", :default => false
38
- opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
39
- opt :category, "Show article category information", :default => true
40
- opt :category_only, "Extract only article title and categories", :default => false
41
- opt :summary_only, "Extract only article title, categories, and summary text before first heading", :default => false
42
- opt :file_size, "Approximate size (in MB) of each output file", :default => 10
43
- opt :num_threads, "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
31
+ opt :input, "Path to compressed file (bz2) or decompressed file (xml), or path to directory containing files of the latter format", :required => true, :short => "-i"
32
+ opt :output_dir, "Path to output directory", :default => Dir::pwd, :type => String, :short => "-o"
33
+ opt :convert, "Output in plain text (converting from XML)", :default => true, :short => "-c"
34
+ opt :category, "Show article category information", :default => true, :short => "-a"
35
+ opt :category_only, "Extract only article title and categories", :default => false, :short => "-g"
36
+ opt :summary_only, "Extract only article title, categories, and summary text before first heading", :default => false, :short => "-s"
37
+ opt :file_size, "Approximate size (in MB) of each output file", :default => 10, :short => "-f"
38
+ opt :num_procs, "Number of proccesses to be run concurrently (default: max num of CPU cores minus two)", :short => "-n"
39
+ opt :del_interfile, "Delete intermediate XML files from output dir", :short => "-x", :default => false
40
+ opt :title, "Keep page titles in output", :default => true, :short => "-t"
41
+ opt :heading, "Keep section titles in output", :default => true, :short => "-d"
42
+ opt :list, "Keep unprocessed list items in output", :default => false, :short => "-l"
43
+ opt :ref, "Keep reference notations in the format [ref]...[/ref]", :default => false, :short => "-r"
44
+ opt :redirect, "Show redirect destination", :default => false, :short => "-e"
45
+ opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true, :short => "-m"
46
+ opt :bz2_gem, "Use Ruby's bzip2-ruby gem instead of a system command", :default => false, :short => "-b"
44
47
  end
48
+
45
49
  Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
46
50
  Optimist::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
47
51
 
52
+ pastel = Pastel.new
53
+
48
54
  input_file = ARGV[0]
49
55
  output_dir = opts[:output_dir]
50
56
  tfile_size = opts[:file_size]
51
- num_threads = opts[:num_threads]
57
+ num_processors = Etc.nprocessors
58
+ if opts[:num_procs] && opts[:num_procs].to_i <= num_processors
59
+ num_processes = opts[:num_procs]
60
+ else
61
+ num_processes = num_processors - 2
62
+ end
63
+ num_processes = 1 if num_processes < 1
64
+
52
65
  convert = opts[:convert]
53
66
  strip_tmarker = opts[:marker] ? false : true
54
- opt_array = [:title, :list, :heading, :table, :redirect, :multiline, :category, :category_only, :summary_only]
67
+ opt_array = [:title,
68
+ :list,
69
+ :heading,
70
+ :table,
71
+ :redirect,
72
+ :multiline,
73
+ :category,
74
+ :category_only,
75
+ :summary_only,
76
+ :del_interfile,
77
+ :bz2_gem ]
78
+
55
79
  $leave_inline_template = true if opts[:inline]
56
80
  $leave_ref = true if opts[:ref]
81
+
57
82
  config = {}
58
83
  opt_array.each do |opt|
59
84
  config[opt] = opts[opt]
60
85
  end
61
86
 
62
- parent = Wp2txt::CmdProgbar.new
63
- wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_threads, convert, strip_tmarker)
64
-
65
- wpconv.extract_text do |article|
66
- format_wiki!(article.title)
67
-
68
- if config[:category_only]
69
- title = "#{article.title}\t"
70
- contents = article.categories.join(", ")
71
- contents << "\n"
72
- elsif config[:category] && !article.categories.empty?
73
- title = "\n[[#{article.title}]]\n\n"
74
- contents = "\nCATEGORIES: "
75
- contents << article.categories.join(", ")
76
- contents << "\n\n"
77
- else
78
- title = "\n[[#{article.title}]]\n\n"
79
- contents = ""
80
- end
87
+ if File::ftype(input_file) == "directory"
88
+ input_files = Dir.glob("#{input_file}/*.xml")
89
+ else
90
+ puts ""
91
+ puts pastel.green.bold("Preprocessing")
92
+ puts "Decompressing and splitting the original dump file."
93
+ puts pastel.underline("This may take a while. Please be patient!")
94
+
95
+ time_start = Time.now.to_i
96
+ wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
97
+ spinner = TTY::Spinner.new(":spinner", format: :arrow_pulse, hide_cursor: true, interval: 5)
98
+ spinner.auto_spin
99
+ wpsplitter.split_file
100
+ time_finish = Time.now.to_i
101
+
102
+ spinner.stop("Time: #{sec_to_str(time_finish - time_start)}")# Stop animation
103
+ puts pastel.blue.bold("Complete!")
104
+ exit if !convert
105
+ input_files = Dir.glob("#{output_dir}/*.xml")
106
+ end
107
+
108
+ puts ""
109
+ puts pastel.red.bold("Converting")
110
+ puts "Number of files being processed: " + pastel.bold("#{input_files.size}")
111
+ puts "Number of CPU cores being used: " + pastel.bold("#{num_processes}")
112
+
113
+ Parallel.map(input_files, progress: pastel.magenta.bold("WP2TXT"), in_processes: num_processes) do |input_file|
114
+ wpconv = Wp2txt::Runner.new(input_file, output_dir, strip_tmarker, config[:del_interfile])
115
+ wpconv.extract_text do |article|
116
+ format_wiki!(article.title)
117
+
118
+ if config[:category_only]
119
+ title = "#{article.title}\t"
120
+ contents = article.categories.join(", ")
121
+ contents << "\n"
122
+ elsif config[:category] && !article.categories.empty?
123
+ title = "\n[[#{article.title}]]\n\n"
124
+ contents = "\nCATEGORIES: "
125
+ contents << article.categories.join(", ")
126
+ contents << "\n\n"
127
+ else
128
+ title = "\n[[#{article.title}]]\n\n"
129
+ contents = ""
130
+ end
81
131
 
82
- unless config[:category_only]
83
- article.elements.each do |e|
84
- case e.first
85
- when :mw_heading
86
- break if config[:summary_only]
87
- next if !config[:heading]
88
- format_wiki!(e.last)
89
- line = e.last
90
- line << "+HEADING+" if $DEBUG_MODE
91
- when :mw_paragraph
92
- format_wiki!(e.last)
93
- line = e.last + "\n"
94
- line << "+PARAGRAPH+" if $DEBUG_MODE
95
- when :mw_table, :mw_htable
96
- next if !config[:table]
97
- line = e.last
98
- line << "+TABLE+" if $DEBUG_MODE
99
- when :mw_pre
100
- next if !config[:pre]
101
- line = e.last
102
- line << "+PRE+" if $DEBUG_MODE
103
- when :mw_quote
104
- line = e.last
105
- line << "+QUOTE+" if $DEBUG_MODE
106
- when :mw_unordered, :mw_ordered, :mw_definition
107
- next if !config[:list]
108
- line = e.last
109
- line << "+LIST+" if $DEBUG_MODE
110
- when :mw_ml_template
111
- next if !config[:multiline]
112
- line = e.last
113
- line << "+MLTEMPLATE+" if $DEBUG_MODE
114
- when :mw_redirect
115
- next if !config[:redirect]
116
- line = e.last
117
- line << "+REDIRECT+" if $DEBUG_MODE
118
- line << "\n\n"
119
- when :mw_isolated_template
120
- next if !config[:multiline]
121
- line = e.last
122
- line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
123
- when :mw_isolated_tag
124
- next
125
- else
126
- if $DEBUG_MODE
127
- # format_wiki!(e.last)
132
+ unless config[:category_only]
133
+ article.elements.each do |e|
134
+ case e.first
135
+ when :mw_heading
136
+ break if config[:summary_only]
137
+ next if !config[:heading]
138
+ format_wiki!(e.last)
128
139
  line = e.last
129
- line << "+OTHER+"
130
- else
140
+ line << "+HEADING+" if $DEBUG_MODE
141
+ when :mw_paragraph
142
+ format_wiki!(e.last)
143
+ line = e.last + "\n"
144
+ line << "+PARAGRAPH+" if $DEBUG_MODE
145
+ when :mw_table, :mw_htable
146
+ next if !config[:table]
147
+ line = e.last
148
+ line << "+TABLE+" if $DEBUG_MODE
149
+ when :mw_pre
150
+ next if !config[:pre]
151
+ line = e.last
152
+ line << "+PRE+" if $DEBUG_MODE
153
+ when :mw_quote
154
+ line = e.last
155
+ line << "+QUOTE+" if $DEBUG_MODE
156
+ when :mw_unordered, :mw_ordered, :mw_definition
157
+ next if !config[:list]
158
+ line = e.last
159
+ line << "+LIST+" if $DEBUG_MODE
160
+ when :mw_ml_template
161
+ next if !config[:multiline]
162
+ line = e.last
163
+ line << "+MLTEMPLATE+" if $DEBUG_MODE
164
+ when :mw_redirect
165
+ next if !config[:redirect]
166
+ line = e.last
167
+ line << "+REDIRECT+" if $DEBUG_MODE
168
+ line << "\n\n"
169
+ when :mw_isolated_template
170
+ next if !config[:multiline]
171
+ line = e.last
172
+ line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
173
+ when :mw_isolated_tag
131
174
  next
175
+ else
176
+ if $DEBUG_MODE
177
+ # format_wiki!(e.last)
178
+ line = e.last
179
+ line << "+OTHER+"
180
+ else
181
+ next
182
+ end
132
183
  end
184
+ contents << line << "\n"
133
185
  end
134
- contents << line << "\n"
135
186
  end
136
- end
137
-
138
- if /\A[\s ]*\z/m =~ contents
139
- result = ""
140
- else
141
- result = config[:title] ? title << contents : contents
187
+
188
+ if /\A[\s ]*\z/m =~ contents
189
+ result = ""
190
+ else
191
+ result = config[:title] ? title << contents : contents
192
+ end
142
193
  end
143
194
  end
195
+
196
+ puts pastel.blue.bold("Complete!")
197
+