wp2txt 0.8.0 → 0.9.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +5 -5
- data/README.md +48 -21
- data/bin/benchmark.rb +5 -4
- data/bin/wp2txt +67 -67
- data/data/output_samples/testdata_en.txt +49076 -0
- data/data/output_samples/testdata_en_categories.txt +824 -0
- data/data/output_samples/testdata_ja.txt +9382 -0
- data/data/output_samples/testdata_ja_categories.txt +188 -0
- data/data/testdata_en.bz2 +0 -0
- data/data/{testdata.bz2 → testdata_ja.bz2} +0 -0
- data/lib/wp2txt/article.rb +33 -3
- data/lib/wp2txt/utils.rb +44 -49
- data/lib/wp2txt/version.rb +1 -1
- data/lib/wp2txt.rb +67 -42
- data/spec/utils_spec.rb +28 -16
- data/wp2txt.gemspec +2 -1
- metadata +27 -9
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
|
-
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 32966949db257b30be7a5c044965ce08426bdede1f1fa0dbb0a276361d1c69c2
|
|
4
|
+
data.tar.gz: ee0b08031ae75b9d08fd1f07e5d08e8e10135dba8b15bd44692f7c434b220262
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 9dee99ed39d2da01c9aeda462291645533773ed537de06a1f9127d626c91bc421e92ef805c52764fd0a668cf104a3733ab5981e59bb35d4abebe2f2909c63e3f
|
|
7
|
+
data.tar.gz: 8c6d7a5841a47fa4643a229050be652e38879d03477568175e8b29fb6f77971633731ec2d5dbd429dc2499ecfc2bfbb58917389d89b9cd57743b86b97ceaa4af
|
data/README.md
CHANGED
|
@@ -2,33 +2,54 @@
|
|
|
2
2
|
|
|
3
3
|
Wikipedia dump file to text converter
|
|
4
4
|
|
|
5
|
-
**
|
|
5
|
+
**IMPORTANT:** This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
## About
|
|
8
8
|
|
|
9
9
|
WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
|
|
10
10
|
|
|
11
|
-
|
|
11
|
+
**UPDATE:** Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
|
|
12
12
|
|
|
13
|
-
|
|
13
|
+
## Features
|
|
14
|
+
|
|
15
|
+
* Convert dump files of Wikipedia of various languages
|
|
14
16
|
* Create output files of specified size.
|
|
15
|
-
* Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables)
|
|
17
|
+
* Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables)
|
|
18
|
+
* Extract category information of each article
|
|
16
19
|
|
|
17
|
-
|
|
20
|
+
## Installation
|
|
18
21
|
|
|
19
22
|
$ gem install wp2txt
|
|
20
23
|
|
|
21
|
-
|
|
24
|
+
## Usage
|
|
22
25
|
|
|
23
26
|
Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
|
|
24
27
|
|
|
25
28
|
xxwiki-yyyymmdd-pages-articles.xml.bz2
|
|
26
29
|
|
|
27
|
-
where `xx` is language code such as "en (English)" or "
|
|
30
|
+
where `xx` is language code such as "en (English)" or "", and `yyyymmdd` is the date of creation (e.g. 20120601).
|
|
28
31
|
|
|
29
|
-
|
|
32
|
+
### Example 1
|
|
33
|
+
|
|
34
|
+
The following extracts text data, including list items and excluding tables.
|
|
35
|
+
|
|
36
|
+
$ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
|
|
37
|
+
|
|
38
|
+
- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
|
|
39
|
+
- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
|
|
40
|
+
|
|
41
|
+
### Example 2
|
|
30
42
|
|
|
31
|
-
|
|
43
|
+
The following will extract only article titles and the categories to which each article belongs:
|
|
44
|
+
|
|
45
|
+
$ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
|
|
46
|
+
|
|
47
|
+
- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
|
|
48
|
+
- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
|
|
49
|
+
|
|
50
|
+
## Options
|
|
51
|
+
|
|
52
|
+
Command line options are as follows:
|
|
32
53
|
|
|
33
54
|
Usage: wp2txt [options]
|
|
34
55
|
where [options] are:
|
|
@@ -40,39 +61,45 @@ Command line options are as follows:
|
|
|
40
61
|
--list, --no-list, -l: Show list items in output (default: true)
|
|
41
62
|
--heading, --no-heading, -d: Show section titles in output (default: true)
|
|
42
63
|
--title, --no-title, -t: Show page titles in output (default: true)
|
|
43
|
-
--table, -a: Show table source code in output
|
|
44
|
-
|
|
45
|
-
|
|
64
|
+
--table, -a: Show table source code in output (default: false)
|
|
65
|
+
--inline, -n: leave inline template notations unmodified (default: false)
|
|
66
|
+
--multiline, -m: leave multiline template notations unmodified (default: false)
|
|
67
|
+
--ref, -r: leave reference notations in the format (default: false)
|
|
46
68
|
[ref]...[/ref]
|
|
47
|
-
|
|
48
|
-
--marker, --no-marker, -
|
|
69
|
+
--redirect, -e: Show redirect destination (default: false)
|
|
70
|
+
--marker, --no-marker, -k: Show symbols prefixed to list items,
|
|
49
71
|
definitions, etc. (Default: true)
|
|
50
|
-
--category, -g: Show article category information
|
|
72
|
+
--category, -g: Show article category information (default: true)
|
|
73
|
+
--category-only, -y: Extract only article title and categories (default: false)
|
|
51
74
|
--file-size, -f <i>: Approximate size (in MB) of each output file
|
|
52
75
|
(default: 10)
|
|
76
|
+
-u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
|
|
77
|
+
set 99 to spawn max num of threads) (default: 4)
|
|
53
78
|
--version, -v: Print version and exit
|
|
54
79
|
--help, -h: Show this message
|
|
55
80
|
|
|
56
|
-
|
|
81
|
+
## Caveats
|
|
57
82
|
|
|
58
83
|
* Certain types of data such as mathematical equations and computer source code are not be properly converted. Please remember this software is originally intended for correcting “sentences” for linguistic studies.
|
|
59
84
|
* Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
|
|
60
85
|
* Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
|
|
61
86
|
* Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
|
|
62
87
|
|
|
63
|
-
### Useful
|
|
88
|
+
### Useful Links
|
|
64
89
|
|
|
65
90
|
* [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
|
|
66
91
|
|
|
67
|
-
### Author
|
|
92
|
+
### Author
|
|
68
93
|
|
|
69
94
|
* Yoichiro Hasebe (<yohasebe@gmail.com>)
|
|
70
95
|
|
|
71
|
-
### References
|
|
96
|
+
### References
|
|
97
|
+
|
|
98
|
+
The author will appreciate your mentioning one of these in your research.
|
|
72
99
|
|
|
73
100
|
* Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
|
|
74
101
|
* 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
|
|
75
102
|
|
|
76
|
-
### License
|
|
103
|
+
### License
|
|
77
104
|
|
|
78
105
|
This software is distributed under the MIT License. Please see the LICENSE file.
|
data/bin/benchmark.rb
CHANGED
|
@@ -12,15 +12,16 @@ require 'benchmark'
|
|
|
12
12
|
data_dir = File.join(File.dirname(__FILE__), '..', "data")
|
|
13
13
|
|
|
14
14
|
parent = Wp2txt::CmdProgbar.new
|
|
15
|
-
input_file = File.join(data_dir, "
|
|
15
|
+
input_file = File.join(data_dir, "testdata_ja.bz2")
|
|
16
16
|
output_dir = data_dir
|
|
17
17
|
tfile_size = 10
|
|
18
|
+
num_threads = 1
|
|
18
19
|
convert = true
|
|
19
20
|
strip_tmarker = true
|
|
20
21
|
|
|
21
22
|
Benchmark.bm do |x|
|
|
22
23
|
x.report do
|
|
23
|
-
wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert, strip_tmarker)
|
|
24
|
+
wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_threads, convert, strip_tmarker)
|
|
24
25
|
wpconv.extract_text do |article|
|
|
25
26
|
format_wiki!(article.title)
|
|
26
27
|
title = "[[#{article.title}]]\n"
|
|
@@ -58,11 +59,11 @@ Benchmark.bm do |x|
|
|
|
58
59
|
end
|
|
59
60
|
contents << line
|
|
60
61
|
end
|
|
61
|
-
|
|
62
|
+
format_wiki!(contents)
|
|
62
63
|
convert_characters!(contents)
|
|
63
64
|
|
|
64
65
|
##### cleanup #####
|
|
65
|
-
if /\A\s*\z/m =~ contents
|
|
66
|
+
if /\A[\s ]*\z/m =~ contents
|
|
66
67
|
result = ""
|
|
67
68
|
else
|
|
68
69
|
result = title + "\n" + contents
|
data/bin/wp2txt
CHANGED
|
@@ -4,18 +4,18 @@
|
|
|
4
4
|
$: << File.join(File.dirname(__FILE__))
|
|
5
5
|
$: << File.join(File.dirname(__FILE__), '..', 'lib')
|
|
6
6
|
|
|
7
|
-
DEBUG_MODE =
|
|
7
|
+
$DEBUG_MODE = false
|
|
8
8
|
SHAREDIR = File.join(File.dirname(__FILE__), '..', 'share')
|
|
9
9
|
DOCDIR = File.join(File.dirname(__FILE__), '..', 'doc')
|
|
10
10
|
|
|
11
11
|
require 'wp2txt'
|
|
12
12
|
require 'wp2txt/utils'
|
|
13
13
|
require 'wp2txt/version'
|
|
14
|
-
require '
|
|
14
|
+
require 'optimist'
|
|
15
15
|
|
|
16
16
|
include Wp2txt
|
|
17
17
|
|
|
18
|
-
opts =
|
|
18
|
+
opts = Optimist::options do
|
|
19
19
|
version Wp2txt::VERSION
|
|
20
20
|
banner <<-EOS
|
|
21
21
|
WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
|
|
@@ -31,37 +31,40 @@ EOS
|
|
|
31
31
|
opt :heading, "Show section titles in output", :default => true, :short => "-d"
|
|
32
32
|
opt :title, "Show page titles in output", :default => true
|
|
33
33
|
opt :table, "Show table source code in output", :default => false
|
|
34
|
-
opt :
|
|
34
|
+
opt :inline, "leave inline template notations as they are", :default => false
|
|
35
|
+
opt :multiline, "leave multiline template notations as they are", :default => false
|
|
35
36
|
opt :ref, "leave reference notations in the format [ref]...[/ref]", :default => false
|
|
36
37
|
opt :redirect, "Show redirect destination", :default => false
|
|
37
38
|
opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
|
|
38
|
-
opt :category, "Show article category information", :default =>
|
|
39
|
+
opt :category, "Show article category information", :default => true
|
|
40
|
+
opt :category_only, "Extract only article title and categories", :default => false
|
|
39
41
|
opt :file_size, "Approximate size (in MB) of each output file", :default => 10
|
|
42
|
+
opt :num_threads, "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
|
|
40
43
|
end
|
|
41
|
-
|
|
42
|
-
|
|
44
|
+
Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
|
|
45
|
+
Optimist::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
|
|
43
46
|
|
|
44
47
|
input_file = ARGV[0]
|
|
45
48
|
output_dir = opts[:output_dir]
|
|
46
49
|
tfile_size = opts[:file_size]
|
|
50
|
+
num_threads = opts[:num_threads]
|
|
47
51
|
convert = opts[:convert]
|
|
48
52
|
strip_tmarker = opts[:marker] ? false : true
|
|
49
|
-
opt_array = [:title, :list, :heading, :table, :redirect]
|
|
50
|
-
$
|
|
51
|
-
$leave_table = true if opts[:table]
|
|
53
|
+
opt_array = [:title, :list, :heading, :table, :redirect, :multiline]
|
|
54
|
+
$leave_inline_template = true if opts[:inline]
|
|
52
55
|
$leave_ref = true if opts[:ref]
|
|
56
|
+
# $leave_table = true if opts[:table]
|
|
53
57
|
config = {}
|
|
54
58
|
opt_array.each do |opt|
|
|
55
59
|
config[opt] = opts[opt]
|
|
56
60
|
end
|
|
57
61
|
|
|
58
62
|
parent = Wp2txt::CmdProgbar.new
|
|
59
|
-
wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert, strip_tmarker)
|
|
63
|
+
wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_threads, convert, strip_tmarker)
|
|
60
64
|
|
|
61
65
|
wpconv.extract_text do |article|
|
|
62
66
|
format_wiki!(article.title)
|
|
63
67
|
title = "[[#{article.title}]]\n"
|
|
64
|
-
convert_characters!(title)
|
|
65
68
|
|
|
66
69
|
if opts[:category] && !article.categories.empty?
|
|
67
70
|
contents = "\nCATEGORIES: "
|
|
@@ -71,67 +74,64 @@ wpconv.extract_text do |article|
|
|
|
71
74
|
contents = ""
|
|
72
75
|
end
|
|
73
76
|
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
line = e.last
|
|
81
|
-
line << "+HEADING+" if $DEBUG_MODE
|
|
82
|
-
when :mw_paragraph
|
|
83
|
-
# next if !config[:paragraph]
|
|
84
|
-
format_wiki!(e.last)
|
|
85
|
-
format_article!(e.last)
|
|
86
|
-
line = e.last + "\n"
|
|
87
|
-
line << "+PARAGRAPH+" if $DEBUG_MODE
|
|
88
|
-
when :mw_table, :mw_htable
|
|
89
|
-
next if !config[:table]
|
|
90
|
-
# format_wiki!(e.last)
|
|
91
|
-
line = e.last
|
|
92
|
-
line << "+TABLE+" if $DEBUG_MODE
|
|
93
|
-
when :mw_pre
|
|
94
|
-
next if !config[:pre]
|
|
95
|
-
line = e.last
|
|
96
|
-
line << "+PRE+" if $DEBUG_MODE
|
|
97
|
-
when :mw_quote
|
|
98
|
-
# next if !config[:quote]
|
|
99
|
-
# format_wiki!(e.last)
|
|
100
|
-
line = e.last
|
|
101
|
-
line << "+QUOTE+" if $DEBUG_MODE
|
|
102
|
-
when :mw_unordered, :mw_ordered, :mw_definition
|
|
103
|
-
next if !config[:list]
|
|
104
|
-
# format_wiki!(e.last)
|
|
105
|
-
line = e.last
|
|
106
|
-
line << "+LIST+" if $DEBUG_MODE
|
|
107
|
-
when :mw_redirect
|
|
108
|
-
next if !config[:redirect]
|
|
109
|
-
# format_wiki!(e.last)
|
|
110
|
-
line = e.last
|
|
111
|
-
line << "+REDIRECT+" if $DEBUG_MODE
|
|
112
|
-
line << "\n\n"
|
|
113
|
-
else
|
|
114
|
-
if $DEBUG_MODE
|
|
115
|
-
# format_wiki!(e.last)
|
|
77
|
+
unless opts[:category_only]
|
|
78
|
+
article.elements.each do |e|
|
|
79
|
+
case e.first
|
|
80
|
+
when :mw_heading
|
|
81
|
+
next if !config[:heading]
|
|
82
|
+
format_wiki!(e.last)
|
|
116
83
|
line = e.last
|
|
117
|
-
line << "+
|
|
118
|
-
|
|
84
|
+
line << "+HEADING+" if $DEBUG_MODE
|
|
85
|
+
when :mw_paragraph
|
|
86
|
+
format_wiki!(e.last)
|
|
87
|
+
line = e.last + "\n"
|
|
88
|
+
line << "+PARAGRAPH+" if $DEBUG_MODE
|
|
89
|
+
when :mw_table, :mw_htable
|
|
90
|
+
next if !config[:table]
|
|
91
|
+
line = e.last
|
|
92
|
+
line << "+TABLE+" if $DEBUG_MODE
|
|
93
|
+
when :mw_pre
|
|
94
|
+
next if !config[:pre]
|
|
95
|
+
line = e.last
|
|
96
|
+
line << "+PRE+" if $DEBUG_MODE
|
|
97
|
+
when :mw_quote
|
|
98
|
+
line = e.last
|
|
99
|
+
line << "+QUOTE+" if $DEBUG_MODE
|
|
100
|
+
when :mw_unordered, :mw_ordered, :mw_definition
|
|
101
|
+
next if !config[:list]
|
|
102
|
+
line = e.last
|
|
103
|
+
line << "+LIST+" if $DEBUG_MODE
|
|
104
|
+
when :mw_ml_template
|
|
105
|
+
next if !config[:multiline]
|
|
106
|
+
line = e.last
|
|
107
|
+
line << "+MLTEMPLATE+" if $DEBUG_MODE
|
|
108
|
+
when :mw_redirect
|
|
109
|
+
next if !config[:redirect]
|
|
110
|
+
line = e.last
|
|
111
|
+
line << "+REDIRECT+" if $DEBUG_MODE
|
|
112
|
+
line << "\n\n"
|
|
113
|
+
when :mw_isolated_template
|
|
114
|
+
next if !config[:multiline]
|
|
115
|
+
line = e.last
|
|
116
|
+
line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
|
|
117
|
+
when :mw_isolated_tag
|
|
119
118
|
next
|
|
119
|
+
else
|
|
120
|
+
if $DEBUG_MODE
|
|
121
|
+
# format_wiki!(e.last)
|
|
122
|
+
line = e.last
|
|
123
|
+
line << "+OTHER+"
|
|
124
|
+
else
|
|
125
|
+
next
|
|
126
|
+
end
|
|
120
127
|
end
|
|
128
|
+
contents << line << "\n"
|
|
121
129
|
end
|
|
122
|
-
contents << line
|
|
123
130
|
end
|
|
124
|
-
convert_characters!(contents)
|
|
125
|
-
remove_table!(contents) unless $leave_table
|
|
126
|
-
remove_ref!(contents) unless $leave_ref
|
|
127
131
|
|
|
128
|
-
|
|
129
|
-
if /\A\s*\z/m =~ contents
|
|
132
|
+
if /\A[\s ]*\z/m =~ contents
|
|
130
133
|
result = ""
|
|
131
134
|
else
|
|
132
|
-
result = config[:title] ?
|
|
135
|
+
result = config[:title] ? "\n#{title}\n" << contents : contents
|
|
133
136
|
end
|
|
134
|
-
|
|
135
|
-
result.gsub!(/\n\n\n+/m){"\n\n"}
|
|
136
|
-
result << "\n"
|
|
137
|
-
end
|
|
137
|
+
end
|