wp2txt 1.0.0 → 1.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +50 -14
- data/bin/wp2txt +9 -3
- data/image/screenshot.png +0 -0
- data/lib/wp2txt/utils.rb +0 -4
- data/lib/wp2txt/version.rb +1 -1
- data/lib/wp2txt.rb +52 -39
- data/wp2txt.gemspec +3 -3
- metadata +8 -7
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d33a41cf46688679a14eb8c3eb16f6ed33ce9175c7f5b566c9f87998ba2c8401
|
4
|
+
data.tar.gz: 7371e0f7b06b2f0846f01d66f461c7e106778adc6e686919302f0f29b1f80a9e
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: cab8d9c27989387acc6dbbe052029d2205508ce10e38b8eedc111c822328d8eba551d603020684cbb3844a87b747f261a5959f711267acd96a3b97ccef4f6834
|
7
|
+
data.tar.gz: 4de59be37d57ef3d14ae2304660e8dde069bdf645a7cff862026562b26327984f1be13840e9d6ec1f25110222367f71c84a0286b649d71fec0c13805c6b0a647
|
data/README.md
CHANGED
@@ -1,26 +1,28 @@
|
|
1
1
|
<img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/wp2txt-logo.svg' width="400" />
|
2
2
|
|
3
|
-
|
3
|
+
A command-line toolkit to extract text content and category data from Wikipedia dump files
|
4
4
|
|
5
5
|
## About
|
6
6
|
|
7
|
-
WP2TXT extracts
|
7
|
+
WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
|
8
8
|
|
9
9
|
**UPDATE (August 2022)**
|
10
10
|
|
11
11
|
1. A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
|
12
|
-
2. A new option `--summary-only` has been added. If this option is enabled, only the title
|
13
|
-
3.
|
12
|
+
2. A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
|
13
|
+
3. Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
|
14
14
|
|
15
15
|
## Screenshot
|
16
16
|
|
17
17
|
<img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="700" />
|
18
18
|
|
19
|
-
|
20
|
-
- MacBook Pro (2019) 2.3GHz 8Core Intel Core i9
|
21
|
-
- enwiki-20220802-pages-articles.xml.bz2 (approx. 20GB)
|
19
|
+
**Environment**
|
22
20
|
|
23
|
-
|
21
|
+
- WP2TXT 1.0.1
|
22
|
+
- MacBook Pro (2021 Apple M1 Pro)
|
23
|
+
- enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
|
24
|
+
|
25
|
+
In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.
|
24
26
|
|
25
27
|
## Features
|
26
28
|
|
@@ -30,23 +32,45 @@ In the above environment, the process (decompression, splitting, extraction, and
|
|
30
32
|
- Allows extracting category information of the article
|
31
33
|
- Allows extracting opening paragraphs of the article
|
32
34
|
|
35
|
+
## Preparation
|
36
|
+
|
37
|
+
### For MacOS / Linux/ WSL2
|
38
|
+
|
39
|
+
WP2TXT requires that one of the following commands be installed on the system in order to decompress `bz2` files:
|
40
|
+
|
41
|
+
- `lbzip2` (recommended)
|
42
|
+
- `pbzip2`
|
43
|
+
- `bzip2`
|
44
|
+
|
45
|
+
In most cases, the `bzip2` command is pre-installed on the system. However, since `lbzip2` can use multiple CPU cores and is faster than `bzip2`, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.
|
46
|
+
|
47
|
+
If you are using MacOS with Homebrew installed, you can install `lbzip2` with the following command:
|
48
|
+
|
49
|
+
$ brew install lbzip2
|
50
|
+
|
51
|
+
### For Windows
|
52
|
+
|
53
|
+
Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.
|
54
|
+
|
33
55
|
## Installation
|
34
56
|
|
57
|
+
### WP2TXT command
|
58
|
+
|
35
59
|
$ gem install wp2txt
|
36
60
|
|
37
|
-
##
|
61
|
+
## Wikipedia Dump File
|
38
62
|
|
39
|
-
|
63
|
+
Download the latest Wikipedia dump file for the desired language at a URL such as
|
40
64
|
|
41
|
-
https://dumps.wikimedia.org/
|
65
|
+
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
|
42
66
|
|
43
|
-
|
67
|
+
Here, `enwiki` refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to jawiki (Japanese). In doing so, note that there are two instances of `enwiki` in the URL above.
|
44
68
|
|
45
69
|
Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
|
46
70
|
|
47
71
|
xxwiki-yyyymmdd-pages-articles.xml.bz2
|
48
72
|
|
49
|
-
where `xx` is language code such as `en` (English)" or `
|
73
|
+
where `xx` is language code such as `en` (English)" or `ja` (japanese), and `yyyymmdd` is the date of creation (e.g. `20220801`).
|
50
74
|
|
51
75
|
## Basic Usage
|
52
76
|
|
@@ -124,7 +148,7 @@ Command line options are as follows:
|
|
124
148
|
-g, --category-only Extract only article title and categories
|
125
149
|
-s, --summary-only Extract only article title, categories, and summary text before first heading
|
126
150
|
-f, --file-size=<i> Approximate size (in MB) of each output file (default: 10)
|
127
|
-
-n, --num-procs Number of proccesses to be run concurrently (default: max num of CPU cores minus two)
|
151
|
+
-n, --num-procs Number of proccesses to be run concurrently (default: max num of available CPU cores minus two)
|
128
152
|
-x, --del-interfile Delete intermediate XML files from output dir
|
129
153
|
-t, --title, --no-title Keep page titles in output (default: true)
|
130
154
|
-d, --heading, --no-heading Keep section titles in output (default: true)
|
@@ -132,6 +156,7 @@ Command line options are as follows:
|
|
132
156
|
-r, --ref Keep reference notations in the format [ref]...[/ref]
|
133
157
|
-e, --redirect Show redirect destination
|
134
158
|
-m, --marker, --no-marker Show symbols prefixed to list items, definitions, etc. (Default: true)
|
159
|
+
-b, --bz2-gem Use Ruby's bzip2-ruby gem instead of a system command
|
135
160
|
-v, --version Print version and exit
|
136
161
|
-h, --help Show this message
|
137
162
|
|
@@ -156,6 +181,17 @@ The author will appreciate your mentioning one of these in your research.
|
|
156
181
|
* Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
|
157
182
|
* 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
|
158
183
|
|
184
|
+
Or use this BibTeX entry:
|
185
|
+
|
186
|
+
```
|
187
|
+
@misc{WP2TXT_2022,
|
188
|
+
author = {Yoichiro Hasebe},
|
189
|
+
title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
|
190
|
+
url = {https://github.com/yohasebe/wp2txt}
|
191
|
+
year = {2022},
|
192
|
+
}
|
193
|
+
```
|
194
|
+
|
159
195
|
## License
|
160
196
|
|
161
197
|
This software is distributed under the MIT License. Please see the LICENSE file.
|
data/bin/wp2txt
CHANGED
@@ -43,6 +43,7 @@ EOS
|
|
43
43
|
opt :ref, "Keep reference notations in the format [ref]...[/ref]", :default => false, :short => "-r"
|
44
44
|
opt :redirect, "Show redirect destination", :default => false, :short => "-e"
|
45
45
|
opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true, :short => "-m"
|
46
|
+
opt :bz2_gem, "Use Ruby's bzip2-ruby gem instead of a system command", :default => false, :short => "-b"
|
46
47
|
end
|
47
48
|
|
48
49
|
Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
|
@@ -72,7 +73,8 @@ opt_array = [:title,
|
|
72
73
|
:category,
|
73
74
|
:category_only,
|
74
75
|
:summary_only,
|
75
|
-
:del_interfile
|
76
|
+
:del_interfile,
|
77
|
+
:bz2_gem ]
|
76
78
|
|
77
79
|
$leave_inline_template = true if opts[:inline]
|
78
80
|
$leave_ref = true if opts[:ref]
|
@@ -90,11 +92,15 @@ else
|
|
90
92
|
puts "Decompressing and splitting the original dump file."
|
91
93
|
puts pastel.underline("This may take a while. Please be patient!")
|
92
94
|
|
95
|
+
time_start = Time.now.to_i
|
96
|
+
wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
|
93
97
|
spinner = TTY::Spinner.new(":spinner", format: :arrow_pulse, hide_cursor: true, interval: 5)
|
94
98
|
spinner.auto_spin
|
95
|
-
wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
|
96
99
|
wpsplitter.split_file
|
97
|
-
|
100
|
+
time_finish = Time.now.to_i
|
101
|
+
|
102
|
+
spinner.stop("Time: #{sec_to_str(time_finish - time_start)}")# Stop animation
|
103
|
+
puts pastel.blue.bold("Complete!")
|
98
104
|
exit if !convert
|
99
105
|
input_files = Dir.glob("#{output_dir}/*.xml")
|
100
106
|
end
|
data/image/screenshot.png
CHANGED
Binary file
|
data/lib/wp2txt/utils.rb
CHANGED
data/lib/wp2txt/version.rb
CHANGED
data/lib/wp2txt.rb
CHANGED
@@ -7,26 +7,22 @@ require "nokogiri"
|
|
7
7
|
require "wp2txt/article"
|
8
8
|
require "wp2txt/utils"
|
9
9
|
|
10
|
-
begin
|
11
|
-
require "bzip2-ruby"
|
12
|
-
NO_BZ2 = false
|
13
|
-
rescue LoadError
|
14
|
-
# in case bzip2-ruby gem is not available
|
15
|
-
NO_BZ2 = true
|
16
|
-
end
|
17
|
-
|
18
10
|
module Wp2txt
|
19
11
|
class Splitter
|
20
12
|
include Wp2txt
|
21
|
-
def initialize(input_file, output_dir = ".", tfile_size = 10)
|
13
|
+
def initialize(input_file, output_dir = ".", tfile_size = 10, bz2_gem = false)
|
22
14
|
@fp = nil
|
23
15
|
@input_file = input_file
|
24
16
|
@output_dir = output_dir
|
25
17
|
@tfile_size = tfile_size
|
26
|
-
|
18
|
+
if bz2_gem
|
19
|
+
require "bzip2-ruby"
|
20
|
+
end
|
21
|
+
@bz2_gem = bz2_gem
|
22
|
+
prepare
|
27
23
|
end
|
28
|
-
|
29
|
-
def file_size(file)
|
24
|
+
|
25
|
+
def file_size(file)
|
30
26
|
origin = Time.now
|
31
27
|
size = 0; unit = 10485760; star = 0; before = Time.now.to_f
|
32
28
|
error_count = 10
|
@@ -36,7 +32,7 @@ module Wp2txt
|
|
36
32
|
rescue => e
|
37
33
|
a = nil
|
38
34
|
end
|
39
|
-
break unless a
|
35
|
+
break unless a
|
40
36
|
|
41
37
|
present = Time.now.to_f
|
42
38
|
size += a.size
|
@@ -44,12 +40,29 @@ module Wp2txt
|
|
44
40
|
star = 0 if star > 10
|
45
41
|
star += 1
|
46
42
|
before = present
|
47
|
-
end
|
43
|
+
end
|
48
44
|
end
|
49
45
|
time_elapsed = Time.now - origin
|
50
46
|
size
|
51
47
|
end
|
52
48
|
|
49
|
+
# check if a given command exists: return the path if it does, return false if not
|
50
|
+
def command_exist?(command)
|
51
|
+
basename = File.basename(command)
|
52
|
+
path = ""
|
53
|
+
print "Checking #{basename}: "
|
54
|
+
if open("| which #{command} 2>/dev/null"){ |f| path = f.gets.strip }
|
55
|
+
puts "detected [#{path}]"
|
56
|
+
return path.strip
|
57
|
+
elsif open("| which #{basename} 2>/dev/null"){ |f| path = f.gets.strip }
|
58
|
+
puts "detected [#{path}]"
|
59
|
+
return path.strip
|
60
|
+
else
|
61
|
+
puts "not found"
|
62
|
+
return false
|
63
|
+
end
|
64
|
+
end
|
65
|
+
|
53
66
|
# check the size of input file (bz2 or plain xml) when decompressed
|
54
67
|
def prepare
|
55
68
|
# if output_dir is not specified, output in the same directory
|
@@ -58,31 +71,31 @@ module Wp2txt
|
|
58
71
|
@output_dir = File.dirname(@input_file)
|
59
72
|
end
|
60
73
|
|
61
|
-
# if input file is bz2 compressed, use bz2-ruby if available,
|
62
|
-
# use command line bzip2 program otherwise.
|
63
74
|
if /.bz2$/ =~ @input_file
|
64
|
-
|
75
|
+
if @bz2_gem
|
65
76
|
file = Bzip2::Reader.new File.open(@input_file, "r:UTF-8")
|
77
|
+
elsif RUBY_PLATFORM.index("win32")
|
78
|
+
file = IO.popen("bunzip2.exe -c #{@input_file}")
|
66
79
|
else
|
67
|
-
if
|
68
|
-
|
69
|
-
|
70
|
-
file = IO.popen("
|
80
|
+
if bzpath = command_exist?("lbzip2") ||
|
81
|
+
command_exist?("pbzip2") ||
|
82
|
+
command_exist?("bzip2")
|
83
|
+
file = IO.popen("#{bzpath} -c -d #{@input_file}")
|
71
84
|
end
|
72
|
-
end
|
85
|
+
end
|
73
86
|
else # meaning that it is a text file
|
74
87
|
@infile_size = File.stat(@input_file).size
|
75
88
|
file = open(@input_file)
|
76
89
|
end
|
77
90
|
|
78
91
|
#create basename of output file
|
79
|
-
@outfile_base = File.basename(@input_file, ".*") + "-"
|
92
|
+
@outfile_base = File.basename(@input_file, ".*") + "-"
|
80
93
|
@total_size = 0
|
81
94
|
@file_index = 1
|
82
95
|
outfilename = File.join(@output_dir, @outfile_base + @file_index.to_s)
|
83
96
|
@outfiles = []
|
84
97
|
@outfiles << outfilename
|
85
|
-
@fp = File.open(outfilename, "w")
|
98
|
+
@fp = File.open(outfilename, "w")
|
86
99
|
@file_pointer = file
|
87
100
|
return true
|
88
101
|
end
|
@@ -100,7 +113,7 @@ module Wp2txt
|
|
100
113
|
# temp_buf is filled with text split by "\n"
|
101
114
|
temp_buf = []
|
102
115
|
ss = StringScanner.new(new_lines)
|
103
|
-
while ss.scan(/.*?\n/m)
|
116
|
+
while ss.scan(/.*?\n/m)
|
104
117
|
temp_buf << ss[0]
|
105
118
|
end
|
106
119
|
temp_buf << ss.rest unless ss.eos?
|
@@ -122,16 +135,16 @@ module Wp2txt
|
|
122
135
|
end
|
123
136
|
|
124
137
|
def get_newline
|
125
|
-
@buffer ||= [""]
|
138
|
+
@buffer ||= [""]
|
126
139
|
if @buffer.size == 1
|
127
140
|
return nil unless fill_buffer
|
128
141
|
end
|
129
142
|
if @buffer.empty?
|
130
143
|
return nil
|
131
|
-
else
|
144
|
+
else
|
132
145
|
new_line = @buffer.shift
|
133
146
|
return new_line
|
134
|
-
end
|
147
|
+
end
|
135
148
|
end
|
136
149
|
|
137
150
|
def split_file
|
@@ -145,7 +158,7 @@ module Wp2txt
|
|
145
158
|
output_text << text
|
146
159
|
end_flag = true if @total_size > (@tfile_size * 1024 * 1024)
|
147
160
|
# never close the file until the end of the page even if end_flag is on
|
148
|
-
if end_flag && /<\/page/ =~ text
|
161
|
+
if end_flag && /<\/page/ =~ text
|
149
162
|
@fp.puts(output_text)
|
150
163
|
output_text = ""
|
151
164
|
@total_size = 0
|
@@ -159,15 +172,15 @@ module Wp2txt
|
|
159
172
|
end
|
160
173
|
end
|
161
174
|
@fp.puts(output_text) if output_text != ""
|
162
|
-
@fp.close
|
175
|
+
@fp.close
|
163
176
|
|
164
177
|
if File.size(outfilename) == 0
|
165
|
-
File.delete(outfilename)
|
178
|
+
File.delete(outfilename)
|
166
179
|
@outfiles.delete(outfilename)
|
167
180
|
end
|
168
181
|
|
169
|
-
rename(@outfiles, "xml")
|
170
|
-
end
|
182
|
+
rename(@outfiles, "xml")
|
183
|
+
end
|
171
184
|
end
|
172
185
|
|
173
186
|
class Runner
|
@@ -181,7 +194,7 @@ module Wp2txt
|
|
181
194
|
@del_interfile = del_interfile
|
182
195
|
prepare
|
183
196
|
end
|
184
|
-
|
197
|
+
|
185
198
|
def prepare
|
186
199
|
@infile_size = File.stat(@input_file).size
|
187
200
|
file = open(@input_file)
|
@@ -203,7 +216,7 @@ module Wp2txt
|
|
203
216
|
# temp_buf is filled with text split by "\n"
|
204
217
|
temp_buf = []
|
205
218
|
ss = StringScanner.new(new_lines)
|
206
|
-
while ss.scan(/.*?\n/m)
|
219
|
+
while ss.scan(/.*?\n/m)
|
207
220
|
temp_buf << ss[0]
|
208
221
|
end
|
209
222
|
temp_buf << ss.rest unless ss.eos?
|
@@ -225,16 +238,16 @@ module Wp2txt
|
|
225
238
|
end
|
226
239
|
|
227
240
|
def get_newline
|
228
|
-
@buffer ||= [""]
|
241
|
+
@buffer ||= [""]
|
229
242
|
if @buffer.size == 1
|
230
243
|
return nil unless fill_buffer
|
231
244
|
end
|
232
245
|
if @buffer.empty?
|
233
246
|
return nil
|
234
|
-
else
|
247
|
+
else
|
235
248
|
new_line = @buffer.shift
|
236
249
|
return new_line
|
237
|
-
end
|
250
|
+
end
|
238
251
|
end
|
239
252
|
|
240
253
|
def get_page
|
@@ -270,7 +283,7 @@ module Wp2txt
|
|
270
283
|
pages = []
|
271
284
|
data_empty = false
|
272
285
|
|
273
|
-
while !data_empty
|
286
|
+
while !data_empty
|
274
287
|
page = get_page
|
275
288
|
if page
|
276
289
|
pages << page
|
data/wp2txt.gemspec
CHANGED
@@ -7,9 +7,9 @@ Gem::Specification.new do |s|
|
|
7
7
|
s.version = Wp2txt::VERSION
|
8
8
|
s.authors = ["Yoichiro Hasebe"]
|
9
9
|
s.email = ["yohasebe@gmail.com"]
|
10
|
-
s.homepage = "
|
11
|
-
s.summary = %q{
|
12
|
-
s.description = %q{WP2TXT extracts
|
10
|
+
s.homepage = "https://github.com/yohasebe/wp2txt"
|
11
|
+
s.summary = %q{A command-line toolkit to extract text content and category data from Wikipedia dump files}
|
12
|
+
s.description = %q{WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.}
|
13
13
|
|
14
14
|
s.rubyforge_project = "wp2txt"
|
15
15
|
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wp2txt
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Yoichiro Hasebe
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-08-
|
11
|
+
date: 2022-08-11 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -108,8 +108,8 @@ dependencies:
|
|
108
108
|
- - ">="
|
109
109
|
- !ruby/object:Gem::Version
|
110
110
|
version: '0'
|
111
|
-
description: WP2TXT extracts
|
112
|
-
XML/compressed with Bzip2)
|
111
|
+
description: WP2TXT extracts text and category data from Wikipedia dump files (encoded
|
112
|
+
in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
|
113
113
|
email:
|
114
114
|
- yohasebe@gmail.com
|
115
115
|
executables:
|
@@ -141,7 +141,7 @@ files:
|
|
141
141
|
- spec/spec_helper.rb
|
142
142
|
- spec/utils_spec.rb
|
143
143
|
- wp2txt.gemspec
|
144
|
-
homepage:
|
144
|
+
homepage: https://github.com/yohasebe/wp2txt
|
145
145
|
licenses: []
|
146
146
|
metadata: {}
|
147
147
|
post_install_message:
|
@@ -159,10 +159,11 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
159
159
|
- !ruby/object:Gem::Version
|
160
160
|
version: '0'
|
161
161
|
requirements: []
|
162
|
-
rubygems_version: 3.3.
|
162
|
+
rubygems_version: 3.3.7
|
163
163
|
signing_key:
|
164
164
|
specification_version: 4
|
165
|
-
summary:
|
165
|
+
summary: A command-line toolkit to extract text content and category data from Wikipedia
|
166
|
+
dump files
|
166
167
|
test_files:
|
167
168
|
- spec/spec_helper.rb
|
168
169
|
- spec/utils_spec.rb
|