wp2txt 1.0.0 → 1.0.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +59 -17
- data/bin/wp2txt +9 -3
- data/image/screenshot.png +0 -0
- data/lib/wp2txt/utils.rb +22 -25
- data/lib/wp2txt/version.rb +1 -1
- data/lib/wp2txt.rb +52 -39
- data/tags +58 -0
- data/wp2txt.gemspec +3 -3
- metadata +8 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: bb540f4f17f7825786d110245c235ac556e3e64cedb17efae3e0591887425801
|
4
|
+
data.tar.gz: 479c357f7ba117ae10d9a5a04d24ce3aca2e54d942a156b02eb932c1aab55c8b
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 940d47d2c8bce06029fe76e3b3744563d089e26e297e5224b36e65d815295da57117eae84cbb43abeddf2f2c052e2a987d668cba52c7af6148e935b571b6d403
|
7
|
+
data.tar.gz: 8ce76523a3bf181ac7a5da11f088dd14cfb1e1d7ac0d5239832db52968d183db16a3ece6074513b634eebe0e5ca28ceea945eaef6542ecb1933266caf4e89a3c
|
data/README.md
CHANGED
@@ -1,26 +1,34 @@
|
|
1
1
|
<img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/wp2txt-logo.svg' width="400" />
|
2
2
|
|
3
|
-
|
3
|
+
A command-line toolkit to extract text content and category data from Wikipedia dump files
|
4
4
|
|
5
5
|
## About
|
6
6
|
|
7
|
-
WP2TXT extracts
|
7
|
+
WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
|
8
8
|
|
9
|
-
|
9
|
+
## Changelog
|
10
10
|
|
11
|
-
|
12
|
-
|
13
|
-
|
11
|
+
**November 2022**
|
12
|
+
|
13
|
+
- Code added to suppress "Invalid byte sequence error" when an ilegal UTF-8 character is input.
|
14
|
+
|
15
|
+
**August 2022**
|
16
|
+
|
17
|
+
- A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
|
18
|
+
- A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
|
19
|
+
- Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
|
14
20
|
|
15
21
|
## Screenshot
|
16
22
|
|
17
|
-
<img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="
|
23
|
+
<img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="800" />
|
18
24
|
|
19
|
-
|
20
|
-
- MacBook Pro (2019) 2.3GHz 8Core Intel Core i9
|
21
|
-
- enwiki-20220802-pages-articles.xml.bz2 (approx. 20GB)
|
25
|
+
**Environment**
|
22
26
|
|
23
|
-
|
27
|
+
- WP2TXT 1.0.1
|
28
|
+
- MacBook Pro (2021 Apple M1 Pro)
|
29
|
+
- enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
|
30
|
+
|
31
|
+
In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.
|
24
32
|
|
25
33
|
## Features
|
26
34
|
|
@@ -30,23 +38,45 @@ In the above environment, the process (decompression, splitting, extraction, and
|
|
30
38
|
- Allows extracting category information of the article
|
31
39
|
- Allows extracting opening paragraphs of the article
|
32
40
|
|
41
|
+
## Preparation
|
42
|
+
|
43
|
+
### For MacOS and Linux
|
44
|
+
|
45
|
+
WP2TXT requires that one of the following commands be installed on the system in order to decompress `bz2` files:
|
46
|
+
|
47
|
+
- `lbzip2` (recommended)
|
48
|
+
- `pbzip2`
|
49
|
+
- `bzip2`
|
50
|
+
|
51
|
+
In most cases, the `bzip2` command is pre-installed on the system. However, since `lbzip2` can use multiple CPU cores and is faster than `bzip2`, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.
|
52
|
+
|
53
|
+
If you are using MacOS with Homebrew installed, you can install `lbzip2` with the following command:
|
54
|
+
|
55
|
+
$ brew install lbzip2
|
56
|
+
|
57
|
+
### For Windows
|
58
|
+
|
59
|
+
Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.
|
60
|
+
|
33
61
|
## Installation
|
34
62
|
|
63
|
+
### WP2TXT command
|
64
|
+
|
35
65
|
$ gem install wp2txt
|
36
66
|
|
37
|
-
##
|
67
|
+
## Wikipedia Dump File
|
38
68
|
|
39
|
-
|
69
|
+
Download the latest Wikipedia dump file for the desired language at a URL such as
|
40
70
|
|
41
|
-
https://dumps.wikimedia.org/
|
71
|
+
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
|
42
72
|
|
43
|
-
|
73
|
+
Here, `enwiki` refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to jawiki (Japanese). In doing so, note that there are two instances of `enwiki` in the URL above.
|
44
74
|
|
45
75
|
Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
|
46
76
|
|
47
77
|
xxwiki-yyyymmdd-pages-articles.xml.bz2
|
48
78
|
|
49
|
-
where `xx` is language code such as `en` (English)" or `
|
79
|
+
where `xx` is language code such as `en` (English)" or `ja` (japanese), and `yyyymmdd` is the date of creation (e.g. `20220801`).
|
50
80
|
|
51
81
|
## Basic Usage
|
52
82
|
|
@@ -124,7 +154,7 @@ Command line options are as follows:
|
|
124
154
|
-g, --category-only Extract only article title and categories
|
125
155
|
-s, --summary-only Extract only article title, categories, and summary text before first heading
|
126
156
|
-f, --file-size=<i> Approximate size (in MB) of each output file (default: 10)
|
127
|
-
-n, --num-procs Number of proccesses to be run concurrently (default: max num of CPU cores minus two)
|
157
|
+
-n, --num-procs Number of proccesses to be run concurrently (default: max num of available CPU cores minus two)
|
128
158
|
-x, --del-interfile Delete intermediate XML files from output dir
|
129
159
|
-t, --title, --no-title Keep page titles in output (default: true)
|
130
160
|
-d, --heading, --no-heading Keep section titles in output (default: true)
|
@@ -132,6 +162,7 @@ Command line options are as follows:
|
|
132
162
|
-r, --ref Keep reference notations in the format [ref]...[/ref]
|
133
163
|
-e, --redirect Show redirect destination
|
134
164
|
-m, --marker, --no-marker Show symbols prefixed to list items, definitions, etc. (Default: true)
|
165
|
+
-b, --bz2-gem Use Ruby's bzip2-ruby gem instead of a system command
|
135
166
|
-v, --version Print version and exit
|
136
167
|
-h, --help Show this message
|
137
168
|
|
@@ -156,6 +187,17 @@ The author will appreciate your mentioning one of these in your research.
|
|
156
187
|
* Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
|
157
188
|
* 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
|
158
189
|
|
190
|
+
Or use this BibTeX entry:
|
191
|
+
|
192
|
+
```
|
193
|
+
@misc{wp2txt_2022,
|
194
|
+
author = {Yoichiro Hasebe},
|
195
|
+
title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
|
196
|
+
url = {https://github.com/yohasebe/wp2txt},
|
197
|
+
year = {2022}
|
198
|
+
}
|
199
|
+
```
|
200
|
+
|
159
201
|
## License
|
160
202
|
|
161
203
|
This software is distributed under the MIT License. Please see the LICENSE file.
|
data/bin/wp2txt
CHANGED
@@ -43,6 +43,7 @@ EOS
|
|
43
43
|
opt :ref, "Keep reference notations in the format [ref]...[/ref]", :default => false, :short => "-r"
|
44
44
|
opt :redirect, "Show redirect destination", :default => false, :short => "-e"
|
45
45
|
opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true, :short => "-m"
|
46
|
+
opt :bz2_gem, "Use Ruby's bzip2-ruby gem instead of a system command", :default => false, :short => "-b"
|
46
47
|
end
|
47
48
|
|
48
49
|
Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
|
@@ -72,7 +73,8 @@ opt_array = [:title,
|
|
72
73
|
:category,
|
73
74
|
:category_only,
|
74
75
|
:summary_only,
|
75
|
-
:del_interfile
|
76
|
+
:del_interfile,
|
77
|
+
:bz2_gem ]
|
76
78
|
|
77
79
|
$leave_inline_template = true if opts[:inline]
|
78
80
|
$leave_ref = true if opts[:ref]
|
@@ -90,11 +92,15 @@ else
|
|
90
92
|
puts "Decompressing and splitting the original dump file."
|
91
93
|
puts pastel.underline("This may take a while. Please be patient!")
|
92
94
|
|
95
|
+
time_start = Time.now.to_i
|
96
|
+
wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
|
93
97
|
spinner = TTY::Spinner.new(":spinner", format: :arrow_pulse, hide_cursor: true, interval: 5)
|
94
98
|
spinner.auto_spin
|
95
|
-
wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
|
96
99
|
wpsplitter.split_file
|
97
|
-
|
100
|
+
time_finish = Time.now.to_i
|
101
|
+
|
102
|
+
spinner.stop("Time: #{sec_to_str(time_finish - time_start)}")# Stop animation
|
103
|
+
puts pastel.blue.bold("Complete!")
|
98
104
|
exit if !convert
|
99
105
|
input_files = Dir.glob("#{output_dir}/*.xml")
|
100
106
|
end
|
data/image/screenshot.png
CHANGED
Binary file
|
data/lib/wp2txt/utils.rb
CHANGED
@@ -41,7 +41,7 @@ $in_table_regex2 = Regexp.new('^\|\}.*?$')
|
|
41
41
|
$in_unordered_regex = Regexp.new('^\*')
|
42
42
|
$in_ordered_regex = Regexp.new('^\#')
|
43
43
|
$in_pre_regex = Regexp.new('^ ')
|
44
|
-
$in_definition_regex = Regexp.new('^[\;\:]')
|
44
|
+
$in_definition_regex = Regexp.new('^[\;\:]')
|
45
45
|
$blank_line_regex = Regexp.new('^\s*$')
|
46
46
|
$redirect_regex = Regexp.new('#(?:REDIRECT|転送)\s+\[\[(.+)\]\]', Regexp::IGNORECASE)
|
47
47
|
$remove_tag_regex = Regexp.new("\<[^\<\>]*\>")
|
@@ -98,11 +98,12 @@ $cleanup_regex_08 = Regexp.new('\n\n\n+', Regexp::MULTILINE)
|
|
98
98
|
module Wp2txt
|
99
99
|
|
100
100
|
def convert_characters!(text, has_retried = false)
|
101
|
-
begin
|
102
|
-
text << ""
|
101
|
+
begin
|
102
|
+
text << ""
|
103
103
|
chrref_to_utf!(text)
|
104
104
|
special_chr!(text)
|
105
|
-
|
105
|
+
text.encode!("UTF-8", "UTF-8", invalid: :replace, replace: "")
|
106
|
+
|
106
107
|
rescue # detect invalid byte sequence in UTF-8
|
107
108
|
if has_retried
|
108
109
|
puts "invalid byte sequence detected"
|
@@ -112,20 +113,20 @@ module Wp2txt
|
|
112
113
|
end
|
113
114
|
exit
|
114
115
|
else
|
115
|
-
text.encode!("UTF-16")
|
116
|
-
text.encode!("UTF-
|
116
|
+
text.encode!("UTF-16", "UTF-16", invalid: :replace, replace: "")
|
117
|
+
text.encode!("UTF-16", "UTF-16", invalid: :replace, replace: "")
|
117
118
|
convert_characters!(text, true)
|
118
119
|
end
|
119
120
|
end
|
120
121
|
end
|
121
|
-
|
122
|
+
|
122
123
|
def format_wiki!(text, has_retried = false)
|
123
124
|
remove_complex!(text)
|
124
125
|
|
125
126
|
escape_nowiki!(text)
|
126
127
|
process_interwiki_links!(text)
|
127
128
|
process_external_links!(text)
|
128
|
-
unescape_nowiki!(text)
|
129
|
+
unescape_nowiki!(text)
|
129
130
|
remove_directive!(text)
|
130
131
|
remove_emphasis!(text)
|
131
132
|
mndash!(text)
|
@@ -135,7 +136,7 @@ module Wp2txt
|
|
135
136
|
remove_templates!(text) unless $leave_inline_template
|
136
137
|
remove_table!(text) unless $leave_table
|
137
138
|
end
|
138
|
-
|
139
|
+
|
139
140
|
def cleanup!(text)
|
140
141
|
text.gsub!($cleanup_regex_01){""}
|
141
142
|
text.gsub!($cleanup_regex_02){""}
|
@@ -150,7 +151,7 @@ module Wp2txt
|
|
150
151
|
end
|
151
152
|
|
152
153
|
#################### parser for nested structure ####################
|
153
|
-
|
154
|
+
|
154
155
|
def process_nested_structure(scanner, left, right, &block)
|
155
156
|
test = false
|
156
157
|
buffer = ""
|
@@ -195,7 +196,7 @@ module Wp2txt
|
|
195
196
|
rescue => e
|
196
197
|
return scanner.string
|
197
198
|
end
|
198
|
-
end
|
199
|
+
end
|
199
200
|
|
200
201
|
#################### methods used from format_wiki ####################
|
201
202
|
def escape_nowiki!(str)
|
@@ -218,11 +219,11 @@ module Wp2txt
|
|
218
219
|
@nowikis[obj_id]
|
219
220
|
end
|
220
221
|
end
|
221
|
-
|
222
|
+
|
222
223
|
def process_interwiki_links!(str)
|
223
224
|
scanner = StringScanner.new(str)
|
224
225
|
result = process_nested_structure(scanner, "[[", "]]") do |contents|
|
225
|
-
parts = contents.split("|")
|
226
|
+
parts = contents.split("|")
|
226
227
|
case parts.size
|
227
228
|
when 1
|
228
229
|
parts.first || ""
|
@@ -265,7 +266,7 @@ module Wp2txt
|
|
265
266
|
end
|
266
267
|
str.replace(result)
|
267
268
|
end
|
268
|
-
|
269
|
+
|
269
270
|
def remove_table!(str)
|
270
271
|
scanner = StringScanner.new(str)
|
271
272
|
result = process_nested_structure(scanner, "{|", "|}") do |contents|
|
@@ -273,7 +274,7 @@ module Wp2txt
|
|
273
274
|
end
|
274
275
|
str.replace(result)
|
275
276
|
end
|
276
|
-
|
277
|
+
|
277
278
|
def special_chr!(str)
|
278
279
|
str.replace $html_decoder.decode(str)
|
279
280
|
end
|
@@ -316,7 +317,7 @@ module Wp2txt
|
|
316
317
|
end
|
317
318
|
return true
|
318
319
|
end
|
319
|
-
|
320
|
+
|
320
321
|
def mndash!(str)
|
321
322
|
str.gsub!($mndash_regex, "–")
|
322
323
|
end
|
@@ -347,7 +348,7 @@ module Wp2txt
|
|
347
348
|
str.gsub!($complex_regex_04){""}
|
348
349
|
str.gsub!($complex_regex_05){""}
|
349
350
|
end
|
350
|
-
|
351
|
+
|
351
352
|
def make_reference!(str)
|
352
353
|
str.gsub!($make_reference_regex_a){"\n"}
|
353
354
|
str.gsub!($make_reference_regex_b){""}
|
@@ -413,7 +414,7 @@ module Wp2txt
|
|
413
414
|
File.rename(file_path, file_path + ".bak")
|
414
415
|
File.rename("temp", file_path)
|
415
416
|
File.unlink(file_path + ".bak") unless backup
|
416
|
-
end
|
417
|
+
end
|
417
418
|
|
418
419
|
# modify files under a directry (recursive)
|
419
420
|
def batch_file_mod(dir_path, &block)
|
@@ -421,7 +422,7 @@ module Wp2txt
|
|
421
422
|
collect_files(dir_path).each do |file|
|
422
423
|
yield file if FileTest.file?(file)
|
423
424
|
end
|
424
|
-
else
|
425
|
+
else
|
425
426
|
yield dir_path if FileTest.file?(dir_path)
|
426
427
|
end
|
427
428
|
end
|
@@ -445,9 +446,9 @@ module Wp2txt
|
|
445
446
|
end
|
446
447
|
end
|
447
448
|
|
448
|
-
def rename(files, ext = "txt")
|
449
|
+
def rename(files, ext = "txt")
|
449
450
|
# num of digits necessary to name the last file generated
|
450
|
-
maxwidth = 0
|
451
|
+
maxwidth = 0
|
451
452
|
|
452
453
|
files.each do |f|
|
453
454
|
width = f.slice(/\-(\d+)\z/, 1).to_s.length.to_i
|
@@ -476,8 +477,4 @@ module Wp2txt
|
|
476
477
|
return str
|
477
478
|
end
|
478
479
|
|
479
|
-
def decimal_format(i)
|
480
|
-
str = i.to_s.reverse
|
481
|
-
return str.scan(/.?.?./).join(',').reverse
|
482
|
-
end
|
483
480
|
end
|
data/lib/wp2txt/version.rb
CHANGED
data/lib/wp2txt.rb
CHANGED
@@ -7,26 +7,22 @@ require "nokogiri"
|
|
7
7
|
require "wp2txt/article"
|
8
8
|
require "wp2txt/utils"
|
9
9
|
|
10
|
-
begin
|
11
|
-
require "bzip2-ruby"
|
12
|
-
NO_BZ2 = false
|
13
|
-
rescue LoadError
|
14
|
-
# in case bzip2-ruby gem is not available
|
15
|
-
NO_BZ2 = true
|
16
|
-
end
|
17
|
-
|
18
10
|
module Wp2txt
|
19
11
|
class Splitter
|
20
12
|
include Wp2txt
|
21
|
-
def initialize(input_file, output_dir = ".", tfile_size = 10)
|
13
|
+
def initialize(input_file, output_dir = ".", tfile_size = 10, bz2_gem = false)
|
22
14
|
@fp = nil
|
23
15
|
@input_file = input_file
|
24
16
|
@output_dir = output_dir
|
25
17
|
@tfile_size = tfile_size
|
26
|
-
|
18
|
+
if bz2_gem
|
19
|
+
require "bzip2-ruby"
|
20
|
+
end
|
21
|
+
@bz2_gem = bz2_gem
|
22
|
+
prepare
|
27
23
|
end
|
28
|
-
|
29
|
-
def file_size(file)
|
24
|
+
|
25
|
+
def file_size(file)
|
30
26
|
origin = Time.now
|
31
27
|
size = 0; unit = 10485760; star = 0; before = Time.now.to_f
|
32
28
|
error_count = 10
|
@@ -36,7 +32,7 @@ module Wp2txt
|
|
36
32
|
rescue => e
|
37
33
|
a = nil
|
38
34
|
end
|
39
|
-
break unless a
|
35
|
+
break unless a
|
40
36
|
|
41
37
|
present = Time.now.to_f
|
42
38
|
size += a.size
|
@@ -44,12 +40,29 @@ module Wp2txt
|
|
44
40
|
star = 0 if star > 10
|
45
41
|
star += 1
|
46
42
|
before = present
|
47
|
-
end
|
43
|
+
end
|
48
44
|
end
|
49
45
|
time_elapsed = Time.now - origin
|
50
46
|
size
|
51
47
|
end
|
52
48
|
|
49
|
+
# check if a given command exists: return the path if it does, return false if not
|
50
|
+
def command_exist?(command)
|
51
|
+
basename = File.basename(command)
|
52
|
+
path = ""
|
53
|
+
print "Checking #{basename}: "
|
54
|
+
if open("| which #{command} 2>/dev/null"){ |f| path = f.gets.strip }
|
55
|
+
puts "detected [#{path}]"
|
56
|
+
return path.strip
|
57
|
+
elsif open("| which #{basename} 2>/dev/null"){ |f| path = f.gets.strip }
|
58
|
+
puts "detected [#{path}]"
|
59
|
+
return path.strip
|
60
|
+
else
|
61
|
+
puts "not found"
|
62
|
+
return false
|
63
|
+
end
|
64
|
+
end
|
65
|
+
|
53
66
|
# check the size of input file (bz2 or plain xml) when decompressed
|
54
67
|
def prepare
|
55
68
|
# if output_dir is not specified, output in the same directory
|
@@ -58,31 +71,31 @@ module Wp2txt
|
|
58
71
|
@output_dir = File.dirname(@input_file)
|
59
72
|
end
|
60
73
|
|
61
|
-
# if input file is bz2 compressed, use bz2-ruby if available,
|
62
|
-
# use command line bzip2 program otherwise.
|
63
74
|
if /.bz2$/ =~ @input_file
|
64
|
-
|
75
|
+
if @bz2_gem
|
65
76
|
file = Bzip2::Reader.new File.open(@input_file, "r:UTF-8")
|
77
|
+
elsif RUBY_PLATFORM.index("win32")
|
78
|
+
file = IO.popen("bunzip2.exe -c #{@input_file}")
|
66
79
|
else
|
67
|
-
if
|
68
|
-
|
69
|
-
|
70
|
-
file = IO.popen("
|
80
|
+
if bzpath = command_exist?("lbzip2") ||
|
81
|
+
command_exist?("pbzip2") ||
|
82
|
+
command_exist?("bzip2")
|
83
|
+
file = IO.popen("#{bzpath} -c -d #{@input_file}")
|
71
84
|
end
|
72
|
-
end
|
85
|
+
end
|
73
86
|
else # meaning that it is a text file
|
74
87
|
@infile_size = File.stat(@input_file).size
|
75
88
|
file = open(@input_file)
|
76
89
|
end
|
77
90
|
|
78
91
|
#create basename of output file
|
79
|
-
@outfile_base = File.basename(@input_file, ".*") + "-"
|
92
|
+
@outfile_base = File.basename(@input_file, ".*") + "-"
|
80
93
|
@total_size = 0
|
81
94
|
@file_index = 1
|
82
95
|
outfilename = File.join(@output_dir, @outfile_base + @file_index.to_s)
|
83
96
|
@outfiles = []
|
84
97
|
@outfiles << outfilename
|
85
|
-
@fp = File.open(outfilename, "w")
|
98
|
+
@fp = File.open(outfilename, "w")
|
86
99
|
@file_pointer = file
|
87
100
|
return true
|
88
101
|
end
|
@@ -100,7 +113,7 @@ module Wp2txt
|
|
100
113
|
# temp_buf is filled with text split by "\n"
|
101
114
|
temp_buf = []
|
102
115
|
ss = StringScanner.new(new_lines)
|
103
|
-
while ss.scan(/.*?\n/m)
|
116
|
+
while ss.scan(/.*?\n/m)
|
104
117
|
temp_buf << ss[0]
|
105
118
|
end
|
106
119
|
temp_buf << ss.rest unless ss.eos?
|
@@ -122,16 +135,16 @@ module Wp2txt
|
|
122
135
|
end
|
123
136
|
|
124
137
|
def get_newline
|
125
|
-
@buffer ||= [""]
|
138
|
+
@buffer ||= [""]
|
126
139
|
if @buffer.size == 1
|
127
140
|
return nil unless fill_buffer
|
128
141
|
end
|
129
142
|
if @buffer.empty?
|
130
143
|
return nil
|
131
|
-
else
|
144
|
+
else
|
132
145
|
new_line = @buffer.shift
|
133
146
|
return new_line
|
134
|
-
end
|
147
|
+
end
|
135
148
|
end
|
136
149
|
|
137
150
|
def split_file
|
@@ -145,7 +158,7 @@ module Wp2txt
|
|
145
158
|
output_text << text
|
146
159
|
end_flag = true if @total_size > (@tfile_size * 1024 * 1024)
|
147
160
|
# never close the file until the end of the page even if end_flag is on
|
148
|
-
if end_flag && /<\/page/ =~ text
|
161
|
+
if end_flag && /<\/page/ =~ text
|
149
162
|
@fp.puts(output_text)
|
150
163
|
output_text = ""
|
151
164
|
@total_size = 0
|
@@ -159,15 +172,15 @@ module Wp2txt
|
|
159
172
|
end
|
160
173
|
end
|
161
174
|
@fp.puts(output_text) if output_text != ""
|
162
|
-
@fp.close
|
175
|
+
@fp.close
|
163
176
|
|
164
177
|
if File.size(outfilename) == 0
|
165
|
-
File.delete(outfilename)
|
178
|
+
File.delete(outfilename)
|
166
179
|
@outfiles.delete(outfilename)
|
167
180
|
end
|
168
181
|
|
169
|
-
rename(@outfiles, "xml")
|
170
|
-
end
|
182
|
+
rename(@outfiles, "xml")
|
183
|
+
end
|
171
184
|
end
|
172
185
|
|
173
186
|
class Runner
|
@@ -181,7 +194,7 @@ module Wp2txt
|
|
181
194
|
@del_interfile = del_interfile
|
182
195
|
prepare
|
183
196
|
end
|
184
|
-
|
197
|
+
|
185
198
|
def prepare
|
186
199
|
@infile_size = File.stat(@input_file).size
|
187
200
|
file = open(@input_file)
|
@@ -203,7 +216,7 @@ module Wp2txt
|
|
203
216
|
# temp_buf is filled with text split by "\n"
|
204
217
|
temp_buf = []
|
205
218
|
ss = StringScanner.new(new_lines)
|
206
|
-
while ss.scan(/.*?\n/m)
|
219
|
+
while ss.scan(/.*?\n/m)
|
207
220
|
temp_buf << ss[0]
|
208
221
|
end
|
209
222
|
temp_buf << ss.rest unless ss.eos?
|
@@ -225,16 +238,16 @@ module Wp2txt
|
|
225
238
|
end
|
226
239
|
|
227
240
|
def get_newline
|
228
|
-
@buffer ||= [""]
|
241
|
+
@buffer ||= [""]
|
229
242
|
if @buffer.size == 1
|
230
243
|
return nil unless fill_buffer
|
231
244
|
end
|
232
245
|
if @buffer.empty?
|
233
246
|
return nil
|
234
|
-
else
|
247
|
+
else
|
235
248
|
new_line = @buffer.shift
|
236
249
|
return new_line
|
237
|
-
end
|
250
|
+
end
|
238
251
|
end
|
239
252
|
|
240
253
|
def get_page
|
@@ -270,7 +283,7 @@ module Wp2txt
|
|
270
283
|
pages = []
|
271
284
|
data_empty = false
|
272
285
|
|
273
|
-
while !data_empty
|
286
|
+
while !data_empty
|
274
287
|
page = get_page
|
275
288
|
if page
|
276
289
|
pages << page
|
data/tags
ADDED
@@ -0,0 +1,58 @@
|
|
1
|
+
!_TAG_FILE_FORMAT 2 /extended format; --format=1 will not append ;" to lines/
|
2
|
+
!_TAG_FILE_SORTED 1 /0=unsorted, 1=sorted, 2=foldcase/
|
3
|
+
!_TAG_PROGRAM_AUTHOR Darren Hiebert /dhiebert@users.sourceforge.net/
|
4
|
+
!_TAG_PROGRAM_NAME Exuberant Ctags //
|
5
|
+
!_TAG_PROGRAM_URL http://ctags.sourceforge.net /official site/
|
6
|
+
!_TAG_PROGRAM_VERSION 5.8 //
|
7
|
+
Article lib/wp2txt/article.rb /^ class Article$/;" c class:Wp2txt
|
8
|
+
Runner lib/wp2txt.rb /^ class Runner$/;" c class:Wp2txt.Splitter.file_size
|
9
|
+
Splitter lib/wp2txt.rb /^ class Splitter$/;" c class:Wp2txt
|
10
|
+
Wp2txt lib/wp2txt.rb /^module Wp2txt$/;" m
|
11
|
+
Wp2txt lib/wp2txt/article.rb /^module Wp2txt$/;" m
|
12
|
+
Wp2txt lib/wp2txt/utils.rb /^module Wp2txt$/;" m
|
13
|
+
Wp2txt lib/wp2txt/version.rb /^module Wp2txt$/;" m
|
14
|
+
batch_file_mod lib/wp2txt/utils.rb /^ def batch_file_mod(dir_path, &block)$/;" f
|
15
|
+
chrref_to_utf! lib/wp2txt/utils.rb /^ def chrref_to_utf!(num_str)$/;" f
|
16
|
+
cleanup! lib/wp2txt/utils.rb /^ def cleanup!(text)$/;" f
|
17
|
+
collect_files lib/wp2txt/utils.rb /^ def collect_files(str, regex = nil)$/;" f
|
18
|
+
command_exist? lib/wp2txt.rb /^ def command_exist?(command)$/;" f class:Wp2txt.Splitter.file_size
|
19
|
+
convert_characters! lib/wp2txt/utils.rb /^ def convert_characters!(text, has_retried = false)$/;" f class:Wp2txt
|
20
|
+
correct_inline_template! lib/wp2txt/utils.rb /^ def correct_inline_template!(str)$/;" f
|
21
|
+
correct_separator lib/wp2txt/utils.rb /^ def correct_separator(input)$/;" f
|
22
|
+
create_element lib/wp2txt/article.rb /^ def create_element(tp, text)$/;" f class:Wp2txt.Article
|
23
|
+
escape_nowiki! lib/wp2txt/utils.rb /^ def escape_nowiki!(str)$/;" f
|
24
|
+
extract_text lib/wp2txt.rb /^ def extract_text(&block)$/;" f class:Wp2txt.Splitter.file_size.Runner.fill_buffer
|
25
|
+
file_mod lib/wp2txt/utils.rb /^ def file_mod(file_path, backup = false, &block)$/;" f
|
26
|
+
file_size lib/wp2txt.rb /^ def file_size(file)$/;" f class:Wp2txt.Splitter
|
27
|
+
fill_buffer lib/wp2txt.rb /^ def fill_buffer$/;" f class:Wp2txt.Splitter.file_size
|
28
|
+
fill_buffer lib/wp2txt.rb /^ def fill_buffer$/;" f class:Wp2txt.Splitter.file_size.Runner
|
29
|
+
format_wiki! lib/wp2txt/utils.rb /^ def format_wiki!(text, has_retried = false)$/;" f
|
30
|
+
get_newline lib/wp2txt.rb /^ def get_newline$/;" f class:Wp2txt.Splitter.file_size.Runner.fill_buffer
|
31
|
+
get_newline lib/wp2txt.rb /^ def get_newline$/;" f class:Wp2txt.Splitter.file_size.fill_buffer
|
32
|
+
get_page lib/wp2txt.rb /^ def get_page$/;" f class:Wp2txt.Splitter.file_size.Runner.fill_buffer
|
33
|
+
initialize lib/wp2txt.rb /^ def initialize(input_file, output_dir = ".", strip_tmarker = false, del_interfile = true)$/;" f class:Wp2txt.Splitter.file_size.Runner
|
34
|
+
initialize lib/wp2txt.rb /^ def initialize(input_file, output_dir = ".", tfile_size = 10, bz2_gem = false)$/;" f class:Wp2txt.Splitter
|
35
|
+
initialize lib/wp2txt/article.rb /^ def initialize(text, title = "", strip_tmarker = false)$/;" f class:Wp2txt.Article
|
36
|
+
make_reference! lib/wp2txt/utils.rb /^ def make_reference!(str)$/;" f
|
37
|
+
mndash! lib/wp2txt/utils.rb /^ def mndash!(str)$/;" f
|
38
|
+
parse lib/wp2txt/article.rb /^ def parse(source)$/;" f class:Wp2txt.Article
|
39
|
+
prepare lib/wp2txt.rb /^ def prepare$/;" f class:Wp2txt.Splitter.file_size
|
40
|
+
prepare lib/wp2txt.rb /^ def prepare$/;" f class:Wp2txt.Splitter.file_size.Runner
|
41
|
+
process_external_links! lib/wp2txt/utils.rb /^ def process_external_links!(str)$/;" f
|
42
|
+
process_interwiki_links! lib/wp2txt/utils.rb /^ def process_interwiki_links!(str)$/;" f
|
43
|
+
process_nested_structure lib/wp2txt/utils.rb /^ def process_nested_structure(scanner, left, right, &block)$/;" f
|
44
|
+
remove_complex! lib/wp2txt/utils.rb /^ def remove_complex!(str)$/;" f
|
45
|
+
remove_directive! lib/wp2txt/utils.rb /^ def remove_directive!(str)$/;" f
|
46
|
+
remove_emphasis! lib/wp2txt/utils.rb /^ def remove_emphasis!(str)$/;" f
|
47
|
+
remove_hr! lib/wp2txt/utils.rb /^ def remove_hr!(str)$/;" f
|
48
|
+
remove_html! lib/wp2txt/utils.rb /^ def remove_html!(str)$/;" f
|
49
|
+
remove_inbetween! lib/wp2txt/utils.rb /^ def remove_inbetween!(str, tagset = ['<', '>'])$/;" f
|
50
|
+
remove_ref! lib/wp2txt/utils.rb /^ def remove_ref!(str)$/;" f
|
51
|
+
remove_table! lib/wp2txt/utils.rb /^ def remove_table!(str)$/;" f
|
52
|
+
remove_tag! lib/wp2txt/utils.rb /^ def remove_tag!(str)$/;" f
|
53
|
+
remove_templates! lib/wp2txt/utils.rb /^ def remove_templates!(str)$/;" f
|
54
|
+
rename lib/wp2txt/utils.rb /^ def rename(files, ext = "txt")$/;" f
|
55
|
+
sec_to_str lib/wp2txt/utils.rb /^ def sec_to_str(int)$/;" f
|
56
|
+
special_chr! lib/wp2txt/utils.rb /^ def special_chr!(str)$/;" f
|
57
|
+
split_file lib/wp2txt.rb /^ def split_file$/;" f class:Wp2txt.Splitter.file_size.fill_buffer
|
58
|
+
unescape_nowiki! lib/wp2txt/utils.rb /^ def unescape_nowiki!(str)$/;" f
|
data/wp2txt.gemspec
CHANGED
@@ -7,9 +7,9 @@ Gem::Specification.new do |s|
|
|
7
7
|
s.version = Wp2txt::VERSION
|
8
8
|
s.authors = ["Yoichiro Hasebe"]
|
9
9
|
s.email = ["yohasebe@gmail.com"]
|
10
|
-
s.homepage = "
|
11
|
-
s.summary = %q{
|
12
|
-
s.description = %q{WP2TXT extracts
|
10
|
+
s.homepage = "https://github.com/yohasebe/wp2txt"
|
11
|
+
s.summary = %q{A command-line toolkit to extract text content and category data from Wikipedia dump files}
|
12
|
+
s.description = %q{WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.}
|
13
13
|
|
14
14
|
s.rubyforge_project = "wp2txt"
|
15
15
|
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wp2txt
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Yoichiro Hasebe
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-
|
11
|
+
date: 2022-11-25 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -108,8 +108,8 @@ dependencies:
|
|
108
108
|
- - ">="
|
109
109
|
- !ruby/object:Gem::Version
|
110
110
|
version: '0'
|
111
|
-
description: WP2TXT extracts
|
112
|
-
XML/compressed with Bzip2)
|
111
|
+
description: WP2TXT extracts text and category data from Wikipedia dump files (encoded
|
112
|
+
in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
|
113
113
|
email:
|
114
114
|
- yohasebe@gmail.com
|
115
115
|
executables:
|
@@ -140,8 +140,9 @@ files:
|
|
140
140
|
- lib/wp2txt/version.rb
|
141
141
|
- spec/spec_helper.rb
|
142
142
|
- spec/utils_spec.rb
|
143
|
+
- tags
|
143
144
|
- wp2txt.gemspec
|
144
|
-
homepage:
|
145
|
+
homepage: https://github.com/yohasebe/wp2txt
|
145
146
|
licenses: []
|
146
147
|
metadata: {}
|
147
148
|
post_install_message:
|
@@ -162,7 +163,8 @@ requirements: []
|
|
162
163
|
rubygems_version: 3.3.3
|
163
164
|
signing_key:
|
164
165
|
specification_version: 4
|
165
|
-
summary:
|
166
|
+
summary: A command-line toolkit to extract text content and category data from Wikipedia
|
167
|
+
dump files
|
166
168
|
test_files:
|
167
169
|
- spec/spec_helper.rb
|
168
170
|
- spec/utils_spec.rb
|