wp2txt 1.0.0 → 1.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a15462742cc2912a4dca9e0e4e42e90af4b8f9e09ea29584da94946d0a563872
4
- data.tar.gz: 0c63c91b90883b4ed69199ef569c7bd467aece538bb1de1f8e7d632e710d6964
3
+ metadata.gz: d33a41cf46688679a14eb8c3eb16f6ed33ce9175c7f5b566c9f87998ba2c8401
4
+ data.tar.gz: 7371e0f7b06b2f0846f01d66f461c7e106778adc6e686919302f0f29b1f80a9e
5
5
  SHA512:
6
- metadata.gz: 22f5c61c0ff6d11cd2c0155ad77940e9b618aea1354826a7b8fc5155289b42daff159be6c48f3f038c8df08753731cad623561cbd8055a10a12ce7feae0566ca
7
- data.tar.gz: 9b286a09211576f5a397e3e2e46fefbedbf9e95d200f3393b030ede106c9b543fb800c73d3d958ddc5dccad1ba2a30f0b99700af05eef88b142e90c8603e9699
6
+ metadata.gz: cab8d9c27989387acc6dbbe052029d2205508ce10e38b8eedc111c822328d8eba551d603020684cbb3844a87b747f261a5959f711267acd96a3b97ccef4f6834
7
+ data.tar.gz: 4de59be37d57ef3d14ae2304660e8dde069bdf645a7cff862026562b26327984f1be13840e9d6ec1f25110222367f71c84a0286b649d71fec0c13805c6b0a647
data/README.md CHANGED
@@ -1,26 +1,28 @@
1
1
  <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/wp2txt-logo.svg' width="400" />
2
2
 
3
- Text conversion tool to extract content and category data from Wikipedia dump files
3
+ A command-line toolkit to extract text content and category data from Wikipedia dump files
4
4
 
5
5
  ## About
6
6
 
7
- WP2TXT extracts plain text data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata.
7
+ WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
8
8
 
9
9
  **UPDATE (August 2022)**
10
10
 
11
11
  1. A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
12
- 2. A new option `--summary-only` has been added. If this option is enabled, only the title and text data from the opening paragraphs of the article (= summary) will be extracted.
13
- 3. The current WP2TXT is *several times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
12
+ 2. A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
13
+ 3. Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
14
14
 
15
15
  ## Screenshot
16
16
 
17
17
  <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="700" />
18
18
 
19
- - WP2TXT 1.0.0
20
- - MacBook Pro (2019) 2.3GHz 8Core Intel Core i9
21
- - enwiki-20220802-pages-articles.xml.bz2 (approx. 20GB)
19
+ **Environment**
22
20
 
23
- In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes a little over two hours.
21
+ - WP2TXT 1.0.1
22
+ - MacBook Pro (2021 Apple M1 Pro)
23
+ - enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
24
+
25
+ In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.
24
26
 
25
27
  ## Features
26
28
 
@@ -30,23 +32,45 @@ In the above environment, the process (decompression, splitting, extraction, and
30
32
  - Allows extracting category information of the article
31
33
  - Allows extracting opening paragraphs of the article
32
34
 
35
+ ## Preparation
36
+
37
+ ### For MacOS / Linux/ WSL2
38
+
39
+ WP2TXT requires that one of the following commands be installed on the system in order to decompress `bz2` files:
40
+
41
+ - `lbzip2` (recommended)
42
+ - `pbzip2`
43
+ - `bzip2`
44
+
45
+ In most cases, the `bzip2` command is pre-installed on the system. However, since `lbzip2` can use multiple CPU cores and is faster than `bzip2`, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.
46
+
47
+ If you are using MacOS with Homebrew installed, you can install `lbzip2` with the following command:
48
+
49
+ $ brew install lbzip2
50
+
51
+ ### For Windows
52
+
53
+ Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.
54
+
33
55
  ## Installation
34
56
 
57
+ ### WP2TXT command
58
+
35
59
  $ gem install wp2txt
36
60
 
37
- ## Preparation
61
+ ## Wikipedia Dump File
38
62
 
39
- First, download the latest Wikipedia dump file for the language of your choice.
63
+ Download the latest Wikipedia dump file for the desired language at a URL such as
40
64
 
41
- https://dumps.wikimedia.org/xxwiki/latest/xxwiki-latest-pages-articles.xml.bz2
65
+ https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
42
66
 
43
- where `xx` is language code such as `en` (English) or `zh` (Chinese). Change it to `ja`, for instance, if you want the latest Japanese Wikipedia dump file.
67
+ Here, `enwiki` refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to jawiki (Japanese). In doing so, note that there are two instances of `enwiki` in the URL above.
44
68
 
45
69
  Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
46
70
 
47
71
  xxwiki-yyyymmdd-pages-articles.xml.bz2
48
72
 
49
- where `xx` is language code such as `en` (English)" or `ko` (Korean), and `yyyymmdd` is the date of creation (e.g. `20220801`).
73
+ where `xx` is language code such as `en` (English)" or `ja` (japanese), and `yyyymmdd` is the date of creation (e.g. `20220801`).
50
74
 
51
75
  ## Basic Usage
52
76
 
@@ -124,7 +148,7 @@ Command line options are as follows:
124
148
  -g, --category-only Extract only article title and categories
125
149
  -s, --summary-only Extract only article title, categories, and summary text before first heading
126
150
  -f, --file-size=<i> Approximate size (in MB) of each output file (default: 10)
127
- -n, --num-procs Number of proccesses to be run concurrently (default: max num of CPU cores minus two)
151
+ -n, --num-procs Number of proccesses to be run concurrently (default: max num of available CPU cores minus two)
128
152
  -x, --del-interfile Delete intermediate XML files from output dir
129
153
  -t, --title, --no-title Keep page titles in output (default: true)
130
154
  -d, --heading, --no-heading Keep section titles in output (default: true)
@@ -132,6 +156,7 @@ Command line options are as follows:
132
156
  -r, --ref Keep reference notations in the format [ref]...[/ref]
133
157
  -e, --redirect Show redirect destination
134
158
  -m, --marker, --no-marker Show symbols prefixed to list items, definitions, etc. (Default: true)
159
+ -b, --bz2-gem Use Ruby's bzip2-ruby gem instead of a system command
135
160
  -v, --version Print version and exit
136
161
  -h, --help Show this message
137
162
 
@@ -156,6 +181,17 @@ The author will appreciate your mentioning one of these in your research.
156
181
  * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
157
182
  * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
158
183
 
184
+ Or use this BibTeX entry:
185
+
186
+ ```
187
+ @misc{WP2TXT_2022,
188
+ author = {Yoichiro Hasebe},
189
+ title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
190
+ url = {https://github.com/yohasebe/wp2txt}
191
+ year = {2022},
192
+ }
193
+ ```
194
+
159
195
  ## License
160
196
 
161
197
  This software is distributed under the MIT License. Please see the LICENSE file.
data/bin/wp2txt CHANGED
@@ -43,6 +43,7 @@ EOS
43
43
  opt :ref, "Keep reference notations in the format [ref]...[/ref]", :default => false, :short => "-r"
44
44
  opt :redirect, "Show redirect destination", :default => false, :short => "-e"
45
45
  opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true, :short => "-m"
46
+ opt :bz2_gem, "Use Ruby's bzip2-ruby gem instead of a system command", :default => false, :short => "-b"
46
47
  end
47
48
 
48
49
  Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
@@ -72,7 +73,8 @@ opt_array = [:title,
72
73
  :category,
73
74
  :category_only,
74
75
  :summary_only,
75
- :del_interfile]
76
+ :del_interfile,
77
+ :bz2_gem ]
76
78
 
77
79
  $leave_inline_template = true if opts[:inline]
78
80
  $leave_ref = true if opts[:ref]
@@ -90,11 +92,15 @@ else
90
92
  puts "Decompressing and splitting the original dump file."
91
93
  puts pastel.underline("This may take a while. Please be patient!")
92
94
 
95
+ time_start = Time.now.to_i
96
+ wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
93
97
  spinner = TTY::Spinner.new(":spinner", format: :arrow_pulse, hide_cursor: true, interval: 5)
94
98
  spinner.auto_spin
95
- wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
96
99
  wpsplitter.split_file
97
- spinner.stop(pastel.blue.bold("Done!")) # Stop animation
100
+ time_finish = Time.now.to_i
101
+
102
+ spinner.stop("Time: #{sec_to_str(time_finish - time_start)}")# Stop animation
103
+ puts pastel.blue.bold("Complete!")
98
104
  exit if !convert
99
105
  input_files = Dir.glob("#{output_dir}/*.xml")
100
106
  end
data/image/screenshot.png CHANGED
Binary file
data/lib/wp2txt/utils.rb CHANGED
@@ -476,8 +476,4 @@ module Wp2txt
476
476
  return str
477
477
  end
478
478
 
479
- def decimal_format(i)
480
- str = i.to_s.reverse
481
- return str.scan(/.?.?./).join(',').reverse
482
- end
483
479
  end
@@ -1,3 +1,3 @@
1
1
  module Wp2txt
2
- VERSION = "1.0.0"
2
+ VERSION = "1.0.1"
3
3
  end
data/lib/wp2txt.rb CHANGED
@@ -7,26 +7,22 @@ require "nokogiri"
7
7
  require "wp2txt/article"
8
8
  require "wp2txt/utils"
9
9
 
10
- begin
11
- require "bzip2-ruby"
12
- NO_BZ2 = false
13
- rescue LoadError
14
- # in case bzip2-ruby gem is not available
15
- NO_BZ2 = true
16
- end
17
-
18
10
  module Wp2txt
19
11
  class Splitter
20
12
  include Wp2txt
21
- def initialize(input_file, output_dir = ".", tfile_size = 10)
13
+ def initialize(input_file, output_dir = ".", tfile_size = 10, bz2_gem = false)
22
14
  @fp = nil
23
15
  @input_file = input_file
24
16
  @output_dir = output_dir
25
17
  @tfile_size = tfile_size
26
- prepare
18
+ if bz2_gem
19
+ require "bzip2-ruby"
20
+ end
21
+ @bz2_gem = bz2_gem
22
+ prepare
27
23
  end
28
-
29
- def file_size(file)
24
+
25
+ def file_size(file)
30
26
  origin = Time.now
31
27
  size = 0; unit = 10485760; star = 0; before = Time.now.to_f
32
28
  error_count = 10
@@ -36,7 +32,7 @@ module Wp2txt
36
32
  rescue => e
37
33
  a = nil
38
34
  end
39
- break unless a
35
+ break unless a
40
36
 
41
37
  present = Time.now.to_f
42
38
  size += a.size
@@ -44,12 +40,29 @@ module Wp2txt
44
40
  star = 0 if star > 10
45
41
  star += 1
46
42
  before = present
47
- end
43
+ end
48
44
  end
49
45
  time_elapsed = Time.now - origin
50
46
  size
51
47
  end
52
48
 
49
+ # check if a given command exists: return the path if it does, return false if not
50
+ def command_exist?(command)
51
+ basename = File.basename(command)
52
+ path = ""
53
+ print "Checking #{basename}: "
54
+ if open("| which #{command} 2>/dev/null"){ |f| path = f.gets.strip }
55
+ puts "detected [#{path}]"
56
+ return path.strip
57
+ elsif open("| which #{basename} 2>/dev/null"){ |f| path = f.gets.strip }
58
+ puts "detected [#{path}]"
59
+ return path.strip
60
+ else
61
+ puts "not found"
62
+ return false
63
+ end
64
+ end
65
+
53
66
  # check the size of input file (bz2 or plain xml) when decompressed
54
67
  def prepare
55
68
  # if output_dir is not specified, output in the same directory
@@ -58,31 +71,31 @@ module Wp2txt
58
71
  @output_dir = File.dirname(@input_file)
59
72
  end
60
73
 
61
- # if input file is bz2 compressed, use bz2-ruby if available,
62
- # use command line bzip2 program otherwise.
63
74
  if /.bz2$/ =~ @input_file
64
- unless NO_BZ2
75
+ if @bz2_gem
65
76
  file = Bzip2::Reader.new File.open(@input_file, "r:UTF-8")
77
+ elsif RUBY_PLATFORM.index("win32")
78
+ file = IO.popen("bunzip2.exe -c #{@input_file}")
66
79
  else
67
- if RUBY_PLATFORM.index("win32")
68
- file = IO.popen("bunzip2.exe -c #{@input_file}")
69
- else
70
- file = IO.popen("bzip2 -c -d #{@input_file}")
80
+ if bzpath = command_exist?("lbzip2") ||
81
+ command_exist?("pbzip2") ||
82
+ command_exist?("bzip2")
83
+ file = IO.popen("#{bzpath} -c -d #{@input_file}")
71
84
  end
72
- end
85
+ end
73
86
  else # meaning that it is a text file
74
87
  @infile_size = File.stat(@input_file).size
75
88
  file = open(@input_file)
76
89
  end
77
90
 
78
91
  #create basename of output file
79
- @outfile_base = File.basename(@input_file, ".*") + "-"
92
+ @outfile_base = File.basename(@input_file, ".*") + "-"
80
93
  @total_size = 0
81
94
  @file_index = 1
82
95
  outfilename = File.join(@output_dir, @outfile_base + @file_index.to_s)
83
96
  @outfiles = []
84
97
  @outfiles << outfilename
85
- @fp = File.open(outfilename, "w")
98
+ @fp = File.open(outfilename, "w")
86
99
  @file_pointer = file
87
100
  return true
88
101
  end
@@ -100,7 +113,7 @@ module Wp2txt
100
113
  # temp_buf is filled with text split by "\n"
101
114
  temp_buf = []
102
115
  ss = StringScanner.new(new_lines)
103
- while ss.scan(/.*?\n/m)
116
+ while ss.scan(/.*?\n/m)
104
117
  temp_buf << ss[0]
105
118
  end
106
119
  temp_buf << ss.rest unless ss.eos?
@@ -122,16 +135,16 @@ module Wp2txt
122
135
  end
123
136
 
124
137
  def get_newline
125
- @buffer ||= [""]
138
+ @buffer ||= [""]
126
139
  if @buffer.size == 1
127
140
  return nil unless fill_buffer
128
141
  end
129
142
  if @buffer.empty?
130
143
  return nil
131
- else
144
+ else
132
145
  new_line = @buffer.shift
133
146
  return new_line
134
- end
147
+ end
135
148
  end
136
149
 
137
150
  def split_file
@@ -145,7 +158,7 @@ module Wp2txt
145
158
  output_text << text
146
159
  end_flag = true if @total_size > (@tfile_size * 1024 * 1024)
147
160
  # never close the file until the end of the page even if end_flag is on
148
- if end_flag && /<\/page/ =~ text
161
+ if end_flag && /<\/page/ =~ text
149
162
  @fp.puts(output_text)
150
163
  output_text = ""
151
164
  @total_size = 0
@@ -159,15 +172,15 @@ module Wp2txt
159
172
  end
160
173
  end
161
174
  @fp.puts(output_text) if output_text != ""
162
- @fp.close
175
+ @fp.close
163
176
 
164
177
  if File.size(outfilename) == 0
165
- File.delete(outfilename)
178
+ File.delete(outfilename)
166
179
  @outfiles.delete(outfilename)
167
180
  end
168
181
 
169
- rename(@outfiles, "xml")
170
- end
182
+ rename(@outfiles, "xml")
183
+ end
171
184
  end
172
185
 
173
186
  class Runner
@@ -181,7 +194,7 @@ module Wp2txt
181
194
  @del_interfile = del_interfile
182
195
  prepare
183
196
  end
184
-
197
+
185
198
  def prepare
186
199
  @infile_size = File.stat(@input_file).size
187
200
  file = open(@input_file)
@@ -203,7 +216,7 @@ module Wp2txt
203
216
  # temp_buf is filled with text split by "\n"
204
217
  temp_buf = []
205
218
  ss = StringScanner.new(new_lines)
206
- while ss.scan(/.*?\n/m)
219
+ while ss.scan(/.*?\n/m)
207
220
  temp_buf << ss[0]
208
221
  end
209
222
  temp_buf << ss.rest unless ss.eos?
@@ -225,16 +238,16 @@ module Wp2txt
225
238
  end
226
239
 
227
240
  def get_newline
228
- @buffer ||= [""]
241
+ @buffer ||= [""]
229
242
  if @buffer.size == 1
230
243
  return nil unless fill_buffer
231
244
  end
232
245
  if @buffer.empty?
233
246
  return nil
234
- else
247
+ else
235
248
  new_line = @buffer.shift
236
249
  return new_line
237
- end
250
+ end
238
251
  end
239
252
 
240
253
  def get_page
@@ -270,7 +283,7 @@ module Wp2txt
270
283
  pages = []
271
284
  data_empty = false
272
285
 
273
- while !data_empty
286
+ while !data_empty
274
287
  page = get_page
275
288
  if page
276
289
  pages << page
data/wp2txt.gemspec CHANGED
@@ -7,9 +7,9 @@ Gem::Specification.new do |s|
7
7
  s.version = Wp2txt::VERSION
8
8
  s.authors = ["Yoichiro Hasebe"]
9
9
  s.email = ["yohasebe@gmail.com"]
10
- s.homepage = "http://github.com/yohasebe/wp2txt"
11
- s.summary = %q{Wikipedia dump to text converter}
12
- s.description = %q{WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.}
10
+ s.homepage = "https://github.com/yohasebe/wp2txt"
11
+ s.summary = %q{A command-line toolkit to extract text content and category data from Wikipedia dump files}
12
+ s.description = %q{WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.}
13
13
 
14
14
  s.rubyforge_project = "wp2txt"
15
15
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wp2txt
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Yoichiro Hasebe
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2022-08-09 00:00:00.000000000 Z
11
+ date: 2022-08-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -108,8 +108,8 @@ dependencies:
108
108
  - - ">="
109
109
  - !ruby/object:Gem::Version
110
110
  version: '0'
111
- description: WP2TXT extracts plain text data from Wikipedia dump file (encoded in
112
- XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
111
+ description: WP2TXT extracts text and category data from Wikipedia dump files (encoded
112
+ in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
113
113
  email:
114
114
  - yohasebe@gmail.com
115
115
  executables:
@@ -141,7 +141,7 @@ files:
141
141
  - spec/spec_helper.rb
142
142
  - spec/utils_spec.rb
143
143
  - wp2txt.gemspec
144
- homepage: http://github.com/yohasebe/wp2txt
144
+ homepage: https://github.com/yohasebe/wp2txt
145
145
  licenses: []
146
146
  metadata: {}
147
147
  post_install_message:
@@ -159,10 +159,11 @@ required_rubygems_version: !ruby/object:Gem::Requirement
159
159
  - !ruby/object:Gem::Version
160
160
  version: '0'
161
161
  requirements: []
162
- rubygems_version: 3.3.3
162
+ rubygems_version: 3.3.7
163
163
  signing_key:
164
164
  specification_version: 4
165
- summary: Wikipedia dump to text converter
165
+ summary: A command-line toolkit to extract text content and category data from Wikipedia
166
+ dump files
166
167
  test_files:
167
168
  - spec/spec_helper.rb
168
169
  - spec/utils_spec.rb