wp2txt 1.0.0 → 1.0.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a15462742cc2912a4dca9e0e4e42e90af4b8f9e09ea29584da94946d0a563872
4
- data.tar.gz: 0c63c91b90883b4ed69199ef569c7bd467aece538bb1de1f8e7d632e710d6964
3
+ metadata.gz: bb540f4f17f7825786d110245c235ac556e3e64cedb17efae3e0591887425801
4
+ data.tar.gz: 479c357f7ba117ae10d9a5a04d24ce3aca2e54d942a156b02eb932c1aab55c8b
5
5
  SHA512:
6
- metadata.gz: 22f5c61c0ff6d11cd2c0155ad77940e9b618aea1354826a7b8fc5155289b42daff159be6c48f3f038c8df08753731cad623561cbd8055a10a12ce7feae0566ca
7
- data.tar.gz: 9b286a09211576f5a397e3e2e46fefbedbf9e95d200f3393b030ede106c9b543fb800c73d3d958ddc5dccad1ba2a30f0b99700af05eef88b142e90c8603e9699
6
+ metadata.gz: 940d47d2c8bce06029fe76e3b3744563d089e26e297e5224b36e65d815295da57117eae84cbb43abeddf2f2c052e2a987d668cba52c7af6148e935b571b6d403
7
+ data.tar.gz: 8ce76523a3bf181ac7a5da11f088dd14cfb1e1d7ac0d5239832db52968d183db16a3ece6074513b634eebe0e5ca28ceea945eaef6542ecb1933266caf4e89a3c
data/README.md CHANGED
@@ -1,26 +1,34 @@
1
1
  <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/wp2txt-logo.svg' width="400" />
2
2
 
3
- Text conversion tool to extract content and category data from Wikipedia dump files
3
+ A command-line toolkit to extract text content and category data from Wikipedia dump files
4
4
 
5
5
  ## About
6
6
 
7
- WP2TXT extracts plain text data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata.
7
+ WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
8
8
 
9
- **UPDATE (August 2022)**
9
+ ## Changelog
10
10
 
11
- 1. A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
12
- 2. A new option `--summary-only` has been added. If this option is enabled, only the title and text data from the opening paragraphs of the article (= summary) will be extracted.
13
- 3. The current WP2TXT is *several times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
11
+ **November 2022**
12
+
13
+ - Code added to suppress "Invalid byte sequence error" when an ilegal UTF-8 character is input.
14
+
15
+ **August 2022**
16
+
17
+ - A new option `--category-only` has been added. When this option is enabled, only the title and category information of the article is extracted.
18
+ - A new option `--summary-only` has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
19
+ - Text conversion with the current version of WP2TXT is *more than 2x times faster* than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).
14
20
 
15
21
  ## Screenshot
16
22
 
17
- <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="700" />
23
+ <img src='https://raw.githubusercontent.com/yohasebe/wp2txt/master/image/screenshot.png' width="800" />
18
24
 
19
- - WP2TXT 1.0.0
20
- - MacBook Pro (2019) 2.3GHz 8Core Intel Core i9
21
- - enwiki-20220802-pages-articles.xml.bz2 (approx. 20GB)
25
+ **Environment**
22
26
 
23
- In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes a little over two hours.
27
+ - WP2TXT 1.0.1
28
+ - MacBook Pro (2021 Apple M1 Pro)
29
+ - enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)
30
+
31
+ In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.
24
32
 
25
33
  ## Features
26
34
 
@@ -30,23 +38,45 @@ In the above environment, the process (decompression, splitting, extraction, and
30
38
  - Allows extracting category information of the article
31
39
  - Allows extracting opening paragraphs of the article
32
40
 
41
+ ## Preparation
42
+
43
+ ### For MacOS and Linux
44
+
45
+ WP2TXT requires that one of the following commands be installed on the system in order to decompress `bz2` files:
46
+
47
+ - `lbzip2` (recommended)
48
+ - `pbzip2`
49
+ - `bzip2`
50
+
51
+ In most cases, the `bzip2` command is pre-installed on the system. However, since `lbzip2` can use multiple CPU cores and is faster than `bzip2`, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.
52
+
53
+ If you are using MacOS with Homebrew installed, you can install `lbzip2` with the following command:
54
+
55
+ $ brew install lbzip2
56
+
57
+ ### For Windows
58
+
59
+ Install [Bzip2 for Windows](http://gnuwin32.sourceforge.net/packages/bzip2.htm) and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.
60
+
33
61
  ## Installation
34
62
 
63
+ ### WP2TXT command
64
+
35
65
  $ gem install wp2txt
36
66
 
37
- ## Preparation
67
+ ## Wikipedia Dump File
38
68
 
39
- First, download the latest Wikipedia dump file for the language of your choice.
69
+ Download the latest Wikipedia dump file for the desired language at a URL such as
40
70
 
41
- https://dumps.wikimedia.org/xxwiki/latest/xxwiki-latest-pages-articles.xml.bz2
71
+ https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
42
72
 
43
- where `xx` is language code such as `en` (English) or `zh` (Chinese). Change it to `ja`, for instance, if you want the latest Japanese Wikipedia dump file.
73
+ Here, `enwiki` refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to jawiki (Japanese). In doing so, note that there are two instances of `enwiki` in the URL above.
44
74
 
45
75
  Alternatively, you can also select Wikipedia dump files created on a specific date from [here](http://dumps.wikimedia.org/backup-index.html). Make sure to download a file named in the following format:
46
76
 
47
77
  xxwiki-yyyymmdd-pages-articles.xml.bz2
48
78
 
49
- where `xx` is language code such as `en` (English)" or `ko` (Korean), and `yyyymmdd` is the date of creation (e.g. `20220801`).
79
+ where `xx` is language code such as `en` (English)" or `ja` (japanese), and `yyyymmdd` is the date of creation (e.g. `20220801`).
50
80
 
51
81
  ## Basic Usage
52
82
 
@@ -124,7 +154,7 @@ Command line options are as follows:
124
154
  -g, --category-only Extract only article title and categories
125
155
  -s, --summary-only Extract only article title, categories, and summary text before first heading
126
156
  -f, --file-size=<i> Approximate size (in MB) of each output file (default: 10)
127
- -n, --num-procs Number of proccesses to be run concurrently (default: max num of CPU cores minus two)
157
+ -n, --num-procs Number of proccesses to be run concurrently (default: max num of available CPU cores minus two)
128
158
  -x, --del-interfile Delete intermediate XML files from output dir
129
159
  -t, --title, --no-title Keep page titles in output (default: true)
130
160
  -d, --heading, --no-heading Keep section titles in output (default: true)
@@ -132,6 +162,7 @@ Command line options are as follows:
132
162
  -r, --ref Keep reference notations in the format [ref]...[/ref]
133
163
  -e, --redirect Show redirect destination
134
164
  -m, --marker, --no-marker Show symbols prefixed to list items, definitions, etc. (Default: true)
165
+ -b, --bz2-gem Use Ruby's bzip2-ruby gem instead of a system command
135
166
  -v, --version Print version and exit
136
167
  -h, --help Show this message
137
168
 
@@ -156,6 +187,17 @@ The author will appreciate your mentioning one of these in your research.
156
187
  * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
157
188
  * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
158
189
 
190
+ Or use this BibTeX entry:
191
+
192
+ ```
193
+ @misc{wp2txt_2022,
194
+ author = {Yoichiro Hasebe},
195
+ title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
196
+ url = {https://github.com/yohasebe/wp2txt},
197
+ year = {2022}
198
+ }
199
+ ```
200
+
159
201
  ## License
160
202
 
161
203
  This software is distributed under the MIT License. Please see the LICENSE file.
data/bin/wp2txt CHANGED
@@ -43,6 +43,7 @@ EOS
43
43
  opt :ref, "Keep reference notations in the format [ref]...[/ref]", :default => false, :short => "-r"
44
44
  opt :redirect, "Show redirect destination", :default => false, :short => "-e"
45
45
  opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true, :short => "-m"
46
+ opt :bz2_gem, "Use Ruby's bzip2-ruby gem instead of a system command", :default => false, :short => "-b"
46
47
  end
47
48
 
48
49
  Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
@@ -72,7 +73,8 @@ opt_array = [:title,
72
73
  :category,
73
74
  :category_only,
74
75
  :summary_only,
75
- :del_interfile]
76
+ :del_interfile,
77
+ :bz2_gem ]
76
78
 
77
79
  $leave_inline_template = true if opts[:inline]
78
80
  $leave_ref = true if opts[:ref]
@@ -90,11 +92,15 @@ else
90
92
  puts "Decompressing and splitting the original dump file."
91
93
  puts pastel.underline("This may take a while. Please be patient!")
92
94
 
95
+ time_start = Time.now.to_i
96
+ wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
93
97
  spinner = TTY::Spinner.new(":spinner", format: :arrow_pulse, hide_cursor: true, interval: 5)
94
98
  spinner.auto_spin
95
- wpsplitter = Wp2txt::Splitter.new(input_file, output_dir, tfile_size)
96
99
  wpsplitter.split_file
97
- spinner.stop(pastel.blue.bold("Done!")) # Stop animation
100
+ time_finish = Time.now.to_i
101
+
102
+ spinner.stop("Time: #{sec_to_str(time_finish - time_start)}")# Stop animation
103
+ puts pastel.blue.bold("Complete!")
98
104
  exit if !convert
99
105
  input_files = Dir.glob("#{output_dir}/*.xml")
100
106
  end
data/image/screenshot.png CHANGED
Binary file
data/lib/wp2txt/utils.rb CHANGED
@@ -41,7 +41,7 @@ $in_table_regex2 = Regexp.new('^\|\}.*?$')
41
41
  $in_unordered_regex = Regexp.new('^\*')
42
42
  $in_ordered_regex = Regexp.new('^\#')
43
43
  $in_pre_regex = Regexp.new('^ ')
44
- $in_definition_regex = Regexp.new('^[\;\:]')
44
+ $in_definition_regex = Regexp.new('^[\;\:]')
45
45
  $blank_line_regex = Regexp.new('^\s*$')
46
46
  $redirect_regex = Regexp.new('#(?:REDIRECT|転送)\s+\[\[(.+)\]\]', Regexp::IGNORECASE)
47
47
  $remove_tag_regex = Regexp.new("\<[^\<\>]*\>")
@@ -98,11 +98,12 @@ $cleanup_regex_08 = Regexp.new('\n\n\n+', Regexp::MULTILINE)
98
98
  module Wp2txt
99
99
 
100
100
  def convert_characters!(text, has_retried = false)
101
- begin
102
- text << ""
101
+ begin
102
+ text << ""
103
103
  chrref_to_utf!(text)
104
104
  special_chr!(text)
105
-
105
+ text.encode!("UTF-8", "UTF-8", invalid: :replace, replace: "")
106
+
106
107
  rescue # detect invalid byte sequence in UTF-8
107
108
  if has_retried
108
109
  puts "invalid byte sequence detected"
@@ -112,20 +113,20 @@ module Wp2txt
112
113
  end
113
114
  exit
114
115
  else
115
- text.encode!("UTF-16")
116
- text.encode!("UTF-8")
116
+ text.encode!("UTF-16", "UTF-16", invalid: :replace, replace: "")
117
+ text.encode!("UTF-16", "UTF-16", invalid: :replace, replace: "")
117
118
  convert_characters!(text, true)
118
119
  end
119
120
  end
120
121
  end
121
-
122
+
122
123
  def format_wiki!(text, has_retried = false)
123
124
  remove_complex!(text)
124
125
 
125
126
  escape_nowiki!(text)
126
127
  process_interwiki_links!(text)
127
128
  process_external_links!(text)
128
- unescape_nowiki!(text)
129
+ unescape_nowiki!(text)
129
130
  remove_directive!(text)
130
131
  remove_emphasis!(text)
131
132
  mndash!(text)
@@ -135,7 +136,7 @@ module Wp2txt
135
136
  remove_templates!(text) unless $leave_inline_template
136
137
  remove_table!(text) unless $leave_table
137
138
  end
138
-
139
+
139
140
  def cleanup!(text)
140
141
  text.gsub!($cleanup_regex_01){""}
141
142
  text.gsub!($cleanup_regex_02){""}
@@ -150,7 +151,7 @@ module Wp2txt
150
151
  end
151
152
 
152
153
  #################### parser for nested structure ####################
153
-
154
+
154
155
  def process_nested_structure(scanner, left, right, &block)
155
156
  test = false
156
157
  buffer = ""
@@ -195,7 +196,7 @@ module Wp2txt
195
196
  rescue => e
196
197
  return scanner.string
197
198
  end
198
- end
199
+ end
199
200
 
200
201
  #################### methods used from format_wiki ####################
201
202
  def escape_nowiki!(str)
@@ -218,11 +219,11 @@ module Wp2txt
218
219
  @nowikis[obj_id]
219
220
  end
220
221
  end
221
-
222
+
222
223
  def process_interwiki_links!(str)
223
224
  scanner = StringScanner.new(str)
224
225
  result = process_nested_structure(scanner, "[[", "]]") do |contents|
225
- parts = contents.split("|")
226
+ parts = contents.split("|")
226
227
  case parts.size
227
228
  when 1
228
229
  parts.first || ""
@@ -265,7 +266,7 @@ module Wp2txt
265
266
  end
266
267
  str.replace(result)
267
268
  end
268
-
269
+
269
270
  def remove_table!(str)
270
271
  scanner = StringScanner.new(str)
271
272
  result = process_nested_structure(scanner, "{|", "|}") do |contents|
@@ -273,7 +274,7 @@ module Wp2txt
273
274
  end
274
275
  str.replace(result)
275
276
  end
276
-
277
+
277
278
  def special_chr!(str)
278
279
  str.replace $html_decoder.decode(str)
279
280
  end
@@ -316,7 +317,7 @@ module Wp2txt
316
317
  end
317
318
  return true
318
319
  end
319
-
320
+
320
321
  def mndash!(str)
321
322
  str.gsub!($mndash_regex, "–")
322
323
  end
@@ -347,7 +348,7 @@ module Wp2txt
347
348
  str.gsub!($complex_regex_04){""}
348
349
  str.gsub!($complex_regex_05){""}
349
350
  end
350
-
351
+
351
352
  def make_reference!(str)
352
353
  str.gsub!($make_reference_regex_a){"\n"}
353
354
  str.gsub!($make_reference_regex_b){""}
@@ -413,7 +414,7 @@ module Wp2txt
413
414
  File.rename(file_path, file_path + ".bak")
414
415
  File.rename("temp", file_path)
415
416
  File.unlink(file_path + ".bak") unless backup
416
- end
417
+ end
417
418
 
418
419
  # modify files under a directry (recursive)
419
420
  def batch_file_mod(dir_path, &block)
@@ -421,7 +422,7 @@ module Wp2txt
421
422
  collect_files(dir_path).each do |file|
422
423
  yield file if FileTest.file?(file)
423
424
  end
424
- else
425
+ else
425
426
  yield dir_path if FileTest.file?(dir_path)
426
427
  end
427
428
  end
@@ -445,9 +446,9 @@ module Wp2txt
445
446
  end
446
447
  end
447
448
 
448
- def rename(files, ext = "txt")
449
+ def rename(files, ext = "txt")
449
450
  # num of digits necessary to name the last file generated
450
- maxwidth = 0
451
+ maxwidth = 0
451
452
 
452
453
  files.each do |f|
453
454
  width = f.slice(/\-(\d+)\z/, 1).to_s.length.to_i
@@ -476,8 +477,4 @@ module Wp2txt
476
477
  return str
477
478
  end
478
479
 
479
- def decimal_format(i)
480
- str = i.to_s.reverse
481
- return str.scan(/.?.?./).join(',').reverse
482
- end
483
480
  end
@@ -1,3 +1,3 @@
1
1
  module Wp2txt
2
- VERSION = "1.0.0"
2
+ VERSION = "1.0.2"
3
3
  end
data/lib/wp2txt.rb CHANGED
@@ -7,26 +7,22 @@ require "nokogiri"
7
7
  require "wp2txt/article"
8
8
  require "wp2txt/utils"
9
9
 
10
- begin
11
- require "bzip2-ruby"
12
- NO_BZ2 = false
13
- rescue LoadError
14
- # in case bzip2-ruby gem is not available
15
- NO_BZ2 = true
16
- end
17
-
18
10
  module Wp2txt
19
11
  class Splitter
20
12
  include Wp2txt
21
- def initialize(input_file, output_dir = ".", tfile_size = 10)
13
+ def initialize(input_file, output_dir = ".", tfile_size = 10, bz2_gem = false)
22
14
  @fp = nil
23
15
  @input_file = input_file
24
16
  @output_dir = output_dir
25
17
  @tfile_size = tfile_size
26
- prepare
18
+ if bz2_gem
19
+ require "bzip2-ruby"
20
+ end
21
+ @bz2_gem = bz2_gem
22
+ prepare
27
23
  end
28
-
29
- def file_size(file)
24
+
25
+ def file_size(file)
30
26
  origin = Time.now
31
27
  size = 0; unit = 10485760; star = 0; before = Time.now.to_f
32
28
  error_count = 10
@@ -36,7 +32,7 @@ module Wp2txt
36
32
  rescue => e
37
33
  a = nil
38
34
  end
39
- break unless a
35
+ break unless a
40
36
 
41
37
  present = Time.now.to_f
42
38
  size += a.size
@@ -44,12 +40,29 @@ module Wp2txt
44
40
  star = 0 if star > 10
45
41
  star += 1
46
42
  before = present
47
- end
43
+ end
48
44
  end
49
45
  time_elapsed = Time.now - origin
50
46
  size
51
47
  end
52
48
 
49
+ # check if a given command exists: return the path if it does, return false if not
50
+ def command_exist?(command)
51
+ basename = File.basename(command)
52
+ path = ""
53
+ print "Checking #{basename}: "
54
+ if open("| which #{command} 2>/dev/null"){ |f| path = f.gets.strip }
55
+ puts "detected [#{path}]"
56
+ return path.strip
57
+ elsif open("| which #{basename} 2>/dev/null"){ |f| path = f.gets.strip }
58
+ puts "detected [#{path}]"
59
+ return path.strip
60
+ else
61
+ puts "not found"
62
+ return false
63
+ end
64
+ end
65
+
53
66
  # check the size of input file (bz2 or plain xml) when decompressed
54
67
  def prepare
55
68
  # if output_dir is not specified, output in the same directory
@@ -58,31 +71,31 @@ module Wp2txt
58
71
  @output_dir = File.dirname(@input_file)
59
72
  end
60
73
 
61
- # if input file is bz2 compressed, use bz2-ruby if available,
62
- # use command line bzip2 program otherwise.
63
74
  if /.bz2$/ =~ @input_file
64
- unless NO_BZ2
75
+ if @bz2_gem
65
76
  file = Bzip2::Reader.new File.open(@input_file, "r:UTF-8")
77
+ elsif RUBY_PLATFORM.index("win32")
78
+ file = IO.popen("bunzip2.exe -c #{@input_file}")
66
79
  else
67
- if RUBY_PLATFORM.index("win32")
68
- file = IO.popen("bunzip2.exe -c #{@input_file}")
69
- else
70
- file = IO.popen("bzip2 -c -d #{@input_file}")
80
+ if bzpath = command_exist?("lbzip2") ||
81
+ command_exist?("pbzip2") ||
82
+ command_exist?("bzip2")
83
+ file = IO.popen("#{bzpath} -c -d #{@input_file}")
71
84
  end
72
- end
85
+ end
73
86
  else # meaning that it is a text file
74
87
  @infile_size = File.stat(@input_file).size
75
88
  file = open(@input_file)
76
89
  end
77
90
 
78
91
  #create basename of output file
79
- @outfile_base = File.basename(@input_file, ".*") + "-"
92
+ @outfile_base = File.basename(@input_file, ".*") + "-"
80
93
  @total_size = 0
81
94
  @file_index = 1
82
95
  outfilename = File.join(@output_dir, @outfile_base + @file_index.to_s)
83
96
  @outfiles = []
84
97
  @outfiles << outfilename
85
- @fp = File.open(outfilename, "w")
98
+ @fp = File.open(outfilename, "w")
86
99
  @file_pointer = file
87
100
  return true
88
101
  end
@@ -100,7 +113,7 @@ module Wp2txt
100
113
  # temp_buf is filled with text split by "\n"
101
114
  temp_buf = []
102
115
  ss = StringScanner.new(new_lines)
103
- while ss.scan(/.*?\n/m)
116
+ while ss.scan(/.*?\n/m)
104
117
  temp_buf << ss[0]
105
118
  end
106
119
  temp_buf << ss.rest unless ss.eos?
@@ -122,16 +135,16 @@ module Wp2txt
122
135
  end
123
136
 
124
137
  def get_newline
125
- @buffer ||= [""]
138
+ @buffer ||= [""]
126
139
  if @buffer.size == 1
127
140
  return nil unless fill_buffer
128
141
  end
129
142
  if @buffer.empty?
130
143
  return nil
131
- else
144
+ else
132
145
  new_line = @buffer.shift
133
146
  return new_line
134
- end
147
+ end
135
148
  end
136
149
 
137
150
  def split_file
@@ -145,7 +158,7 @@ module Wp2txt
145
158
  output_text << text
146
159
  end_flag = true if @total_size > (@tfile_size * 1024 * 1024)
147
160
  # never close the file until the end of the page even if end_flag is on
148
- if end_flag && /<\/page/ =~ text
161
+ if end_flag && /<\/page/ =~ text
149
162
  @fp.puts(output_text)
150
163
  output_text = ""
151
164
  @total_size = 0
@@ -159,15 +172,15 @@ module Wp2txt
159
172
  end
160
173
  end
161
174
  @fp.puts(output_text) if output_text != ""
162
- @fp.close
175
+ @fp.close
163
176
 
164
177
  if File.size(outfilename) == 0
165
- File.delete(outfilename)
178
+ File.delete(outfilename)
166
179
  @outfiles.delete(outfilename)
167
180
  end
168
181
 
169
- rename(@outfiles, "xml")
170
- end
182
+ rename(@outfiles, "xml")
183
+ end
171
184
  end
172
185
 
173
186
  class Runner
@@ -181,7 +194,7 @@ module Wp2txt
181
194
  @del_interfile = del_interfile
182
195
  prepare
183
196
  end
184
-
197
+
185
198
  def prepare
186
199
  @infile_size = File.stat(@input_file).size
187
200
  file = open(@input_file)
@@ -203,7 +216,7 @@ module Wp2txt
203
216
  # temp_buf is filled with text split by "\n"
204
217
  temp_buf = []
205
218
  ss = StringScanner.new(new_lines)
206
- while ss.scan(/.*?\n/m)
219
+ while ss.scan(/.*?\n/m)
207
220
  temp_buf << ss[0]
208
221
  end
209
222
  temp_buf << ss.rest unless ss.eos?
@@ -225,16 +238,16 @@ module Wp2txt
225
238
  end
226
239
 
227
240
  def get_newline
228
- @buffer ||= [""]
241
+ @buffer ||= [""]
229
242
  if @buffer.size == 1
230
243
  return nil unless fill_buffer
231
244
  end
232
245
  if @buffer.empty?
233
246
  return nil
234
- else
247
+ else
235
248
  new_line = @buffer.shift
236
249
  return new_line
237
- end
250
+ end
238
251
  end
239
252
 
240
253
  def get_page
@@ -270,7 +283,7 @@ module Wp2txt
270
283
  pages = []
271
284
  data_empty = false
272
285
 
273
- while !data_empty
286
+ while !data_empty
274
287
  page = get_page
275
288
  if page
276
289
  pages << page
data/tags ADDED
@@ -0,0 +1,58 @@
1
+ !_TAG_FILE_FORMAT 2 /extended format; --format=1 will not append ;" to lines/
2
+ !_TAG_FILE_SORTED 1 /0=unsorted, 1=sorted, 2=foldcase/
3
+ !_TAG_PROGRAM_AUTHOR Darren Hiebert /dhiebert@users.sourceforge.net/
4
+ !_TAG_PROGRAM_NAME Exuberant Ctags //
5
+ !_TAG_PROGRAM_URL http://ctags.sourceforge.net /official site/
6
+ !_TAG_PROGRAM_VERSION 5.8 //
7
+ Article lib/wp2txt/article.rb /^ class Article$/;" c class:Wp2txt
8
+ Runner lib/wp2txt.rb /^ class Runner$/;" c class:Wp2txt.Splitter.file_size
9
+ Splitter lib/wp2txt.rb /^ class Splitter$/;" c class:Wp2txt
10
+ Wp2txt lib/wp2txt.rb /^module Wp2txt$/;" m
11
+ Wp2txt lib/wp2txt/article.rb /^module Wp2txt$/;" m
12
+ Wp2txt lib/wp2txt/utils.rb /^module Wp2txt$/;" m
13
+ Wp2txt lib/wp2txt/version.rb /^module Wp2txt$/;" m
14
+ batch_file_mod lib/wp2txt/utils.rb /^ def batch_file_mod(dir_path, &block)$/;" f
15
+ chrref_to_utf! lib/wp2txt/utils.rb /^ def chrref_to_utf!(num_str)$/;" f
16
+ cleanup! lib/wp2txt/utils.rb /^ def cleanup!(text)$/;" f
17
+ collect_files lib/wp2txt/utils.rb /^ def collect_files(str, regex = nil)$/;" f
18
+ command_exist? lib/wp2txt.rb /^ def command_exist?(command)$/;" f class:Wp2txt.Splitter.file_size
19
+ convert_characters! lib/wp2txt/utils.rb /^ def convert_characters!(text, has_retried = false)$/;" f class:Wp2txt
20
+ correct_inline_template! lib/wp2txt/utils.rb /^ def correct_inline_template!(str)$/;" f
21
+ correct_separator lib/wp2txt/utils.rb /^ def correct_separator(input)$/;" f
22
+ create_element lib/wp2txt/article.rb /^ def create_element(tp, text)$/;" f class:Wp2txt.Article
23
+ escape_nowiki! lib/wp2txt/utils.rb /^ def escape_nowiki!(str)$/;" f
24
+ extract_text lib/wp2txt.rb /^ def extract_text(&block)$/;" f class:Wp2txt.Splitter.file_size.Runner.fill_buffer
25
+ file_mod lib/wp2txt/utils.rb /^ def file_mod(file_path, backup = false, &block)$/;" f
26
+ file_size lib/wp2txt.rb /^ def file_size(file)$/;" f class:Wp2txt.Splitter
27
+ fill_buffer lib/wp2txt.rb /^ def fill_buffer$/;" f class:Wp2txt.Splitter.file_size
28
+ fill_buffer lib/wp2txt.rb /^ def fill_buffer$/;" f class:Wp2txt.Splitter.file_size.Runner
29
+ format_wiki! lib/wp2txt/utils.rb /^ def format_wiki!(text, has_retried = false)$/;" f
30
+ get_newline lib/wp2txt.rb /^ def get_newline$/;" f class:Wp2txt.Splitter.file_size.Runner.fill_buffer
31
+ get_newline lib/wp2txt.rb /^ def get_newline$/;" f class:Wp2txt.Splitter.file_size.fill_buffer
32
+ get_page lib/wp2txt.rb /^ def get_page$/;" f class:Wp2txt.Splitter.file_size.Runner.fill_buffer
33
+ initialize lib/wp2txt.rb /^ def initialize(input_file, output_dir = ".", strip_tmarker = false, del_interfile = true)$/;" f class:Wp2txt.Splitter.file_size.Runner
34
+ initialize lib/wp2txt.rb /^ def initialize(input_file, output_dir = ".", tfile_size = 10, bz2_gem = false)$/;" f class:Wp2txt.Splitter
35
+ initialize lib/wp2txt/article.rb /^ def initialize(text, title = "", strip_tmarker = false)$/;" f class:Wp2txt.Article
36
+ make_reference! lib/wp2txt/utils.rb /^ def make_reference!(str)$/;" f
37
+ mndash! lib/wp2txt/utils.rb /^ def mndash!(str)$/;" f
38
+ parse lib/wp2txt/article.rb /^ def parse(source)$/;" f class:Wp2txt.Article
39
+ prepare lib/wp2txt.rb /^ def prepare$/;" f class:Wp2txt.Splitter.file_size
40
+ prepare lib/wp2txt.rb /^ def prepare$/;" f class:Wp2txt.Splitter.file_size.Runner
41
+ process_external_links! lib/wp2txt/utils.rb /^ def process_external_links!(str)$/;" f
42
+ process_interwiki_links! lib/wp2txt/utils.rb /^ def process_interwiki_links!(str)$/;" f
43
+ process_nested_structure lib/wp2txt/utils.rb /^ def process_nested_structure(scanner, left, right, &block)$/;" f
44
+ remove_complex! lib/wp2txt/utils.rb /^ def remove_complex!(str)$/;" f
45
+ remove_directive! lib/wp2txt/utils.rb /^ def remove_directive!(str)$/;" f
46
+ remove_emphasis! lib/wp2txt/utils.rb /^ def remove_emphasis!(str)$/;" f
47
+ remove_hr! lib/wp2txt/utils.rb /^ def remove_hr!(str)$/;" f
48
+ remove_html! lib/wp2txt/utils.rb /^ def remove_html!(str)$/;" f
49
+ remove_inbetween! lib/wp2txt/utils.rb /^ def remove_inbetween!(str, tagset = ['<', '>'])$/;" f
50
+ remove_ref! lib/wp2txt/utils.rb /^ def remove_ref!(str)$/;" f
51
+ remove_table! lib/wp2txt/utils.rb /^ def remove_table!(str)$/;" f
52
+ remove_tag! lib/wp2txt/utils.rb /^ def remove_tag!(str)$/;" f
53
+ remove_templates! lib/wp2txt/utils.rb /^ def remove_templates!(str)$/;" f
54
+ rename lib/wp2txt/utils.rb /^ def rename(files, ext = "txt")$/;" f
55
+ sec_to_str lib/wp2txt/utils.rb /^ def sec_to_str(int)$/;" f
56
+ special_chr! lib/wp2txt/utils.rb /^ def special_chr!(str)$/;" f
57
+ split_file lib/wp2txt.rb /^ def split_file$/;" f class:Wp2txt.Splitter.file_size.fill_buffer
58
+ unescape_nowiki! lib/wp2txt/utils.rb /^ def unescape_nowiki!(str)$/;" f
data/wp2txt.gemspec CHANGED
@@ -7,9 +7,9 @@ Gem::Specification.new do |s|
7
7
  s.version = Wp2txt::VERSION
8
8
  s.authors = ["Yoichiro Hasebe"]
9
9
  s.email = ["yohasebe@gmail.com"]
10
- s.homepage = "http://github.com/yohasebe/wp2txt"
11
- s.summary = %q{Wikipedia dump to text converter}
12
- s.description = %q{WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.}
10
+ s.homepage = "https://github.com/yohasebe/wp2txt"
11
+ s.summary = %q{A command-line toolkit to extract text content and category data from Wikipedia dump files}
12
+ s.description = %q{WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.}
13
13
 
14
14
  s.rubyforge_project = "wp2txt"
15
15
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wp2txt
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Yoichiro Hasebe
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2022-08-09 00:00:00.000000000 Z
11
+ date: 2022-11-25 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -108,8 +108,8 @@ dependencies:
108
108
  - - ">="
109
109
  - !ruby/object:Gem::Version
110
110
  version: '0'
111
- description: WP2TXT extracts plain text data from Wikipedia dump file (encoded in
112
- XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
111
+ description: WP2TXT extracts text and category data from Wikipedia dump files (encoded
112
+ in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
113
113
  email:
114
114
  - yohasebe@gmail.com
115
115
  executables:
@@ -140,8 +140,9 @@ files:
140
140
  - lib/wp2txt/version.rb
141
141
  - spec/spec_helper.rb
142
142
  - spec/utils_spec.rb
143
+ - tags
143
144
  - wp2txt.gemspec
144
- homepage: http://github.com/yohasebe/wp2txt
145
+ homepage: https://github.com/yohasebe/wp2txt
145
146
  licenses: []
146
147
  metadata: {}
147
148
  post_install_message:
@@ -162,7 +163,8 @@ requirements: []
162
163
  rubygems_version: 3.3.3
163
164
  signing_key:
164
165
  specification_version: 4
165
- summary: Wikipedia dump to text converter
166
+ summary: A command-line toolkit to extract text content and category data from Wikipedia
167
+ dump files
166
168
  test_files:
167
169
  - spec/spec_helper.rb
168
170
  - spec/utils_spec.rb