wp2txt 0.8.0 → 0.9.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: d0610b7e28e04c4cd9c3a1401c88e15f6ddb16ec
4
- data.tar.gz: b866915631fdc956395c005735b089ddff7956e5
2
+ SHA256:
3
+ metadata.gz: 32966949db257b30be7a5c044965ce08426bdede1f1fa0dbb0a276361d1c69c2
4
+ data.tar.gz: ee0b08031ae75b9d08fd1f07e5d08e8e10135dba8b15bd44692f7c434b220262
5
5
  SHA512:
6
- metadata.gz: e9fbef3de5ed866de0b3c7fadd96bdf0ff501b71c2d9f6f282eed538194bdfff8d9659cf53aedf95062c9aadf2ec90393075158ef9c8ae78f3e53ce84119f764
7
- data.tar.gz: 36ad316986d94a6be89ccb591dec510dc4695bede188448fde0745702faad8039d0df4a84d2f0730dd11749a63d58e6b63b51b2bce585d8f8ccb3ff02553c3c8
6
+ metadata.gz: 9dee99ed39d2da01c9aeda462291645533773ed537de06a1f9127d626c91bc421e92ef805c52764fd0a668cf104a3733ab5981e59bb35d4abebe2f2909c63e3f
7
+ data.tar.gz: 8c6d7a5841a47fa4643a229050be652e38879d03477568175e8b29fb6f77971633731ec2d5dbd429dc2499ecfc2bfbb58917389d89b9cd57743b86b97ceaa4af
data/README.md CHANGED
@@ -2,33 +2,54 @@
2
2
 
3
3
  Wikipedia dump file to text converter
4
4
 
5
- **Important: This is a project *work in progress* and it could be slow, unstable, and even destructive! Please use it with caution!**
5
+ **IMPORTANT:** This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.
6
6
 
7
- ### About ###
7
+ ## About
8
8
 
9
9
  WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
10
10
 
11
- ### Features ###
11
+ **UPDATE:** Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
12
12
 
13
- * Convert dump files of Wikipedia of various languages (I hope).
13
+ ## Features
14
+
15
+ * Convert dump files of Wikipedia of various languages
14
16
  * Create output files of specified size.
15
- * Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables).
17
+ * Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables)
18
+ * Extract category information of each article
16
19
 
17
- ### Installation
20
+ ## Installation
18
21
 
19
22
  $ gem install wp2txt
20
23
 
21
- ### Usage
24
+ ## Usage
22
25
 
23
26
  Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
24
27
 
25
28
  xxwiki-yyyymmdd-pages-articles.xml.bz2
26
29
 
27
- where `xx` is language code such as "en (English)" or "ja (Japanese)", and `yyyymmdd` is the date of creation (e.g. 20120601).
30
+ where `xx` is language code such as "en (English)" or "", and `yyyymmdd` is the date of creation (e.g. 20120601).
28
31
 
29
- Command line options are as follows:
32
+ ### Example 1
33
+
34
+ The following extracts text data, including list items and excluding tables.
35
+
36
+ $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
37
+
38
+ - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
39
+ - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
40
+
41
+ ### Example 2
30
42
 
31
- **Important** Command line options in the current version have been drastically changed from previous versions.
43
+ The following will extract only article titles and the categories to which each article belongs:
44
+
45
+ $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
46
+
47
+ - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
48
+ - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
49
+
50
+ ## Options
51
+
52
+ Command line options are as follows:
32
53
 
33
54
  Usage: wp2txt [options]
34
55
  where [options] are:
@@ -40,39 +61,45 @@ Command line options are as follows:
40
61
  --list, --no-list, -l: Show list items in output (default: true)
41
62
  --heading, --no-heading, -d: Show section titles in output (default: true)
42
63
  --title, --no-title, -t: Show page titles in output (default: true)
43
- --table, -a: Show table source code in output
44
- --template, -e: leave inline template notations unmodified
45
- --ref, -r: leave reference notations in the format
64
+ --table, -a: Show table source code in output (default: false)
65
+ --inline, -n: leave inline template notations unmodified (default: false)
66
+ --multiline, -m: leave multiline template notations unmodified (default: false)
67
+ --ref, -r: leave reference notations in the format (default: false)
46
68
  [ref]...[/ref]
47
- --redirect: Show redirect destination
48
- --marker, --no-marker, -m: Show symbols prefixed to list items,
69
+ --redirect, -e: Show redirect destination (default: false)
70
+ --marker, --no-marker, -k: Show symbols prefixed to list items,
49
71
  definitions, etc. (Default: true)
50
- --category, -g: Show article category information
72
+ --category, -g: Show article category information (default: true)
73
+ --category-only, -y: Extract only article title and categories (default: false)
51
74
  --file-size, -f <i>: Approximate size (in MB) of each output file
52
75
  (default: 10)
76
+ -u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
77
+ set 99 to spawn max num of threads) (default: 4)
53
78
  --version, -v: Print version and exit
54
79
  --help, -h: Show this message
55
80
 
56
- ### Caveats ###
81
+ ## Caveats
57
82
 
58
83
  * Certain types of data such as mathematical equations and computer source code are not be properly converted. Please remember this software is originally intended for correcting “sentences” for linguistic studies.
59
84
  * Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
60
85
  * Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
61
86
  * Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
62
87
 
63
- ### Useful Link ###
88
+ ### Useful Links
64
89
 
65
90
  * [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
66
91
 
67
- ### Author ###
92
+ ### Author
68
93
 
69
94
  * Yoichiro Hasebe (<yohasebe@gmail.com>)
70
95
 
71
- ### References ###
96
+ ### References
97
+
98
+ The author will appreciate your mentioning one of these in your research.
72
99
 
73
100
  * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
74
101
  * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
75
102
 
76
- ### License ###
103
+ ### License
77
104
 
78
105
  This software is distributed under the MIT License. Please see the LICENSE file.
data/bin/benchmark.rb CHANGED
@@ -12,15 +12,16 @@ require 'benchmark'
12
12
  data_dir = File.join(File.dirname(__FILE__), '..', "data")
13
13
 
14
14
  parent = Wp2txt::CmdProgbar.new
15
- input_file = File.join(data_dir, "testdata.bz2")
15
+ input_file = File.join(data_dir, "testdata_ja.bz2")
16
16
  output_dir = data_dir
17
17
  tfile_size = 10
18
+ num_threads = 1
18
19
  convert = true
19
20
  strip_tmarker = true
20
21
 
21
22
  Benchmark.bm do |x|
22
23
  x.report do
23
- wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert, strip_tmarker)
24
+ wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_threads, convert, strip_tmarker)
24
25
  wpconv.extract_text do |article|
25
26
  format_wiki!(article.title)
26
27
  title = "[[#{article.title}]]\n"
@@ -58,11 +59,11 @@ Benchmark.bm do |x|
58
59
  end
59
60
  contents << line
60
61
  end
61
- format_article!(contents)
62
+ format_wiki!(contents)
62
63
  convert_characters!(contents)
63
64
 
64
65
  ##### cleanup #####
65
- if /\A\s*\z/m =~ contents
66
+ if /\A[\s ]*\z/m =~ contents
66
67
  result = ""
67
68
  else
68
69
  result = title + "\n" + contents
data/bin/wp2txt CHANGED
@@ -4,18 +4,18 @@
4
4
  $: << File.join(File.dirname(__FILE__))
5
5
  $: << File.join(File.dirname(__FILE__), '..', 'lib')
6
6
 
7
- DEBUG_MODE = true
7
+ $DEBUG_MODE = false
8
8
  SHAREDIR = File.join(File.dirname(__FILE__), '..', 'share')
9
9
  DOCDIR = File.join(File.dirname(__FILE__), '..', 'doc')
10
10
 
11
11
  require 'wp2txt'
12
12
  require 'wp2txt/utils'
13
13
  require 'wp2txt/version'
14
- require 'trollop'
14
+ require 'optimist'
15
15
 
16
16
  include Wp2txt
17
17
 
18
- opts = Trollop::options do
18
+ opts = Optimist::options do
19
19
  version Wp2txt::VERSION
20
20
  banner <<-EOS
21
21
  WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
@@ -31,37 +31,40 @@ EOS
31
31
  opt :heading, "Show section titles in output", :default => true, :short => "-d"
32
32
  opt :title, "Show page titles in output", :default => true
33
33
  opt :table, "Show table source code in output", :default => false
34
- opt :template, "leave inline template notations unmodified", :default => false
34
+ opt :inline, "leave inline template notations as they are", :default => false
35
+ opt :multiline, "leave multiline template notations as they are", :default => false
35
36
  opt :ref, "leave reference notations in the format [ref]...[/ref]", :default => false
36
37
  opt :redirect, "Show redirect destination", :default => false
37
38
  opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
38
- opt :category, "Show article category information", :default => false
39
+ opt :category, "Show article category information", :default => true
40
+ opt :category_only, "Extract only article title and categories", :default => false
39
41
  opt :file_size, "Approximate size (in MB) of each output file", :default => 10
42
+ opt :num_threads, "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
40
43
  end
41
- Trollop::die :size, "must be larger than 0" unless opts[:file_size] >= 0
42
- Trollop::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
44
+ Optimist::die :size, "must be larger than 0" unless opts[:file_size] >= 0
45
+ Optimist::die :output_dir, "must exist" unless File.exist?(opts[:output_dir])
43
46
 
44
47
  input_file = ARGV[0]
45
48
  output_dir = opts[:output_dir]
46
49
  tfile_size = opts[:file_size]
50
+ num_threads = opts[:num_threads]
47
51
  convert = opts[:convert]
48
52
  strip_tmarker = opts[:marker] ? false : true
49
- opt_array = [:title, :list, :heading, :table, :redirect]
50
- $leave_template = true if opts[:template]
51
- $leave_table = true if opts[:table]
53
+ opt_array = [:title, :list, :heading, :table, :redirect, :multiline]
54
+ $leave_inline_template = true if opts[:inline]
52
55
  $leave_ref = true if opts[:ref]
56
+ # $leave_table = true if opts[:table]
53
57
  config = {}
54
58
  opt_array.each do |opt|
55
59
  config[opt] = opts[opt]
56
60
  end
57
61
 
58
62
  parent = Wp2txt::CmdProgbar.new
59
- wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, convert, strip_tmarker)
63
+ wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_threads, convert, strip_tmarker)
60
64
 
61
65
  wpconv.extract_text do |article|
62
66
  format_wiki!(article.title)
63
67
  title = "[[#{article.title}]]\n"
64
- convert_characters!(title)
65
68
 
66
69
  if opts[:category] && !article.categories.empty?
67
70
  contents = "\nCATEGORIES: "
@@ -71,67 +74,64 @@ wpconv.extract_text do |article|
71
74
  contents = ""
72
75
  end
73
76
 
74
- article.elements.each do |e|
75
- case e.first
76
- when :mw_heading
77
- next if !config[:heading]
78
- format_wiki!(e.last)
79
- format_article!(e.last)
80
- line = e.last
81
- line << "+HEADING+" if $DEBUG_MODE
82
- when :mw_paragraph
83
- # next if !config[:paragraph]
84
- format_wiki!(e.last)
85
- format_article!(e.last)
86
- line = e.last + "\n"
87
- line << "+PARAGRAPH+" if $DEBUG_MODE
88
- when :mw_table, :mw_htable
89
- next if !config[:table]
90
- # format_wiki!(e.last)
91
- line = e.last
92
- line << "+TABLE+" if $DEBUG_MODE
93
- when :mw_pre
94
- next if !config[:pre]
95
- line = e.last
96
- line << "+PRE+" if $DEBUG_MODE
97
- when :mw_quote
98
- # next if !config[:quote]
99
- # format_wiki!(e.last)
100
- line = e.last
101
- line << "+QUOTE+" if $DEBUG_MODE
102
- when :mw_unordered, :mw_ordered, :mw_definition
103
- next if !config[:list]
104
- # format_wiki!(e.last)
105
- line = e.last
106
- line << "+LIST+" if $DEBUG_MODE
107
- when :mw_redirect
108
- next if !config[:redirect]
109
- # format_wiki!(e.last)
110
- line = e.last
111
- line << "+REDIRECT+" if $DEBUG_MODE
112
- line << "\n\n"
113
- else
114
- if $DEBUG_MODE
115
- # format_wiki!(e.last)
77
+ unless opts[:category_only]
78
+ article.elements.each do |e|
79
+ case e.first
80
+ when :mw_heading
81
+ next if !config[:heading]
82
+ format_wiki!(e.last)
116
83
  line = e.last
117
- line << "+OTHER+"
118
- else
84
+ line << "+HEADING+" if $DEBUG_MODE
85
+ when :mw_paragraph
86
+ format_wiki!(e.last)
87
+ line = e.last + "\n"
88
+ line << "+PARAGRAPH+" if $DEBUG_MODE
89
+ when :mw_table, :mw_htable
90
+ next if !config[:table]
91
+ line = e.last
92
+ line << "+TABLE+" if $DEBUG_MODE
93
+ when :mw_pre
94
+ next if !config[:pre]
95
+ line = e.last
96
+ line << "+PRE+" if $DEBUG_MODE
97
+ when :mw_quote
98
+ line = e.last
99
+ line << "+QUOTE+" if $DEBUG_MODE
100
+ when :mw_unordered, :mw_ordered, :mw_definition
101
+ next if !config[:list]
102
+ line = e.last
103
+ line << "+LIST+" if $DEBUG_MODE
104
+ when :mw_ml_template
105
+ next if !config[:multiline]
106
+ line = e.last
107
+ line << "+MLTEMPLATE+" if $DEBUG_MODE
108
+ when :mw_redirect
109
+ next if !config[:redirect]
110
+ line = e.last
111
+ line << "+REDIRECT+" if $DEBUG_MODE
112
+ line << "\n\n"
113
+ when :mw_isolated_template
114
+ next if !config[:multiline]
115
+ line = e.last
116
+ line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
117
+ when :mw_isolated_tag
119
118
  next
119
+ else
120
+ if $DEBUG_MODE
121
+ # format_wiki!(e.last)
122
+ line = e.last
123
+ line << "+OTHER+"
124
+ else
125
+ next
126
+ end
120
127
  end
128
+ contents << line << "\n"
121
129
  end
122
- contents << line
123
130
  end
124
- convert_characters!(contents)
125
- remove_table!(contents) unless $leave_table
126
- remove_ref!(contents) unless $leave_ref
127
131
 
128
- ##### cleanup #####
129
- if /\A\s*\z/m =~ contents
132
+ if /\A[\s ]*\z/m =~ contents
130
133
  result = ""
131
134
  else
132
- result = config[:title] ? title + "\n" << contents : contents
135
+ result = config[:title] ? "\n#{title}\n" << contents : contents
133
136
  end
134
- result.gsub!(/\[ref\]\s*\[\/ref\]/m){""}
135
- result.gsub!(/\n\n\n+/m){"\n\n"}
136
- result << "\n"
137
- end
137
+ end