wp2txt 0.9.2 → 0.9.5
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/README.md +68 -31
- data/bin/wp2txt +62 -53
- data/data/output_samples/testdata_en.txt +11923 -36921
- data/data/output_samples/testdata_en_categories.txt +132 -0
- data/data/output_samples/testdata_en_summary.txt +1368 -0
- data/data/output_samples/testdata_ja.txt +24812 -4686
- data/data/output_samples/testdata_ja_categories.txt +206 -0
- data/data/output_samples/testdata_ja_summary.txt +1684 -0
- data/data/testdata_en.bz2 +0 -0
- data/data/testdata_ja.bz2 +0 -0
- data/lib/wp2txt/article.rb +3 -2
- data/lib/wp2txt/utils.rb +51 -27
- data/lib/wp2txt/version.rb +1 -1
- data/lib/wp2txt.rb +2 -2
- metadata +7 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: bf8270b3488c0045a067f71c155db8d9ac6366a94d825eed9bc6d05c95598345
|
4
|
+
data.tar.gz: 8802949a232c60d8b5ae6f93726154f7a6b40436b478919657f58f4bdc54add3
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 0a10804d78c33e035aaf429dd4613f84f3db0c6f22a6c36617a1fda25f03c0fd8fac224ec9e6009ab6ddddb475d73e6eda4c21606f89ef94950bc3749ce4f452
|
7
|
+
data.tar.gz: 4a8ea2f0900c6f97d3dcaf6c6387b3543d23962ea064cf1b18a8c293b4664c16fcabca9230aa83b0f5685eacffed74968fe82529566080b7821fd944c7bf275d
|
data/.gitignore
CHANGED
data/README.md
CHANGED
@@ -1,32 +1,67 @@
|
|
1
1
|
# WP2TXT
|
2
2
|
|
3
|
-
Wikipedia dump file to text converter
|
3
|
+
Wikipedia dump file to text converter that extracts both content and category data
|
4
4
|
|
5
|
-
|
5
|
+
## About
|
6
6
|
|
7
|
-
|
7
|
+
WP2TXT extracts plain text data from a Wikipedia dump file (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata. It was developed for researchers who want easy access to open-source multilingual corpora, but may be used for other purposes as well.
|
8
8
|
|
9
|
-
|
9
|
+
**UPDATE (July 2022)**: Version 0.9.3 adds a new option `category_only`. When this option is enabled, wp2txt will extract only the title and category information of the article. See output examples below.
|
10
10
|
|
11
|
-
**UPDATE:** Version 0.9.1 has added a new option `num-threads`, which improves the performance significantly . Note also that `--category` option is enabled by default, resulting with output format somewhat different from previous versions. Check out the new format using test data in `data/output_samples` folder before going on to convert a huge wikipedia dump.
|
12
11
|
|
13
|
-
|
12
|
+
## Features
|
14
13
|
|
15
|
-
*
|
16
|
-
*
|
17
|
-
*
|
14
|
+
* Converts Wikipedia dump files in various languages
|
15
|
+
* Creates output files of specified size
|
16
|
+
* Can specify text elements to be extracted and converted (page titles, section titles, lists, tables)
|
17
|
+
* Can extract category information for each article
|
18
|
+
|
19
|
+
|
20
|
+
## Installation
|
18
21
|
|
19
|
-
### Installation
|
20
|
-
|
21
22
|
$ gem install wp2txt
|
22
23
|
|
23
|
-
|
24
|
+
## Usage
|
24
25
|
|
25
26
|
Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
|
26
27
|
|
27
|
-
|
28
|
+
> `xxwiki-yyyymmdd-pages-articles.xml.bz2`
|
29
|
+
|
30
|
+
where `xx` is language code such as "en (English)" or "ja (Japanese)", and `yyyymmdd` is the date of creation (e.g. 20220720).
|
31
|
+
|
32
|
+
### Example 1: Basic
|
33
|
+
|
34
|
+
The following extracts text data, including list items and excluding tables.
|
35
|
+
|
36
|
+
$ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
|
37
|
+
|
38
|
+
- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
|
39
|
+
- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
|
40
|
+
|
41
|
+
### Example 2: Title and category information only
|
42
|
+
|
43
|
+
The following will extract only article titles and the categories to which each article belongs:
|
44
|
+
|
45
|
+
$ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
|
28
46
|
|
29
|
-
|
47
|
+
Each line of the output data contains the title and the categories of an article:
|
48
|
+
|
49
|
+
> title `TAB` category1`,` category2`,` category3`,` ...
|
50
|
+
|
51
|
+
- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
|
52
|
+
- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
|
53
|
+
|
54
|
+
### Example 3: Title, category, and summary text only
|
55
|
+
|
56
|
+
The following will extract only article titles, the categories to which each article belongs, and text blocks before the first heading of the article:
|
57
|
+
|
58
|
+
$ wp2txt --summary-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
|
59
|
+
|
60
|
+
- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
|
61
|
+
- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
|
62
|
+
|
63
|
+
|
64
|
+
## Options
|
30
65
|
|
31
66
|
Command line options are as follows:
|
32
67
|
|
@@ -40,44 +75,46 @@ Command line options are as follows:
|
|
40
75
|
--list, --no-list, -l: Show list items in output (default: true)
|
41
76
|
--heading, --no-heading, -d: Show section titles in output (default: true)
|
42
77
|
--title, --no-title, -t: Show page titles in output (default: true)
|
43
|
-
--table, -a: Show table source code in output
|
44
|
-
--inline, -n: leave inline template notations unmodified
|
45
|
-
--multiline, -m: leave multiline template notations unmodified
|
46
|
-
--ref, -r: leave reference notations in the format
|
78
|
+
--table, -a: Show table source code in output (default: false)
|
79
|
+
--inline, -n: leave inline template notations unmodified (default: false)
|
80
|
+
--multiline, -m: leave multiline template notations unmodified (default: false)
|
81
|
+
--ref, -r: leave reference notations in the format (default: false)
|
47
82
|
[ref]...[/ref]
|
48
|
-
--redirect, -e: Show redirect destination
|
83
|
+
--redirect, -e: Show redirect destination (default: false)
|
49
84
|
--marker, --no-marker, -k: Show symbols prefixed to list items,
|
50
85
|
definitions, etc. (Default: true)
|
51
|
-
--category, -g: Show article category information
|
86
|
+
--category, -g: Show article category information (default: true)
|
87
|
+
--category-only, -y: Extract only article title and categories (default: false)
|
88
|
+
-s, --summary-only: Extract only article title, categories, and summary text before first heading
|
52
89
|
--file-size, -f <i>: Approximate size (in MB) of each output file
|
53
90
|
(default: 10)
|
54
|
-
-u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
|
91
|
+
-u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
|
55
92
|
set 99 to spawn max num of threads) (default: 4)
|
56
93
|
--version, -v: Print version and exit
|
57
94
|
--help, -h: Show this message
|
58
95
|
|
59
|
-
|
96
|
+
## Caveats
|
60
97
|
|
61
|
-
*
|
62
|
-
*
|
63
|
-
*
|
64
|
-
*
|
98
|
+
* Some data, such as mathematical formulas and computer source code, will not be converted correctly.
|
99
|
+
* Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
|
100
|
+
* The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
|
101
|
+
* WP2TXT, by the nature of its task, requires a lot of machine power and consumes a large amount of memory/storage resources. Therefore, there is a possibility that the process may stop unexpectedly. In the worst case, the process may even freeze without terminating successfully. Please understand this and use at your own risk.
|
65
102
|
|
66
|
-
|
103
|
+
## Useful Links
|
67
104
|
|
68
105
|
* [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
|
69
|
-
|
70
|
-
|
106
|
+
|
107
|
+
## Author
|
71
108
|
|
72
109
|
* Yoichiro Hasebe (<yohasebe@gmail.com>)
|
73
110
|
|
74
|
-
|
111
|
+
## References
|
75
112
|
|
76
113
|
The author will appreciate your mentioning one of these in your research.
|
77
114
|
|
78
115
|
* Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
|
79
116
|
* 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
|
80
117
|
|
81
|
-
|
118
|
+
## License
|
82
119
|
|
83
120
|
This software is distributed under the MIT License. Please see the LICENSE file.
|
data/bin/wp2txt
CHANGED
@@ -27,7 +27,7 @@ EOS
|
|
27
27
|
opt :input_file, "Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format", :required => true
|
28
28
|
opt :output_dir, "Output directory", :default => Dir::pwd, :type => String
|
29
29
|
opt :convert, "Output in plain text (converting from XML)", :default => true
|
30
|
-
opt :list, "Show list items in output", :default =>
|
30
|
+
opt :list, "Show list items in output", :default => false
|
31
31
|
opt :heading, "Show section titles in output", :default => true, :short => "-d"
|
32
32
|
opt :title, "Show page titles in output", :default => true
|
33
33
|
opt :table, "Show table source code in output", :default => false
|
@@ -37,6 +37,8 @@ EOS
|
|
37
37
|
opt :redirect, "Show redirect destination", :default => false
|
38
38
|
opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
|
39
39
|
opt :category, "Show article category information", :default => true
|
40
|
+
opt :category_only, "Extract only article title and categories", :default => false
|
41
|
+
opt :summary_only, "Extract only article title, categories, and summary text before first heading", :default => false
|
40
42
|
opt :file_size, "Approximate size (in MB) of each output file", :default => 10
|
41
43
|
opt :num_threads, "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
|
42
44
|
end
|
@@ -49,10 +51,9 @@ tfile_size = opts[:file_size]
|
|
49
51
|
num_threads = opts[:num_threads]
|
50
52
|
convert = opts[:convert]
|
51
53
|
strip_tmarker = opts[:marker] ? false : true
|
52
|
-
opt_array = [:title, :list, :heading, :table, :redirect, :multiline]
|
54
|
+
opt_array = [:title, :list, :heading, :table, :redirect, :multiline, :category, :category_only, :summary_only]
|
53
55
|
$leave_inline_template = true if opts[:inline]
|
54
56
|
$leave_ref = true if opts[:ref]
|
55
|
-
# $leave_table = true if opts[:table]
|
56
57
|
config = {}
|
57
58
|
opt_array.each do |opt|
|
58
59
|
config[opt] = opts[opt]
|
@@ -63,72 +64,80 @@ wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_thre
|
|
63
64
|
|
64
65
|
wpconv.extract_text do |article|
|
65
66
|
format_wiki!(article.title)
|
66
|
-
title = "[[#{article.title}]]\n"
|
67
67
|
|
68
|
-
if
|
68
|
+
if config[:category_only]
|
69
|
+
title = "#{article.title}\t"
|
70
|
+
contents = article.categories.join(", ")
|
71
|
+
contents << "\n"
|
72
|
+
elsif config[:category] && !article.categories.empty?
|
73
|
+
title = "\n[[#{article.title}]]\n\n"
|
69
74
|
contents = "\nCATEGORIES: "
|
70
75
|
contents << article.categories.join(", ")
|
71
76
|
contents << "\n\n"
|
72
77
|
else
|
78
|
+
title = "\n[[#{article.title}]]\n\n"
|
73
79
|
contents = ""
|
74
80
|
end
|
75
81
|
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
when :mw_paragraph
|
84
|
-
format_wiki!(e.last)
|
85
|
-
line = e.last + "\n"
|
86
|
-
line << "+PARAGRAPH+" if $DEBUG_MODE
|
87
|
-
when :mw_table, :mw_htable
|
88
|
-
next if !config[:table]
|
89
|
-
line = e.last
|
90
|
-
line << "+TABLE+" if $DEBUG_MODE
|
91
|
-
when :mw_pre
|
92
|
-
next if !config[:pre]
|
93
|
-
line = e.last
|
94
|
-
line << "+PRE+" if $DEBUG_MODE
|
95
|
-
when :mw_quote
|
96
|
-
line = e.last
|
97
|
-
line << "+QUOTE+" if $DEBUG_MODE
|
98
|
-
when :mw_unordered, :mw_ordered, :mw_definition
|
99
|
-
next if !config[:list]
|
100
|
-
line = e.last
|
101
|
-
line << "+LIST+" if $DEBUG_MODE
|
102
|
-
when :mw_ml_template
|
103
|
-
next if !config[:multiline]
|
104
|
-
line = e.last
|
105
|
-
line << "+MLTEMPLATE+" if $DEBUG_MODE
|
106
|
-
when :mw_redirect
|
107
|
-
next if !config[:redirect]
|
108
|
-
line = e.last
|
109
|
-
line << "+REDIRECT+" if $DEBUG_MODE
|
110
|
-
line << "\n\n"
|
111
|
-
when :mw_isolated_template
|
112
|
-
next if !config[:multiline]
|
113
|
-
line = e.last
|
114
|
-
line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
|
115
|
-
when :mw_isolated_tag
|
116
|
-
next
|
117
|
-
else
|
118
|
-
if $DEBUG_MODE
|
119
|
-
# format_wiki!(e.last)
|
82
|
+
unless config[:category_only]
|
83
|
+
article.elements.each do |e|
|
84
|
+
case e.first
|
85
|
+
when :mw_heading
|
86
|
+
break if config[:summary_only]
|
87
|
+
next if !config[:heading]
|
88
|
+
format_wiki!(e.last)
|
120
89
|
line = e.last
|
121
|
-
line << "+
|
122
|
-
|
90
|
+
line << "+HEADING+" if $DEBUG_MODE
|
91
|
+
when :mw_paragraph
|
92
|
+
format_wiki!(e.last)
|
93
|
+
line = e.last + "\n"
|
94
|
+
line << "+PARAGRAPH+" if $DEBUG_MODE
|
95
|
+
when :mw_table, :mw_htable
|
96
|
+
next if !config[:table]
|
97
|
+
line = e.last
|
98
|
+
line << "+TABLE+" if $DEBUG_MODE
|
99
|
+
when :mw_pre
|
100
|
+
next if !config[:pre]
|
101
|
+
line = e.last
|
102
|
+
line << "+PRE+" if $DEBUG_MODE
|
103
|
+
when :mw_quote
|
104
|
+
line = e.last
|
105
|
+
line << "+QUOTE+" if $DEBUG_MODE
|
106
|
+
when :mw_unordered, :mw_ordered, :mw_definition
|
107
|
+
next if !config[:list]
|
108
|
+
line = e.last
|
109
|
+
line << "+LIST+" if $DEBUG_MODE
|
110
|
+
when :mw_ml_template
|
111
|
+
next if !config[:multiline]
|
112
|
+
line = e.last
|
113
|
+
line << "+MLTEMPLATE+" if $DEBUG_MODE
|
114
|
+
when :mw_redirect
|
115
|
+
next if !config[:redirect]
|
116
|
+
line = e.last
|
117
|
+
line << "+REDIRECT+" if $DEBUG_MODE
|
118
|
+
line << "\n\n"
|
119
|
+
when :mw_isolated_template
|
120
|
+
next if !config[:multiline]
|
121
|
+
line = e.last
|
122
|
+
line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
|
123
|
+
when :mw_isolated_tag
|
123
124
|
next
|
125
|
+
else
|
126
|
+
if $DEBUG_MODE
|
127
|
+
# format_wiki!(e.last)
|
128
|
+
line = e.last
|
129
|
+
line << "+OTHER+"
|
130
|
+
else
|
131
|
+
next
|
132
|
+
end
|
124
133
|
end
|
134
|
+
contents << line << "\n"
|
125
135
|
end
|
126
|
-
contents << line << "\n"
|
127
136
|
end
|
128
137
|
|
129
138
|
if /\A[\s ]*\z/m =~ contents
|
130
139
|
result = ""
|
131
140
|
else
|
132
|
-
result = config[:title] ?
|
141
|
+
result = config[:title] ? title << contents : contents
|
133
142
|
end
|
134
143
|
end
|