wp2txt 0.9.3 → 0.9.5.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/README.md +40 -25
- data/bin/wp2txt +13 -7
- data/data/output_samples/testdata_en.txt +11923 -36921
- data/data/output_samples/testdata_en_categories.txt +131 -823
- data/data/output_samples/testdata_en_summary.txt +1368 -0
- data/data/output_samples/testdata_ja.txt +24812 -4686
- data/data/output_samples/testdata_ja_categories.txt +205 -187
- data/data/output_samples/testdata_ja_summary.txt +1684 -0
- data/data/testdata_en.bz2 +0 -0
- data/data/testdata_ja.bz2 +0 -0
- data/lib/wp2txt/article.rb +3 -2
- data/lib/wp2txt/utils.rb +82 -54
- data/lib/wp2txt/version.rb +1 -1
- metadata +5 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 2aa3c73ab9202aa22974bbb60dad95f10b8abb434cd923fe5f2f6e917f89ac18
|
4
|
+
data.tar.gz: 790d280ee298ff08c5dde80e355f69a1803b949abe14c81912ec6119f3371d59
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 39f16e5df3c22f60ef4c0f3c9fe05c5f9ee0732fa90dd9916dd7bf6ffdc05e991afd67425fa6fdb9661cd206e4e16e0db032c131cb59c9d71b7fd2b668635429
|
7
|
+
data.tar.gz: b7c700c667220e11b39fd25a91c76609d3f2608599223f8525e0c8b4b03e29fd1c9547ec2bf30117ca4d65aa0cb09db15f9841ce4790fbfe73a16bfeb5cebfc3
|
data/.gitignore
CHANGED
data/README.md
CHANGED
@@ -1,35 +1,35 @@
|
|
1
1
|
# WP2TXT
|
2
2
|
|
3
|
-
Wikipedia dump file to text converter
|
4
|
-
|
5
|
-
**IMPORTANT:** This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.
|
3
|
+
Wikipedia dump file to text converter that extracts both content and category data
|
6
4
|
|
7
5
|
## About
|
8
6
|
|
9
|
-
WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2)
|
7
|
+
WP2TXT extracts plain text data from a Wikipedia dump file (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata. It was developed for researchers who want easy access to open-source multilingual corpora, but may be used for other purposes as well.
|
8
|
+
|
9
|
+
**UPDATE (July 2022)**: Version 0.9.3 adds a new option `category_only`. When this option is enabled, wp2txt will extract only the title and category information of the article. See output examples below.
|
10
10
|
|
11
|
-
**UPDATE:** Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
|
12
11
|
|
13
12
|
## Features
|
14
13
|
|
15
|
-
*
|
16
|
-
*
|
17
|
-
*
|
18
|
-
*
|
14
|
+
* Converts Wikipedia dump files in various languages
|
15
|
+
* Creates output files of specified size
|
16
|
+
* Can specify text elements to be extracted and converted (page titles, section titles, lists, tables)
|
17
|
+
* Can extract category information for each article
|
18
|
+
|
19
19
|
|
20
20
|
## Installation
|
21
|
-
|
21
|
+
|
22
22
|
$ gem install wp2txt
|
23
23
|
|
24
24
|
## Usage
|
25
25
|
|
26
26
|
Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
|
27
27
|
|
28
|
-
|
28
|
+
> `xxwiki-yyyymmdd-pages-articles.xml.bz2`
|
29
29
|
|
30
|
-
where `xx` is language code such as "en (English)" or "", and `yyyymmdd` is the date of creation (e.g.
|
30
|
+
where `xx` is language code such as "en (English)" or "ja (Japanese)", and `yyyymmdd` is the date of creation (e.g. 20220720).
|
31
31
|
|
32
|
-
### Example 1
|
32
|
+
### Example 1: Basic
|
33
33
|
|
34
34
|
The following extracts text data, including list items and excluding tables.
|
35
35
|
|
@@ -38,15 +38,29 @@ The following extracts text data, including list items and excluding tables.
|
|
38
38
|
- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
|
39
39
|
- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
|
40
40
|
|
41
|
-
### Example 2
|
41
|
+
### Example 2: Title and category information only
|
42
42
|
|
43
43
|
The following will extract only article titles and the categories to which each article belongs:
|
44
44
|
|
45
|
-
$ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
|
45
|
+
$ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
|
46
|
+
|
47
|
+
Each line of the output data contains the title and the categories of an article:
|
48
|
+
|
49
|
+
> title `TAB` category1`,` category2`,` category3`,` ...
|
46
50
|
|
47
51
|
- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
|
48
52
|
- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
|
49
53
|
|
54
|
+
### Example 3: Title, category, and summary text only
|
55
|
+
|
56
|
+
The following will extract only article titles, the categories to which each article belongs, and text blocks before the first heading of the article:
|
57
|
+
|
58
|
+
$ wp2txt --summary-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
|
59
|
+
|
60
|
+
- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
|
61
|
+
- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
|
62
|
+
|
63
|
+
|
50
64
|
## Options
|
51
65
|
|
52
66
|
Command line options are as follows:
|
@@ -71,35 +85,36 @@ Command line options are as follows:
|
|
71
85
|
definitions, etc. (Default: true)
|
72
86
|
--category, -g: Show article category information (default: true)
|
73
87
|
--category-only, -y: Extract only article title and categories (default: false)
|
88
|
+
-s, --summary-only: Extract only article title, categories, and summary text before first heading
|
74
89
|
--file-size, -f <i>: Approximate size (in MB) of each output file
|
75
90
|
(default: 10)
|
76
|
-
-u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
|
91
|
+
-u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
|
77
92
|
set 99 to spawn max num of threads) (default: 4)
|
78
93
|
--version, -v: Print version and exit
|
79
94
|
--help, -h: Show this message
|
80
95
|
|
81
96
|
## Caveats
|
82
97
|
|
83
|
-
*
|
84
|
-
*
|
85
|
-
*
|
86
|
-
*
|
98
|
+
* Some data, such as mathematical formulas and computer source code, will not be converted correctly.
|
99
|
+
* Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
|
100
|
+
* The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
|
101
|
+
* WP2TXT, by the nature of its task, requires a lot of machine power and consumes a large amount of memory/storage resources. Therefore, there is a possibility that the process may stop unexpectedly. In the worst case, the process may even freeze without terminating successfully. Please understand this and use at your own risk.
|
87
102
|
|
88
|
-
|
103
|
+
## Useful Links
|
89
104
|
|
90
105
|
* [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
|
91
|
-
|
92
|
-
|
106
|
+
|
107
|
+
## Author
|
93
108
|
|
94
109
|
* Yoichiro Hasebe (<yohasebe@gmail.com>)
|
95
110
|
|
96
|
-
|
111
|
+
## References
|
97
112
|
|
98
113
|
The author will appreciate your mentioning one of these in your research.
|
99
114
|
|
100
115
|
* Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
|
101
116
|
* 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
|
102
117
|
|
103
|
-
|
118
|
+
## License
|
104
119
|
|
105
120
|
This software is distributed under the MIT License. Please see the LICENSE file.
|
data/bin/wp2txt
CHANGED
@@ -27,7 +27,7 @@ EOS
|
|
27
27
|
opt :input_file, "Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format", :required => true
|
28
28
|
opt :output_dir, "Output directory", :default => Dir::pwd, :type => String
|
29
29
|
opt :convert, "Output in plain text (converting from XML)", :default => true
|
30
|
-
opt :list, "Show list items in output", :default =>
|
30
|
+
opt :list, "Show list items in output", :default => false
|
31
31
|
opt :heading, "Show section titles in output", :default => true, :short => "-d"
|
32
32
|
opt :title, "Show page titles in output", :default => true
|
33
33
|
opt :table, "Show table source code in output", :default => false
|
@@ -38,6 +38,7 @@ EOS
|
|
38
38
|
opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
|
39
39
|
opt :category, "Show article category information", :default => true
|
40
40
|
opt :category_only, "Extract only article title and categories", :default => false
|
41
|
+
opt :summary_only, "Extract only article title, categories, and summary text before first heading", :default => false
|
41
42
|
opt :file_size, "Approximate size (in MB) of each output file", :default => 10
|
42
43
|
opt :num_threads, "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
|
43
44
|
end
|
@@ -50,10 +51,9 @@ tfile_size = opts[:file_size]
|
|
50
51
|
num_threads = opts[:num_threads]
|
51
52
|
convert = opts[:convert]
|
52
53
|
strip_tmarker = opts[:marker] ? false : true
|
53
|
-
opt_array = [:title, :list, :heading, :table, :redirect, :multiline]
|
54
|
+
opt_array = [:title, :list, :heading, :table, :redirect, :multiline, :category, :category_only, :summary_only]
|
54
55
|
$leave_inline_template = true if opts[:inline]
|
55
56
|
$leave_ref = true if opts[:ref]
|
56
|
-
# $leave_table = true if opts[:table]
|
57
57
|
config = {}
|
58
58
|
opt_array.each do |opt|
|
59
59
|
config[opt] = opts[opt]
|
@@ -64,20 +64,26 @@ wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_thre
|
|
64
64
|
|
65
65
|
wpconv.extract_text do |article|
|
66
66
|
format_wiki!(article.title)
|
67
|
-
title = "[[#{article.title}]]\n"
|
68
67
|
|
69
|
-
if
|
68
|
+
if config[:category_only]
|
69
|
+
title = "#{article.title}\t"
|
70
|
+
contents = article.categories.join(", ")
|
71
|
+
contents << "\n"
|
72
|
+
elsif config[:category] && !article.categories.empty?
|
73
|
+
title = "\n[[#{article.title}]]\n\n"
|
70
74
|
contents = "\nCATEGORIES: "
|
71
75
|
contents << article.categories.join(", ")
|
72
76
|
contents << "\n\n"
|
73
77
|
else
|
78
|
+
title = "\n[[#{article.title}]]\n\n"
|
74
79
|
contents = ""
|
75
80
|
end
|
76
81
|
|
77
|
-
unless
|
82
|
+
unless config[:category_only]
|
78
83
|
article.elements.each do |e|
|
79
84
|
case e.first
|
80
85
|
when :mw_heading
|
86
|
+
break if config[:summary_only]
|
81
87
|
next if !config[:heading]
|
82
88
|
format_wiki!(e.last)
|
83
89
|
line = e.last
|
@@ -132,6 +138,6 @@ wpconv.extract_text do |article|
|
|
132
138
|
if /\A[\s ]*\z/m =~ contents
|
133
139
|
result = ""
|
134
140
|
else
|
135
|
-
result = config[:title] ?
|
141
|
+
result = config[:title] ? title << contents : contents
|
136
142
|
end
|
137
143
|
end
|