wp2txt 0.9.3 → 0.9.4
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +8 -10
- data/bin/wp2txt +8 -3
- data/data/output_samples/testdata_en.txt +1 -1
- data/data/output_samples/testdata_en_categories.txt +206 -823
- data/data/output_samples/testdata_ja_categories.txt +47 -187
- data/lib/wp2txt/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 50f291332872b0e3cd0b651662d7494ec9edd823fdb6ba6a928f501a37ea06c3
|
4
|
+
data.tar.gz: ec4891f6a30c7bc2f8f0a6fd3ec56618c9f706ea277207e7f955347417959f7e
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: afa3770c47bc25252993bfddf6da6e99a7bca87d4d899b3f8ce44d8a6298d29a19ce06fe9b64166316a31672a76d7d4530887e77d98212bc8f17a350c0e1598a
|
7
|
+
data.tar.gz: ef6f5b11b8a7d2ae5eeb640b0f2319bea9ee1209b0ab1dd78833f3cde41149fb8468871d90ba75b00c62876fccfa5c5f7cca6fc2420d4769c3c35a7bd9aa8786
|
data/README.md
CHANGED
@@ -1,14 +1,12 @@
|
|
1
1
|
# WP2TXT
|
2
2
|
|
3
|
-
Wikipedia dump file to text converter
|
4
|
-
|
5
|
-
**IMPORTANT:** This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.
|
3
|
+
Wikipedia dump file to text converter that extracts both content and category data
|
6
4
|
|
7
5
|
## About
|
8
6
|
|
9
7
|
WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
|
10
8
|
|
11
|
-
**UPDATE
|
9
|
+
**UPDATE (July 2022)**: Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
|
12
10
|
|
13
11
|
## Features
|
14
12
|
|
@@ -18,7 +16,7 @@ WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compres
|
|
18
16
|
* Extract category information of each article
|
19
17
|
|
20
18
|
## Installation
|
21
|
-
|
19
|
+
|
22
20
|
$ gem install wp2txt
|
23
21
|
|
24
22
|
## Usage
|
@@ -27,7 +25,7 @@ Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-inde
|
|
27
25
|
|
28
26
|
xxwiki-yyyymmdd-pages-articles.xml.bz2
|
29
27
|
|
30
|
-
where `xx` is language code such as "en (English)" or "", and `yyyymmdd` is the date of creation (e.g.
|
28
|
+
where `xx` is language code such as "en (English)" or "ja (Japanese)", and `yyyymmdd` is the date of creation (e.g. 20220720).
|
31
29
|
|
32
30
|
### Example 1
|
33
31
|
|
@@ -42,7 +40,7 @@ The following extracts text data, including list items and excluding tables.
|
|
42
40
|
|
43
41
|
The following will extract only article titles and the categories to which each article belongs:
|
44
42
|
|
45
|
-
$ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
|
43
|
+
$ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
|
46
44
|
|
47
45
|
- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
|
48
46
|
- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
|
@@ -73,7 +71,7 @@ Command line options are as follows:
|
|
73
71
|
--category-only, -y: Extract only article title and categories (default: false)
|
74
72
|
--file-size, -f <i>: Approximate size (in MB) of each output file
|
75
73
|
(default: 10)
|
76
|
-
-u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
|
74
|
+
-u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
|
77
75
|
set 99 to spawn max num of threads) (default: 4)
|
78
76
|
--version, -v: Print version and exit
|
79
77
|
--help, -h: Show this message
|
@@ -81,14 +79,14 @@ Command line options are as follows:
|
|
81
79
|
## Caveats
|
82
80
|
|
83
81
|
* Certain types of data such as mathematical equations and computer source code are not be properly converted. Please remember this software is originally intended for correcting “sentences” for linguistic studies.
|
84
|
-
* Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
|
82
|
+
* Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
|
85
83
|
* Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
|
86
84
|
* Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
|
87
85
|
|
88
86
|
### Useful Links
|
89
87
|
|
90
88
|
* [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
|
91
|
-
|
89
|
+
|
92
90
|
### Author
|
93
91
|
|
94
92
|
* Yoichiro Hasebe (<yohasebe@gmail.com>)
|
data/bin/wp2txt
CHANGED
@@ -64,13 +64,18 @@ wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_thre
|
|
64
64
|
|
65
65
|
wpconv.extract_text do |article|
|
66
66
|
format_wiki!(article.title)
|
67
|
-
title = "[[#{article.title}]]\n"
|
68
67
|
|
69
|
-
if opts[:
|
68
|
+
if opts[:category_only]
|
69
|
+
title = "#{article.title}\t"
|
70
|
+
contents = article.categories.join(", ")
|
71
|
+
contents << "\n"
|
72
|
+
elsif opts[:category] && !article.categories.empty?
|
73
|
+
title = "\n[[#{article.title}]]\n\n"
|
70
74
|
contents = "\nCATEGORIES: "
|
71
75
|
contents << article.categories.join(", ")
|
72
76
|
contents << "\n\n"
|
73
77
|
else
|
78
|
+
title = "\n[[#{article.title}]]\n\n"
|
74
79
|
contents = ""
|
75
80
|
end
|
76
81
|
|
@@ -132,6 +137,6 @@ wpconv.extract_text do |article|
|
|
132
137
|
if /\A[\s ]*\z/m =~ contents
|
133
138
|
result = ""
|
134
139
|
else
|
135
|
-
result = config[:title] ?
|
140
|
+
result = config[:title] ? title << contents : contents
|
136
141
|
end
|
137
142
|
end
|
@@ -28704,7 +28704,7 @@ File:Halkbank.jpg|Halkbank Tower (1993) designed by Doğan Tekeli and Sami Sisa
|
|
28704
28704
|
|
28705
28705
|
* [http://www.esenbogaairport.com/ Esenboğa International Airport]
|
28706
28706
|
|
28707
|
-
[[
|
28707
|
+
[[Arabic language]]
|
28708
28708
|
|
28709
28709
|
CATEGORIES: Arabic language, Central Semitic languages, Fusional languages, Languages of Algeria, Languages of Bahrain, Languages of Chad, Languages of Comoros, Languages of Djibouti, Languages of Eritrea, Languages of Gibraltar, Languages of Iraq, Languages of Israel, Languages of Jordan, Languages of Kuwait, Languages of Lebanon, Languages of Libya, Languages of Mauritania, Languages of Morocco, Languages of Oman, Languages of Qatar, Languages of Saudi Arabia, Languages of Somalia, Languages of Somaliland, Languages of Sudan, Languages of Syria, Languages of the United Arab Emirates, Languages of Tunisia, Languages of Yemen, Languages of Trinidad and Tobago, Requests for audio pronunciation (Arabic), Stress-timed languages, Subject–verb–object languages, Languages of Palestine
|
28710
28710
|
|