wp2txt 0.9.3 → 0.9.4

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 32966949db257b30be7a5c044965ce08426bdede1f1fa0dbb0a276361d1c69c2
4
- data.tar.gz: ee0b08031ae75b9d08fd1f07e5d08e8e10135dba8b15bd44692f7c434b220262
3
+ metadata.gz: 50f291332872b0e3cd0b651662d7494ec9edd823fdb6ba6a928f501a37ea06c3
4
+ data.tar.gz: ec4891f6a30c7bc2f8f0a6fd3ec56618c9f706ea277207e7f955347417959f7e
5
5
  SHA512:
6
- metadata.gz: 9dee99ed39d2da01c9aeda462291645533773ed537de06a1f9127d626c91bc421e92ef805c52764fd0a668cf104a3733ab5981e59bb35d4abebe2f2909c63e3f
7
- data.tar.gz: 8c6d7a5841a47fa4643a229050be652e38879d03477568175e8b29fb6f77971633731ec2d5dbd429dc2499ecfc2bfbb58917389d89b9cd57743b86b97ceaa4af
6
+ metadata.gz: afa3770c47bc25252993bfddf6da6e99a7bca87d4d899b3f8ce44d8a6298d29a19ce06fe9b64166316a31672a76d7d4530887e77d98212bc8f17a350c0e1598a
7
+ data.tar.gz: ef6f5b11b8a7d2ae5eeb640b0f2319bea9ee1209b0ab1dd78833f3cde41149fb8468871d90ba75b00c62876fccfa5c5f7cca6fc2420d4769c3c35a7bd9aa8786
data/README.md CHANGED
@@ -1,14 +1,12 @@
1
1
  # WP2TXT
2
2
 
3
- Wikipedia dump file to text converter
4
-
5
- **IMPORTANT:** This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.
3
+ Wikipedia dump file to text converter that extracts both content and category data
6
4
 
7
5
  ## About
8
6
 
9
7
  WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
10
8
 
11
- **UPDATE:** Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
9
+ **UPDATE (July 2022)**: Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
12
10
 
13
11
  ## Features
14
12
 
@@ -18,7 +16,7 @@ WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compres
18
16
  * Extract category information of each article
19
17
 
20
18
  ## Installation
21
-
19
+
22
20
  $ gem install wp2txt
23
21
 
24
22
  ## Usage
@@ -27,7 +25,7 @@ Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-inde
27
25
 
28
26
  xxwiki-yyyymmdd-pages-articles.xml.bz2
29
27
 
30
- where `xx` is language code such as "en (English)" or "", and `yyyymmdd` is the date of creation (e.g. 20120601).
28
+ where `xx` is language code such as "en (English)" or "ja (Japanese)", and `yyyymmdd` is the date of creation (e.g. 20220720).
31
29
 
32
30
  ### Example 1
33
31
 
@@ -42,7 +40,7 @@ The following extracts text data, including list items and excluding tables.
42
40
 
43
41
  The following will extract only article titles and the categories to which each article belongs:
44
42
 
45
- $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
43
+ $ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
46
44
 
47
45
  - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
48
46
  - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
@@ -73,7 +71,7 @@ Command line options are as follows:
73
71
  --category-only, -y: Extract only article title and categories (default: false)
74
72
  --file-size, -f <i>: Approximate size (in MB) of each output file
75
73
  (default: 10)
76
- -u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
74
+ -u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
77
75
  set 99 to spawn max num of threads) (default: 4)
78
76
  --version, -v: Print version and exit
79
77
  --help, -h: Show this message
@@ -81,14 +79,14 @@ Command line options are as follows:
81
79
  ## Caveats
82
80
 
83
81
  * Certain types of data such as mathematical equations and computer source code are not be properly converted. Please remember this software is originally intended for correcting “sentences” for linguistic studies.
84
- * Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
82
+ * Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
85
83
  * Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
86
84
  * Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
87
85
 
88
86
  ### Useful Links
89
87
 
90
88
  * [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
91
-
89
+
92
90
  ### Author
93
91
 
94
92
  * Yoichiro Hasebe (<yohasebe@gmail.com>)
data/bin/wp2txt CHANGED
@@ -64,13 +64,18 @@ wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_thre
64
64
 
65
65
  wpconv.extract_text do |article|
66
66
  format_wiki!(article.title)
67
- title = "[[#{article.title}]]\n"
68
67
 
69
- if opts[:category] && !article.categories.empty?
68
+ if opts[:category_only]
69
+ title = "#{article.title}\t"
70
+ contents = article.categories.join(", ")
71
+ contents << "\n"
72
+ elsif opts[:category] && !article.categories.empty?
73
+ title = "\n[[#{article.title}]]\n\n"
70
74
  contents = "\nCATEGORIES: "
71
75
  contents << article.categories.join(", ")
72
76
  contents << "\n\n"
73
77
  else
78
+ title = "\n[[#{article.title}]]\n\n"
74
79
  contents = ""
75
80
  end
76
81
 
@@ -132,6 +137,6 @@ wpconv.extract_text do |article|
132
137
  if /\A[\s ]*\z/m =~ contents
133
138
  result = ""
134
139
  else
135
- result = config[:title] ? "\n#{title}\n" << contents : contents
140
+ result = config[:title] ? title << contents : contents
136
141
  end
137
142
  end
@@ -28704,7 +28704,7 @@ File:Halkbank.jpg|Halkbank Tower (1993) designed by Doğan Tekeli and Sami Sisa
28704
28704
 
28705
28705
  * [http://www.esenbogaairport.com/ Esenboğa International Airport]
28706
28706
 
28707
- [[Anaconda]]
28707
+ [[Arabic language]]
28708
28708
 
28709
28709
  CATEGORIES: Arabic language, Central Semitic languages, Fusional languages, Languages of Algeria, Languages of Bahrain, Languages of Chad, Languages of Comoros, Languages of Djibouti, Languages of Eritrea, Languages of Gibraltar, Languages of Iraq, Languages of Israel, Languages of Jordan, Languages of Kuwait, Languages of Lebanon, Languages of Libya, Languages of Mauritania, Languages of Morocco, Languages of Oman, Languages of Qatar, Languages of Saudi Arabia, Languages of Somalia, Languages of Somaliland, Languages of Sudan, Languages of Syria, Languages of the United Arab Emirates, Languages of Tunisia, Languages of Yemen, Languages of Trinidad and Tobago, Requests for audio pronunciation (Arabic), Stress-timed languages, Subject–verb–object languages, Languages of Palestine
28710
28710