RubyGems - wp2txt - Versions diffs - 0.9.3 → 0.9.4 - Mend

wp2txt 0.9.3 → 0.9.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +4 -4
data/README.md +8 -10
data/bin/wp2txt +8 -3
data/data/output_samples/testdata_en.txt +1 -1
data/data/output_samples/testdata_en_categories.txt +206 -823
data/data/output_samples/testdata_ja_categories.txt +47 -187
data/lib/wp2txt/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 32966949db257b30be7a5c044965ce08426bdede1f1fa0dbb0a276361d1c69c2
-  data.tar.gz: ee0b08031ae75b9d08fd1f07e5d08e8e10135dba8b15bd44692f7c434b220262
+  metadata.gz: 50f291332872b0e3cd0b651662d7494ec9edd823fdb6ba6a928f501a37ea06c3
+  data.tar.gz: ec4891f6a30c7bc2f8f0a6fd3ec56618c9f706ea277207e7f955347417959f7e
 SHA512:
-  metadata.gz: 9dee99ed39d2da01c9aeda462291645533773ed537de06a1f9127d626c91bc421e92ef805c52764fd0a668cf104a3733ab5981e59bb35d4abebe2f2909c63e3f
-  data.tar.gz: 8c6d7a5841a47fa4643a229050be652e38879d03477568175e8b29fb6f77971633731ec2d5dbd429dc2499ecfc2bfbb58917389d89b9cd57743b86b97ceaa4af
+  metadata.gz: afa3770c47bc25252993bfddf6da6e99a7bca87d4d899b3f8ce44d8a6298d29a19ce06fe9b64166316a31672a76d7d4530887e77d98212bc8f17a350c0e1598a
+  data.tar.gz: ef6f5b11b8a7d2ae5eeb640b0f2319bea9ee1209b0ab1dd78833f3cde41149fb8468871d90ba75b00c62876fccfa5c5f7cca6fc2420d4769c3c35a7bd9aa8786

data/README.md CHANGED Viewed

@@ -1,14 +1,12 @@
 # WP2TXT
-Wikipedia dump file to text converter
-**IMPORTANT:** This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.
+Wikipedia dump file to text converter that extracts both content and category data
 ## About
 WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
-**UPDATE:** Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
+**UPDATE (July 2022)**: Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
 ## Features
@@ -18,7 +16,7 @@ WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compres
 * Extract category information of each article
 ## Installation
     $ gem install wp2txt
 ## Usage
@@ -27,7 +25,7 @@ Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-inde
     xxwiki-yyyymmdd-pages-articles.xml.bz2
-where `xx` is language code such as "en (English)" or "", and  `yyyymmdd` is the date of creation (e.g. 20120601).
+where `xx` is language code such as "en (English)" or "ja (Japanese)", and  `yyyymmdd` is the date of creation (e.g. 20220720).
 ### Example 1
@@ -42,7 +40,7 @@ The following extracts text data, including list items and excluding tables.
 The following will extract only article titles and the categories to which each article belongs:
-    $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+    $ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
 - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
 - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
@@ -73,7 +71,7 @@ Command line options are as follows:
             --category-only, -y:   Extract only article title and categories (default: false)
             --file-size, -f <i>:   Approximate size (in MB) of each output file
                                    (default: 10)
-          -u, --num-threads=<i>:   Number of threads to be spawned (capped to the number of CPU cores;
+          -u, --num-threads=<i>:   Number of threads to be spawned (capped to the number of CPU cores;
                                    set 99 to spawn max num of threads) (default: 4)
                   --version, -v:   Print version and exit
                      --help, -h:   Show this message
@@ -81,14 +79,14 @@ Command line options are as follows:
 ## Caveats
 * Certain types of data such as mathematical equations and computer source code are not be properly converted.  Please remember this software is originally intended for correcting “sentences” for linguistic studies.
-* Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
+* Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
 * Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
 * Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
 ### Useful Links
 * [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
 ### Author
 * Yoichiro Hasebe (<yohasebe@gmail.com>)

data/bin/wp2txt CHANGED Viewed

@@ -64,13 +64,18 @@ wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_thre
 wpconv.extract_text do |article|
   format_wiki!(article.title)
-  title = "[[#{article.title}]]\n"
-  if opts[:category] && !article.categories.empty?
+  if opts[:category_only]
+    title = "#{article.title}\t"
+    contents = article.categories.join(", ")
+    contents << "\n"
+  elsif opts[:category] && !article.categories.empty?
+    title = "\n[[#{article.title}]]\n\n"
     contents = "\nCATEGORIES: "
     contents << article.categories.join(", ")
     contents << "\n\n"
   else
+    title = "\n[[#{article.title}]]\n\n"
     contents = ""
   end
@@ -132,6 +137,6 @@ wpconv.extract_text do |article|
   if /\A[\s　]*\z/m =~ contents
     result = ""
   else
-    result = config[:title] ? "\n#{title}\n" << contents : contents
+    result = config[:title] ? title << contents : contents
   end
 end

data/data/output_samples/testdata_en.txt CHANGED Viewed

@@ -28704,7 +28704,7 @@ File:Halkbank.jpg|Halkbank Tower (1993) designed by Doğan Tekeli and Sami Sisa
 * [http://www.esenbogaairport.com/ Esenboğa International Airport]
-[[Anaconda]]
+[[Arabic language]]
 CATEGORIES: Arabic language, Central Semitic languages, Fusional languages, Languages of Algeria, Languages of Bahrain, Languages of Chad, Languages of Comoros, Languages of Djibouti, Languages of Eritrea, Languages of Gibraltar, Languages of Iraq, Languages of Israel, Languages of Jordan, Languages of Kuwait, Languages of Lebanon, Languages of Libya, Languages of Mauritania, Languages of Morocco, Languages of Oman, Languages of Qatar, Languages of Saudi Arabia, Languages of Somalia, Languages of Somaliland, Languages of Sudan, Languages of Syria, Languages of the United Arab Emirates, Languages of Tunisia, Languages of Yemen, Languages of Trinidad and Tobago, Requests for audio pronunciation (Arabic), Stress-timed languages, Subject–verb–object languages, Languages of Palestine