RubyGems - wp2txt - Versions diffs - 0.9.3 → 0.9.5.1 - Mend

wp2txt 0.9.3 → 0.9.5.1

Files changed (16) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/README.md +40 -25
data/bin/wp2txt +13 -7
data/data/output_samples/testdata_en.txt +11923 -36921
data/data/output_samples/testdata_en_categories.txt +131 -823
data/data/output_samples/testdata_en_summary.txt +1368 -0
data/data/output_samples/testdata_ja.txt +24812 -4686
data/data/output_samples/testdata_ja_categories.txt +205 -187
data/data/output_samples/testdata_ja_summary.txt +1684 -0
data/data/testdata_en.bz2 +0 -0
data/data/testdata_ja.bz2 +0 -0
data/lib/wp2txt/article.rb +3 -2
data/lib/wp2txt/utils.rb +82 -54
data/lib/wp2txt/version.rb +1 -1
metadata +5 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 32966949db257b30be7a5c044965ce08426bdede1f1fa0dbb0a276361d1c69c2
-  data.tar.gz: ee0b08031ae75b9d08fd1f07e5d08e8e10135dba8b15bd44692f7c434b220262
+  metadata.gz: 2aa3c73ab9202aa22974bbb60dad95f10b8abb434cd923fe5f2f6e917f89ac18
+  data.tar.gz: 790d280ee298ff08c5dde80e355f69a1803b949abe14c81912ec6119f3371d59
 SHA512:
-  metadata.gz: 9dee99ed39d2da01c9aeda462291645533773ed537de06a1f9127d626c91bc421e92ef805c52764fd0a668cf104a3733ab5981e59bb35d4abebe2f2909c63e3f
-  data.tar.gz: 8c6d7a5841a47fa4643a229050be652e38879d03477568175e8b29fb6f77971633731ec2d5dbd429dc2499ecfc2bfbb58917389d89b9cd57743b86b97ceaa4af
+  metadata.gz: 39f16e5df3c22f60ef4c0f3c9fe05c5f9ee0732fa90dd9916dd7bf6ffdc05e991afd67425fa6fdb9661cd206e4e16e0db032c131cb59c9d71b7fd2b668635429
+  data.tar.gz: b7c700c667220e11b39fd25a91c76609d3f2608599223f8525e0c8b4b03e29fd1c9547ec2bf30117ca4d65aa0cb09db15f9841ce4790fbfe73a16bfeb5cebfc3

data/.gitignore CHANGED Viewed

@@ -18,3 +18,4 @@ tmp
 .DS_Store
 *.bak
 *.~

data/README.md CHANGED Viewed

@@ -1,35 +1,35 @@
 # WP2TXT
-Wikipedia dump file to text converter
-**IMPORTANT:** This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.
+Wikipedia dump file to text converter that extracts both content and category data
 ## About
-WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
+WP2TXT extracts plain text data from a Wikipedia dump file (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata. It was developed for researchers who want easy access to open-source multilingual corpora, but may be used for other purposes as well.
+**UPDATE (July 2022)**: Version 0.9.3 adds a new option `category_only`. When this option is enabled, wp2txt will extract only the title and category information of the article. See output examples below.
-**UPDATE:** Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
 ## Features
-* Convert dump files of Wikipedia of various languages
-* Create output files of specified size.
-* Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables)
-* Extract category information of each article
+* Converts Wikipedia dump files in various languages
+* Creates output files of specified size
+* Can specify text elements to be extracted and converted (page titles, section titles, lists, tables)
+* Can extract category information for each article
 ## Installation
     $ gem install wp2txt
 ## Usage
 Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
-    xxwiki-yyyymmdd-pages-articles.xml.bz2
+> `xxwiki-yyyymmdd-pages-articles.xml.bz2`
-where `xx` is language code such as "en (English)" or "", and  `yyyymmdd` is the date of creation (e.g. 20120601).
+where `xx` is language code such as "en (English)" or "ja (Japanese)", and  `yyyymmdd` is the date of creation (e.g. 20220720).
-### Example 1
+### Example 1: Basic
 The following extracts text data, including list items and excluding tables.
@@ -38,15 +38,29 @@ The following extracts text data, including list items and excluding tables.
 - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
 - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
-### Example 2
+### Example 2: Title and category information only
 The following will extract only article titles and the categories to which each article belongs:
-    $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+    $ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+Each line of the output data contains the title and the categories of an article:
+> title `TAB` category1`,` category2`,` category3`,` ...
 - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
 - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
+### Example 3: Title, category, and summary text only
+The following will extract only article titles, the categories to which each article belongs, and text blocks before the first heading of the article:
+    $ wp2txt --summary-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
+- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
 ## Options
 Command line options are as follows:
@@ -71,35 +85,36 @@ Command line options are as follows:
                                    definitions, etc. (Default: true)
                  --category, -g:   Show article category information (default: true)
             --category-only, -y:   Extract only article title and categories (default: false)
+             -s, --summary-only:   Extract only article title, categories, and summary text before first heading
             --file-size, -f <i>:   Approximate size (in MB) of each output file
                                    (default: 10)
-          -u, --num-threads=<i>:   Number of threads to be spawned (capped to the number of CPU cores;
+          -u, --num-threads=<i>:   Number of threads to be spawned (capped to the number of CPU cores;
                                    set 99 to spawn max num of threads) (default: 4)
                   --version, -v:   Print version and exit
                      --help, -h:   Show this message
 ## Caveats
-* Certain types of data such as mathematical equations and computer source code are not be properly converted.  Please remember this software is originally intended for correcting “sentences” for linguistic studies.
-* Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
-* Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
-* Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
+* Some data, such as mathematical formulas and computer source code, will not be converted correctly.
+* Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
+* The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
+* WP2TXT, by the nature of its task, requires a lot of machine power and consumes a large amount of memory/storage resources. Therefore, there is a possibility that the process may stop unexpectedly. In the worst case, the process may even freeze without terminating successfully. Please understand this and use at your own risk.
-### Useful Links
+## Useful Links
 * [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
-### Author
+## Author
 * Yoichiro Hasebe (<yohasebe@gmail.com>)
-### References
+## References
 The author will appreciate your mentioning one of these in your research.
 * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
 * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
-### License
+## License
 This software is distributed under the MIT License. Please see the LICENSE file.

data/bin/wp2txt CHANGED Viewed

@@ -27,7 +27,7 @@ EOS
   opt :input_file,  "Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format", :required => true
   opt :output_dir,  "Output directory", :default => Dir::pwd, :type => String
   opt :convert,     "Output in plain text (converting from XML)", :default => true
-  opt :list,    "Show list items in output", :default => true
+  opt :list,    "Show list items in output", :default => false
   opt :heading, "Show section titles in output", :default => true, :short => "-d"
   opt :title,   "Show page titles in output", :default => true
   opt :table,   "Show table source code in output", :default => false
@@ -38,6 +38,7 @@ EOS
   opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
   opt :category, "Show article category information", :default => true
   opt :category_only, "Extract only article title and categories", :default => false
+  opt :summary_only, "Extract only article title, categories, and summary text before first heading", :default => false
   opt :file_size,   "Approximate size (in MB) of each output file", :default => 10
   opt :num_threads,   "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
 end
@@ -50,10 +51,9 @@ tfile_size = opts[:file_size]
 num_threads = opts[:num_threads]
 convert = opts[:convert]
 strip_tmarker = opts[:marker] ? false : true
-opt_array = [:title, :list, :heading, :table, :redirect, :multiline]
+opt_array = [:title, :list, :heading, :table, :redirect, :multiline, :category, :category_only, :summary_only]
 $leave_inline_template = true if opts[:inline]
 $leave_ref = true if opts[:ref]
-# $leave_table = true if opts[:table]
 config = {}
 opt_array.each do |opt|
   config[opt] = opts[opt]
@@ -64,20 +64,26 @@ wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_thre
 wpconv.extract_text do |article|
   format_wiki!(article.title)
-  title = "[[#{article.title}]]\n"
-  if opts[:category] && !article.categories.empty?
+  if config[:category_only]
+    title = "#{article.title}\t"
+    contents = article.categories.join(", ")
+    contents << "\n"
+  elsif config[:category] && !article.categories.empty?
+    title = "\n[[#{article.title}]]\n\n"
     contents = "\nCATEGORIES: "
     contents << article.categories.join(", ")
     contents << "\n\n"
   else
+    title = "\n[[#{article.title}]]\n\n"
     contents = ""
   end
-  unless opts[:category_only]
+  unless config[:category_only]
     article.elements.each do |e|
       case e.first
       when :mw_heading
+        break if config[:summary_only]
         next if !config[:heading]
         format_wiki!(e.last)
         line = e.last
@@ -132,6 +138,6 @@ wpconv.extract_text do |article|
   if /\A[\s　]*\z/m =~ contents
     result = ""
   else
-    result = config[:title] ? "\n#{title}\n" << contents : contents
+    result = config[:title] ? title << contents : contents
   end
 end