RubyGems - wp2txt - Versions diffs - 0.9.2 → 0.9.5 - Mend

wp2txt 0.9.2 → 0.9.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/README.md +68 -31
data/bin/wp2txt +62 -53
data/data/output_samples/testdata_en.txt +11923 -36921
data/data/output_samples/testdata_en_categories.txt +132 -0
data/data/output_samples/testdata_en_summary.txt +1368 -0
data/data/output_samples/testdata_ja.txt +24812 -4686
data/data/output_samples/testdata_ja_categories.txt +206 -0
data/data/output_samples/testdata_ja_summary.txt +1684 -0
data/data/testdata_en.bz2 +0 -0
data/data/testdata_ja.bz2 +0 -0
data/lib/wp2txt/article.rb +3 -2
data/lib/wp2txt/utils.rb +51 -27
data/lib/wp2txt/version.rb +1 -1
data/lib/wp2txt.rb +2 -2
metadata +7 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 3ed3d7e29a8f1c6b5f97ca0da646ddfb53ae88add38f647eae0bdc03e626269e
-  data.tar.gz: '009188addebcd908f449f2ce4cf39036406f3816cafeeb61beba097fe036e890'
+  metadata.gz: bf8270b3488c0045a067f71c155db8d9ac6366a94d825eed9bc6d05c95598345
+  data.tar.gz: 8802949a232c60d8b5ae6f93726154f7a6b40436b478919657f58f4bdc54add3
 SHA512:
-  metadata.gz: d91531685df204222ab7bae9b3153653d61ccd36270e36f14575cabc3c2b1d6009bfa15f9033cb8eeb837f7c1a97fdb6303611166ec62ca96b9e4c8fc1e1ec15
-  data.tar.gz: 19183feee7eb8f7c03d3f7bf60eebb7e75ffeb6c6eec6967a8c3e480f82f2b48b6e171d2aa22c7aa44a9336b981ad51dfd37ab423c3db2fe1a0d854860c37231
+  metadata.gz: 0a10804d78c33e035aaf429dd4613f84f3db0c6f22a6c36617a1fda25f03c0fd8fac224ec9e6009ab6ddddb475d73e6eda4c21606f89ef94950bc3749ce4f452
+  data.tar.gz: 4a8ea2f0900c6f97d3dcaf6c6387b3543d23962ea064cf1b18a8c293b4664c16fcabca9230aa83b0f5685eacffed74968fe82529566080b7821fd944c7bf275d

data/.gitignore CHANGED Viewed

@@ -18,3 +18,4 @@ tmp
 .DS_Store
 *.bak
 *.~

data/README.md CHANGED Viewed

@@ -1,32 +1,67 @@
 # WP2TXT
-Wikipedia dump file to text converter
+Wikipedia dump file to text converter that extracts both content and category data
-**IMPORTANT:** This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.
+## About
-### About ###
+WP2TXT extracts plain text data from a Wikipedia dump file (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata. It was developed for researchers who want easy access to open-source multilingual corpora, but may be used for other purposes as well.
-WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
+**UPDATE (July 2022)**: Version 0.9.3 adds a new option `category_only`. When this option is enabled, wp2txt will extract only the title and category information of the article. See output examples below.
-**UPDATE:** Version 0.9.1 has added a new option `num-threads`, which improves the performance significantly . Note also that `--category` option is enabled by default, resulting with output format somewhat different from previous versions. Check out the new format using test data in `data/output_samples` folder before going on to convert a huge wikipedia dump.
-### Features ###
+## Features
-* Convert dump files of Wikipedia of various languages (I hope).
-* Create output files of specified size.
-* Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables).
+* Converts Wikipedia dump files in various languages
+* Creates output files of specified size
+* Can specify text elements to be extracted and converted (page titles, section titles, lists, tables)
+* Can extract category information for each article
+## Installation
-### Installation
     $ gem install wp2txt
-### Usage
+## Usage
 Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
-    xxwiki-yyyymmdd-pages-articles.xml.bz2
+> `xxwiki-yyyymmdd-pages-articles.xml.bz2`
+where `xx` is language code such as "en (English)" or "ja (Japanese)", and  `yyyymmdd` is the date of creation (e.g. 20220720).
+### Example 1: Basic
+The following extracts text data, including list items and excluding tables.
+    $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
+- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
+### Example 2: Title and category information only
+The following will extract only article titles and the categories to which each article belongs:
+    $ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
-where `xx` is language code such as "en (English)" or "ja (Japanese)", and  `yyyymmdd` is the date of creation (e.g. 20120601).
+Each line of the output data contains the title and the categories of an article:
+> title `TAB` category1`,` category2`,` category3`,` ...
+- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
+- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
+### Example 3: Title, category, and summary text only
+The following will extract only article titles, the categories to which each article belongs, and text blocks before the first heading of the article:
+    $ wp2txt --summary-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
+- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
+## Options
 Command line options are as follows:
@@ -40,44 +75,46 @@ Command line options are as follows:
           --list, --no-list, -l:   Show list items in output (default: true)
     --heading, --no-heading, -d:   Show section titles in output (default: true)
         --title, --no-title, -t:   Show page titles in output (default: true)
-                    --table, -a:   Show table source code in output
-                   --inline, -n:   leave inline template notations unmodified
-                --multiline, -m:   leave multiline template notations unmodified
-                      --ref, -r:   leave reference notations in the format
+                    --table, -a:   Show table source code in output (default: false)
+                   --inline, -n:   leave inline template notations unmodified (default: false)
+                --multiline, -m:   leave multiline template notations unmodified (default: false)
+                      --ref, -r:   leave reference notations in the format (default: false)
                                    [ref]...[/ref]
-                 --redirect, -e:   Show redirect destination
+                 --redirect, -e:   Show redirect destination (default: false)
       --marker, --no-marker, -k:   Show symbols prefixed to list items,
                                    definitions, etc. (Default: true)
-                 --category, -g:   Show article category information
+                 --category, -g:   Show article category information (default: true)
+            --category-only, -y:   Extract only article title and categories (default: false)
+             -s, --summary-only:   Extract only article title, categories, and summary text before first heading
             --file-size, -f <i>:   Approximate size (in MB) of each output file
                                    (default: 10)
-          -u, --num-threads=<i>:   Number of threads to be spawned (capped to the number of CPU cores;
+          -u, --num-threads=<i>:   Number of threads to be spawned (capped to the number of CPU cores;
                                    set 99 to spawn max num of threads) (default: 4)
                   --version, -v:   Print version and exit
                      --help, -h:   Show this message
-### Caveats ###
+## Caveats
-* Certain types of data such as mathematical equations and computer source code are not be properly converted.  Please remember this software is originally intended for correcting “sentences” for linguistic studies.
-* Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
-* Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
-* Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
+* Some data, such as mathematical formulas and computer source code, will not be converted correctly.
+* Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
+* The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
+* WP2TXT, by the nature of its task, requires a lot of machine power and consumes a large amount of memory/storage resources. Therefore, there is a possibility that the process may stop unexpectedly. In the worst case, the process may even freeze without terminating successfully. Please understand this and use at your own risk.
-### Useful Link ###
+## Useful Links
 * [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
-### Author ###
+## Author
 * Yoichiro Hasebe (<yohasebe@gmail.com>)
-### References ###
+## References
 The author will appreciate your mentioning one of these in your research.
 * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
 * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
-### License ###
+## License
 This software is distributed under the MIT License. Please see the LICENSE file.

data/bin/wp2txt CHANGED Viewed

@@ -27,7 +27,7 @@ EOS
   opt :input_file,  "Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format", :required => true
   opt :output_dir,  "Output directory", :default => Dir::pwd, :type => String
   opt :convert,     "Output in plain text (converting from XML)", :default => true
-  opt :list,    "Show list items in output", :default => true
+  opt :list,    "Show list items in output", :default => false
   opt :heading, "Show section titles in output", :default => true, :short => "-d"
   opt :title,   "Show page titles in output", :default => true
   opt :table,   "Show table source code in output", :default => false
@@ -37,6 +37,8 @@ EOS
   opt :redirect, "Show redirect destination", :default => false
   opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
   opt :category, "Show article category information", :default => true
+  opt :category_only, "Extract only article title and categories", :default => false
+  opt :summary_only, "Extract only article title, categories, and summary text before first heading", :default => false
   opt :file_size,   "Approximate size (in MB) of each output file", :default => 10
   opt :num_threads,   "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
 end
@@ -49,10 +51,9 @@ tfile_size = opts[:file_size]
 num_threads = opts[:num_threads]
 convert = opts[:convert]
 strip_tmarker = opts[:marker] ? false : true
-opt_array = [:title, :list, :heading, :table, :redirect, :multiline]
+opt_array = [:title, :list, :heading, :table, :redirect, :multiline, :category, :category_only, :summary_only]
 $leave_inline_template = true if opts[:inline]
 $leave_ref = true if opts[:ref]
-# $leave_table = true if opts[:table]
 config = {}
 opt_array.each do |opt|
   config[opt] = opts[opt]
@@ -63,72 +64,80 @@ wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_thre
 wpconv.extract_text do |article|
   format_wiki!(article.title)
-  title = "[[#{article.title}]]\n"
-  if opts[:category] && !article.categories.empty?
+  if config[:category_only]
+    title = "#{article.title}\t"
+    contents = article.categories.join(", ")
+    contents << "\n"
+  elsif config[:category] && !article.categories.empty?
+    title = "\n[[#{article.title}]]\n\n"
     contents = "\nCATEGORIES: "
     contents << article.categories.join(", ")
     contents << "\n\n"
   else
+    title = "\n[[#{article.title}]]\n\n"
     contents = ""
   end
-  article.elements.each do |e|
-    case e.first
-    when :mw_heading
-      next if !config[:heading]
-      format_wiki!(e.last)
-      line = e.last
-      line << "+HEADING+" if $DEBUG_MODE
-    when :mw_paragraph
-      format_wiki!(e.last)
-      line = e.last + "\n"
-      line << "+PARAGRAPH+" if $DEBUG_MODE
-    when :mw_table, :mw_htable
-      next if !config[:table]
-      line = e.last
-      line << "+TABLE+" if $DEBUG_MODE
-    when :mw_pre
-      next if !config[:pre]
-      line = e.last
-      line << "+PRE+" if $DEBUG_MODE
-    when :mw_quote
-      line = e.last
-      line << "+QUOTE+" if $DEBUG_MODE
-    when :mw_unordered, :mw_ordered, :mw_definition
-      next if !config[:list]
-      line = e.last
-      line << "+LIST+" if $DEBUG_MODE
-    when :mw_ml_template
-      next if !config[:multiline]
-      line = e.last
-      line << "+MLTEMPLATE+" if $DEBUG_MODE
-    when :mw_redirect
-      next if !config[:redirect]
-      line = e.last
-      line << "+REDIRECT+" if $DEBUG_MODE
-      line << "\n\n"
-    when :mw_isolated_template
-      next if !config[:multiline]
-      line = e.last
-      line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
-    when :mw_isolated_tag
-      next
-    else
-      if $DEBUG_MODE
-        # format_wiki!(e.last)
+  unless config[:category_only]
+    article.elements.each do |e|
+      case e.first
+      when :mw_heading
+        break if config[:summary_only]
+        next if !config[:heading]
+        format_wiki!(e.last)
         line = e.last
-        line << "+OTHER+"
-      else
+        line << "+HEADING+" if $DEBUG_MODE
+      when :mw_paragraph
+        format_wiki!(e.last)
+        line = e.last + "\n"
+        line << "+PARAGRAPH+" if $DEBUG_MODE
+      when :mw_table, :mw_htable
+        next if !config[:table]
+        line = e.last
+        line << "+TABLE+" if $DEBUG_MODE
+      when :mw_pre
+        next if !config[:pre]
+        line = e.last
+        line << "+PRE+" if $DEBUG_MODE
+      when :mw_quote
+        line = e.last
+        line << "+QUOTE+" if $DEBUG_MODE
+      when :mw_unordered, :mw_ordered, :mw_definition
+        next if !config[:list]
+        line = e.last
+        line << "+LIST+" if $DEBUG_MODE
+      when :mw_ml_template
+        next if !config[:multiline]
+        line = e.last
+        line << "+MLTEMPLATE+" if $DEBUG_MODE
+      when :mw_redirect
+        next if !config[:redirect]
+        line = e.last
+        line << "+REDIRECT+" if $DEBUG_MODE
+        line << "\n\n"
+      when :mw_isolated_template
+        next if !config[:multiline]
+        line = e.last
+        line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
+      when :mw_isolated_tag
         next
+      else
+        if $DEBUG_MODE
+          # format_wiki!(e.last)
+          line = e.last
+          line << "+OTHER+"
+        else
+          next
+        end
       end
+      contents << line << "\n"
     end
-    contents << line << "\n"
   end
   if /\A[\s　]*\z/m =~ contents
     result = ""
   else
-    result = config[:title] ? "\n#{title}\n" << contents : contents
+    result = config[:title] ? title << contents : contents
   end
 end