RubyGems - wp2txt - Versions diffs - 0.9.4 → 0.9.5 - Mend

wp2txt 0.9.4 → 0.9.5

Files changed (16) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/README.md +34 -17
data/bin/wp2txt +7 -6
data/data/output_samples/testdata_en.txt +11923 -36921
data/data/output_samples/testdata_en_categories.txt +107 -182
data/data/output_samples/testdata_en_summary.txt +1368 -0
data/data/output_samples/testdata_ja.txt +24812 -4686
data/data/output_samples/testdata_ja_categories.txt +202 -44
data/data/output_samples/testdata_ja_summary.txt +1684 -0
data/data/testdata_en.bz2 +0 -0
data/data/testdata_ja.bz2 +0 -0
data/lib/wp2txt/article.rb +3 -2
data/lib/wp2txt/utils.rb +51 -27
data/lib/wp2txt/version.rb +1 -1
metadata +4 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 50f291332872b0e3cd0b651662d7494ec9edd823fdb6ba6a928f501a37ea06c3
-  data.tar.gz: ec4891f6a30c7bc2f8f0a6fd3ec56618c9f706ea277207e7f955347417959f7e
+  metadata.gz: bf8270b3488c0045a067f71c155db8d9ac6366a94d825eed9bc6d05c95598345
+  data.tar.gz: 8802949a232c60d8b5ae6f93726154f7a6b40436b478919657f58f4bdc54add3
 SHA512:
-  metadata.gz: afa3770c47bc25252993bfddf6da6e99a7bca87d4d899b3f8ce44d8a6298d29a19ce06fe9b64166316a31672a76d7d4530887e77d98212bc8f17a350c0e1598a
-  data.tar.gz: ef6f5b11b8a7d2ae5eeb640b0f2319bea9ee1209b0ab1dd78833f3cde41149fb8468871d90ba75b00c62876fccfa5c5f7cca6fc2420d4769c3c35a7bd9aa8786
+  metadata.gz: 0a10804d78c33e035aaf429dd4613f84f3db0c6f22a6c36617a1fda25f03c0fd8fac224ec9e6009ab6ddddb475d73e6eda4c21606f89ef94950bc3749ce4f452
+  data.tar.gz: 4a8ea2f0900c6f97d3dcaf6c6387b3543d23962ea064cf1b18a8c293b4664c16fcabca9230aa83b0f5685eacffed74968fe82529566080b7821fd944c7bf275d

data/.gitignore CHANGED Viewed

@@ -18,3 +18,4 @@ tmp
 .DS_Store
 *.bak
 *.~

data/README.md CHANGED Viewed

@@ -4,16 +4,18 @@ Wikipedia dump file to text converter that extracts both content and category da
 ## About
-WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
+WP2TXT extracts plain text data from a Wikipedia dump file (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata. It was developed for researchers who want easy access to open-source multilingual corpora, but may be used for other purposes as well.
+**UPDATE (July 2022)**: Version 0.9.3 adds a new option `category_only`. When this option is enabled, wp2txt will extract only the title and category information of the article. See output examples below.
-**UPDATE (July 2022)**: Version 0.9.3 has added a new option `category_only`. With this option enabled, wp2txt extracts article title and category info only. Please see output examples below.
 ## Features
-* Convert dump files of Wikipedia of various languages
-* Create output files of specified size.
-* Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables)
-* Extract category information of each article
+* Converts Wikipedia dump files in various languages
+* Creates output files of specified size
+* Can specify text elements to be extracted and converted (page titles, section titles, lists, tables)
+* Can extract category information for each article
 ## Installation
@@ -23,11 +25,11 @@ WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compres
 Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
-    xxwiki-yyyymmdd-pages-articles.xml.bz2
+> `xxwiki-yyyymmdd-pages-articles.xml.bz2`
 where `xx` is language code such as "en (English)" or "ja (Japanese)", and  `yyyymmdd` is the date of creation (e.g. 20220720).
-### Example 1
+### Example 1: Basic
 The following extracts text data, including list items and excluding tables.
@@ -36,15 +38,29 @@ The following extracts text data, including list items and excluding tables.
 - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
 - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
-### Example 2
+### Example 2: Title and category information only
 The following will extract only article titles and the categories to which each article belongs:
     $ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+Each line of the output data contains the title and the categories of an article:
+> title `TAB` category1`,` category2`,` category3`,` ...
 - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
 - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
+### Example 3: Title, category, and summary text only
+The following will extract only article titles, the categories to which each article belongs, and text blocks before the first heading of the article:
+    $ wp2txt --summary-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
+- [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
+- [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
 ## Options
 Command line options are as follows:
@@ -69,6 +85,7 @@ Command line options are as follows:
                                    definitions, etc. (Default: true)
                  --category, -g:   Show article category information (default: true)
             --category-only, -y:   Extract only article title and categories (default: false)
+             -s, --summary-only:   Extract only article title, categories, and summary text before first heading
             --file-size, -f <i>:   Approximate size (in MB) of each output file
                                    (default: 10)
           -u, --num-threads=<i>:   Number of threads to be spawned (capped to the number of CPU cores;
@@ -78,26 +95,26 @@ Command line options are as follows:
 ## Caveats
-* Certain types of data such as mathematical equations and computer source code are not be properly converted.  Please remember this software is originally intended for correcting “sentences” for linguistic studies.
-* Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
-* Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
-* Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
+* Some data, such as mathematical formulas and computer source code, will not be converted correctly.
+* Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
+* The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
+* WP2TXT, by the nature of its task, requires a lot of machine power and consumes a large amount of memory/storage resources. Therefore, there is a possibility that the process may stop unexpectedly. In the worst case, the process may even freeze without terminating successfully. Please understand this and use at your own risk.
-### Useful Links
+## Useful Links
 * [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
-### Author
+## Author
 * Yoichiro Hasebe (<yohasebe@gmail.com>)
-### References
+## References
 The author will appreciate your mentioning one of these in your research.
 * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
 * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
-### License
+## License
 This software is distributed under the MIT License. Please see the LICENSE file.

data/bin/wp2txt CHANGED Viewed

@@ -27,7 +27,7 @@ EOS
   opt :input_file,  "Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format", :required => true
   opt :output_dir,  "Output directory", :default => Dir::pwd, :type => String
   opt :convert,     "Output in plain text (converting from XML)", :default => true
-  opt :list,    "Show list items in output", :default => true
+  opt :list,    "Show list items in output", :default => false
   opt :heading, "Show section titles in output", :default => true, :short => "-d"
   opt :title,   "Show page titles in output", :default => true
   opt :table,   "Show table source code in output", :default => false
@@ -38,6 +38,7 @@ EOS
   opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
   opt :category, "Show article category information", :default => true
   opt :category_only, "Extract only article title and categories", :default => false
+  opt :summary_only, "Extract only article title, categories, and summary text before first heading", :default => false
   opt :file_size,   "Approximate size (in MB) of each output file", :default => 10
   opt :num_threads,   "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
 end
@@ -50,10 +51,9 @@ tfile_size = opts[:file_size]
 num_threads = opts[:num_threads]
 convert = opts[:convert]
 strip_tmarker = opts[:marker] ? false : true
-opt_array = [:title, :list, :heading, :table, :redirect, :multiline]
+opt_array = [:title, :list, :heading, :table, :redirect, :multiline, :category, :category_only, :summary_only]
 $leave_inline_template = true if opts[:inline]
 $leave_ref = true if opts[:ref]
-# $leave_table = true if opts[:table]
 config = {}
 opt_array.each do |opt|
   config[opt] = opts[opt]
@@ -65,11 +65,11 @@ wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_thre
 wpconv.extract_text do |article|
   format_wiki!(article.title)
-  if opts[:category_only]
+  if config[:category_only]
     title = "#{article.title}\t"
     contents = article.categories.join(", ")
     contents << "\n"
-  elsif opts[:category] && !article.categories.empty?
+  elsif config[:category] && !article.categories.empty?
     title = "\n[[#{article.title}]]\n\n"
     contents = "\nCATEGORIES: "
     contents << article.categories.join(", ")
@@ -79,10 +79,11 @@ wpconv.extract_text do |article|
     contents = ""
   end
-  unless opts[:category_only]
+  unless config[:category_only]
     article.elements.each do |e|
       case e.first
       when :mw_heading
+        break if config[:summary_only]
         next if !config[:heading]
         format_wiki!(e.last)
         line = e.last