wp2txt 0.9.2 → 0.9.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3ed3d7e29a8f1c6b5f97ca0da646ddfb53ae88add38f647eae0bdc03e626269e
4
- data.tar.gz: '009188addebcd908f449f2ce4cf39036406f3816cafeeb61beba097fe036e890'
3
+ metadata.gz: bf8270b3488c0045a067f71c155db8d9ac6366a94d825eed9bc6d05c95598345
4
+ data.tar.gz: 8802949a232c60d8b5ae6f93726154f7a6b40436b478919657f58f4bdc54add3
5
5
  SHA512:
6
- metadata.gz: d91531685df204222ab7bae9b3153653d61ccd36270e36f14575cabc3c2b1d6009bfa15f9033cb8eeb837f7c1a97fdb6303611166ec62ca96b9e4c8fc1e1ec15
7
- data.tar.gz: 19183feee7eb8f7c03d3f7bf60eebb7e75ffeb6c6eec6967a8c3e480f82f2b48b6e171d2aa22c7aa44a9336b981ad51dfd37ab423c3db2fe1a0d854860c37231
6
+ metadata.gz: 0a10804d78c33e035aaf429dd4613f84f3db0c6f22a6c36617a1fda25f03c0fd8fac224ec9e6009ab6ddddb475d73e6eda4c21606f89ef94950bc3749ce4f452
7
+ data.tar.gz: 4a8ea2f0900c6f97d3dcaf6c6387b3543d23962ea064cf1b18a8c293b4664c16fcabca9230aa83b0f5685eacffed74968fe82529566080b7821fd944c7bf275d
data/.gitignore CHANGED
@@ -18,3 +18,4 @@ tmp
18
18
  .DS_Store
19
19
  *.bak
20
20
  *.~
21
+
data/README.md CHANGED
@@ -1,32 +1,67 @@
1
1
  # WP2TXT
2
2
 
3
- Wikipedia dump file to text converter
3
+ Wikipedia dump file to text converter that extracts both content and category data
4
4
 
5
- **IMPORTANT:** This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.
5
+ ## About
6
6
 
7
- ### About ###
7
+ WP2TXT extracts plain text data from a Wikipedia dump file (encoded in XML / compressed with Bzip2), removing all MediaWiki markup and other metadata. It was developed for researchers who want easy access to open-source multilingual corpora, but may be used for other purposes as well.
8
8
 
9
- WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
9
+ **UPDATE (July 2022)**: Version 0.9.3 adds a new option `category_only`. When this option is enabled, wp2txt will extract only the title and category information of the article. See output examples below.
10
10
 
11
- **UPDATE:** Version 0.9.1 has added a new option `num-threads`, which improves the performance significantly . Note also that `--category` option is enabled by default, resulting with output format somewhat different from previous versions. Check out the new format using test data in `data/output_samples` folder before going on to convert a huge wikipedia dump.
12
11
 
13
- ### Features ###
12
+ ## Features
14
13
 
15
- * Convert dump files of Wikipedia of various languages (I hope).
16
- * Create output files of specified size.
17
- * Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables).
14
+ * Converts Wikipedia dump files in various languages
15
+ * Creates output files of specified size
16
+ * Can specify text elements to be extracted and converted (page titles, section titles, lists, tables)
17
+ * Can extract category information for each article
18
+
19
+
20
+ ## Installation
18
21
 
19
- ### Installation
20
-
21
22
  $ gem install wp2txt
22
23
 
23
- ### Usage
24
+ ## Usage
24
25
 
25
26
  Obtain a Wikipedia dump file (from [here](http://dumps.wikimedia.org/backup-index.html)) with a file name such as:
26
27
 
27
- xxwiki-yyyymmdd-pages-articles.xml.bz2
28
+ > `xxwiki-yyyymmdd-pages-articles.xml.bz2`
29
+
30
+ where `xx` is language code such as "en (English)" or "ja (Japanese)", and `yyyymmdd` is the date of creation (e.g. 20220720).
31
+
32
+ ### Example 1: Basic
33
+
34
+ The following extracts text data, including list items and excluding tables.
35
+
36
+ $ wp2txt -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
37
+
38
+ - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en.txt)
39
+ - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja.txt)
40
+
41
+ ### Example 2: Title and category information only
42
+
43
+ The following will extract only article titles and the categories to which each article belongs:
44
+
45
+ $ wp2txt --category-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
28
46
 
29
- where `xx` is language code such as "en (English)" or "ja (Japanese)", and `yyyymmdd` is the date of creation (e.g. 20120601).
47
+ Each line of the output data contains the title and the categories of an article:
48
+
49
+ > title `TAB` category1`,` category2`,` category3`,` ...
50
+
51
+ - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_categories.txt)
52
+ - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_categories.txt)
53
+
54
+ ### Example 3: Title, category, and summary text only
55
+
56
+ The following will extract only article titles, the categories to which each article belongs, and text blocks before the first heading of the article:
57
+
58
+ $ wp2txt --summary-only -i xxwiki-yyyymmdd-pages-articles.xml.bz2 -o /output_dir
59
+
60
+ - [Output example (English)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt)
61
+ - [Output example (Japanese)](https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_ja_summary.txt)
62
+
63
+
64
+ ## Options
30
65
 
31
66
  Command line options are as follows:
32
67
 
@@ -40,44 +75,46 @@ Command line options are as follows:
40
75
  --list, --no-list, -l: Show list items in output (default: true)
41
76
  --heading, --no-heading, -d: Show section titles in output (default: true)
42
77
  --title, --no-title, -t: Show page titles in output (default: true)
43
- --table, -a: Show table source code in output
44
- --inline, -n: leave inline template notations unmodified
45
- --multiline, -m: leave multiline template notations unmodified
46
- --ref, -r: leave reference notations in the format
78
+ --table, -a: Show table source code in output (default: false)
79
+ --inline, -n: leave inline template notations unmodified (default: false)
80
+ --multiline, -m: leave multiline template notations unmodified (default: false)
81
+ --ref, -r: leave reference notations in the format (default: false)
47
82
  [ref]...[/ref]
48
- --redirect, -e: Show redirect destination
83
+ --redirect, -e: Show redirect destination (default: false)
49
84
  --marker, --no-marker, -k: Show symbols prefixed to list items,
50
85
  definitions, etc. (Default: true)
51
- --category, -g: Show article category information
86
+ --category, -g: Show article category information (default: true)
87
+ --category-only, -y: Extract only article title and categories (default: false)
88
+ -s, --summary-only: Extract only article title, categories, and summary text before first heading
52
89
  --file-size, -f <i>: Approximate size (in MB) of each output file
53
90
  (default: 10)
54
- -u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
91
+ -u, --num-threads=<i>: Number of threads to be spawned (capped to the number of CPU cores;
55
92
  set 99 to spawn max num of threads) (default: 4)
56
93
  --version, -v: Print version and exit
57
94
  --help, -h: Show this message
58
95
 
59
- ### Caveats ###
96
+ ## Caveats
60
97
 
61
- * Certain types of data such as mathematical equations and computer source code are not be properly converted. Please remember this software is originally intended for correcting “sentences” for linguistic studies.
62
- * Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
63
- * Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
64
- * Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__.
98
+ * Some data, such as mathematical formulas and computer source code, will not be converted correctly.
99
+ * Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
100
+ * The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.
101
+ * WP2TXT, by the nature of its task, requires a lot of machine power and consumes a large amount of memory/storage resources. Therefore, there is a possibility that the process may stop unexpectedly. In the worst case, the process may even freeze without terminating successfully. Please understand this and use at your own risk.
65
102
 
66
- ### Useful Link ###
103
+ ## Useful Links
67
104
 
68
105
  * [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html)
69
-
70
- ### Author ###
106
+
107
+ ## Author
71
108
 
72
109
  * Yoichiro Hasebe (<yohasebe@gmail.com>)
73
110
 
74
- ### References ###
111
+ ## References
75
112
 
76
113
  The author will appreciate your mentioning one of these in your research.
77
114
 
78
115
  * Yoichiro HASEBE. 2006. [Method for using Wikipedia as Japanese corpus.](http://ci.nii.ac.jp/naid/110006226727) _Doshisha Studies in Language and Culture_ 9(2), 373-403.
79
116
  * 長谷部陽一郎. 2006. [Wikipedia日本語版をコーパスとして用いた言語研究の手法](http://ci.nii.ac.jp/naid/110006226727). 『言語文化』9(2), 373-403.
80
117
 
81
- ### License ###
118
+ ## License
82
119
 
83
120
  This software is distributed under the MIT License. Please see the LICENSE file.
data/bin/wp2txt CHANGED
@@ -27,7 +27,7 @@ EOS
27
27
  opt :input_file, "Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format", :required => true
28
28
  opt :output_dir, "Output directory", :default => Dir::pwd, :type => String
29
29
  opt :convert, "Output in plain text (converting from XML)", :default => true
30
- opt :list, "Show list items in output", :default => true
30
+ opt :list, "Show list items in output", :default => false
31
31
  opt :heading, "Show section titles in output", :default => true, :short => "-d"
32
32
  opt :title, "Show page titles in output", :default => true
33
33
  opt :table, "Show table source code in output", :default => false
@@ -37,6 +37,8 @@ EOS
37
37
  opt :redirect, "Show redirect destination", :default => false
38
38
  opt :marker, "Show symbols prefixed to list items, definitions, etc.", :default => true
39
39
  opt :category, "Show article category information", :default => true
40
+ opt :category_only, "Extract only article title and categories", :default => false
41
+ opt :summary_only, "Extract only article title, categories, and summary text before first heading", :default => false
40
42
  opt :file_size, "Approximate size (in MB) of each output file", :default => 10
41
43
  opt :num_threads, "Number of threads to be spawned (capped to the number of CPU cores; set 99 to spawn max num of threads)", :default => 4
42
44
  end
@@ -49,10 +51,9 @@ tfile_size = opts[:file_size]
49
51
  num_threads = opts[:num_threads]
50
52
  convert = opts[:convert]
51
53
  strip_tmarker = opts[:marker] ? false : true
52
- opt_array = [:title, :list, :heading, :table, :redirect, :multiline]
54
+ opt_array = [:title, :list, :heading, :table, :redirect, :multiline, :category, :category_only, :summary_only]
53
55
  $leave_inline_template = true if opts[:inline]
54
56
  $leave_ref = true if opts[:ref]
55
- # $leave_table = true if opts[:table]
56
57
  config = {}
57
58
  opt_array.each do |opt|
58
59
  config[opt] = opts[opt]
@@ -63,72 +64,80 @@ wpconv = Wp2txt::Runner.new(parent, input_file, output_dir, tfile_size, num_thre
63
64
 
64
65
  wpconv.extract_text do |article|
65
66
  format_wiki!(article.title)
66
- title = "[[#{article.title}]]\n"
67
67
 
68
- if opts[:category] && !article.categories.empty?
68
+ if config[:category_only]
69
+ title = "#{article.title}\t"
70
+ contents = article.categories.join(", ")
71
+ contents << "\n"
72
+ elsif config[:category] && !article.categories.empty?
73
+ title = "\n[[#{article.title}]]\n\n"
69
74
  contents = "\nCATEGORIES: "
70
75
  contents << article.categories.join(", ")
71
76
  contents << "\n\n"
72
77
  else
78
+ title = "\n[[#{article.title}]]\n\n"
73
79
  contents = ""
74
80
  end
75
81
 
76
- article.elements.each do |e|
77
- case e.first
78
- when :mw_heading
79
- next if !config[:heading]
80
- format_wiki!(e.last)
81
- line = e.last
82
- line << "+HEADING+" if $DEBUG_MODE
83
- when :mw_paragraph
84
- format_wiki!(e.last)
85
- line = e.last + "\n"
86
- line << "+PARAGRAPH+" if $DEBUG_MODE
87
- when :mw_table, :mw_htable
88
- next if !config[:table]
89
- line = e.last
90
- line << "+TABLE+" if $DEBUG_MODE
91
- when :mw_pre
92
- next if !config[:pre]
93
- line = e.last
94
- line << "+PRE+" if $DEBUG_MODE
95
- when :mw_quote
96
- line = e.last
97
- line << "+QUOTE+" if $DEBUG_MODE
98
- when :mw_unordered, :mw_ordered, :mw_definition
99
- next if !config[:list]
100
- line = e.last
101
- line << "+LIST+" if $DEBUG_MODE
102
- when :mw_ml_template
103
- next if !config[:multiline]
104
- line = e.last
105
- line << "+MLTEMPLATE+" if $DEBUG_MODE
106
- when :mw_redirect
107
- next if !config[:redirect]
108
- line = e.last
109
- line << "+REDIRECT+" if $DEBUG_MODE
110
- line << "\n\n"
111
- when :mw_isolated_template
112
- next if !config[:multiline]
113
- line = e.last
114
- line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
115
- when :mw_isolated_tag
116
- next
117
- else
118
- if $DEBUG_MODE
119
- # format_wiki!(e.last)
82
+ unless config[:category_only]
83
+ article.elements.each do |e|
84
+ case e.first
85
+ when :mw_heading
86
+ break if config[:summary_only]
87
+ next if !config[:heading]
88
+ format_wiki!(e.last)
120
89
  line = e.last
121
- line << "+OTHER+"
122
- else
90
+ line << "+HEADING+" if $DEBUG_MODE
91
+ when :mw_paragraph
92
+ format_wiki!(e.last)
93
+ line = e.last + "\n"
94
+ line << "+PARAGRAPH+" if $DEBUG_MODE
95
+ when :mw_table, :mw_htable
96
+ next if !config[:table]
97
+ line = e.last
98
+ line << "+TABLE+" if $DEBUG_MODE
99
+ when :mw_pre
100
+ next if !config[:pre]
101
+ line = e.last
102
+ line << "+PRE+" if $DEBUG_MODE
103
+ when :mw_quote
104
+ line = e.last
105
+ line << "+QUOTE+" if $DEBUG_MODE
106
+ when :mw_unordered, :mw_ordered, :mw_definition
107
+ next if !config[:list]
108
+ line = e.last
109
+ line << "+LIST+" if $DEBUG_MODE
110
+ when :mw_ml_template
111
+ next if !config[:multiline]
112
+ line = e.last
113
+ line << "+MLTEMPLATE+" if $DEBUG_MODE
114
+ when :mw_redirect
115
+ next if !config[:redirect]
116
+ line = e.last
117
+ line << "+REDIRECT+" if $DEBUG_MODE
118
+ line << "\n\n"
119
+ when :mw_isolated_template
120
+ next if !config[:multiline]
121
+ line = e.last
122
+ line << "+ISOLATED_TEMPLATE+" if $DEBUG_MODE
123
+ when :mw_isolated_tag
123
124
  next
125
+ else
126
+ if $DEBUG_MODE
127
+ # format_wiki!(e.last)
128
+ line = e.last
129
+ line << "+OTHER+"
130
+ else
131
+ next
132
+ end
124
133
  end
134
+ contents << line << "\n"
125
135
  end
126
- contents << line << "\n"
127
136
  end
128
137
 
129
138
  if /\A[\s ]*\z/m =~ contents
130
139
  result = ""
131
140
  else
132
- result = config[:title] ? "\n#{title}\n" << contents : contents
141
+ result = config[:title] ? title << contents : contents
133
142
  end
134
143
  end