chupa-text 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (67) hide show
  1. checksums.yaml +7 -0
  2. data/.yardopts +5 -0
  3. data/Gemfile +21 -0
  4. data/LICENSE.txt +502 -0
  5. data/README.md +91 -0
  6. data/Rakefile +46 -0
  7. data/bin/chupa-text +21 -0
  8. data/bin/chupa-text-generate-decomposer +21 -0
  9. data/chupa-text.gemspec +58 -0
  10. data/data/chupa-text.conf +5 -0
  11. data/data/mime-types.conf +19 -0
  12. data/doc/text/command-line.md +136 -0
  13. data/doc/text/decomposer.md +343 -0
  14. data/doc/text/library.md +72 -0
  15. data/doc/text/news.md +5 -0
  16. data/lib/chupa-text.rb +37 -0
  17. data/lib/chupa-text/command.rb +18 -0
  18. data/lib/chupa-text/command/chupa-text-generate-decomposer.rb +324 -0
  19. data/lib/chupa-text/command/chupa-text.rb +102 -0
  20. data/lib/chupa-text/configuration-loader.rb +95 -0
  21. data/lib/chupa-text/configuration.rb +49 -0
  22. data/lib/chupa-text/data.rb +149 -0
  23. data/lib/chupa-text/decomposer-registry.rb +37 -0
  24. data/lib/chupa-text/decomposer.rb +37 -0
  25. data/lib/chupa-text/decomposers.rb +59 -0
  26. data/lib/chupa-text/decomposers/csv.rb +44 -0
  27. data/lib/chupa-text/decomposers/gzip.rb +51 -0
  28. data/lib/chupa-text/decomposers/tar.rb +42 -0
  29. data/lib/chupa-text/decomposers/xml.rb +55 -0
  30. data/lib/chupa-text/extractor.rb +91 -0
  31. data/lib/chupa-text/file-content.rb +35 -0
  32. data/lib/chupa-text/formatters.rb +17 -0
  33. data/lib/chupa-text/formatters/json.rb +60 -0
  34. data/lib/chupa-text/input-data.rb +58 -0
  35. data/lib/chupa-text/mime-type-registry.rb +41 -0
  36. data/lib/chupa-text/mime-type.rb +36 -0
  37. data/lib/chupa-text/text-data.rb +26 -0
  38. data/lib/chupa-text/version.rb +19 -0
  39. data/lib/chupa-text/virtual-content.rb +91 -0
  40. data/lib/chupa-text/virtual-file-data.rb +46 -0
  41. data/test/command/test-chupa-text.rb +178 -0
  42. data/test/decomposers/test-csv.rb +48 -0
  43. data/test/decomposers/test-gzip.rb +113 -0
  44. data/test/decomposers/test-tar.rb +78 -0
  45. data/test/decomposers/test-xml.rb +58 -0
  46. data/test/fixture/command/chupa-text/hello.txt +1 -0
  47. data/test/fixture/command/chupa-text/hello.txt.gz +0 -0
  48. data/test/fixture/command/chupa-text/no-decomposer.conf +3 -0
  49. data/test/fixture/extractor/hello.txt +1 -0
  50. data/test/fixture/gzip/hello.tar.gz +0 -0
  51. data/test/fixture/gzip/hello.tgz +0 -0
  52. data/test/fixture/gzip/hello.txt.gz +0 -0
  53. data/test/fixture/tar/directory.tar +0 -0
  54. data/test/fixture/tar/top-level.tar +0 -0
  55. data/test/helper.rb +25 -0
  56. data/test/run-test.rb +35 -0
  57. data/test/test-configuration-loader.rb +54 -0
  58. data/test/test-data.rb +85 -0
  59. data/test/test-decomposer-registry.rb +30 -0
  60. data/test/test-decomposer.rb +41 -0
  61. data/test/test-decomposers.rb +59 -0
  62. data/test/test-extractor.rb +125 -0
  63. data/test/test-file-content.rb +51 -0
  64. data/test/test-mime-type-registry.rb +48 -0
  65. data/test/test-text-data.rb +36 -0
  66. data/test/test-virtual-content.rb +103 -0
  67. metadata +183 -0
@@ -0,0 +1,91 @@
1
+ # README
2
+
3
+ ## Name
4
+
5
+ ChupaText
6
+
7
+ ## Description
8
+
9
+ ChupaText is an extensible text extractor. You can plug your custom
10
+ text extractor in ChupaText. You can write your plugin by Ruby.
11
+
12
+ ## Overview
13
+
14
+ ChupaText applies registered decomposers to input data
15
+ recursively. Finally, the input data is decomposed to text data.
16
+
17
+ Here is an ASCII art to describe process flow:
18
+
19
+ ```
20
+ input data
21
+ |
22
+ \|/
23
+ |decomposer|
24
+ |
25
+ \|/
26
+ other data
27
+ |
28
+ \|/
29
+ |decomposer|
30
+ |
31
+ \|/
32
+ ...
33
+ |
34
+ \|/
35
+ |decomposer|
36
+ |
37
+ \|/
38
+ text data
39
+ ```
40
+
41
+ Decomposer is a module that decomposes input data to other data. The
42
+ decomposed data may not be text data. If the decomposed data is not
43
+ text data, ChupaText applies a decomposer again. Finally, the
44
+ decomposed data will be text data.
45
+
46
+ Decomposer module is a plugin. You can add supported data types by
47
+ installing decomposer modules. Or you can create your custom
48
+ decomposer. Decomposer is a simple Ruby object. So it is easy to
49
+ create. It is described later.
50
+
51
+ ## Install
52
+
53
+ Install `chupa-text` gem:
54
+
55
+ ```
56
+ % gem install chupa-text
57
+ ```
58
+
59
+ Now, you can use `chupa-text` command:
60
+
61
+ ```
62
+ % chupa-text --version
63
+ chupa-text 1.0.0
64
+ ```
65
+
66
+ ## How to use
67
+
68
+ You can use ChupaText as command line tool or Ruby library. See the
69
+ following documentations for details:
70
+
71
+ * [doc/text/command-line.md](http://rubydoc.info/gems/chupa-text/file/doc/text/command-line.md)
72
+ describes how to use ChupaText as command line tool.
73
+ * [doc/text/library.md](http://rubydoc.info/gems/chupa-text/file/doc/text/library.md)
74
+ describes how to use ChupaText as a Ruby library.
75
+
76
+ ## How to create a decomposer
77
+
78
+ See
79
+ [doc/text/decomposer.md](http://rubydoc.info/gems/chupa-text/file/doc/text/decomposer.md)
80
+ how to write a decomposer.
81
+
82
+ ## Author
83
+
84
+ * Kouhei Sutou `<kou@clear-code.com>`
85
+
86
+ ## License
87
+
88
+ LGPL 2.1 or later.
89
+
90
+ (Kouhei Sutou has a right to change the license including contributed
91
+ patches.)
@@ -0,0 +1,46 @@
1
+ # -*- mode: ruby; coding: utf-8 -*-
2
+ #
3
+ # Copyright (C) 2013 Kouhei Sutou <kou@clear-code.com>
4
+ #
5
+ # This library is free software; you can redistribute it and/or
6
+ # modify it under the terms of the GNU Lesser General Public
7
+ # License as published by the Free Software Foundation; either
8
+ # version 2.1 of the License, or (at your option) any later version.
9
+ #
10
+ # This library is distributed in the hope that it will be useful,
11
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ # Lesser General Public License for more details.
14
+ #
15
+ # You should have received a copy of the GNU Lesser General Public
16
+ # License along with this library; if not, write to the Free Software
17
+ # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
18
+
19
+ task :default => :test
20
+
21
+ require "pathname"
22
+
23
+ require "rubygems"
24
+ require "bundler/gem_helper"
25
+ require "packnga"
26
+
27
+ base_dir = Pathname(__FILE__).dirname
28
+
29
+ helper = Bundler::GemHelper.new(base_dir.to_s)
30
+ def helper.version_tag
31
+ version
32
+ end
33
+
34
+ helper.install
35
+ spec = helper.gemspec
36
+
37
+ Packnga::DocumentTask.new(spec) do
38
+ end
39
+
40
+ Packnga::ReleaseTask.new(spec) do
41
+ end
42
+
43
+ desc "Run tests"
44
+ task :test do
45
+ ruby("test/run-test.rb")
46
+ end
@@ -0,0 +1,21 @@
1
+ #!/usr/bin/env ruby
2
+ #
3
+ # Copyright (C) 2013 Kouhei Sutou <kou@clear-code.com>
4
+ #
5
+ # This library is free software; you can redistribute it and/or
6
+ # modify it under the terms of the GNU Lesser General Public
7
+ # License as published by the Free Software Foundation; either
8
+ # version 2.1 of the License, or (at your option) any later version.
9
+ #
10
+ # This library is distributed in the hope that it will be useful,
11
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ # Lesser General Public License for more details.
14
+ #
15
+ # You should have received a copy of the GNU Lesser General Public
16
+ # License along with this library; if not, write to the Free Software
17
+ # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
18
+
19
+ require "chupa-text"
20
+
21
+ exit(ChupaText::Command::ChupaText.run(*ARGV))
@@ -0,0 +1,21 @@
1
+ #!/usr/bin/env ruby
2
+ #
3
+ # Copyright (C) 2013 Kouhei Sutou <kou@clear-code.com>
4
+ #
5
+ # This library is free software; you can redistribute it and/or
6
+ # modify it under the terms of the GNU Lesser General Public
7
+ # License as published by the Free Software Foundation; either
8
+ # version 2.1 of the License, or (at your option) any later version.
9
+ #
10
+ # This library is distributed in the hope that it will be useful,
11
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ # Lesser General Public License for more details.
14
+ #
15
+ # You should have received a copy of the GNU Lesser General Public
16
+ # License along with this library; if not, write to the Free Software
17
+ # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
18
+
19
+ require "chupa-text"
20
+
21
+ exit(ChupaText::Command::ChupaTextGenerateDecomposer.run(*ARGV))
@@ -0,0 +1,58 @@
1
+ # -*- mode: ruby; coding: utf-8 -*-
2
+ #
3
+ # Copyright (C) 2013 Kouhei Sutou <kou@clear-code.com>
4
+ #
5
+ # This library is free software; you can redistribute it and/or
6
+ # modify it under the terms of the GNU Lesser General Public
7
+ # License as published by the Free Software Foundation; either
8
+ # version 2.1 of the License, or (at your option) any later version.
9
+ #
10
+ # This library is distributed in the hope that it will be useful,
11
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ # Lesser General Public License for more details.
14
+ #
15
+ # You should have received a copy of the GNU Lesser General Public
16
+ # License along with this library; if not, write to the Free Software
17
+ # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
18
+
19
+ require "pathname"
20
+
21
+ base_dir = Pathname(__FILE__).dirname
22
+ lib_dir = base_dir + "lib"
23
+ $LOAD_PATH.unshift(lib_dir.to_s)
24
+
25
+ require "chupa-text/version"
26
+
27
+ clean_white_space = lambda do |entry|
28
+ entry.gsub(/(\A\n+|\n+\z)/, '') + "\n"
29
+ end
30
+
31
+ Gem::Specification.new do |spec|
32
+ spec.name = "chupa-text"
33
+ spec.version = ChupaText::VERSION
34
+ spec.homepage = "http://ranguba.org/#about-chupa-text"
35
+ spec.authors = ["Kouhei Sutou"]
36
+ spec.email = ["kou@clear-code.com"]
37
+ readme = File.read("README.md", :encoding => "UTF-8")
38
+ entries = readme.split(/^\#\#\s(.*)$/)
39
+ description = clean_white_space.call(entries[entries.index("Description") + 1])
40
+ spec.summary, spec.description, = description.split(/\n\n+/, 3)
41
+ spec.license = "LGPLv2.1 or later"
42
+ spec.files = ["#{spec.name}.gemspec"]
43
+ spec.files += ["README.md", "LICENSE.txt", "Rakefile", "Gemfile"]
44
+ spec.files += [".yardopts"]
45
+ spec.files += Dir.glob("data/*.conf")
46
+ spec.files += Dir.glob("lib/**/*.rb")
47
+ spec.files += Dir.glob("doc/text/*")
48
+ spec.files += Dir.glob("test/**/*")
49
+ Dir.chdir("bin") do
50
+ spec.executables = Dir.glob("*")
51
+ end
52
+
53
+ spec.add_development_dependency("bundler")
54
+ spec.add_development_dependency("rake")
55
+ spec.add_development_dependency("test-unit")
56
+ spec.add_development_dependency("packnga")
57
+ spec.add_development_dependency("redcarpet")
58
+ end
@@ -0,0 +1,5 @@
1
+ # -*- ruby -*-
2
+
3
+ load("mime-types.conf")
4
+
5
+ decomposer.names = ["*"]
@@ -0,0 +1,19 @@
1
+ # -*- ruby -*-
2
+
3
+ mime_type["txt"] = "text/plain"
4
+
5
+ mime_type["gz"] = "application/x-gzip"
6
+ mime_type["tgz"] = "application/x-gtar-compressed"
7
+
8
+ mime_type["tar"] = "application/x-tar"
9
+
10
+ mime_type["htm"] = "text/html"
11
+ mime_type["html"] = "text/html"
12
+ mime_type["xhtml"] = "application/xhtml+xml"
13
+
14
+ mime_type["xml"] = "text/xml"
15
+
16
+ mime_type["css"] = "text/css"
17
+
18
+ mime_type["csv"] = "text/csv"
19
+ mime_type["tsv"] = "text/tab-separated-values"
@@ -0,0 +1,136 @@
1
+ # How to use ChupaText as command line tool
2
+
3
+ You can extract text and meta-data from an input by `chupa-text`
4
+ command. `chupa-text` prints extracted text and meta-data as JSON.
5
+
6
+ ## Input
7
+
8
+ `chupa-text` command accept a local file path or a URI.
9
+
10
+ Here is a local file path example:
11
+
12
+ ```
13
+ % chupa-text hello.txt.gz
14
+ ```
15
+
16
+ Here is an URI example:
17
+
18
+ ```
19
+ % chupa-text https://github.com/ranguba/chupa-text/raw/master/test/fixture/gzip/hello.txt.gz
20
+ ```
21
+
22
+ ## Output
23
+
24
+ `chupa-text` command prints the extracted result as JSON:
25
+
26
+ ```
27
+ % chupa-text hello.txt.gz
28
+ {
29
+ "mime-type": "application/x-gzip",
30
+ "uri": "hello.txt.gz",
31
+ "size": 36,
32
+ "texts": [
33
+ {
34
+ "mime-type": "text/plain",
35
+ "uri": "hello.txt",
36
+ "size": 6,
37
+ "body": "Hello\n"
38
+ }
39
+ ]
40
+ }
41
+ ```
42
+
43
+ JSON uses the following data structure:
44
+
45
+ ```txt
46
+ {
47
+ "mime-type": "<MIME type of the input>",
48
+ "uri": "<URI or path of the input>",
49
+ "size": <Byte size of the input data>,
50
+ "other-meta-data1": <Other meta-data value1>,
51
+ "other-meta-data2": <Other meta-data value2>,
52
+ "...": <...>,
53
+ "texts": [
54
+ {
55
+ "mime-type": "<MIME type of the extracted data1>",
56
+ "uri": "<URI or path of the extracted data1>",
57
+ "size": "<Byte size of the text of the extracted data1>",
58
+ "body": "<The text of the extracted data1>",
59
+ "other-meta-data1": <Other meta-data value1 of the extracted data1>,
60
+ "other-meta-data2": <Other meta-data value2 of the extracted data1>,
61
+ "...": <...>
62
+ },
63
+ {
64
+ <The information of the extracted data2>
65
+ },
66
+ {
67
+ <The information of the extracted data3>
68
+ },
69
+ <...>
70
+ ]
71
+ }
72
+ ```
73
+
74
+ You can find extracted texts in `texts[0].body`, `texts[1].body` and
75
+ so on. You may extract one or more texts from one input because
76
+ ChupaText supports archive file such as `tar`.
77
+
78
+ ## Command line options
79
+
80
+ You can custom `chupa-text` command behavior. Here are command line
81
+ options:
82
+
83
+ `--configuration=FILE`
84
+
85
+ It reads configuration from `FILE`. See the next section for
86
+ configuration file details.
87
+
88
+ ChupaText provides the default configuration file. It has suitable
89
+ configurations. Normally, you don't need to use your custom
90
+ configuration file.
91
+
92
+ `--help`
93
+
94
+ It shows available command line options and exits.
95
+
96
+ ## Configuration
97
+
98
+ ChupaText configuration file is a Ruby script but it is easy to read
99
+ and write ChupaText configuration file for users who don't know about
100
+ Ruby.
101
+
102
+ The basic syntax is the following:
103
+
104
+ ```
105
+ category.name = value
106
+ ```
107
+
108
+ Here is an example that sets `["tar", "gzip"]` as `value` to `names`
109
+ name variable in `decomposer` category:
110
+
111
+ ```
112
+ decomposer.names = ["tar", "gzip"]
113
+ ```
114
+
115
+ Here are configuration parameters:
116
+
117
+ `decomposer.names = ["<decomposer name1>", "<decomposer name2>, "..."]`
118
+
119
+ It specifies an array of decomposer name to be used in `chupa-text`
120
+ command. You can use glob pattern for decomposer name such as
121
+ `"*zip"`. `"*zip"` matches `"zip"`, `"gzip"` and so on.
122
+
123
+ The default is `["*"]`. It means that all installed decomposers are
124
+ used.
125
+
126
+ `mime_type["<extension>"] = "<MIME type>"`
127
+
128
+ It specifies a map to a MIME type from path extension.
129
+
130
+ Here is an example that maps `"html"` to `"text/html"`:
131
+
132
+ ```
133
+ mime_type["html"] = "text/html"
134
+ ```
135
+
136
+ Th default configuration file registers popular MIME types.
@@ -0,0 +1,343 @@
1
+ # How to create a decomposer
2
+
3
+ You can extend ChupaText by Ruby. You can add supported input type by
4
+ writing a decomposer module.
5
+
6
+ ## Overview
7
+
8
+ Decomposer is a Ruby class. It needs the following two API:
9
+
10
+ * `target?`
11
+ * `decompose`
12
+
13
+ Both of them accept only one argument `data`. `data` is an input
14
+ data.
15
+
16
+ First, ChupaText calls `target?` method of your decomposer. If your
17
+ decomposer can decompose the input data, your `target?` method should
18
+ return `true`.
19
+
20
+ If your decomposer's `target?` method returns `true`, ChupaText calls
21
+ `decomposer` method of your decomposer. Your decomposer needs to
22
+ decomposer the input data and `yield` extracted text data or other
23
+ format data that will be decomposed by other decomposers. Your
24
+ decomposer can `yield` multiple times.
25
+
26
+ If your decomposer decomposes an archive file such as tar and zip
27
+ archives, your `decompose` method will `yield` other format data. If
28
+ your decomposer extracts text and meta-data from an input such as
29
+ HTML, your `decompose` method will `yield` text data.
30
+
31
+ ## Example
32
+
33
+ Let's create a simple XML decomposer as an example. It extracts text
34
+ data from input XML.
35
+
36
+ For example, here is an input XML:
37
+
38
+ ```xml
39
+ <root>
40
+ Hello <em>&amp;</em> World!
41
+ </root>
42
+ ```
43
+
44
+ The XML decomposer extracts the following text:
45
+
46
+ ```text
47
+ Hello & World!
48
+ ```
49
+
50
+ ChupaText provides `chupa-text-genearte-decomposer` command. It
51
+ generates skeleton code for a new decomposer. Let's use it.
52
+
53
+ `chupa-text-genearte-decomposer` accepts required information by
54
+ command line options or reading from standard input. You can confirm
55
+ the required information by `--help` option:
56
+
57
+ ```text
58
+ % chupa-text-generate-decomposer --help
59
+ Usage: chupa-text-generate-decomposer [options]
60
+ --name=NAME Decomposer name
61
+ (e.g.: html)
62
+ --extensions=EXTENSION1,EXTENSION2,...
63
+ Target file extensions
64
+ (e.g.: htm,html,xhtml)
65
+ --mime-types=TYPE1,TYPE2,... Target MIME types
66
+ (e.g.: text/html,application/xhtml+xml)
67
+ --author=AUTHOR Author
68
+ (e.g.: 'Your Name')
69
+ (default: Kouhei Sutou)
70
+ --email=EMAIL Author E-mail
71
+ (e.g.: your@email.address)
72
+ (default: kou@clear-code.com)
73
+ --license=LICENSE License
74
+ (e.g.: MIT)
75
+ (default: LGPLv2.1 or later)
76
+ ```
77
+
78
+ Some pieces of information have the default values. In the above case,
79
+ `--author`, `--email` and `-license` have the default values.
80
+
81
+ XML decomposer uses the following information:
82
+
83
+ * `--name`: `xml`
84
+ * `--extensions`: `xml`
85
+ * `--mime-types`: `text/xml`
86
+
87
+ Run with the above information:
88
+
89
+ ```text
90
+ % chupa-text-generate-decomposer --name xml --extensions xml --mime-types text/xml
91
+ Creating directory: chupa-text-decomposer-xml
92
+ Creating file: chupa-text-decomposer-xml/chupa-text-decomposer-xml.gemspec
93
+ Creating file: chupa-text-decomposer-xml/Gemfile
94
+ Creating file: chupa-text-decomposer-xml/Rakefile
95
+ Creating file: chupa-text-decomposer-xml/LICENSE.txt
96
+ Creating directory: chupa-text-decomposer-xml/lib/chupa-text/decomposers
97
+ Creating file: chupa-text-decomposer-xml/lib/chupa-text/decomposers/xml.rb
98
+ Creating directory: chupa-text-decomposer-xml/test
99
+ Creating file: chupa-text-decomposer-xml/test/test-xml.rb
100
+ Creating file: chupa-text-decomposer-xml/test/helper.rb
101
+ Creating file: chupa-text-decomposer-xml/test/run-test.rb
102
+ ```
103
+
104
+ `chupa-text-generate-decomposer` generates a directory that is named
105
+ as `chupa-text-decomposer-#{name}/`.
106
+
107
+ Look `lib/chupa-text/decomposers/xml.rb`:
108
+
109
+ ```
110
+ module ChupaText
111
+ module Decomposers
112
+ class Xml < Decomposer
113
+ def target?(data)
114
+ ["xml"].include?(data.extension) or
115
+ ["text/xml"].include?(data.mime_type)
116
+ end
117
+
118
+ def decompose(data)
119
+ raise NotImplementedError, "#{self.class}##{__method__} isn't implemented yet."
120
+ text = "IMPLEMENTED ME"
121
+ text_data = TextData.new(text)
122
+ yield(text_data)
123
+ end
124
+ end
125
+ end
126
+ end
127
+ ```
128
+
129
+ The generated code implements `target?` method but doesn't implemented
130
+ `decompose` method completely. Let's implement `decompose` method:
131
+
132
+ ```
133
+ require "cgi"
134
+
135
+ # ...
136
+ def decompose(data)
137
+ text = CGI.unescapeHTML(untag(data.body).strip)
138
+ text_data = TextData.new(text)
139
+ yield(text_data)
140
+ end
141
+
142
+ private
143
+ def untag(xml)
144
+ xml.gsub(/<.+?>/m, "")
145
+ end
146
+ # ...
147
+ ```
148
+
149
+ `chupa-text-generate-decomposer` also generates a test. Run the test:
150
+
151
+ ```
152
+ % bundle install
153
+ % rake
154
+ /usr/bin/ruby2.0 test/run-test.rb
155
+ Loaded suite .
156
+ Started
157
+ F
158
+ ===============================================================================
159
+ Failure:
160
+ test_body(decompose)
161
+ /tmp/chupa-text-decomposer-xml/test/test-xml.rb:24:in `test_body'
162
+ 21: def test_body
163
+ 22: input_body = "TODO (input)"
164
+ 23: expected_text = "TODO (extracted)"
165
+ => 24: assert_equal([expected_text],
166
+ 25: decompose(input_body).collect(&:body))
167
+ 26: end
168
+ 27: end
169
+ <["TODO (extracted)"]> expected but was
170
+ <["TODO (input)"]>
171
+
172
+ diff:
173
+ ? ["TODO (ex tracted)"]
174
+ ? inpu
175
+ ===============================================================================
176
+
177
+
178
+ Finished in 0.013355116 seconds.
179
+
180
+ 1 tests, 1 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
181
+ 0% passed
182
+
183
+ 74.88 tests/s, 74.88 assertions/s
184
+ rake aborted!
185
+ Command failed with status (1): [/usr/bin/ruby2.0 test/run-test.rb...]
186
+ /tmp/chupa-text-decomposer-xml/Rakefile:9:in `block in <top (required)>'
187
+ ```
188
+
189
+ The generated test fails because the test has place holders. Look the
190
+ generated test:
191
+
192
+ ```
193
+ class TestXml < Test::Unit::TestCase
194
+ include Helper
195
+
196
+ def setup
197
+ @decomposer = ChupaText::Decomposers::Xml.new({})
198
+ end
199
+
200
+ sub_test_case("decompose") do
201
+ def decompose(input_body)
202
+ data = ChupaText::Data.new
203
+ data.mime_type = "text/xml"
204
+ data.body = input_body
205
+
206
+ decomposed = []
207
+ @decomposer.decompose(data) do |decomposed_data|
208
+ decomposed << decomposed_data
209
+ end
210
+ decomposed
211
+ end
212
+
213
+ def test_body
214
+ input_body = "TODO (input)"
215
+ expected_text = "TODO (extracted)"
216
+ assert_equal([expected_text],
217
+ decompose(input_body).collect(&:body))
218
+ end
219
+ end
220
+ end
221
+ ```
222
+
223
+ `test_body` has TODO codes as place holder:
224
+
225
+ ```
226
+ # ...
227
+ def test_body
228
+ input_body = "TODO (input)"
229
+ expected_text = "TODO (extracted)"
230
+ assert_equal([expected_text],
231
+ decompose(input_body).collect(&:body))
232
+ end
233
+ # ...
234
+ ```
235
+
236
+ Fill the TODO by test XML and expected result:
237
+
238
+ ```
239
+ # ...
240
+ def test_body
241
+ input_body = <<-XML
242
+ <root>
243
+ Hello <em>&amp;</em> World!
244
+ </root>
245
+ XML
246
+ expected_text = "Hello & World!"
247
+ assert_equal([expected_text],
248
+ decompose(input_body).collect(&:body))
249
+ end
250
+ # ...
251
+ ```
252
+
253
+ Run test again:
254
+
255
+ ```
256
+ % rake
257
+ /usr/bin/ruby2.0 test/run-test.rb
258
+ Loaded suite .
259
+ Started
260
+ .
261
+
262
+ Finished in 0.000915172 seconds.
263
+
264
+ 1 tests, 1 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
265
+ 100% passed
266
+
267
+ 1092.69 tests/s, 1092.69 assertions/s
268
+ ```
269
+
270
+ The test is passed!
271
+
272
+ You can release the generator by the following command. It requires an
273
+ account on https://rubygems.org/.
274
+
275
+ ```
276
+ % rake release
277
+ ```
278
+
279
+ Can you understand how to create a new decomposer?
280
+
281
+ ## API reference
282
+
283
+ ### `data`
284
+
285
+ Both of `target?` and `decompose` receives an argument `data`. It is a
286
+ {ChupaText::Data} instance or an instance of its sub class. You need
287
+ to see the API reference manual just for {ChupaText::Data}. You don't
288
+ use sub class specific API. It is not portable.
289
+
290
+ ### `target?`
291
+
292
+ `target?` should return `true` or `false`. The decomposer should
293
+ return `true` if the decomposer can decompose received `data`, `false`
294
+ otherwise.
295
+
296
+ ### `decompose`
297
+
298
+ `decompose` decomposes input `data` and `yield` extracted text data or
299
+ decomposed other type data. `decompose` can `yield` zero or more
300
+ times.
301
+
302
+ Here is a template code to `yield` extracted text data:
303
+
304
+ ```
305
+ def decompose(data)
306
+ text = extract_text(data)
307
+ text_data = ChupaText::TextData.new(text)
308
+ # text_data["meta-data1"] = meta_data_value1
309
+ # text_data["meta-data2"] = meta_data_value2
310
+ # ...
311
+ yield(text_data)
312
+ end
313
+ ```
314
+
315
+ See
316
+ [lib/chupa-text/decomposers/csv.rb](https://github.com/ranguba/chupa-text/blob/master/lib/chupa-text/decomposers/csv.rb)
317
+ as an example of extracting text data.
318
+
319
+ Here is a template code to `yield` other type data:
320
+
321
+ ```
322
+ def decompose(data)
323
+ entries = decompose_archive(data)
324
+ entries.each do |entry|
325
+ path = entry.path
326
+ if entry.respond_to?(:read)
327
+ # The input must have "read" method.
328
+ input = entry
329
+ else
330
+ # If the entry doesn't have "read" method, wrap String data
331
+ # by StringIO.
332
+ input = StringIO.new(entry.data)
333
+ end
334
+ decomposed_data = ChupaText::VirtualFileData.new(path, input)
335
+ decomposed_data.source = data
336
+ yield(decomposed_data)
337
+ end
338
+ end
339
+ ```
340
+
341
+ See
342
+ [lib/chupa-text/decomposers/tar.rb](https://github.com/ranguba/chupa-text/blob/master/lib/chupa-text/decomposers/tar.rb)
343
+ as an example of decomposing to other type data.