chupa-text 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.yardopts +5 -0
- data/Gemfile +21 -0
- data/LICENSE.txt +502 -0
- data/README.md +91 -0
- data/Rakefile +46 -0
- data/bin/chupa-text +21 -0
- data/bin/chupa-text-generate-decomposer +21 -0
- data/chupa-text.gemspec +58 -0
- data/data/chupa-text.conf +5 -0
- data/data/mime-types.conf +19 -0
- data/doc/text/command-line.md +136 -0
- data/doc/text/decomposer.md +343 -0
- data/doc/text/library.md +72 -0
- data/doc/text/news.md +5 -0
- data/lib/chupa-text.rb +37 -0
- data/lib/chupa-text/command.rb +18 -0
- data/lib/chupa-text/command/chupa-text-generate-decomposer.rb +324 -0
- data/lib/chupa-text/command/chupa-text.rb +102 -0
- data/lib/chupa-text/configuration-loader.rb +95 -0
- data/lib/chupa-text/configuration.rb +49 -0
- data/lib/chupa-text/data.rb +149 -0
- data/lib/chupa-text/decomposer-registry.rb +37 -0
- data/lib/chupa-text/decomposer.rb +37 -0
- data/lib/chupa-text/decomposers.rb +59 -0
- data/lib/chupa-text/decomposers/csv.rb +44 -0
- data/lib/chupa-text/decomposers/gzip.rb +51 -0
- data/lib/chupa-text/decomposers/tar.rb +42 -0
- data/lib/chupa-text/decomposers/xml.rb +55 -0
- data/lib/chupa-text/extractor.rb +91 -0
- data/lib/chupa-text/file-content.rb +35 -0
- data/lib/chupa-text/formatters.rb +17 -0
- data/lib/chupa-text/formatters/json.rb +60 -0
- data/lib/chupa-text/input-data.rb +58 -0
- data/lib/chupa-text/mime-type-registry.rb +41 -0
- data/lib/chupa-text/mime-type.rb +36 -0
- data/lib/chupa-text/text-data.rb +26 -0
- data/lib/chupa-text/version.rb +19 -0
- data/lib/chupa-text/virtual-content.rb +91 -0
- data/lib/chupa-text/virtual-file-data.rb +46 -0
- data/test/command/test-chupa-text.rb +178 -0
- data/test/decomposers/test-csv.rb +48 -0
- data/test/decomposers/test-gzip.rb +113 -0
- data/test/decomposers/test-tar.rb +78 -0
- data/test/decomposers/test-xml.rb +58 -0
- data/test/fixture/command/chupa-text/hello.txt +1 -0
- data/test/fixture/command/chupa-text/hello.txt.gz +0 -0
- data/test/fixture/command/chupa-text/no-decomposer.conf +3 -0
- data/test/fixture/extractor/hello.txt +1 -0
- data/test/fixture/gzip/hello.tar.gz +0 -0
- data/test/fixture/gzip/hello.tgz +0 -0
- data/test/fixture/gzip/hello.txt.gz +0 -0
- data/test/fixture/tar/directory.tar +0 -0
- data/test/fixture/tar/top-level.tar +0 -0
- data/test/helper.rb +25 -0
- data/test/run-test.rb +35 -0
- data/test/test-configuration-loader.rb +54 -0
- data/test/test-data.rb +85 -0
- data/test/test-decomposer-registry.rb +30 -0
- data/test/test-decomposer.rb +41 -0
- data/test/test-decomposers.rb +59 -0
- data/test/test-extractor.rb +125 -0
- data/test/test-file-content.rb +51 -0
- data/test/test-mime-type-registry.rb +48 -0
- data/test/test-text-data.rb +36 -0
- data/test/test-virtual-content.rb +103 -0
- metadata +183 -0
data/README.md
ADDED
@@ -0,0 +1,91 @@
|
|
1
|
+
# README
|
2
|
+
|
3
|
+
## Name
|
4
|
+
|
5
|
+
ChupaText
|
6
|
+
|
7
|
+
## Description
|
8
|
+
|
9
|
+
ChupaText is an extensible text extractor. You can plug your custom
|
10
|
+
text extractor in ChupaText. You can write your plugin by Ruby.
|
11
|
+
|
12
|
+
## Overview
|
13
|
+
|
14
|
+
ChupaText applies registered decomposers to input data
|
15
|
+
recursively. Finally, the input data is decomposed to text data.
|
16
|
+
|
17
|
+
Here is an ASCII art to describe process flow:
|
18
|
+
|
19
|
+
```
|
20
|
+
input data
|
21
|
+
|
|
22
|
+
\|/
|
23
|
+
|decomposer|
|
24
|
+
|
|
25
|
+
\|/
|
26
|
+
other data
|
27
|
+
|
|
28
|
+
\|/
|
29
|
+
|decomposer|
|
30
|
+
|
|
31
|
+
\|/
|
32
|
+
...
|
33
|
+
|
|
34
|
+
\|/
|
35
|
+
|decomposer|
|
36
|
+
|
|
37
|
+
\|/
|
38
|
+
text data
|
39
|
+
```
|
40
|
+
|
41
|
+
Decomposer is a module that decomposes input data to other data. The
|
42
|
+
decomposed data may not be text data. If the decomposed data is not
|
43
|
+
text data, ChupaText applies a decomposer again. Finally, the
|
44
|
+
decomposed data will be text data.
|
45
|
+
|
46
|
+
Decomposer module is a plugin. You can add supported data types by
|
47
|
+
installing decomposer modules. Or you can create your custom
|
48
|
+
decomposer. Decomposer is a simple Ruby object. So it is easy to
|
49
|
+
create. It is described later.
|
50
|
+
|
51
|
+
## Install
|
52
|
+
|
53
|
+
Install `chupa-text` gem:
|
54
|
+
|
55
|
+
```
|
56
|
+
% gem install chupa-text
|
57
|
+
```
|
58
|
+
|
59
|
+
Now, you can use `chupa-text` command:
|
60
|
+
|
61
|
+
```
|
62
|
+
% chupa-text --version
|
63
|
+
chupa-text 1.0.0
|
64
|
+
```
|
65
|
+
|
66
|
+
## How to use
|
67
|
+
|
68
|
+
You can use ChupaText as command line tool or Ruby library. See the
|
69
|
+
following documentations for details:
|
70
|
+
|
71
|
+
* [doc/text/command-line.md](http://rubydoc.info/gems/chupa-text/file/doc/text/command-line.md)
|
72
|
+
describes how to use ChupaText as command line tool.
|
73
|
+
* [doc/text/library.md](http://rubydoc.info/gems/chupa-text/file/doc/text/library.md)
|
74
|
+
describes how to use ChupaText as a Ruby library.
|
75
|
+
|
76
|
+
## How to create a decomposer
|
77
|
+
|
78
|
+
See
|
79
|
+
[doc/text/decomposer.md](http://rubydoc.info/gems/chupa-text/file/doc/text/decomposer.md)
|
80
|
+
how to write a decomposer.
|
81
|
+
|
82
|
+
## Author
|
83
|
+
|
84
|
+
* Kouhei Sutou `<kou@clear-code.com>`
|
85
|
+
|
86
|
+
## License
|
87
|
+
|
88
|
+
LGPL 2.1 or later.
|
89
|
+
|
90
|
+
(Kouhei Sutou has a right to change the license including contributed
|
91
|
+
patches.)
|
data/Rakefile
ADDED
@@ -0,0 +1,46 @@
|
|
1
|
+
# -*- mode: ruby; coding: utf-8 -*-
|
2
|
+
#
|
3
|
+
# Copyright (C) 2013 Kouhei Sutou <kou@clear-code.com>
|
4
|
+
#
|
5
|
+
# This library is free software; you can redistribute it and/or
|
6
|
+
# modify it under the terms of the GNU Lesser General Public
|
7
|
+
# License as published by the Free Software Foundation; either
|
8
|
+
# version 2.1 of the License, or (at your option) any later version.
|
9
|
+
#
|
10
|
+
# This library is distributed in the hope that it will be useful,
|
11
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
12
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
13
|
+
# Lesser General Public License for more details.
|
14
|
+
#
|
15
|
+
# You should have received a copy of the GNU Lesser General Public
|
16
|
+
# License along with this library; if not, write to the Free Software
|
17
|
+
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
18
|
+
|
19
|
+
task :default => :test
|
20
|
+
|
21
|
+
require "pathname"
|
22
|
+
|
23
|
+
require "rubygems"
|
24
|
+
require "bundler/gem_helper"
|
25
|
+
require "packnga"
|
26
|
+
|
27
|
+
base_dir = Pathname(__FILE__).dirname
|
28
|
+
|
29
|
+
helper = Bundler::GemHelper.new(base_dir.to_s)
|
30
|
+
def helper.version_tag
|
31
|
+
version
|
32
|
+
end
|
33
|
+
|
34
|
+
helper.install
|
35
|
+
spec = helper.gemspec
|
36
|
+
|
37
|
+
Packnga::DocumentTask.new(spec) do
|
38
|
+
end
|
39
|
+
|
40
|
+
Packnga::ReleaseTask.new(spec) do
|
41
|
+
end
|
42
|
+
|
43
|
+
desc "Run tests"
|
44
|
+
task :test do
|
45
|
+
ruby("test/run-test.rb")
|
46
|
+
end
|
data/bin/chupa-text
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
#
|
3
|
+
# Copyright (C) 2013 Kouhei Sutou <kou@clear-code.com>
|
4
|
+
#
|
5
|
+
# This library is free software; you can redistribute it and/or
|
6
|
+
# modify it under the terms of the GNU Lesser General Public
|
7
|
+
# License as published by the Free Software Foundation; either
|
8
|
+
# version 2.1 of the License, or (at your option) any later version.
|
9
|
+
#
|
10
|
+
# This library is distributed in the hope that it will be useful,
|
11
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
12
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
13
|
+
# Lesser General Public License for more details.
|
14
|
+
#
|
15
|
+
# You should have received a copy of the GNU Lesser General Public
|
16
|
+
# License along with this library; if not, write to the Free Software
|
17
|
+
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
18
|
+
|
19
|
+
require "chupa-text"
|
20
|
+
|
21
|
+
exit(ChupaText::Command::ChupaText.run(*ARGV))
|
@@ -0,0 +1,21 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
#
|
3
|
+
# Copyright (C) 2013 Kouhei Sutou <kou@clear-code.com>
|
4
|
+
#
|
5
|
+
# This library is free software; you can redistribute it and/or
|
6
|
+
# modify it under the terms of the GNU Lesser General Public
|
7
|
+
# License as published by the Free Software Foundation; either
|
8
|
+
# version 2.1 of the License, or (at your option) any later version.
|
9
|
+
#
|
10
|
+
# This library is distributed in the hope that it will be useful,
|
11
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
12
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
13
|
+
# Lesser General Public License for more details.
|
14
|
+
#
|
15
|
+
# You should have received a copy of the GNU Lesser General Public
|
16
|
+
# License along with this library; if not, write to the Free Software
|
17
|
+
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
18
|
+
|
19
|
+
require "chupa-text"
|
20
|
+
|
21
|
+
exit(ChupaText::Command::ChupaTextGenerateDecomposer.run(*ARGV))
|
data/chupa-text.gemspec
ADDED
@@ -0,0 +1,58 @@
|
|
1
|
+
# -*- mode: ruby; coding: utf-8 -*-
|
2
|
+
#
|
3
|
+
# Copyright (C) 2013 Kouhei Sutou <kou@clear-code.com>
|
4
|
+
#
|
5
|
+
# This library is free software; you can redistribute it and/or
|
6
|
+
# modify it under the terms of the GNU Lesser General Public
|
7
|
+
# License as published by the Free Software Foundation; either
|
8
|
+
# version 2.1 of the License, or (at your option) any later version.
|
9
|
+
#
|
10
|
+
# This library is distributed in the hope that it will be useful,
|
11
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
12
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
13
|
+
# Lesser General Public License for more details.
|
14
|
+
#
|
15
|
+
# You should have received a copy of the GNU Lesser General Public
|
16
|
+
# License along with this library; if not, write to the Free Software
|
17
|
+
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
18
|
+
|
19
|
+
require "pathname"
|
20
|
+
|
21
|
+
base_dir = Pathname(__FILE__).dirname
|
22
|
+
lib_dir = base_dir + "lib"
|
23
|
+
$LOAD_PATH.unshift(lib_dir.to_s)
|
24
|
+
|
25
|
+
require "chupa-text/version"
|
26
|
+
|
27
|
+
clean_white_space = lambda do |entry|
|
28
|
+
entry.gsub(/(\A\n+|\n+\z)/, '') + "\n"
|
29
|
+
end
|
30
|
+
|
31
|
+
Gem::Specification.new do |spec|
|
32
|
+
spec.name = "chupa-text"
|
33
|
+
spec.version = ChupaText::VERSION
|
34
|
+
spec.homepage = "http://ranguba.org/#about-chupa-text"
|
35
|
+
spec.authors = ["Kouhei Sutou"]
|
36
|
+
spec.email = ["kou@clear-code.com"]
|
37
|
+
readme = File.read("README.md", :encoding => "UTF-8")
|
38
|
+
entries = readme.split(/^\#\#\s(.*)$/)
|
39
|
+
description = clean_white_space.call(entries[entries.index("Description") + 1])
|
40
|
+
spec.summary, spec.description, = description.split(/\n\n+/, 3)
|
41
|
+
spec.license = "LGPLv2.1 or later"
|
42
|
+
spec.files = ["#{spec.name}.gemspec"]
|
43
|
+
spec.files += ["README.md", "LICENSE.txt", "Rakefile", "Gemfile"]
|
44
|
+
spec.files += [".yardopts"]
|
45
|
+
spec.files += Dir.glob("data/*.conf")
|
46
|
+
spec.files += Dir.glob("lib/**/*.rb")
|
47
|
+
spec.files += Dir.glob("doc/text/*")
|
48
|
+
spec.files += Dir.glob("test/**/*")
|
49
|
+
Dir.chdir("bin") do
|
50
|
+
spec.executables = Dir.glob("*")
|
51
|
+
end
|
52
|
+
|
53
|
+
spec.add_development_dependency("bundler")
|
54
|
+
spec.add_development_dependency("rake")
|
55
|
+
spec.add_development_dependency("test-unit")
|
56
|
+
spec.add_development_dependency("packnga")
|
57
|
+
spec.add_development_dependency("redcarpet")
|
58
|
+
end
|
@@ -0,0 +1,19 @@
|
|
1
|
+
# -*- ruby -*-
|
2
|
+
|
3
|
+
mime_type["txt"] = "text/plain"
|
4
|
+
|
5
|
+
mime_type["gz"] = "application/x-gzip"
|
6
|
+
mime_type["tgz"] = "application/x-gtar-compressed"
|
7
|
+
|
8
|
+
mime_type["tar"] = "application/x-tar"
|
9
|
+
|
10
|
+
mime_type["htm"] = "text/html"
|
11
|
+
mime_type["html"] = "text/html"
|
12
|
+
mime_type["xhtml"] = "application/xhtml+xml"
|
13
|
+
|
14
|
+
mime_type["xml"] = "text/xml"
|
15
|
+
|
16
|
+
mime_type["css"] = "text/css"
|
17
|
+
|
18
|
+
mime_type["csv"] = "text/csv"
|
19
|
+
mime_type["tsv"] = "text/tab-separated-values"
|
@@ -0,0 +1,136 @@
|
|
1
|
+
# How to use ChupaText as command line tool
|
2
|
+
|
3
|
+
You can extract text and meta-data from an input by `chupa-text`
|
4
|
+
command. `chupa-text` prints extracted text and meta-data as JSON.
|
5
|
+
|
6
|
+
## Input
|
7
|
+
|
8
|
+
`chupa-text` command accept a local file path or a URI.
|
9
|
+
|
10
|
+
Here is a local file path example:
|
11
|
+
|
12
|
+
```
|
13
|
+
% chupa-text hello.txt.gz
|
14
|
+
```
|
15
|
+
|
16
|
+
Here is an URI example:
|
17
|
+
|
18
|
+
```
|
19
|
+
% chupa-text https://github.com/ranguba/chupa-text/raw/master/test/fixture/gzip/hello.txt.gz
|
20
|
+
```
|
21
|
+
|
22
|
+
## Output
|
23
|
+
|
24
|
+
`chupa-text` command prints the extracted result as JSON:
|
25
|
+
|
26
|
+
```
|
27
|
+
% chupa-text hello.txt.gz
|
28
|
+
{
|
29
|
+
"mime-type": "application/x-gzip",
|
30
|
+
"uri": "hello.txt.gz",
|
31
|
+
"size": 36,
|
32
|
+
"texts": [
|
33
|
+
{
|
34
|
+
"mime-type": "text/plain",
|
35
|
+
"uri": "hello.txt",
|
36
|
+
"size": 6,
|
37
|
+
"body": "Hello\n"
|
38
|
+
}
|
39
|
+
]
|
40
|
+
}
|
41
|
+
```
|
42
|
+
|
43
|
+
JSON uses the following data structure:
|
44
|
+
|
45
|
+
```txt
|
46
|
+
{
|
47
|
+
"mime-type": "<MIME type of the input>",
|
48
|
+
"uri": "<URI or path of the input>",
|
49
|
+
"size": <Byte size of the input data>,
|
50
|
+
"other-meta-data1": <Other meta-data value1>,
|
51
|
+
"other-meta-data2": <Other meta-data value2>,
|
52
|
+
"...": <...>,
|
53
|
+
"texts": [
|
54
|
+
{
|
55
|
+
"mime-type": "<MIME type of the extracted data1>",
|
56
|
+
"uri": "<URI or path of the extracted data1>",
|
57
|
+
"size": "<Byte size of the text of the extracted data1>",
|
58
|
+
"body": "<The text of the extracted data1>",
|
59
|
+
"other-meta-data1": <Other meta-data value1 of the extracted data1>,
|
60
|
+
"other-meta-data2": <Other meta-data value2 of the extracted data1>,
|
61
|
+
"...": <...>
|
62
|
+
},
|
63
|
+
{
|
64
|
+
<The information of the extracted data2>
|
65
|
+
},
|
66
|
+
{
|
67
|
+
<The information of the extracted data3>
|
68
|
+
},
|
69
|
+
<...>
|
70
|
+
]
|
71
|
+
}
|
72
|
+
```
|
73
|
+
|
74
|
+
You can find extracted texts in `texts[0].body`, `texts[1].body` and
|
75
|
+
so on. You may extract one or more texts from one input because
|
76
|
+
ChupaText supports archive file such as `tar`.
|
77
|
+
|
78
|
+
## Command line options
|
79
|
+
|
80
|
+
You can custom `chupa-text` command behavior. Here are command line
|
81
|
+
options:
|
82
|
+
|
83
|
+
`--configuration=FILE`
|
84
|
+
|
85
|
+
It reads configuration from `FILE`. See the next section for
|
86
|
+
configuration file details.
|
87
|
+
|
88
|
+
ChupaText provides the default configuration file. It has suitable
|
89
|
+
configurations. Normally, you don't need to use your custom
|
90
|
+
configuration file.
|
91
|
+
|
92
|
+
`--help`
|
93
|
+
|
94
|
+
It shows available command line options and exits.
|
95
|
+
|
96
|
+
## Configuration
|
97
|
+
|
98
|
+
ChupaText configuration file is a Ruby script but it is easy to read
|
99
|
+
and write ChupaText configuration file for users who don't know about
|
100
|
+
Ruby.
|
101
|
+
|
102
|
+
The basic syntax is the following:
|
103
|
+
|
104
|
+
```
|
105
|
+
category.name = value
|
106
|
+
```
|
107
|
+
|
108
|
+
Here is an example that sets `["tar", "gzip"]` as `value` to `names`
|
109
|
+
name variable in `decomposer` category:
|
110
|
+
|
111
|
+
```
|
112
|
+
decomposer.names = ["tar", "gzip"]
|
113
|
+
```
|
114
|
+
|
115
|
+
Here are configuration parameters:
|
116
|
+
|
117
|
+
`decomposer.names = ["<decomposer name1>", "<decomposer name2>, "..."]`
|
118
|
+
|
119
|
+
It specifies an array of decomposer name to be used in `chupa-text`
|
120
|
+
command. You can use glob pattern for decomposer name such as
|
121
|
+
`"*zip"`. `"*zip"` matches `"zip"`, `"gzip"` and so on.
|
122
|
+
|
123
|
+
The default is `["*"]`. It means that all installed decomposers are
|
124
|
+
used.
|
125
|
+
|
126
|
+
`mime_type["<extension>"] = "<MIME type>"`
|
127
|
+
|
128
|
+
It specifies a map to a MIME type from path extension.
|
129
|
+
|
130
|
+
Here is an example that maps `"html"` to `"text/html"`:
|
131
|
+
|
132
|
+
```
|
133
|
+
mime_type["html"] = "text/html"
|
134
|
+
```
|
135
|
+
|
136
|
+
Th default configuration file registers popular MIME types.
|
@@ -0,0 +1,343 @@
|
|
1
|
+
# How to create a decomposer
|
2
|
+
|
3
|
+
You can extend ChupaText by Ruby. You can add supported input type by
|
4
|
+
writing a decomposer module.
|
5
|
+
|
6
|
+
## Overview
|
7
|
+
|
8
|
+
Decomposer is a Ruby class. It needs the following two API:
|
9
|
+
|
10
|
+
* `target?`
|
11
|
+
* `decompose`
|
12
|
+
|
13
|
+
Both of them accept only one argument `data`. `data` is an input
|
14
|
+
data.
|
15
|
+
|
16
|
+
First, ChupaText calls `target?` method of your decomposer. If your
|
17
|
+
decomposer can decompose the input data, your `target?` method should
|
18
|
+
return `true`.
|
19
|
+
|
20
|
+
If your decomposer's `target?` method returns `true`, ChupaText calls
|
21
|
+
`decomposer` method of your decomposer. Your decomposer needs to
|
22
|
+
decomposer the input data and `yield` extracted text data or other
|
23
|
+
format data that will be decomposed by other decomposers. Your
|
24
|
+
decomposer can `yield` multiple times.
|
25
|
+
|
26
|
+
If your decomposer decomposes an archive file such as tar and zip
|
27
|
+
archives, your `decompose` method will `yield` other format data. If
|
28
|
+
your decomposer extracts text and meta-data from an input such as
|
29
|
+
HTML, your `decompose` method will `yield` text data.
|
30
|
+
|
31
|
+
## Example
|
32
|
+
|
33
|
+
Let's create a simple XML decomposer as an example. It extracts text
|
34
|
+
data from input XML.
|
35
|
+
|
36
|
+
For example, here is an input XML:
|
37
|
+
|
38
|
+
```xml
|
39
|
+
<root>
|
40
|
+
Hello <em>&</em> World!
|
41
|
+
</root>
|
42
|
+
```
|
43
|
+
|
44
|
+
The XML decomposer extracts the following text:
|
45
|
+
|
46
|
+
```text
|
47
|
+
Hello & World!
|
48
|
+
```
|
49
|
+
|
50
|
+
ChupaText provides `chupa-text-genearte-decomposer` command. It
|
51
|
+
generates skeleton code for a new decomposer. Let's use it.
|
52
|
+
|
53
|
+
`chupa-text-genearte-decomposer` accepts required information by
|
54
|
+
command line options or reading from standard input. You can confirm
|
55
|
+
the required information by `--help` option:
|
56
|
+
|
57
|
+
```text
|
58
|
+
% chupa-text-generate-decomposer --help
|
59
|
+
Usage: chupa-text-generate-decomposer [options]
|
60
|
+
--name=NAME Decomposer name
|
61
|
+
(e.g.: html)
|
62
|
+
--extensions=EXTENSION1,EXTENSION2,...
|
63
|
+
Target file extensions
|
64
|
+
(e.g.: htm,html,xhtml)
|
65
|
+
--mime-types=TYPE1,TYPE2,... Target MIME types
|
66
|
+
(e.g.: text/html,application/xhtml+xml)
|
67
|
+
--author=AUTHOR Author
|
68
|
+
(e.g.: 'Your Name')
|
69
|
+
(default: Kouhei Sutou)
|
70
|
+
--email=EMAIL Author E-mail
|
71
|
+
(e.g.: your@email.address)
|
72
|
+
(default: kou@clear-code.com)
|
73
|
+
--license=LICENSE License
|
74
|
+
(e.g.: MIT)
|
75
|
+
(default: LGPLv2.1 or later)
|
76
|
+
```
|
77
|
+
|
78
|
+
Some pieces of information have the default values. In the above case,
|
79
|
+
`--author`, `--email` and `-license` have the default values.
|
80
|
+
|
81
|
+
XML decomposer uses the following information:
|
82
|
+
|
83
|
+
* `--name`: `xml`
|
84
|
+
* `--extensions`: `xml`
|
85
|
+
* `--mime-types`: `text/xml`
|
86
|
+
|
87
|
+
Run with the above information:
|
88
|
+
|
89
|
+
```text
|
90
|
+
% chupa-text-generate-decomposer --name xml --extensions xml --mime-types text/xml
|
91
|
+
Creating directory: chupa-text-decomposer-xml
|
92
|
+
Creating file: chupa-text-decomposer-xml/chupa-text-decomposer-xml.gemspec
|
93
|
+
Creating file: chupa-text-decomposer-xml/Gemfile
|
94
|
+
Creating file: chupa-text-decomposer-xml/Rakefile
|
95
|
+
Creating file: chupa-text-decomposer-xml/LICENSE.txt
|
96
|
+
Creating directory: chupa-text-decomposer-xml/lib/chupa-text/decomposers
|
97
|
+
Creating file: chupa-text-decomposer-xml/lib/chupa-text/decomposers/xml.rb
|
98
|
+
Creating directory: chupa-text-decomposer-xml/test
|
99
|
+
Creating file: chupa-text-decomposer-xml/test/test-xml.rb
|
100
|
+
Creating file: chupa-text-decomposer-xml/test/helper.rb
|
101
|
+
Creating file: chupa-text-decomposer-xml/test/run-test.rb
|
102
|
+
```
|
103
|
+
|
104
|
+
`chupa-text-generate-decomposer` generates a directory that is named
|
105
|
+
as `chupa-text-decomposer-#{name}/`.
|
106
|
+
|
107
|
+
Look `lib/chupa-text/decomposers/xml.rb`:
|
108
|
+
|
109
|
+
```
|
110
|
+
module ChupaText
|
111
|
+
module Decomposers
|
112
|
+
class Xml < Decomposer
|
113
|
+
def target?(data)
|
114
|
+
["xml"].include?(data.extension) or
|
115
|
+
["text/xml"].include?(data.mime_type)
|
116
|
+
end
|
117
|
+
|
118
|
+
def decompose(data)
|
119
|
+
raise NotImplementedError, "#{self.class}##{__method__} isn't implemented yet."
|
120
|
+
text = "IMPLEMENTED ME"
|
121
|
+
text_data = TextData.new(text)
|
122
|
+
yield(text_data)
|
123
|
+
end
|
124
|
+
end
|
125
|
+
end
|
126
|
+
end
|
127
|
+
```
|
128
|
+
|
129
|
+
The generated code implements `target?` method but doesn't implemented
|
130
|
+
`decompose` method completely. Let's implement `decompose` method:
|
131
|
+
|
132
|
+
```
|
133
|
+
require "cgi"
|
134
|
+
|
135
|
+
# ...
|
136
|
+
def decompose(data)
|
137
|
+
text = CGI.unescapeHTML(untag(data.body).strip)
|
138
|
+
text_data = TextData.new(text)
|
139
|
+
yield(text_data)
|
140
|
+
end
|
141
|
+
|
142
|
+
private
|
143
|
+
def untag(xml)
|
144
|
+
xml.gsub(/<.+?>/m, "")
|
145
|
+
end
|
146
|
+
# ...
|
147
|
+
```
|
148
|
+
|
149
|
+
`chupa-text-generate-decomposer` also generates a test. Run the test:
|
150
|
+
|
151
|
+
```
|
152
|
+
% bundle install
|
153
|
+
% rake
|
154
|
+
/usr/bin/ruby2.0 test/run-test.rb
|
155
|
+
Loaded suite .
|
156
|
+
Started
|
157
|
+
F
|
158
|
+
===============================================================================
|
159
|
+
Failure:
|
160
|
+
test_body(decompose)
|
161
|
+
/tmp/chupa-text-decomposer-xml/test/test-xml.rb:24:in `test_body'
|
162
|
+
21: def test_body
|
163
|
+
22: input_body = "TODO (input)"
|
164
|
+
23: expected_text = "TODO (extracted)"
|
165
|
+
=> 24: assert_equal([expected_text],
|
166
|
+
25: decompose(input_body).collect(&:body))
|
167
|
+
26: end
|
168
|
+
27: end
|
169
|
+
<["TODO (extracted)"]> expected but was
|
170
|
+
<["TODO (input)"]>
|
171
|
+
|
172
|
+
diff:
|
173
|
+
? ["TODO (ex tracted)"]
|
174
|
+
? inpu
|
175
|
+
===============================================================================
|
176
|
+
|
177
|
+
|
178
|
+
Finished in 0.013355116 seconds.
|
179
|
+
|
180
|
+
1 tests, 1 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
|
181
|
+
0% passed
|
182
|
+
|
183
|
+
74.88 tests/s, 74.88 assertions/s
|
184
|
+
rake aborted!
|
185
|
+
Command failed with status (1): [/usr/bin/ruby2.0 test/run-test.rb...]
|
186
|
+
/tmp/chupa-text-decomposer-xml/Rakefile:9:in `block in <top (required)>'
|
187
|
+
```
|
188
|
+
|
189
|
+
The generated test fails because the test has place holders. Look the
|
190
|
+
generated test:
|
191
|
+
|
192
|
+
```
|
193
|
+
class TestXml < Test::Unit::TestCase
|
194
|
+
include Helper
|
195
|
+
|
196
|
+
def setup
|
197
|
+
@decomposer = ChupaText::Decomposers::Xml.new({})
|
198
|
+
end
|
199
|
+
|
200
|
+
sub_test_case("decompose") do
|
201
|
+
def decompose(input_body)
|
202
|
+
data = ChupaText::Data.new
|
203
|
+
data.mime_type = "text/xml"
|
204
|
+
data.body = input_body
|
205
|
+
|
206
|
+
decomposed = []
|
207
|
+
@decomposer.decompose(data) do |decomposed_data|
|
208
|
+
decomposed << decomposed_data
|
209
|
+
end
|
210
|
+
decomposed
|
211
|
+
end
|
212
|
+
|
213
|
+
def test_body
|
214
|
+
input_body = "TODO (input)"
|
215
|
+
expected_text = "TODO (extracted)"
|
216
|
+
assert_equal([expected_text],
|
217
|
+
decompose(input_body).collect(&:body))
|
218
|
+
end
|
219
|
+
end
|
220
|
+
end
|
221
|
+
```
|
222
|
+
|
223
|
+
`test_body` has TODO codes as place holder:
|
224
|
+
|
225
|
+
```
|
226
|
+
# ...
|
227
|
+
def test_body
|
228
|
+
input_body = "TODO (input)"
|
229
|
+
expected_text = "TODO (extracted)"
|
230
|
+
assert_equal([expected_text],
|
231
|
+
decompose(input_body).collect(&:body))
|
232
|
+
end
|
233
|
+
# ...
|
234
|
+
```
|
235
|
+
|
236
|
+
Fill the TODO by test XML and expected result:
|
237
|
+
|
238
|
+
```
|
239
|
+
# ...
|
240
|
+
def test_body
|
241
|
+
input_body = <<-XML
|
242
|
+
<root>
|
243
|
+
Hello <em>&</em> World!
|
244
|
+
</root>
|
245
|
+
XML
|
246
|
+
expected_text = "Hello & World!"
|
247
|
+
assert_equal([expected_text],
|
248
|
+
decompose(input_body).collect(&:body))
|
249
|
+
end
|
250
|
+
# ...
|
251
|
+
```
|
252
|
+
|
253
|
+
Run test again:
|
254
|
+
|
255
|
+
```
|
256
|
+
% rake
|
257
|
+
/usr/bin/ruby2.0 test/run-test.rb
|
258
|
+
Loaded suite .
|
259
|
+
Started
|
260
|
+
.
|
261
|
+
|
262
|
+
Finished in 0.000915172 seconds.
|
263
|
+
|
264
|
+
1 tests, 1 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
|
265
|
+
100% passed
|
266
|
+
|
267
|
+
1092.69 tests/s, 1092.69 assertions/s
|
268
|
+
```
|
269
|
+
|
270
|
+
The test is passed!
|
271
|
+
|
272
|
+
You can release the generator by the following command. It requires an
|
273
|
+
account on https://rubygems.org/.
|
274
|
+
|
275
|
+
```
|
276
|
+
% rake release
|
277
|
+
```
|
278
|
+
|
279
|
+
Can you understand how to create a new decomposer?
|
280
|
+
|
281
|
+
## API reference
|
282
|
+
|
283
|
+
### `data`
|
284
|
+
|
285
|
+
Both of `target?` and `decompose` receives an argument `data`. It is a
|
286
|
+
{ChupaText::Data} instance or an instance of its sub class. You need
|
287
|
+
to see the API reference manual just for {ChupaText::Data}. You don't
|
288
|
+
use sub class specific API. It is not portable.
|
289
|
+
|
290
|
+
### `target?`
|
291
|
+
|
292
|
+
`target?` should return `true` or `false`. The decomposer should
|
293
|
+
return `true` if the decomposer can decompose received `data`, `false`
|
294
|
+
otherwise.
|
295
|
+
|
296
|
+
### `decompose`
|
297
|
+
|
298
|
+
`decompose` decomposes input `data` and `yield` extracted text data or
|
299
|
+
decomposed other type data. `decompose` can `yield` zero or more
|
300
|
+
times.
|
301
|
+
|
302
|
+
Here is a template code to `yield` extracted text data:
|
303
|
+
|
304
|
+
```
|
305
|
+
def decompose(data)
|
306
|
+
text = extract_text(data)
|
307
|
+
text_data = ChupaText::TextData.new(text)
|
308
|
+
# text_data["meta-data1"] = meta_data_value1
|
309
|
+
# text_data["meta-data2"] = meta_data_value2
|
310
|
+
# ...
|
311
|
+
yield(text_data)
|
312
|
+
end
|
313
|
+
```
|
314
|
+
|
315
|
+
See
|
316
|
+
[lib/chupa-text/decomposers/csv.rb](https://github.com/ranguba/chupa-text/blob/master/lib/chupa-text/decomposers/csv.rb)
|
317
|
+
as an example of extracting text data.
|
318
|
+
|
319
|
+
Here is a template code to `yield` other type data:
|
320
|
+
|
321
|
+
```
|
322
|
+
def decompose(data)
|
323
|
+
entries = decompose_archive(data)
|
324
|
+
entries.each do |entry|
|
325
|
+
path = entry.path
|
326
|
+
if entry.respond_to?(:read)
|
327
|
+
# The input must have "read" method.
|
328
|
+
input = entry
|
329
|
+
else
|
330
|
+
# If the entry doesn't have "read" method, wrap String data
|
331
|
+
# by StringIO.
|
332
|
+
input = StringIO.new(entry.data)
|
333
|
+
end
|
334
|
+
decomposed_data = ChupaText::VirtualFileData.new(path, input)
|
335
|
+
decomposed_data.source = data
|
336
|
+
yield(decomposed_data)
|
337
|
+
end
|
338
|
+
end
|
339
|
+
```
|
340
|
+
|
341
|
+
See
|
342
|
+
[lib/chupa-text/decomposers/tar.rb](https://github.com/ranguba/chupa-text/blob/master/lib/chupa-text/decomposers/tar.rb)
|
343
|
+
as an example of decomposing to other type data.
|