rika 1.11.1-java → 2.0.0-java

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 2964b22b0e32e770c6ace90cf6b3ad1b05a54899b7838696c037c95645f1e73a
4
- data.tar.gz: 3557bba0a54a62f00c9c4c148be307cc7132d806fb13178deb95dbb8f566eb33
3
+ metadata.gz: 70dc9cbc6d2da17d3bcff5cdeb3a905fdbcf2e2b9d24c131558566cbfc19eada
4
+ data.tar.gz: fb81f98476322d2291488b2bd2cb9c89a1544b4ee1f85ce22ab773c30ac4765c
5
5
  SHA512:
6
- metadata.gz: aae34480ff9bf6ee7e9a00221a27fb6780cc60c8425644bf11bbcbf9875d2684a964a3bd88fccc28cc9c42b20d075548d531d485dada7ffd50a3d3eddc83294e
7
- data.tar.gz: 530de9844daa28dddb9b149a0a671eaaa84d6e4dfc4aa840af609121e89a8d2a6a5f2f70e78523e9d62c0eef6472291b50e44a0899c9e464057b76dc89cfed3c
6
+ metadata.gz: 780650b18df09662b8b67e7321641c2cf36d78335c5cf1f59f5d81a23cf262dc478958ad7aab306f80deb235e9c9910332086b3ea5231888ea96c26abd2d2505
7
+ data.tar.gz: afa841ebfe9ebb1bee5ccb0550fedde43edab6e1d6b86eed671537604d64b318c0abc7e41ffee054443f6bfd4a7b3c9423926e137fba4fdd95da66775e35d695
data/.gitignore CHANGED
@@ -1,7 +1,10 @@
1
1
  *.gem
2
2
  *.rbc
3
+ .DS_Store
3
4
  .bundle
4
5
  .config
6
+ coverage/
7
+ .idea/
5
8
  .yardoc
6
9
  Gemfile.lock
7
10
  InstalledFiles
@@ -10,13 +13,11 @@ coverage
10
13
  doc/
11
14
  lib/bundler/man
12
15
  pkg
16
+ projectFilesBackup/
13
17
  rdoc
14
18
  spec/reports
19
+ spec/rspec-failed-tests-control-file.txt
15
20
  target/
16
21
  test/tmp
17
22
  test/version_tmp
18
23
  tmp
19
-
20
- .DS_Store
21
- projectFilesBackup/
22
- .idea/
data/.rubocop.yml ADDED
@@ -0,0 +1,49 @@
1
+ require: rubocop-rspec
2
+ AllCops:
3
+ NewCops: enable
4
+ Include:
5
+ - '**/*.rb'
6
+ - '*.gemspec'
7
+ - '**/Rakefile'
8
+ - '**/Gemfile'
9
+ - 'bin/rika'
10
+ Gemspec/RequiredRubyVersion:
11
+ Enabled: false
12
+ Layout/HashAlignment:
13
+ Enabled: false
14
+ Metrics/AbcSize:
15
+ Enabled: false
16
+ Metrics/BlockLength:
17
+ Enabled: false
18
+ Metrics/MethodLength:
19
+ Enabled: false
20
+ RSpec/ExampleLength:
21
+ Enabled: false
22
+ RSpec/ExpectOutput:
23
+ Enabled: false
24
+ RSpec/InstanceVariable:
25
+ Enabled: false
26
+ RSpec/MultipleExpectations:
27
+ Enabled: false
28
+ RSpec/MultipleMemoizedHelpers:
29
+ Enabled: false
30
+ Style/AccessModifierDeclarations:
31
+ Enabled: false
32
+ Style/FetchEnvVar:
33
+ Enabled: false
34
+ Style/GuardClause:
35
+ Enabled: false
36
+ Style/IfUnlessModifier:
37
+ Enabled: false
38
+ Style/Lambda:
39
+ Enabled: false
40
+ Style/LambdaCall:
41
+ Enabled: false
42
+ Style/NumericLiterals:
43
+ Enabled: false
44
+ Style/PercentLiteralDelimiters:
45
+ Enabled: false
46
+ Style/StderrPuts:
47
+ Enabled: false
48
+ Style/TrailingUnderscoreVariable:
49
+ Enabled: false
data/Gemfile CHANGED
@@ -1,4 +1,16 @@
1
+ # frozen_string_literal: true
2
+
1
3
  source 'https://rubygems.org'
2
4
 
3
5
  # Specify your gem's dependencies in rika.gemspec
4
6
  gemspec
7
+
8
+ group :development do
9
+ gem 'pry'
10
+ gem 'rake', '~> 13.0'
11
+ gem 'rspec', '~> 3.9'
12
+ gem 'rubocop'
13
+ gem 'rubocop-rspec'
14
+ gem 'simplecov', require: false
15
+ gem 'webrick', '~> 1.6'
16
+ end
data/README.md CHANGED
@@ -1,125 +1,262 @@
1
-
2
1
  # Rika
3
2
 
4
- Rika is a [JRuby](https://www.jruby.org) wrapper for the [Apache Tika](http://tika.apache.org/) Java library, which extracts text and metadata from files and resources of [many different formats](https://tika.apache.org/1.24.1/formats.html).
5
-
6
- _Caution: This gem only works with [JRuby](https://www.jruby.org)._
3
+ [Rika](https://github.com/keithrbennett/rika) is a [JRuby](https://www.jruby.org) wrapper for
4
+ the [Apache Tika](http://tika.apache.org/) Java library, which extracts text and metadata from files and resources
5
+ of [many different formats](https://tika.apache.org/1.24.1/formats.html).
7
6
 
8
- Rika currently supports some basic and commonly used functions of Tika. Future development may add Ruby support for more Tika functionality, and perhaps a command line interface as well. See the [Other Tika Resources](#other-tika-resources) section for alternatives to Rika that may suit more demanding needs.
7
+ Rika can be used as a library in your Ruby code, or on the command line.
9
8
 
10
- [![Code Climate](https://codeclimate.com/github/keithrbennett/rika.png)](https://codeclimate.com/github/keithrbennett/rika)
11
- [![Build Status](https://travis-ci.org/keithrbennett/rika.png?branch=master)](https://travis-ci.org/keithrbennett/rika)
9
+ For class and method level documentation, please use [YARD](https://rubydoc.info/gems/yard).
10
+ You can `gem install yard`, then run `yard doc` from the project root,
11
+ and then open the `doc/index.html` file in a browser.
12
12
 
13
- ## Usage
14
13
 
15
- For a quick start with the simplest use cases, the following functions are provided to get what you need in a single function call, for your convenience:
16
-
17
- ```ruby
18
- require 'rika'
19
-
20
- content = Rika.parse_content('x.pdf') # string containing all content text
21
- metadata = Rika.parse_metadata('x.pdf') # hash containing the document metadata
22
- content, metadata = Rika.parse_content_and_metadata('x.pdf') # both of the above
23
- ```
14
+ ### Requirements
24
15
 
25
- A URL can be used instead of a filespec wherever a data source is specified:
16
+ * This gem only works with [JRuby](https://www.jruby.org).
17
+ * The [Apache Tika](http://tika.apache.org/) jar file must be installed on your system.
18
+ See the [Installation](#installation) section below for more information.
26
19
 
27
- ```ruby
28
- content, metadata = Rika.parse_content_and_metadata('https://github.com/keithrbennett/rika')
29
- ```
20
+ Rika currently supports some basic and commonly used functions of Tika.
21
+ Since it runs on JRuby, the Tika library's Java methods can be called directly from Ruby code
22
+ for more advanced needs.
23
+ See the [Other Tika Resources](#other-tika-resources) section of this document for alternatives to
24
+ Rika that may suit more demanding needs.
30
25
 
31
- For other use cases and finer control, you can work directly with the Rika::Parser object:
26
+ Rika can be used either as a gem in your own Ruby project, or on the command line using the provided executable.
32
27
 
33
- ```ruby
34
- require 'rika'
28
+ ## Usage in Your Ruby Code
35
29
 
36
- parser = Rika::Parser.new('x.pdf')
30
+ > [!IMPORTANT]
31
+ > **It is necessary to call `Rika.init` before using Rika.** This is because the loading of the Tika library
32
+ has been put in an init method, rather than at load time, so that 'jar file not found or specified' errors
33
+ do not prevent your application from loading. If you forget to call `Rika.init`, you may see seemingly unrelated
34
+ error messages.
37
35
 
38
- # Return the content of the document:
39
- parser.content
36
+ As a convenience, the `Rika.init` method is called automatically when you call the Rika module methods. However,
37
+ if you access other Rika classes and methods, `init` may not have been called yet, so you should call it yourself.
40
38
 
41
- # Return the metadata of the document:
42
- parser.metadata
39
+ ----
43
40
 
44
- # Return the media type for the document, e.g. "application/pdf":
45
- parser.media_type
41
+ The Rika `parse` method returns a `Rika::ParseResult` object that contains the parsed text and
42
+ various pieces of metadata. The `ParseResult` class' main methods are:
46
43
 
47
- # Return only the first 10000 chars of the content:
48
- parser = Rika::Parser.new('x.pdf', 10000)
49
- parser.content # 10000 first chars returned
44
+ * `content` - the parsed text
45
+ * `metadata` - a hash of metadata key/value pairs
46
+ * `content_type` - the content type of the parsed data, e.g. "text/plain; charset=UTF-8"
47
+ * `language` - the language of the parsed data, e.g. "en"
48
+ * `data_source` - the data source, either a filespec or a URL
50
49
 
51
- # Return content from URL
52
- parser = Rika::Parser.new('http://example.com/x.pdf', 200)
53
- parser.content
50
+ For example:
54
51
 
55
- # Return the language for the content
56
- parser = Rika::Parser.new('german-document.pdf')
57
- parser.language
58
- => "de"
52
+ ```ruby
53
+ require 'rika'
59
54
 
60
- # Check whether the language identification is certain enough to be trusted
61
- parser.language_is_reasonably_certain?
62
-
55
+ parse_result = Rika.parse('x.pdf') # returns a Rika::ParseResult object
56
+ parse_result.content # string containing all content text
57
+ parse_result.text # 'text' is an alias for 'content'
58
+ parse_result.metadata # hash containing the document metadata
59
+ parse_result.content_type # e.g. "application/pdf"
60
+ parse_result.language # e.g. "en"
61
+ parse_result.data_source # e.g. "x.pdf"
63
62
  ```
64
63
 
65
- #### Simple Command Line Use
66
-
67
- Since Ruby supports the `-r` option to require a library, and the `-e` option to evaluate a string of code, you can easily do simple parsing on the command line, such as:
64
+ A URL can be used instead of a filespec wherever a data source is specified:
68
65
 
69
- ```
70
- ruby -r rika -e 'puts Rika.parse_content("x.pdf")'
66
+ ```ruby
67
+ parse_result = Rika.parse('https://github.com/keithrbennett/rika')
71
68
  ```
72
69
 
73
- You could also parse the metadata and output it as JSON as follows:
70
+ The Rika module also has the following methods:
74
71
 
75
- ```
76
- ruby -r rika -r json -e 'puts Rika.parse_metadata("x.pdf").to_json'
72
+ ```ruby
73
+ Rika.language("magnifique") # => "fr"
74
+ Rika.tika_version # => "2.9.0"
77
75
  ```
78
76
 
79
- If you want to get both content and metadata in JSON format, this would do that:
77
+ ## Command Line Executable Usage
78
+
79
+ Rika can also be used on the command line using the `rika` executable. For example, the simplest form is to simply
80
+ specify one or more filespecs or URL's as arguments:
80
81
 
82
+ ```bash
83
+ rika x.pdf https://github.com/keithrbennett/rika
81
84
  ```
82
- ruby -r rika -r json -e 'c,m = Rika.parse_content_and_metadata("tw.pdf"); puts({ c: c, m: m }.to_json)'
83
- ```
84
-
85
- Using the [rexe](https://github.com/keithrbennett/rexe) gem, that can be made much more concise:
85
+ Here is the help text:
86
86
 
87
87
  ```
88
- rexe -r rika -oj 'c,m = Rika.parse_content_and_metadata("x.pdf"); { c: c, m: m }'
88
+ Rika v2.0.0 (Tika v2.9.0) - https://github.com/keithrbennett/rika
89
+
90
+ Usage: rika [options] <file or url> [...file or url...]
91
+ Output formats are: [a]wesome_print, [t]o_s, [i]nspect, [j]son), [J] for pretty json, and [y]aml.
92
+ If a format contains two letters, the first will be used for metadata, the second for text.
93
+ Values for the text, metadata, and as_array boolean options may be specified as follows:
94
+ Enable: +, true, yes, [empty]
95
+ Disable: -, false, no, [long form option with no- prefix, e.g. --no-metadata]
96
+
97
+ -f, --format FORMAT Output format (default: at)
98
+ -m, --[no-]metadata [FLAG] Output metadata (default: true)
99
+ -t, --[no-]text [FLAG] Output text (default: true)
100
+ -k, --[no-]key-sort [FLAG] Sort metadata keys case insensitively (default: true)
101
+ -s, --[no-]source [FLAG] Document source file or URL
102
+ -a, --[no-]as-array [FLAG] Output all parsed results as an array (default: false)
103
+ -v, --version Output version
104
+ -h, --help Output help
105
+ ```
106
+
107
+ ### Outputting Only Metadata or Only Parsed Text
108
+
109
+ The default setting is to output both metadata and text. To disable either, use the `-m` or `-t` options
110
+ with a disabling flag, e.g. `-m-`, `-m false`, `-m no`, or `--no-metadata` to disable metadata.
111
+
112
+ ### Outputting the Document Source Identifier (Filespec or URL)
113
+
114
+ There are many times when it is useful to know the source of the document. For example, if you are processing
115
+ a large number of documents, you may want to know which document a particular piece of output came from.
116
+
117
+ The document source identifier is output by default. To disable it, use the `-s` option with a disabling flag, e.g. `-s-`,
118
+ `-s false`, `-s no`, or `--no-source`.
119
+
120
+ ### Output Formats
121
+
122
+ The `-f` option can be used to specify the output format. The default is `at`, which means that the metadata will be
123
+ output in awesome_print format, and the text will be output using `to_s`
124
+ (i.e. without any changes to the parsed string).
125
+
126
+ If a single argument to `-f` is specified, it will be used for both metadata and text. If two arguments are specified,
127
+ the first will be used for metadata and the second for the parsed text.
128
+
129
+ ### Sorting of Metadata Keys
130
+
131
+ By default, metadata keys will be sorted case insensitively. To disable this, use the `-k` option
132
+ with a disabling flag, i.e. `-k-`, `-k false`, `-k no`, or `--no-key-sort`.
133
+
134
+ The case insensitivity is implemented by using `String#downcase`.
135
+ This may not sort correctly on some non-English systems.
136
+
137
+ ### Specifying Command Line Options in the RIKA_OPTIONS Environment Variable
138
+
139
+ If you find yourself using the same options over and over again, you can put them in the `RIKA_OPTIONS` environment
140
+ variable. For example, if the default behavior of sorting keys does not work for your language, you can disable it
141
+ for all invocations of the `rika` command by specifying `-k-` in the RIKA_OPTIONS environment variable.
142
+
143
+ ### Machine Readable Data Support
144
+
145
+ If both metadata and text are output, and the same output format is used for both, and that format is JSON
146
+ (plain or "pretty") or YAML, then the output per document will be a single JSON or YAML hash representation
147
+ containing both the metadata and the text (whose keys are "metadata" and "text"). This enables piping
148
+ the results of multiple documents to a file or to another program that can use it as a data source.
149
+ In addition, when processing multiple files, this streaming approach will be more efficient
150
+ than calling Rika separately for each file, since each invocation of the rika command requires starting up
151
+ a Java Virtual Machine.
152
+
153
+ If the `-a` (`--as-array`) option is specified, then the output will be an array of such hashes, one for each file.
154
+ This enables the output to be used as a data source for programs that can process an array of hashes, e.g. for analysis.
155
+
156
+ For example, here is an example of how to use Rika and [rexe](https://github.com/keithrbennett/rexe]) to get a tally
157
+ of content types for a set of documents, sorted by content type:
158
+
159
+ ```bash
160
+ $ rika -t- -s- -fy -a spec/fixtures/* | \
161
+ rexe -iy -oa -mb "map { |r| r['metadata']['Content-Type'] }.tally.sort.to_h"
162
+ {
163
+ "application/msword" => 1,
164
+ "application/octet-stream" => 1,
165
+ "application/pdf" => 1,
166
+ "application/vnd.openxmlformats-officedocument.wordprocessingml.document" => 1,
167
+ "image/jpeg" => 2,
168
+ "text/plain; charset=ISO-8859-1" => 1,
169
+ "text/plain; charset=UTF-8" => 6,
170
+ "text/x-matlab; charset=ISO-8859-1" => 1
171
+ }
172
+ ```
173
+ Here is a breakdown of the above command:
174
+
175
+ * `rika`
176
+ * `-t-` suppresses the output of text
177
+ * `-s-` suppresses the output of the source identifier
178
+ * `-fy` outputs the data in YAML format.
179
+ * `-a` option causes the output to be an array of hashes, one for each file
180
+ * `rexe`
181
+ * `-iy` indicates that the input is YAML
182
+ * `-oa` indicates that the output should be done using awesome_print/amazing_print
183
+ * `-mb` indicates that all input should be ingested as a single string ("b" for "big string", as opposed to streamed)
184
+
185
+ * Ruby code passed to `rexe`
186
+ * `map` is called on the array to extract the content type from each parsed document hash
187
+ * `tally` is called on the resulting array to get the count of each content type
188
+ * `sort` is called on the hash to sort it by key (content type) and return an array of 2-element arrays
189
+ * `to_h` is called on the array of 2-element arrays to convert it back to a hash
190
+
191
+ Here is another example that prints out the 5 most common words in all the parsed text, and their counts,
192
+ as "pretty" JSON:
193
+
194
+ ```bash
195
+ $ rika -m- spec/fixtures/* | \
196
+ rexe -in -oJ -mb 'downcase \
197
+ .split \
198
+ .tally \
199
+ .sort_by { |word, count| [-count, word] }
200
+ .first(5) \
201
+ .to_h'
202
+
203
+ {
204
+ "the": 35,
205
+ "to": 30,
206
+ "woods": 25,
207
+ "i": 25,
208
+ "and": 25
209
+ }
89
210
  ```
90
-
91
- ...and changing the `-oj` option gives you access to other output formats such as "Pretty JSON", YAML, and AwesomePrint (a very human readable format).
92
-
93
211
 
94
212
  ## Installation
95
213
 
96
- Add this line to your application's Gemfile. Use `gem` or `jgem` depending on your JRuby installation:
214
+ * Install [JRuby](https://www.jruby.org) if you don't already have it. Ruby version managers such as
215
+ [rvm](https://rvm.io/) and [rbenv](https://github.com/rbenv) can simplify this process.
216
+ * Download the [Apache Tika](http://tika.apache.org/) jar file from
217
+ http://tika.apache.org/download.html (look for the "tika-app" jar file).
218
+ Put it in a place that makes sense for your system, such as `/usr/local/lib`.
219
+ * Configure the `TIKA_JAR_FILESPEC` environment variable to point to the Tika jar file.
220
+ For example, if you are using tika-app-2.9.0.jar, and put the jar file in `/opt/jars',
221
+ then the setting of the environment variable should look like this:
97
222
 
98
- gem 'rika' # or: jgem 'rika'
223
+ ```bash
224
+ export TIKA_JAR_FILESPEC=/opt/jars/tika-app-2.9.0.jar
225
+ ```
99
226
 
100
- And then execute:
227
+ You can put this in your `.bashrc` or `.zshrc` file to make it persistent.
101
228
 
102
- $ bundle
229
+ * Install the gem:
103
230
 
104
- Or install it yourself as:
231
+ ```bash
232
+ gem install rika
233
+ ```
105
234
 
106
- $ gem install rika # or: jgem install rika
235
+ or, if you're using [bundler](https://bundler.io/), add this to your Gemfile:
107
236
 
108
- ## Other Tika Resources
109
-
110
- * For more sophisticated use of Tika, you can use the Tika jar file directly in your JRuby code. After installing the `rika` gem, the Tika jar file will be located in `$GEM_HOME/gems/rika-[rika-version]-java/target/dependency/tika-core-[tika-version].jar`.
237
+ ```ruby
238
+ gem 'rika'
239
+ ```
111
240
 
112
- * Tika also provides another jar file containing a RESTful server that you can run on the command line. You can download this server jar from http://tika.apache.org/download.html.
113
- See the "Running the Tika Server as a Jar file" section of https://cwiki.apache.org/confluence/display/TIKA/TikaServer for more information.
241
+ and then run `bundle install`.
242
+ * Verify that it works by running (as an example) `rika -m https://www.github.com`.
243
+ You should see key/value pairs representing the metadata of the Github home page.
114
244
 
115
- * @chrismattman and others have provided a [Python library and CLI](https://github.com/chrismattmann/tika-python) that interfaces with the Tika server.
245
+ This gem has been tested with JRuby managed by rvm. It should work with other Ruby version managers and
246
+ without any version manager at all, but those configurations have not been tested.
116
247
 
117
- * A general Tika wiki is at https://cwiki.apache.org/confluence/display/tika.
248
+ ## Other Tika Resources
118
249
 
250
+ * The Apache Tika wiki is at https://cwiki.apache.org/confluence/display/tika.
119
251
 
120
- ## Credits
252
+ * Tika also provides another jar file containing a RESTful server that you can run on the command line.
253
+ You can download this server jar from http://tika.apache.org/download.html (look for the "tika-server-standard" jar
254
+ file).
255
+ See the "Running the Tika Server as a Jar file" section of https://cwiki.apache.org/confluence/display/TIKA/TikaServer
256
+ for more information.
121
257
 
122
- Richard Nyström (@ricn) is the original author of Rika, but has not been able to maintain it since 2015. In July 2020, Richard transferred the project to Keith Bennett (@keithrbennett), who had made made some contributions back in 2013.
258
+ * @chrismattman and others have provided a ["tika_python" Python library and CLI](https://github.com/chrismattmann/tika-python)
259
+ that interfaces with the Tika server.
123
260
 
124
261
  ## Contributing
125
262
 
data/RELEASE_NOTES.md CHANGED
@@ -1,5 +1,31 @@
1
1
  ## Release Notes
2
2
 
3
+ #### v2.0.0
4
+
5
+ * Add features:
6
+ * command line interface
7
+ * support for JSON, Pretty JSON, YAML, AwesomePrint, to_s, and inspect output formats
8
+ * optional array mode (previously only nonarray streaming mode).
9
+ * more persistent options can be specified in an environment variable, `RIKA_OPTIONS`.
10
+ * metadata keys can optionally be sorted alphabetically (not all languages though).
11
+ * properties added by Rika to the metadata: data-source, language
12
+ * Filespec or URL data source identifier can optionally be output with metadata and text.
13
+ * Add support for Tika 2.8.0, breaks compatibility with Tika 1.x.
14
+ * Remove tika-app-1.24.1.jar from code base and gem (but it is still in git history).
15
+ * Tika jar file is now downloaded by the user and found via environment variable `TIKA_JAR_FILESPEC`.
16
+ * New class ParseResult created to simplify result access and Parser class.
17
+ * Add `Rika.tika_version`.
18
+ * Add `webrick` dependency, needed for current versions of Ruby.
19
+ * Remove deprecated methods `Parser#available_metadata` and `Parser#metadata_exists?`.
20
+ * Move `Parser#language` to `Rika.language`.
21
+ * Remove `Parser#language_is_reasonably_certain?`, no longer supported by Tika.
22
+ * Remove obsolete `LanguageIdentifier` import. Otherwise updated language detection.
23
+ * Various refactorings and improvements.
24
+ * Add SimpleCov test coverage and Rubocop linting tools to project.
25
+ * Set up RSpec configuration to enable --only-failures and --next-failure options.
26
+
27
+
28
+
3
29
  #### v1.11.1
4
30
 
5
31
  * Add Apache-2.0 license to gemspec.
data/Rakefile CHANGED
@@ -1,11 +1,8 @@
1
- require "bundler/gem_tasks"
1
+ # frozen_string_literal: true
2
+
3
+ require 'bundler/gem_tasks'
2
4
  require 'rspec/core/rake_task'
3
5
 
4
6
  RSpec::Core::RakeTask.new(:spec)
5
7
 
6
- task :default => :spec
7
-
8
- desc 'Download jars'
9
- task :download_jars do
10
- system "mvn dependency:copy-dependencies"
11
- end
8
+ task default: :spec
data/bin/rika ADDED
@@ -0,0 +1,13 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require 'rika/cli/rika_command'
5
+
6
+ begin
7
+ Rika.init
8
+ rescue Rika::TikaLoadError => e
9
+ $stderr.puts e.message
10
+ exit 1
11
+ end
12
+
13
+ RikaCommand.new.call
@@ -0,0 +1,131 @@
1
+ # frozen_string_literal: true
2
+
3
+ # Processes the array of arguments (ARGV by default) and returns the options, targets, and help string.
4
+ class ArgsParser
5
+ attr_reader :args, :options, :option_parser
6
+ private :args, :options, :option_parser
7
+
8
+ DEFAULT_OPTIONS =
9
+ {
10
+ as_array: false,
11
+ format: 'at', # AwesomePrint for metadata, to_s for text content
12
+ metadata: true,
13
+ text: true,
14
+ source: true,
15
+ key_sort: true
16
+ }.freeze
17
+
18
+ # Parses the command line arguments.
19
+ # Shorthand for ArgsParser.new.call. This call is recommended to pro tect the caller in case
20
+ # this functionality is repackaged as a Module or otherwise modified.
21
+ # @param [Array] args the command line arguments (overridable for testing, etc.)
22
+ # @return [Array<Hash,String>] [options, targets, help_string],
23
+ # or exits if help or version requested or no targets specified.
24
+ def self.call(args = ARGV)
25
+ new.call(args)
26
+ end
27
+
28
+ # Parses the command line arguments.
29
+ # @param [Array] args the command line arguments (overridable for testing, etc.)
30
+ # @return [Array<Hash,String>] [options, targets, help_string],
31
+ # or exits if help or version requested or no targets specified.
32
+ def call(args = ARGV)
33
+ @args = args
34
+ @options = DEFAULT_OPTIONS.dup
35
+ prepend_environment_args
36
+ @option_parser = create_option_parser
37
+ option_parser.parse!(args)
38
+ postprocess_format_options
39
+ targets = create_target_array
40
+ [options, targets, option_parser.help]
41
+ end
42
+
43
+ # @return [OptionParser]
44
+ private def create_option_parser
45
+ OptionParser.new do |opts|
46
+ opts.banner = <<~BANNER
47
+ Rika v#{Rika::VERSION} (Tika v#{Rika.tika_version}) - #{Rika::PROJECT_URL}
48
+
49
+ Usage: rika [options] <file or url> [...file or url...]
50
+ Output formats are: [a]wesome_print, [t]o_s, [i]nspect, [j]son), [J] for pretty json, and [y]aml.
51
+ If a format contains two letters, the first will be used for metadata, the second for text.
52
+ Values for the text, metadata, and as_array boolean options may be specified as follows:
53
+ Enable: +, true, yes, [empty]
54
+ Disable: -, false, no, [long form option with no- prefix, e.g. --no-metadata]
55
+
56
+ BANNER
57
+
58
+ format_message = 'Output format (default: at)'
59
+ opts.on('-f', '--format FORMAT', format_message) do |format|
60
+ options[:format] = format
61
+ end
62
+
63
+ opts.on('-m', '--[no-]metadata [FLAG]', TrueClass, 'Output metadata (default: true)') do |v|
64
+ options[:metadata] = (v.nil? ? true : v)
65
+ end
66
+
67
+ opts.on('-t', '--[no-]text [FLAG]', TrueClass, 'Output text (default: true)') do |v|
68
+ options[:text] = (v.nil? ? true : v)
69
+ end
70
+
71
+ opts.on('-k', '--[no-]key-sort [FLAG]', TrueClass, 'Sort metadata keys case insensitively (default: true)') do |v|
72
+ options[:key_sort] = (v.nil? ? true : v)
73
+ end
74
+
75
+ opts.on('-s', '--[no-]source [FLAG]', TrueClass, 'Document source file or URL') do |v|
76
+ options[:source] = (v.nil? ? true : v)
77
+ end
78
+
79
+ opts.on('-a', '--[no-]as-array [FLAG]', TrueClass,
80
+ 'Output all parsed results as an array (default: false)') do |v|
81
+ options[:as_array] = (v.nil? ? true : v)
82
+ end
83
+
84
+ opts.on('-v', '--version', 'Output version') do
85
+ puts versions_string
86
+ exit
87
+ end
88
+
89
+ opts.on('-h', '--help', 'Output help') do
90
+ puts opts
91
+ exit
92
+ end
93
+ end
94
+ end
95
+
96
+ # @return [Array] the targets specified on the command line, possibly expanded by the shell,
97
+ # and with any directories removed.
98
+ private def create_target_array
99
+ targets = args.dup.reject { |arg| File.directory?(arg) }.freeze # reject dirs to handle **/* globbing
100
+ targets.map(&:freeze)
101
+ end
102
+
103
+ # Fills in the second format option character if absent, and removes any excess characters
104
+ # @return [String] format options 2-character value, e.g. 'at'
105
+ private def postprocess_format_options
106
+ # If only one format letter is specified, use it for both metadata and text.
107
+ options[:format] *= 2 if options[:format].length == 1
108
+
109
+ # Ignore and remove extra characters after the first two format characters.
110
+ options[:format] = options[:format][0..1]
111
+ end
112
+
113
+ # If the user wants to specify options in an environment variable ("RIKA_OPTIONS"),
114
+ # then this method will insert those options at the beginning of the `args` array,
115
+ # where they can be overridden by command line arguments.
116
+ private def prepend_environment_args
117
+ env_opt_string = environment_options
118
+ args_to_prepend = Shellwords.shellsplit(env_opt_string)
119
+ args.unshift(args_to_prepend).flatten!
120
+ end
121
+
122
+ # @return [String] the value of the RIKA_OPTIONS environment variable if present, else ''.
123
+ private def environment_options
124
+ ENV['RIKA_OPTIONS'] || ''
125
+ end
126
+
127
+ # @return [String] string containing versions of Rika and Tika, with labels
128
+ private def versions_string
129
+ "Versions: Rika: #{Rika::VERSION}, Tika: #{Rika.tika_version}"
130
+ end
131
+ end