rika 2.0.4-java → 2.2.0-java

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1a4cc1c56edb22131c3409bbaee3617febac8d40ba3aed28600cde997d93d465
4
- data.tar.gz: 8b38c319ca598ab107762222bc0b097bcf5001aaf53fc1c33e912cff83ff7997
3
+ metadata.gz: bd45b4d1119a5372921bf75ba95f8b42c4b75b3299065c77c5a1809f30d21a93
4
+ data.tar.gz: 62f53d53f1b55d08de9f427fa2c7891baa13b0a6406d304fdf6d4fd7673b239c
5
5
  SHA512:
6
- metadata.gz: 1cfea7c20c8c2b294ff6a5d29877cc4670034ac6c58ac299bd5b3ac716d55bc4f1df39843d7c3a33776186ca73d7652747e8853fc914f114e2ecef16e8b26278
7
- data.tar.gz: c6c5aeef86b5e9b2cba4b31f395602f94b8c1094975f5543c388175cb619c34c87258ade2043cc0faa47cff7f5bbb87f85e298179faaa4465e96e47647b1bfe0
6
+ metadata.gz: 7c66fb4cc8fde420e7a25f35dea66ed3e88ce6b3bc4d79df9c26425569feb72281e9321230ac5c99905cb50bb6acb1724428ffea2f3cd41a7ab59e647b73918a
7
+ data.tar.gz: 8ad4a5ea3da65eec6a590871855a99699c9cc890ed5f28f331679383cd3fb4531022d695aee112153cba2859a3bc63c8a6daee63fec45dc1a596beb923a3c20b
data/.gitignore CHANGED
@@ -1,17 +1,20 @@
1
- *.gem
2
- *.rbc
3
- .DS_Store
1
+ _yardoc
4
2
  .bundle
5
3
  .config
6
- coverage/
4
+ .DS_Store
7
5
  .idea/
6
+ .ruby-version
7
+ .rvmrc
8
8
  .yardoc
9
- Gemfile.lock
10
- InstalledFiles
11
- _yardoc
9
+ *.gem
10
+ *.rbc
12
11
  coverage
12
+ coverage/
13
13
  doc/
14
+ Gemfile.lock
15
+ InstalledFiles
14
16
  lib/bundler/man
17
+ pdf/
15
18
  pkg
16
19
  projectFilesBackup/
17
20
  rdoc
data/.rspec CHANGED
@@ -1,2 +1,2 @@
1
+ --format documentation
1
2
  --color
2
- --format progress
data/README.md CHANGED
@@ -2,11 +2,11 @@
2
2
 
3
3
  [Rika](https://github.com/keithrbennett/rika) is a [JRuby](https://www.jruby.org) wrapper for
4
4
  the [Apache Tika](http://tika.apache.org/) Java library, which extracts text and metadata from files and resources
5
- of [many different formats](https://tika.apache.org/1.24.1/formats.html).
5
+ of [many different formats](https://tika.apache.org/3.1.0/formats.html).
6
6
 
7
- Rika can be used as a library in your Ruby code, or on the command line.
7
+ Rika can be used as a library in your Ruby code or on the command line using the provided executable.
8
8
 
9
- For class and method level documentation, please use [YARD](https://rubydoc.info/gems/yard).
9
+ For class and method level documentation, please use [YARD](https://yardoc.org/).
10
10
  You can `gem install yard`, then run `yard doc` from the project root,
11
11
  and then open the `doc/index.html` file in a browser.
12
12
 
@@ -23,12 +23,10 @@ for more advanced needs.
23
23
  See the [Other Tika Resources](#other-tika-resources) section of this document for alternatives to
24
24
  Rika that may suit more demanding needs.
25
25
 
26
- Rika can be used either as a gem in your own Ruby project, or on the command line using the provided executable.
27
26
 
28
27
  ## Usage in Your Ruby Code
29
28
 
30
- > [!IMPORTANT]
31
- > **It is necessary to call `Rika.init` before using Rika.** This is because the loading of the Tika library
29
+ > **⚠️ IMPORTANT:** It is necessary to call `Rika.init` before using Rika. This is because the loading of the Tika library
32
30
  has been put in an init method, rather than at load time, so that 'jar file not found or specified' errors
33
31
  do not prevent your application from loading. If you forget to call `Rika.init`, you may see seemingly unrelated
34
32
  error messages.
@@ -80,16 +78,16 @@ Rika can also be used on the command line using the `rika` executable. For exam
80
78
  specify one or more filespecs or URL's as arguments:
81
79
 
82
80
  ```bash
83
- rika x.pdf https://github.com/keithrbennett/rika
81
+ rika x.pdf https://www.google.com
84
82
  ```
85
-
83
+
86
84
  > [!NOTE]
87
85
  > If running `rika` produces an error indicating that the JRuby interpreter cannot be found, try preceding it with `jruby`, e.g. `jruby rika x.pdf`.
88
86
 
89
87
  Here is the help text:
90
88
 
91
89
  ```
92
- Rika v2.0.2 (Tika v2.9.0) - https://github.com/keithrbennett/rika
90
+ Rika v2.2.0 (Tika v3.1.0) - https://github.com/keithrbennett/rika
93
91
 
94
92
  Usage: rika [options] <file or url> [...file or url...]
95
93
  Output formats are: [a]wesome_print, [t]o_s, [i]nspect, [j]son), [J] for pretty json, and [y]aml.
@@ -98,13 +96,19 @@ Values for the text, metadata, and as_array boolean options may be specified as
98
96
  Enable: +, true, yes, [empty]
99
97
  Disable: -, false, no, [long form option with no- prefix, e.g. --no-metadata]
100
98
 
99
+ IMPORTANT: Always quote wildcard patterns when files might contain special characters:
100
+ - Double quotes: "*.pdf" (allows variable expansion)
101
+ - Single quotes: '*.pdf' (prevents all shell interpretation)
102
+ Use -n/--dry-run to preview command execution and check for issues.
103
+
101
104
  -f, --format FORMAT Output format (default: at)
102
105
  -m, --[no-]metadata [FLAG] Output metadata (default: true)
103
106
  -t, --[no-]text [FLAG] Output text (default: true)
104
107
  -k, --[no-]key-sort [FLAG] Sort metadata keys case insensitively (default: true)
105
- -s, --[no-]source [FLAG] Output document source file or URL (default: false)
108
+ -s, --[no-]source [FLAG] Output document source file or URL (default: true)
106
109
  -a, --[no-]as-array [FLAG] Output all parsed results as an array (default: false)
107
- -v, --version Output version
110
+ -n, --[no-]dry-run [FLAG] Show what would be done without executing (default: false)
111
+ -v, --version Output software versions
108
112
  -h, --help Output help
109
113
  ```
110
114
 
@@ -144,6 +148,72 @@ If you find yourself using the same options over and over again, you can put the
144
148
  variable. For example, if the default behavior of sorting keys does not work for your language, you can disable it
145
149
  for all invocations of the `rika` command by specifying `-k-` in the RIKA_OPTIONS environment variable.
146
150
 
151
+ ### Using Wildcards for File Specification
152
+
153
+ Rika now supports in-app expansion of wildcard patterns for file specification. This means you can quote wildcard patterns
154
+ to prevent the shell from expanding them, and Rika will handle the expansion internally:
155
+
156
+ ```bash
157
+ # Let Rika handle the expansion (no practical limit on number of files)
158
+ rika '**/*.pdf'
159
+
160
+ # Shell expands wildcards (limited by shell's maximum argument length)
161
+ rika **/*.pdf
162
+ ```
163
+
164
+ > **⚠️ IMPORTANT:** Always quote wildcard patterns (using either single or double quotes) when they might match files containing special characters!
165
+ >
166
+ > When unquoted, the shell may misinterpret filenames containing spaces, $, *, ?, [], (), {}, &, |, <, >, ;,
167
+ > backticks, quotes, and other shell metacharacters, causing unpredictable behavior:
168
+ >
169
+ > ```bash
170
+ > # PROBLEMATIC - Shell breaks/misinterprets files with special characters
171
+ > rika pdf/*
172
+ >
173
+ > # CORRECT - Both single and double quotes work to preserve filenames
174
+ > rika "pdf/*" # Double quotes allow variable expansion within the pattern
175
+ > rika 'pdf/*.pdf' # Single quotes prevent all shell interpretation
176
+ > ```
177
+ >
178
+ > Use the `-n` (dry-run) option to preview how your command will be processed and to check for issues.
179
+
180
+ This is particularly useful when dealing with large numbers of files, as shell expansion may hit command line length limits.
181
+ In-app expansion has no practical limit on the number of files that can be processed.
182
+
183
+ Supported wildcard patterns:
184
+ - `*` - Match any number of characters
185
+ - `?` - Match a single character
186
+ - `[abc]` - Match one character from the set
187
+ - `{a,b,c}` - Match any of the patterns a, b, or c
188
+ - `**` - Recursive directory matching (match all files in all subdirectories)
189
+
190
+ ### Dry Run Mode
191
+
192
+ You can use the `-n` or `--dry-run` option to see what would happen when running a command without actually executing it:
193
+
194
+ ```bash
195
+ rika -n -f jy README.md
196
+ ```
197
+
198
+ Like other boolean options, dry-run mode can be disabled with various syntax options:
199
+ ```bash
200
+ rika -n- README.md # Hyphen suffix
201
+ rika -n false README.md # "false" value
202
+ rika -n no README.md # "no" value
203
+ rika --no-dry-run README.md # --no- prefix
204
+ ```
205
+
206
+ This will display:
207
+ - All the options that would be used, with human-readable descriptions
208
+ - A list of files that would be processed
209
+ - Any issues that were detected (like non-existent files)
210
+
211
+ This is useful for:
212
+ - Debugging complex commands
213
+ - Checking what files would be processed when using wildcards
214
+ - Verifying options before running on large sets of files
215
+ - Understanding how different options would affect processing
216
+
147
217
  ### Machine Readable Data Support
148
218
 
149
219
  If both metadata and text are output, and the same output format is used for both, and that format is JSON
data/RELEASE_NOTES.md CHANGED
@@ -1,5 +1,21 @@
1
1
  ## Release Notes
2
2
 
3
+ ### v2.2.0
4
+
5
+ * Switched to TikaInputStream for better memory management and performance
6
+ * Added dry-run mode (-n/--dry-run) to preview command execution
7
+ * Improved error handling and reporting (empty files, symlinks, invalid URLs)
8
+ * Enhanced URL validation and processing
9
+ * Added comprehensive integration tests
10
+ * Moved gem executable from bin/ to exe/, made 'exe' the bindir
11
+ * Changed executable shebang from ruby to jruby
12
+ * Added prominent warning about quoting wildcards
13
+ * Added support for UTF-8 encoding in environment options
14
+
15
+ ### v2.1.0
16
+
17
+ * Add ability to specify quoted wildcard filespecs on command line so Ruby will expand them.
18
+
3
19
  ### v2.0.4
4
20
 
5
21
  * Fix uninitialized constant StringIO error (issue #16).
data/{bin → exe}/rika RENAMED
@@ -1,4 +1,4 @@
1
- #!/usr/bin/env ruby
1
+ #!/usr/bin/env jruby
2
2
  # frozen_string_literal: true
3
3
 
4
4
  require 'rika/cli/rika_command'
@@ -1,5 +1,11 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require 'optparse'
4
+ require 'shellwords'
5
+ require 'uri'
6
+ require 'awesome_print'
7
+ require_relative 'rika_command'
8
+
3
9
  # Processes the array of arguments (ARGV by default) and returns the options, targets, and help string.
4
10
  class ArgsParser
5
11
  attr_reader :args, :options, :option_parser
@@ -12,14 +18,15 @@ class ArgsParser
12
18
  metadata: true,
13
19
  text: true,
14
20
  source: true,
15
- key_sort: true
21
+ key_sort: true,
22
+ dry_run: false,
16
23
  }.freeze
17
24
 
18
25
  # Parses the command line arguments.
19
- # Shorthand for ArgsParser.new.call. This call is recommended to pro tect the caller in case
26
+ # Shorthand for ArgsParser.new.call. This call is recommended to protect the caller in case
20
27
  # this functionality is repackaged as a Module or otherwise modified.
21
28
  # @param [Array] args the command line arguments (overridable for testing, etc.)
22
- # @return [Array<Hash,String>] [options, targets, help_string],
29
+ # @return [Array<Hash,Array,String,Hash>] [options, targets, help_string, issues],
23
30
  # or exits if help or version requested or no targets specified.
24
31
  def self.call(args = ARGV)
25
32
  new.call(args)
@@ -27,7 +34,7 @@ class ArgsParser
27
34
 
28
35
  # Parses the command line arguments.
29
36
  # @param [Array] args the command line arguments (overridable for testing, etc.)
30
- # @return [Array<Hash,String>] [options, targets, help_string],
37
+ # @return [Array<Hash,Array,String,Hash>] [options, targets, help_string, issues],
31
38
  # or exits if help or version requested or no targets specified.
32
39
  def call(args = ARGV)
33
40
  @args = args
@@ -36,12 +43,16 @@ class ArgsParser
36
43
  @option_parser = create_option_parser
37
44
  option_parser.parse!(args)
38
45
  postprocess_format_options
39
- targets = create_target_array
40
- [options, targets, option_parser.help]
46
+ targets, issues = process_args_for_targets
47
+ [options, targets, option_parser.help, issues]
41
48
  end
42
49
 
50
+ # -------------------------------------------------------
51
+ private
52
+ # -------------------------------------------------------
53
+
43
54
  # @return [OptionParser]
44
- private def create_option_parser
55
+ def create_option_parser
45
56
  OptionParser.new do |opts|
46
57
  opts.banner = <<~BANNER
47
58
  Rika v#{Rika::VERSION} (Tika v#{Rika.tika_version}) - #{Rika::PROJECT_URL}
@@ -53,6 +64,11 @@ class ArgsParser
53
64
  Enable: +, true, yes, [empty]
54
65
  Disable: -, false, no, [long form option with no- prefix, e.g. --no-metadata]
55
66
 
67
+ ⚠️ IMPORTANT: Always quote wildcard patterns when files might contain special characters!
68
+ - Double quotes: "*.pdf" (allows variable expansion)
69
+ - Single quotes: '*.pdf' (prevents all shell interpretation)
70
+ Use -n/--dry-run to preview command execution and check for issues.
71
+
56
72
  BANNER
57
73
 
58
74
  format_message = 'Output format (default: at)'
@@ -72,7 +88,7 @@ class ArgsParser
72
88
  options[:key_sort] = (v.nil? ? true : v)
73
89
  end
74
90
 
75
- opts.on('-s', '--[no-]source [FLAG]', TrueClass, 'Output document source file or URL (default: false)') do |v|
91
+ opts.on('-s', '--[no-]source [FLAG]', TrueClass, 'Output document source file or URL (default: true)') do |v|
76
92
  options[:source] = (v.nil? ? true : v)
77
93
  end
78
94
 
@@ -81,52 +97,138 @@ class ArgsParser
81
97
  options[:as_array] = (v.nil? ? true : v)
82
98
  end
83
99
 
84
- opts.on('-v', '--version', 'Output version') do
100
+ opts.on('-n', '--[no-]dry-run [FLAG]', TrueClass, 'Show what would be done without executing (default: false)') do |v|
101
+ options[:dry_run] = (v.nil? ? true : v)
102
+ end
103
+
104
+ opts.on('-v', '--version', 'Output software versions') do
85
105
  puts versions_string
86
106
  exit
87
107
  end
88
108
 
89
109
  opts.on('-h', '--help', 'Output help') do
90
- puts opts
110
+ RikaCommand.output_help_text(opts)
91
111
  exit
92
112
  end
93
113
  end
94
114
  end
95
115
 
96
- # @return [Array] the targets specified on the command line, possibly expanded by the shell,
97
- # and with any directories removed.
98
- private def create_target_array
99
- targets = args.dup.reject { |arg| File.directory?(arg) }.freeze # reject dirs to handle **/* globbing
100
- targets.map(&:freeze)
101
- end
102
-
103
116
  # Fills in the second format option character if absent, and removes any excess characters
104
117
  # @return [String] format options 2-character value, e.g. 'at'
105
- private def postprocess_format_options
118
+ def postprocess_format_options
106
119
  # If only one format letter is specified, use it for both metadata and text.
107
120
  options[:format] *= 2 if options[:format].length == 1
108
121
 
109
122
  # Ignore and remove extra characters after the first two format characters.
110
123
  options[:format] = options[:format][0..1]
124
+
125
+ # Validate format characters
126
+ valid_formats = %w[a i j J t y]
127
+ format_chars = options[:format].chars
128
+
129
+ if options[:format].strip.empty? || format_chars.any? { |c| !valid_formats.include?(c) }
130
+ $stderr.puts "Error: Invalid format characters in '#{options[:format]}'. Valid characters are: #{valid_formats.join(', ')}"
131
+ exit 1
132
+ end
111
133
  end
112
134
 
113
135
  # If the user wants to specify options in an environment variable ("RIKA_OPTIONS"),
114
136
  # then this method will insert those options at the beginning of the `args` array,
115
137
  # where they can be overridden by command line arguments.
116
- private def prepend_environment_args
138
+ def prepend_environment_args
117
139
  env_opt_string = environment_options
118
140
  args_to_prepend = Shellwords.shellsplit(env_opt_string)
119
141
  args.unshift(args_to_prepend).flatten!
120
142
  end
121
143
 
122
144
  # @return [String] the value of the RIKA_OPTIONS environment variable if present, else ''.
123
- private def environment_options
124
- ENV['RIKA_OPTIONS'] || ''
145
+ def environment_options
146
+ env_value = ENV['RIKA_OPTIONS'] || ''
147
+ # Necessary to handle escaped spaces and other special characters consistently:
148
+ env_value.dup.force_encoding('UTF-8')
125
149
  end
126
150
 
127
151
  # @return [String] string containing versions of Rika and Tika, with labels
128
- private def versions_string
152
+ def versions_string
129
153
  java_version = Java::java.lang.System.getProperty("java.version")
130
154
  "Versions: Rika: #{Rika::VERSION}, Tika: #{Rika.tika_version}, Java: #{java_version}"
131
155
  end
156
+
157
+ # Process the command line arguments to find URLs and file specifications
158
+ # @return [Array<Array,Hash>] [targets, issues] where targets is an array of valid URLs and filespecs
159
+ # and issues is a hash of categories to arrays of problematic targets
160
+ def process_args_for_targets
161
+ targets = []
162
+ issues = Hash.new { |hash, key| hash[key] = [] }
163
+
164
+ args.each do |arg|
165
+ if arg.include?('://')
166
+ if File.exist?(arg)
167
+ # Files containing "://" are highly unusual in normal filesystems.
168
+ # This is a defensive check to prevent misinterpreting valid files as URLs
169
+ # just because they contain URL-like patterns, which could happen in test
170
+ # environments or with specially crafted filenames.
171
+ issues[:file_with_url_characters] << arg
172
+ else
173
+ # Otherwise treat it as a URL candidate
174
+ process_url_candidate(arg, targets, issues)
175
+ end
176
+ else
177
+ process_filespec_candidate(arg, targets, issues)
178
+ end
179
+ end
180
+
181
+ [targets, issues]
182
+ end
183
+
184
+ # Determines if a string looks like a URL based on the presence of "://"
185
+ # @param [String] arg string to check
186
+ # @return [Boolean] true if the string appears to be a URL
187
+ def looks_like_url?(arg)
188
+ arg.include?('://')
189
+ end
190
+
191
+ # Process a candidate URL
192
+ # @param [String] arg the URL to process
193
+ # @param [Array] targets array to add valid URLs to
194
+ # @param [Hash] issues hash to collect issues
195
+ # @return [void]
196
+ def process_url_candidate(arg, targets, issues)
197
+ begin
198
+ uri = URI.parse(arg)
199
+ if ['http', 'https'].include?(uri.scheme.downcase)
200
+ targets << arg
201
+ else
202
+ issues[:bad_url_scheme] << arg
203
+ end
204
+ rescue URI::InvalidURIError
205
+ issues[:invalid_url] << arg
206
+ end
207
+ end
208
+
209
+ # Process a candidate file specification
210
+ # @param [String] arg the filespec to process
211
+ # @param [Array] targets array to add valid filespecs to
212
+ # @param [Hash] issues hash to collect issues
213
+ # @return [void]
214
+ def process_filespec_candidate(arg, targets, issues)
215
+ matching_filespecs = Dir.glob(arg)
216
+
217
+ if matching_filespecs.empty?
218
+ issues[:non_existent_file] << arg
219
+ return
220
+ end
221
+
222
+ matching_filespecs.each do |file|
223
+ if File.symlink?(file)
224
+ issues[:is_symlink_wont_process] << file
225
+ elsif File.directory?(file)
226
+ # ignore
227
+ elsif File.empty?(file)
228
+ issues[:empty_file] << file
229
+ else
230
+ targets << file
231
+ end
232
+ end
233
+ end
132
234
  end