simple_text_extract 0.2.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fa34a31195f18156df695d2fb860a2beb9562f6a8ae76d6f2b6a7d4585e2e306
4
- data.tar.gz: 3af31eaa54ed98bf8d0355cdf2bdf29f525a97e1ae66573b1c7317cbfcd3c951
3
+ metadata.gz: 35eee7886cc545693f85facc69bbe8b99ea141f0af3ad520f6f75d2bc61eddd7
4
+ data.tar.gz: 5a4b8e4a4dfe54535805f56e986fe36b5470889a96960b2bf8acd1ee4c94b084
5
5
  SHA512:
6
- metadata.gz: 3d37e232dd959b4c0897439a29b46f64d598f024b61bba2edc2cdad1f0d14461db77b93a227326697d4913e83ab87b4393234943aef964285b253c652849436d
7
- data.tar.gz: 44255a841598321a97559ef3a77c3db3e6f6a7d22cb78fc78819aaecef643b8b67c85484b4f10ac979923f28c8f727fcaca69ebec694b3df9fe5f6a9c37cc1f9
6
+ metadata.gz: 9ace8965b9567c8d9e85f2d52912c8081da83a2dc0ff9c463f6ac28450e39ca874125e501196f546930615b4d6271abad88dc3fe8b91f0ade1253c0468652eca
7
+ data.tar.gz: 05d9819fd5835b8307ae6ece9b0f4efa8b00d615eaf6fcc054dc38959780761ed4cf40711d7b91de38a9e3390904dead968c0a4a0ccc9221a8ec5122a6f9f0b8
@@ -1 +1 @@
1
- 2.5.3
1
+ 2.6.5
data/Gemfile CHANGED
@@ -4,6 +4,4 @@ source "https://rubygems.org"
4
4
 
5
5
  git_source(:github) { |repo_name| "https://github.com/#{repo_name}" }
6
6
 
7
- gem "pry"
8
-
9
7
  gemspec
@@ -1,44 +1,35 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- simple_text_extract (0.1.3)
5
- roo (~> 2.8)
4
+ simple_text_extract (1.0.0)
5
+ roo (~> 2.8.2)
6
6
  spreadsheet (~> 1.1.8)
7
7
 
8
8
  GEM
9
9
  remote: https://rubygems.org/
10
10
  specs:
11
- coderay (1.1.2)
12
- metaclass (0.0.4)
13
- method_source (0.9.2)
14
11
  mini_portile2 (2.4.0)
15
- minitest (5.11.3)
16
- mocha (1.8.0)
17
- metaclass (~> 0.0.1)
18
- nokogiri (1.10.1)
12
+ minitest (5.14.0)
13
+ mocha (1.11.2)
14
+ nokogiri (1.10.9)
19
15
  mini_portile2 (~> 2.4.0)
20
- pry (0.12.2)
21
- coderay (~> 1.1.0)
22
- method_source (~> 0.9.0)
23
- rake (10.5.0)
24
- roo (2.8.1)
16
+ rake (13.0.1)
17
+ roo (2.8.3)
25
18
  nokogiri (~> 1)
26
- rubyzip (>= 1.2.1, < 2.0.0)
27
- ruby-ole (1.2.12.1)
28
- rubyzip (1.2.2)
29
- spreadsheet (1.1.8)
19
+ rubyzip (>= 1.3.0, < 3.0.0)
20
+ ruby-ole (1.2.12.2)
21
+ rubyzip (2.3.0)
22
+ spreadsheet (1.1.9)
30
23
  ruby-ole (>= 1.0)
31
24
 
32
25
  PLATFORMS
33
26
  ruby
34
27
 
35
28
  DEPENDENCIES
36
- bundler (~> 1.17)
37
29
  minitest (~> 5.0)
38
30
  mocha
39
- pry
40
- rake (~> 10.0)
31
+ rake (~> 13.0)
41
32
  simple_text_extract!
42
33
 
43
34
  BUNDLED WITH
44
- 1.17.2
35
+ 2.0.2
data/README.md CHANGED
@@ -9,6 +9,7 @@ SimpleTextExtract handles parsing text from:
9
9
  - `.doc`
10
10
  - `.xlsx`
11
11
  - `.xls`
12
+ - `.csv`
12
13
  - `.txt` 😜
13
14
 
14
15
  If no text is parsed (for `pdf`), or a file format is not supported (like images), then `nil` is returned and you can move on to the heavy-duty tools like [Henkei](https://github.com/abrom/henkei) 💪.
@@ -34,11 +35,13 @@ Or install it yourself as:
34
35
  Text can be parsed from raw file content or files in the filesystem t by calling `SimpleTextExtract.extract`:
35
36
 
36
37
  ```ruby
37
- # raw file content using ActiveStorage
38
- SimpleTextExtract.extract(filename: attachment.blob.filename, raw: attachment.download)
38
+ # using ActiveStorage >= 6
39
+ extract = attachment.open { |tmp| SimpleTextExtract.extract(tempfile: tmp) }
40
+ # raw file content or when ActiveStorage < 6
41
+ extract = SimpleTextExtract.extract(filename: attachment.blob.filename, raw: attachment.download)
39
42
 
40
43
  # filesystem
41
- SimpleTextExtract.extract(filepath: "path_to_file.pdf")
44
+ extract = SimpleTextExtract.extract(filepath: "path_to_file.pdf")
42
45
  ```
43
46
 
44
47
  ### Usage Dependencies
@@ -51,6 +54,9 @@ You can choose to use SimpleTextExtract without the following dependencies, but
51
54
  `doc` parsing requires `antiword`
52
55
  - `brew install antiword`
53
56
 
57
+ `xlsx` and `xls` parsing requires `ssconvert` which is part of `gnumeric`
58
+ - `brew install gnumeric`
59
+
54
60
  ### Usage on Heroku
55
61
 
56
62
  To use on Heroku you'll have to add some custom buildpacks.
@@ -64,13 +70,16 @@ If not, you can either add that buildpack, or add `poppler-utils` to your `Aptfi
64
70
 
65
71
  ##### heroku-buildpack-apt
66
72
 
67
- To add `antiword` as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
73
+ To add `antiword` and/or `gnumeric`* as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
68
74
 
69
75
  In your `Aptfile`, add:
70
76
  ```
71
77
  antiword
78
+ gnumeric
72
79
  ```
73
80
 
81
+ * There is currently an [issue](https://github.com/heroku/heroku-buildpack-google-chrome/issues/59) with the heroku-18 stack that requires additional dependencies added to the Aptfile to get `gnumeric` to work properly. You can reference the linked issue above to figure out those dependencies, or downgrade to heroku-16 until it is fixed.
82
+
74
83
  ## Benchmarks
75
84
 
76
85
  *Benchmarks test extracting text from the same file 50 times (Macbook pro)*
@@ -80,8 +89,9 @@ antiword
80
89
  | .doc | 1.40s | 74.27s |
81
90
  | .docx | 0.78s | 71.44s |
82
91
  | .pdf* | 1.73s | 82.86s |
83
- | .xlsx | 21.99s | 51.89s |
84
- | .txt | 0.036s | 39.25s |
92
+ | .xlsx | 1.16s | 51.89s |
93
+ | .xls | 0.80s | 67.88s |
94
+ | .txt | 0.04s | 39.25s |
85
95
 
86
96
  * SimpleTextExtract is limited in its text extraction from pdfs, as Tika can also perform OCR on pdfs with Tesseract
87
97
 
data/Rakefile CHANGED
@@ -7,4 +7,4 @@ Rake::TestTask.new(:test) do |t|
7
7
  t.test_files = FileList["test/**/*_test.rb"]
8
8
  end
9
9
 
10
- task :default => :test
10
+ task default: :test
@@ -6,8 +6,5 @@ require "simple_text_extract"
6
6
  # You can add fixtures and/or initialization code here to make experimenting
7
7
  # with your gem easier. You can also use a different console, if you like.
8
8
 
9
- require "pry"
10
- Pry.start
11
-
12
9
  require "irb"
13
10
  IRB.start(__FILE__)
@@ -2,14 +2,18 @@
2
2
 
3
3
  require "simple_text_extract/version"
4
4
  require "simple_text_extract/text_extractor"
5
- require "simple_text_extract/file_extractor"
6
- require "simple_text_extract/tempfile_extractor"
7
5
  require "simple_text_extract/format_extractor_factory"
8
6
 
9
7
  module SimpleTextExtract
8
+ SUPPORTED_FILETYPES = ["xls", "xlsx", "doc", "docx", "txt", "pdf", "csv"].freeze
9
+
10
10
  class Error < StandardError; end
11
11
 
12
- def self.extract(filename: nil, raw: nil, filepath: nil)
13
- TextExtractor.call(filename: filename, raw: raw, filepath: filepath).to_s
12
+ def self.extract(filename: nil, raw: nil, filepath: nil, tempfile: nil)
13
+ TextExtractor.new(filename: filename, raw: raw, filepath: filepath, tempfile: tempfile).to_s
14
+ end
15
+
16
+ def self.supports?(filename: nil)
17
+ SUPPORTED_FILETYPES.include?(filename.split(".").last)
14
18
  end
15
19
  end
@@ -4,7 +4,7 @@ module SimpleTextExtract
4
4
  module FormatExtractor
5
5
  class Doc < Base
6
6
  def extract
7
- return nil if missing_dependency?('antiword')
7
+ return nil if missing_dependency?("antiword")
8
8
 
9
9
  `antiword #{Shellwords.escape(file.path)}`
10
10
  end
@@ -8,7 +8,6 @@ module SimpleTextExtract
8
8
 
9
9
  spreadsheet = Spreadsheet.open(file)
10
10
  text = []
11
-
12
11
  spreadsheet.worksheets.each do |sheet|
13
12
  text << sheet.name
14
13
  text << sheet.rows
@@ -6,7 +6,7 @@ module SimpleTextExtract
6
6
  def extract
7
7
  require "roo"
8
8
 
9
- spreadsheet = Roo::Spreadsheet.open(file)
9
+ spreadsheet = Roo::Spreadsheet.open(file, only_visible_sheets: true)
10
10
 
11
11
  text = []
12
12
 
@@ -10,9 +10,9 @@ require "simple_text_extract/format_extractor/doc"
10
10
 
11
11
  module SimpleTextExtract
12
12
  class FormatExtractorFactory
13
- def self.call(file) # rubocop:disable Metrics/MethodLength
13
+ def self.call(file) # rubocop:disable Metrics/MethodLength, Metrics/CyclomaticComplexity
14
14
  case file.path
15
- when /.txt$/i
15
+ when /(.txt$|.csv$)/i
16
16
  FormatExtractor::PlainText.new(file)
17
17
  when /.pdf$/i
18
18
  FormatExtractor::PDF.new(file)
@@ -2,24 +2,55 @@
2
2
 
3
3
  module SimpleTextExtract
4
4
  class TextExtractor
5
- def self.call(filename: nil, raw: nil, filepath: nil)
6
- if !filename.nil? && !raw.nil?
7
- TempfileExtractor.new(filename: filename.to_s, raw: raw).extract
8
- elsif !filepath.nil? && File.exist?(filepath)
9
- FileExtractor.new(filepath: filepath).extract
10
- end
11
- end
5
+ attr_reader :file
12
6
 
13
- def extract
14
- text = FormatExtractorFactory.call(file).extract
15
- cleanup
7
+ def initialize(filename: nil, raw: nil, filepath: nil, tempfile: nil)
8
+ @file = get_file(filename: filename, raw: raw, filepath: filepath, tempfile: tempfile)
9
+ end
16
10
 
17
- text
11
+ def to_s
12
+ @to_s ||= extract.to_s
18
13
  end
19
14
 
20
15
  private
21
16
 
17
+ def get_file(filename:, raw:, filepath:, tempfile:)
18
+ if tempfile&.class == Tempfile
19
+ tempfile
20
+ elsif !filename.nil? && !raw.nil?
21
+ write_tempfile(filename: filename.to_s, raw: raw)
22
+ elsif !filepath.nil? && File.exist?(filepath)
23
+ File.new(filepath)
24
+ end
25
+ end
26
+
27
+ def extract
28
+ return unless file
29
+ return unless file
30
+
31
+ begin
32
+ FormatExtractorFactory.call(file).extract
33
+ rescue StandardError
34
+ nil
35
+ ensure
36
+ cleanup
37
+ end
38
+ end
39
+
22
40
  def cleanup
41
+ return unless file.class == Tempfile
42
+
43
+ file.close
44
+ file.unlink
45
+ end
46
+
47
+ def write_tempfile(filename:, raw:)
48
+ filename = filename.split(".").yield_self { |parts| [parts[0], ".#{parts[1]}"] }
49
+ file = Tempfile.new(filename)
50
+ raw = String.new(raw, encoding: Encoding::UTF_8)
51
+
52
+ file.write(raw)
53
+ file.tap(&:rewind)
23
54
  end
24
55
  end
25
56
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SimpleTextExtract
4
- VERSION = "0.2.0"
4
+ VERSION = "1.1.0"
5
5
  end
@@ -28,11 +28,10 @@ Gem::Specification.new do |spec|
28
28
  spec.requirements << "pdftotext/poppler"
29
29
  spec.required_ruby_version = ">= 2.5"
30
30
 
31
- spec.add_runtime_dependency "roo", "~> 2.8"
31
+ spec.add_runtime_dependency "roo", "~> 2.8.2"
32
32
  spec.add_runtime_dependency "spreadsheet", "~> 1.1.8"
33
33
 
34
- spec.add_development_dependency "bundler", "~> 1.17"
35
- spec.add_development_dependency "rake", "~> 10.0"
34
+ spec.add_development_dependency "rake", "~> 13.0"
36
35
  spec.add_development_dependency "minitest", "~> 5.0"
37
36
  spec.add_development_dependency "mocha"
38
37
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: simple_text_extract
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Nick Weiland
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2019-01-25 00:00:00.000000000 Z
11
+ date: 2020-07-29 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: roo
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - "~>"
18
18
  - !ruby/object:Gem::Version
19
- version: '2.8'
19
+ version: 2.8.2
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - "~>"
25
25
  - !ruby/object:Gem::Version
26
- version: '2.8'
26
+ version: 2.8.2
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: spreadsheet
29
29
  requirement: !ruby/object:Gem::Requirement
@@ -38,34 +38,20 @@ dependencies:
38
38
  - - "~>"
39
39
  - !ruby/object:Gem::Version
40
40
  version: 1.1.8
41
- - !ruby/object:Gem::Dependency
42
- name: bundler
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - "~>"
46
- - !ruby/object:Gem::Version
47
- version: '1.17'
48
- type: :development
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - "~>"
53
- - !ruby/object:Gem::Version
54
- version: '1.17'
55
41
  - !ruby/object:Gem::Dependency
56
42
  name: rake
57
43
  requirement: !ruby/object:Gem::Requirement
58
44
  requirements:
59
45
  - - "~>"
60
46
  - !ruby/object:Gem::Version
61
- version: '10.0'
47
+ version: '13.0'
62
48
  type: :development
63
49
  prerelease: false
64
50
  version_requirements: !ruby/object:Gem::Requirement
65
51
  requirements:
66
52
  - - "~>"
67
53
  - !ruby/object:Gem::Version
68
- version: '10.0'
54
+ version: '13.0'
69
55
  - !ruby/object:Gem::Dependency
70
56
  name: minitest
71
57
  requirement: !ruby/object:Gem::Requirement
@@ -114,7 +100,6 @@ files:
114
100
  - bin/console
115
101
  - bin/setup
116
102
  - lib/simple_text_extract.rb
117
- - lib/simple_text_extract/file_extractor.rb
118
103
  - lib/simple_text_extract/format_extractor/base.rb
119
104
  - lib/simple_text_extract/format_extractor/doc.rb
120
105
  - lib/simple_text_extract/format_extractor/doc_x.rb
@@ -123,9 +108,9 @@ files:
123
108
  - lib/simple_text_extract/format_extractor/xls.rb
124
109
  - lib/simple_text_extract/format_extractor/xls_x.rb
125
110
  - lib/simple_text_extract/format_extractor_factory.rb
126
- - lib/simple_text_extract/tempfile_extractor.rb
127
111
  - lib/simple_text_extract/text_extractor.rb
128
112
  - lib/simple_text_extract/version.rb
113
+ - simple_text_extract-1.0.2.gem
129
114
  - simple_text_extract.gemspec
130
115
  - tags
131
116
  homepage: https://github.com/weilandia/simple_text_extract
@@ -149,8 +134,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
149
134
  requirements:
150
135
  - antiword
151
136
  - pdftotext/poppler
152
- rubyforge_project:
153
- rubygems_version: 2.7.6
137
+ rubygems_version: 3.0.3
154
138
  signing_key:
155
139
  specification_version: 4
156
140
  summary: Attempts to quickly extract text from various file types before resorting
@@ -1,17 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module SimpleTextExtract
4
- class FileExtractor < TextExtractor
5
- attr_reader :filepath
6
-
7
- def initialize(filepath:)
8
- @filepath = filepath
9
- end
10
-
11
- private
12
-
13
- def file
14
- @file ||= File.new(filepath)
15
- end
16
- end
17
- end
@@ -1,34 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module SimpleTextExtract
4
- class TempfileExtractor < TextExtractor
5
- attr_reader :filename, :raw
6
-
7
- def initialize(filename:, raw:)
8
- @filename = filename
9
- @raw = String.new(raw, encoding: Encoding::UTF_8)
10
-
11
- write_raw
12
- end
13
-
14
- private
15
-
16
- def file
17
- @file ||= Tempfile.new(filepath)
18
- end
19
-
20
- def write_raw
21
- file.write(raw)
22
- file.rewind
23
- end
24
-
25
- def cleanup
26
- file.close
27
- file.unlink
28
- end
29
-
30
- def filepath
31
- @filepath ||= filename.split(".").yield_self { |parts| [parts[0], ".#{parts[1]}"] }
32
- end
33
- end
34
- end