simple_text_extract 0.1.3 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: bd06e0bd11dd74c71adb01918b474b714ebd3762785931d14d294052aa3301e5
4
- data.tar.gz: 8fbfff63e6403e4abfc980abd03fce52a1653010c6d401c2507ae253d9391916
3
+ metadata.gz: 67a6daab9ba3d33ea757384fda1407875c1451cb2be0bb636ffea9b32384c12d
4
+ data.tar.gz: e5077817daf69f20d5ad54ae82b55465cf3727f5acb20df7382cc54403ca3e43
5
5
  SHA512:
6
- metadata.gz: 6d2cc814d3b419e9540800752f097b2f9d860e4756e7bd6b8e62f6178ad6cd7967c0f0dd9912a6f900564a686e78564c2c2acb7cee412c9e857ae6ed48cc906e
7
- data.tar.gz: dbb00c6da2de38f9d254486adb98bbce2c607a2c54c8637e2b8c9e08efae746ff8bd2ef384a9b1f7ffc2d4edaabda7527498c63726525aa29ba005f24db03770
6
+ metadata.gz: 7ef181da803d55ba917a5051402a3ac8527deb8886c68c417b4eabc677523a87fd011840944018c3c485a48c8dd2098b60960fc639b716481310c3ccc30f87a3
7
+ data.tar.gz: f73297a615714bbf29b48b87f0437f10c565078dd97d7ab0a07171f422233e5170f54e97eb19ea81f1e85b4777d52482807cc6efcc8c1737d31e52c41c59d778
@@ -1 +1 @@
1
- 2.5.3
1
+ 2.6.5
data/Gemfile CHANGED
@@ -4,6 +4,4 @@ source "https://rubygems.org"
4
4
 
5
5
  git_source(:github) { |repo_name| "https://github.com/#{repo_name}" }
6
6
 
7
- gem "pry"
8
-
9
7
  gemspec
@@ -1,32 +1,35 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- simple_text_extract (0.1.1)
4
+ simple_text_extract (1.0.0)
5
+ roo (~> 2.8.2)
6
+ spreadsheet (~> 1.1.8)
5
7
 
6
8
  GEM
7
9
  remote: https://rubygems.org/
8
10
  specs:
9
- coderay (1.1.2)
10
- metaclass (0.0.4)
11
- method_source (0.9.2)
12
- minitest (5.11.3)
13
- mocha (1.8.0)
14
- metaclass (~> 0.0.1)
15
- pry (0.12.2)
16
- coderay (~> 1.1.0)
17
- method_source (~> 0.9.0)
18
- rake (10.5.0)
11
+ mini_portile2 (2.4.0)
12
+ minitest (5.14.0)
13
+ mocha (1.11.2)
14
+ nokogiri (1.10.9)
15
+ mini_portile2 (~> 2.4.0)
16
+ rake (13.0.1)
17
+ roo (2.8.3)
18
+ nokogiri (~> 1)
19
+ rubyzip (>= 1.3.0, < 3.0.0)
20
+ ruby-ole (1.2.12.2)
21
+ rubyzip (2.3.0)
22
+ spreadsheet (1.1.9)
23
+ ruby-ole (>= 1.0)
19
24
 
20
25
  PLATFORMS
21
26
  ruby
22
27
 
23
28
  DEPENDENCIES
24
- bundler (~> 1.17)
25
29
  minitest (~> 5.0)
26
30
  mocha
27
- pry
28
- rake (~> 10.0)
31
+ rake (~> 13.0)
29
32
  simple_text_extract!
30
33
 
31
34
  BUNDLED WITH
32
- 1.17.2
35
+ 2.0.2
data/README.md CHANGED
@@ -9,6 +9,7 @@ SimpleTextExtract handles parsing text from:
9
9
  - `.doc`
10
10
  - `.xlsx`
11
11
  - `.xls`
12
+ - `.csv`
12
13
  - `.txt` 😜
13
14
 
14
15
  If no text is parsed (for `pdf`), or a file format is not supported (like images), then `nil` is returned and you can move on to the heavy-duty tools like [Henkei](https://github.com/abrom/henkei) 💪.
@@ -34,11 +35,13 @@ Or install it yourself as:
34
35
  Text can be parsed from raw file content or files in the filesystem t by calling `SimpleTextExtract.extract`:
35
36
 
36
37
  ```ruby
37
- # raw file content using ActiveStorage
38
- SimpleTextExtract.extract(filename: attachment.blob.filename, raw: attachment.download)
38
+ # using ActiveStorage >= 6
39
+ extract = attachment.open { |tmp| SimpleTextExtract.extract(tempfile: tmp) }
40
+ # raw file content or when ActiveStorage < 6
41
+ extract = SimpleTextExtract.extract(filename: attachment.blob.filename, raw: attachment.download)
39
42
 
40
43
  # filesystem
41
- SimpleTextExtract.extract(filepath: "path_to_file.pdf")
44
+ extract = SimpleTextExtract.extract(filepath: "path_to_file.pdf")
42
45
  ```
43
46
 
44
47
  ### Usage Dependencies
@@ -67,7 +70,7 @@ If not, you can either add that buildpack, or add `poppler-utils` to your `Aptfi
67
70
 
68
71
  ##### heroku-buildpack-apt
69
72
 
70
- To add `antiword` as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
73
+ To add `antiword` and/or `gnumeric`* as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
71
74
 
72
75
  In your `Aptfile`, add:
73
76
  ```
@@ -75,6 +78,8 @@ antiword
75
78
  gnumeric
76
79
  ```
77
80
 
81
+ * There is currently an [issue](https://github.com/heroku/heroku-buildpack-google-chrome/issues/59) with the heroku-18 stack that requires additional dependencies added to the Aptfile to get `gnumeric` to work properly. You can reference the linked issue above to figure out those dependencies, or downgrade to heroku-16 until it is fixed.
82
+
78
83
  ## Benchmarks
79
84
 
80
85
  *Benchmarks test extracting text from the same file 50 times (Macbook pro)*
@@ -84,8 +89,9 @@ gnumeric
84
89
  | .doc | 1.40s | 74.27s |
85
90
  | .docx | 0.78s | 71.44s |
86
91
  | .pdf* | 1.73s | 82.86s |
87
- | .xlsx | 21.99s | 51.89s |
88
- | .txt | 0.036s | 39.25s |
92
+ | .xlsx | 1.16s | 51.89s |
93
+ | .xls | 0.80s | 67.88s |
94
+ | .txt | 0.04s | 39.25s |
89
95
 
90
96
  * SimpleTextExtract is limited in its text extraction from pdfs, as Tika can also perform OCR on pdfs with Tesseract
91
97
 
data/Rakefile CHANGED
@@ -7,4 +7,4 @@ Rake::TestTask.new(:test) do |t|
7
7
  t.test_files = FileList["test/**/*_test.rb"]
8
8
  end
9
9
 
10
- task :default => :test
10
+ task default: :test
@@ -6,8 +6,5 @@ require "simple_text_extract"
6
6
  # You can add fixtures and/or initialization code here to make experimenting
7
7
  # with your gem easier. You can also use a different console, if you like.
8
8
 
9
- require "pry"
10
- Pry.start
11
-
12
9
  require "irb"
13
10
  IRB.start(__FILE__)
@@ -2,14 +2,18 @@
2
2
 
3
3
  require "simple_text_extract/version"
4
4
  require "simple_text_extract/text_extractor"
5
- require "simple_text_extract/file_extractor"
6
- require "simple_text_extract/tempfile_extractor"
7
5
  require "simple_text_extract/format_extractor_factory"
8
6
 
9
7
  module SimpleTextExtract
8
+ SUPPORTED_FILETYPES = ["xls", "xlsx", "doc", "docx", "txt", "pdf", "csv"].freeze
9
+
10
10
  class Error < StandardError; end
11
11
 
12
- def self.extract(filename: nil, raw: nil, filepath: nil)
13
- TextExtractor.call(filename: filename, raw: raw, filepath: filepath).to_s
12
+ def self.extract(filename: nil, raw: nil, filepath: nil, tempfile: nil)
13
+ TextExtractor.new(filename: filename, raw: raw, filepath: filepath, tempfile: tempfile).to_s
14
+ end
15
+
16
+ def self.supports?(filename: nil)
17
+ SUPPORTED_FILETYPES.include?(filename.split(".").last)
14
18
  end
15
19
  end
@@ -0,0 +1,20 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SimpleTextExtract
4
+ module FormatExtractor
5
+ class Xls < Base
6
+ def extract
7
+ require "spreadsheet"
8
+
9
+ spreadsheet = Spreadsheet.open(file)
10
+ text = []
11
+ spreadsheet.worksheets.each do |sheet|
12
+ text << sheet.name
13
+ text << sheet.rows
14
+ end
15
+
16
+ text.flatten.join(" ")
17
+ end
18
+ end
19
+ end
20
+ end
@@ -4,16 +4,18 @@ module SimpleTextExtract
4
4
  module FormatExtractor
5
5
  class XlsX < Base
6
6
  def extract
7
- return nil if missing_dependency?("ssconvert")
7
+ require "roo"
8
8
 
9
- extract_filepath = "#{file.path.split(".")[0]}.txt"
9
+ spreadsheet = Roo::Spreadsheet.open(file, only_visible_sheets: true)
10
10
 
11
- `ssconvert -O 'separator=" "' #{Shellwords.escape(file.path)} #{extract_filepath}`
11
+ text = []
12
12
 
13
- text = File.read(extract_filepath)
14
- File.unlink(extract_filepath)
13
+ spreadsheet.each_with_pagename do |name, sheet|
14
+ text << name
15
+ 1.upto(sheet.last_row.to_i) { |row| text << sheet.row(row) }
16
+ end
15
17
 
16
- text
18
+ text.flatten.join(" ")
17
19
  end
18
20
  end
19
21
  end
@@ -4,14 +4,15 @@ require "simple_text_extract/format_extractor/base"
4
4
  require "simple_text_extract/format_extractor/plain_text"
5
5
  require "simple_text_extract/format_extractor/pdf"
6
6
  require "simple_text_extract/format_extractor/xls_x"
7
+ require "simple_text_extract/format_extractor/xls"
7
8
  require "simple_text_extract/format_extractor/doc_x"
8
9
  require "simple_text_extract/format_extractor/doc"
9
10
 
10
11
  module SimpleTextExtract
11
12
  class FormatExtractorFactory
12
- def self.call(file) # rubocop:disable Metrics/MethodLength
13
+ def self.call(file) # rubocop:disable Metrics/MethodLength, Metrics/CyclomaticComplexity
13
14
  case file.path
14
- when /.txt$/i
15
+ when /(.txt$|.csv$)/i
15
16
  FormatExtractor::PlainText.new(file)
16
17
  when /.pdf$/i
17
18
  FormatExtractor::PDF.new(file)
@@ -19,8 +20,10 @@ module SimpleTextExtract
19
20
  FormatExtractor::DocX.new(file)
20
21
  when /.doc$/i
21
22
  FormatExtractor::Doc.new(file)
22
- when /(.xlsx$|.xls$)/i
23
+ when /.xlsx$/i
23
24
  FormatExtractor::XlsX.new(file)
25
+ when /.xls$/i
26
+ FormatExtractor::Xls.new(file)
24
27
  else
25
28
  FormatExtractor::Base.new(file)
26
29
  end
@@ -2,24 +2,52 @@
2
2
 
3
3
  module SimpleTextExtract
4
4
  class TextExtractor
5
- def self.call(filename: nil, raw: nil, filepath: nil)
6
- if !filename.nil? && !raw.nil?
7
- TempfileExtractor.new(filename: filename.to_s, raw: raw).extract
8
- elsif !filepath.nil? && File.exist?(filepath)
9
- FileExtractor.new(filepath: filepath).extract
10
- end
11
- end
5
+ attr_reader :file
12
6
 
13
- def extract
14
- text = FormatExtractorFactory.call(file).extract
15
- cleanup
7
+ def initialize(filename: nil, raw: nil, filepath: nil, tempfile: nil)
8
+ @file = get_file(filename: filename, raw: raw, filepath: filepath, tempfile: tempfile)
9
+ end
16
10
 
17
- text
11
+ def to_s
12
+ @to_s ||= extract.to_s
18
13
  end
19
14
 
20
15
  private
21
16
 
17
+ def get_file(filename:, raw:, filepath:, tempfile:)
18
+ if tempfile&.class == Tempfile
19
+ tempfile
20
+ elsif !filename.nil? && !raw.nil?
21
+ write_tempfile(filename: filename.to_s, raw: raw)
22
+ elsif !filepath.nil? && File.exist?(filepath)
23
+ File.new(filepath)
24
+ end
25
+ end
26
+
27
+ def extract
28
+ return unless file
29
+
30
+ begin
31
+ FormatExtractorFactory.call(file).extract
32
+ ensure
33
+ cleanup
34
+ end
35
+ end
36
+
22
37
  def cleanup
38
+ return unless file.class == Tempfile
39
+
40
+ file.close
41
+ file.unlink
42
+ end
43
+
44
+ def write_tempfile(filename:, raw:)
45
+ filename = filename.split(".").yield_self { |parts| [parts[0], ".#{parts[1]}"] }
46
+ file = Tempfile.new(filename)
47
+ raw = String.new(raw, encoding: Encoding::UTF_8)
48
+
49
+ file.write(raw)
50
+ file.tap(&:rewind)
23
51
  end
24
52
  end
25
53
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SimpleTextExtract
4
- VERSION = "0.1.3"
4
+ VERSION = "1.0.2"
5
5
  end
@@ -24,12 +24,14 @@ Gem::Specification.new do |spec|
24
24
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
25
25
  spec.require_paths = ["lib"]
26
26
 
27
- spec.requirements << "Antiword"
27
+ spec.requirements << "antiword"
28
28
  spec.requirements << "pdftotext/poppler"
29
29
  spec.required_ruby_version = ">= 2.5"
30
30
 
31
- spec.add_development_dependency "bundler", "~> 1.17"
32
- spec.add_development_dependency "rake", "~> 10.0"
31
+ spec.add_runtime_dependency "roo", "~> 2.8.2"
32
+ spec.add_runtime_dependency "spreadsheet", "~> 1.1.8"
33
+
34
+ spec.add_development_dependency "rake", "~> 13.0"
33
35
  spec.add_development_dependency "minitest", "~> 5.0"
34
36
  spec.add_development_dependency "mocha"
35
37
  end
metadata CHANGED
@@ -1,43 +1,57 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: simple_text_extract
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.3
4
+ version: 1.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Nick Weiland
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2019-01-24 00:00:00.000000000 Z
11
+ date: 2020-07-17 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
- name: bundler
14
+ name: roo
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
17
  - - "~>"
18
18
  - !ruby/object:Gem::Version
19
- version: '1.17'
20
- type: :development
19
+ version: 2.8.2
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: 2.8.2
27
+ - !ruby/object:Gem::Dependency
28
+ name: spreadsheet
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: 1.1.8
34
+ type: :runtime
21
35
  prerelease: false
22
36
  version_requirements: !ruby/object:Gem::Requirement
23
37
  requirements:
24
38
  - - "~>"
25
39
  - !ruby/object:Gem::Version
26
- version: '1.17'
40
+ version: 1.1.8
27
41
  - !ruby/object:Gem::Dependency
28
42
  name: rake
29
43
  requirement: !ruby/object:Gem::Requirement
30
44
  requirements:
31
45
  - - "~>"
32
46
  - !ruby/object:Gem::Version
33
- version: '10.0'
47
+ version: '13.0'
34
48
  type: :development
35
49
  prerelease: false
36
50
  version_requirements: !ruby/object:Gem::Requirement
37
51
  requirements:
38
52
  - - "~>"
39
53
  - !ruby/object:Gem::Version
40
- version: '10.0'
54
+ version: '13.0'
41
55
  - !ruby/object:Gem::Dependency
42
56
  name: minitest
43
57
  requirement: !ruby/object:Gem::Requirement
@@ -86,15 +100,14 @@ files:
86
100
  - bin/console
87
101
  - bin/setup
88
102
  - lib/simple_text_extract.rb
89
- - lib/simple_text_extract/file_extractor.rb
90
103
  - lib/simple_text_extract/format_extractor/base.rb
91
104
  - lib/simple_text_extract/format_extractor/doc.rb
92
105
  - lib/simple_text_extract/format_extractor/doc_x.rb
93
106
  - lib/simple_text_extract/format_extractor/pdf.rb
94
107
  - lib/simple_text_extract/format_extractor/plain_text.rb
108
+ - lib/simple_text_extract/format_extractor/xls.rb
95
109
  - lib/simple_text_extract/format_extractor/xls_x.rb
96
110
  - lib/simple_text_extract/format_extractor_factory.rb
97
- - lib/simple_text_extract/tempfile_extractor.rb
98
111
  - lib/simple_text_extract/text_extractor.rb
99
112
  - lib/simple_text_extract/version.rb
100
113
  - simple_text_extract.gemspec
@@ -118,10 +131,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
118
131
  - !ruby/object:Gem::Version
119
132
  version: '0'
120
133
  requirements:
121
- - Antiword
134
+ - antiword
122
135
  - pdftotext/poppler
123
- rubyforge_project:
124
- rubygems_version: 2.7.6
136
+ rubygems_version: 3.0.3
125
137
  signing_key:
126
138
  specification_version: 4
127
139
  summary: Attempts to quickly extract text from various file types before resorting
@@ -1,17 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module SimpleTextExtract
4
- class FileExtractor < TextExtractor
5
- attr_reader :filepath
6
-
7
- def initialize(filepath:)
8
- @filepath = filepath
9
- end
10
-
11
- private
12
-
13
- def file
14
- @file ||= File.new(filepath)
15
- end
16
- end
17
- end
@@ -1,34 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module SimpleTextExtract
4
- class TempfileExtractor < TextExtractor
5
- attr_reader :filename, :raw
6
-
7
- def initialize(filename:, raw:)
8
- @filename = filename
9
- @raw = String.new(raw, encoding: Encoding::UTF_8)
10
-
11
- write_raw
12
- end
13
-
14
- private
15
-
16
- def file
17
- @file ||= Tempfile.new(filepath)
18
- end
19
-
20
- def write_raw
21
- file.write(raw)
22
- file.rewind
23
- end
24
-
25
- def cleanup
26
- file.close
27
- file.unlink
28
- end
29
-
30
- def filepath
31
- @filepath ||= filename.split(".").yield_self { |parts| [parts[0], ".#{parts[1]}"] }
32
- end
33
- end
34
- end