simple_text_extract 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fa34a31195f18156df695d2fb860a2beb9562f6a8ae76d6f2b6a7d4585e2e306
4
- data.tar.gz: 3af31eaa54ed98bf8d0355cdf2bdf29f525a97e1ae66573b1c7317cbfcd3c951
3
+ metadata.gz: 83da9d28803f321b9a13aeaad4972211d40733b96f6b5fd085e52ab293a19d30
4
+ data.tar.gz: 99769610f1adef1d8fbe46647c7253af7859029a362854f4c8d73ec45fa9d8da
5
5
  SHA512:
6
- metadata.gz: 3d37e232dd959b4c0897439a29b46f64d598f024b61bba2edc2cdad1f0d14461db77b93a227326697d4913e83ab87b4393234943aef964285b253c652849436d
7
- data.tar.gz: 44255a841598321a97559ef3a77c3db3e6f6a7d22cb78fc78819aaecef643b8b67c85484b4f10ac979923f28c8f727fcaca69ebec694b3df9fe5f6a9c37cc1f9
6
+ metadata.gz: 6f8dc568cf35fe6519d24dfc9a97a2b3c4d68770d5d489a1a1c4f813307ff7cc2fb973a663656893b448fb2532198f36373827ef202887edb1ad73b0ef53d3e7
7
+ data.tar.gz: d334282c216656d91cb038d020e4c1da67ca563b708bf4356f0005ff8d1ec2f1dae1ea58c5427828a2a593b2ef238325caab4fdbdf9d3575ca8ca5e14b1791ca
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- simple_text_extract (0.1.3)
4
+ simple_text_extract (0.2.0)
5
5
  roo (~> 2.8)
6
6
  spreadsheet (~> 1.1.8)
7
7
 
data/README.md CHANGED
@@ -51,6 +51,9 @@ You can choose to use SimpleTextExtract without the following dependencies, but
51
51
  `doc` parsing requires `antiword`
52
52
  - `brew install antiword`
53
53
 
54
+ `xlsx` and `xls` parsing requires `ssconvert` which is part of `gnumeric`
55
+ - `brew install gnumeric`
56
+
54
57
  ### Usage on Heroku
55
58
 
56
59
  To use on Heroku you'll have to add some custom buildpacks.
@@ -64,13 +67,16 @@ If not, you can either add that buildpack, or add `poppler-utils` to your `Aptfi
64
67
 
65
68
  ##### heroku-buildpack-apt
66
69
 
67
- To add `antiword` as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
70
+ To add `antiword` and/or `gnumeric`* as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
68
71
 
69
72
  In your `Aptfile`, add:
70
73
  ```
71
74
  antiword
75
+ gnumeric
72
76
  ```
73
77
 
78
+ * There is currently an [issue](https://github.com/heroku/heroku-buildpack-google-chrome/issues/59) with the heroku-18 stack that requires additional dependencies added to the Aptfile to get `gnumeric` to work properly. You can reference the linked issue above to figure out those dependencies, or downgrade to heroku-16 until it is fixed.
79
+
74
80
  ## Benchmarks
75
81
 
76
82
  *Benchmarks test extracting text from the same file 50 times (Macbook pro)*
@@ -80,8 +86,9 @@ antiword
80
86
  | .doc | 1.40s | 74.27s |
81
87
  | .docx | 0.78s | 71.44s |
82
88
  | .pdf* | 1.73s | 82.86s |
83
- | .xlsx | 21.99s | 51.89s |
84
- | .txt | 0.036s | 39.25s |
89
+ | .xlsx | 1.16s | 51.89s |
90
+ | .xls | 0.80s | 67.88s |
91
+ | .txt | 0.04s | 39.25s |
85
92
 
86
93
  * SimpleTextExtract is limited in its text extraction from pdfs, as Tika can also perform OCR on pdfs with Tesseract
87
94
 
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SimpleTextExtract
4
- VERSION = "0.2.0"
4
+ VERSION = "0.2.1"
5
5
  end
@@ -7,9 +7,15 @@ require "simple_text_extract/tempfile_extractor"
7
7
  require "simple_text_extract/format_extractor_factory"
8
8
 
9
9
  module SimpleTextExtract
10
+ SUPPORTED_FILETYPES = ["xls", "xlsx", "doc", "docx", "txt", "pdf"]
11
+
10
12
  class Error < StandardError; end
11
13
 
12
14
  def self.extract(filename: nil, raw: nil, filepath: nil)
13
15
  TextExtractor.call(filename: filename, raw: raw, filepath: filepath).to_s
14
16
  end
17
+
18
+ def self.supports?(filename: nil)
19
+ SUPPORTED_FILETYPES.include?(filename.split(".")[1])
20
+ end
15
21
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: simple_text_extract
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Nick Weiland
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2019-01-25 00:00:00.000000000 Z
11
+ date: 2019-01-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: roo