simple_text_extract 0.2.0 → 0.2.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fa34a31195f18156df695d2fb860a2beb9562f6a8ae76d6f2b6a7d4585e2e306
4
- data.tar.gz: 3af31eaa54ed98bf8d0355cdf2bdf29f525a97e1ae66573b1c7317cbfcd3c951
3
+ metadata.gz: 83da9d28803f321b9a13aeaad4972211d40733b96f6b5fd085e52ab293a19d30
4
+ data.tar.gz: 99769610f1adef1d8fbe46647c7253af7859029a362854f4c8d73ec45fa9d8da
5
5
  SHA512:
6
- metadata.gz: 3d37e232dd959b4c0897439a29b46f64d598f024b61bba2edc2cdad1f0d14461db77b93a227326697d4913e83ab87b4393234943aef964285b253c652849436d
7
- data.tar.gz: 44255a841598321a97559ef3a77c3db3e6f6a7d22cb78fc78819aaecef643b8b67c85484b4f10ac979923f28c8f727fcaca69ebec694b3df9fe5f6a9c37cc1f9
6
+ metadata.gz: 6f8dc568cf35fe6519d24dfc9a97a2b3c4d68770d5d489a1a1c4f813307ff7cc2fb973a663656893b448fb2532198f36373827ef202887edb1ad73b0ef53d3e7
7
+ data.tar.gz: d334282c216656d91cb038d020e4c1da67ca563b708bf4356f0005ff8d1ec2f1dae1ea58c5427828a2a593b2ef238325caab4fdbdf9d3575ca8ca5e14b1791ca
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- simple_text_extract (0.1.3)
4
+ simple_text_extract (0.2.0)
5
5
  roo (~> 2.8)
6
6
  spreadsheet (~> 1.1.8)
7
7
 
data/README.md CHANGED
@@ -51,6 +51,9 @@ You can choose to use SimpleTextExtract without the following dependencies, but
51
51
  `doc` parsing requires `antiword`
52
52
  - `brew install antiword`
53
53
 
54
+ `xlsx` and `xls` parsing requires `ssconvert` which is part of `gnumeric`
55
+ - `brew install gnumeric`
56
+
54
57
  ### Usage on Heroku
55
58
 
56
59
  To use on Heroku you'll have to add some custom buildpacks.
@@ -64,13 +67,16 @@ If not, you can either add that buildpack, or add `poppler-utils` to your `Aptfi
64
67
 
65
68
  ##### heroku-buildpack-apt
66
69
 
67
- To add `antiword` as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
70
+ To add `antiword` and/or `gnumeric`* as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
68
71
 
69
72
  In your `Aptfile`, add:
70
73
  ```
71
74
  antiword
75
+ gnumeric
72
76
  ```
73
77
 
78
+ * There is currently an [issue](https://github.com/heroku/heroku-buildpack-google-chrome/issues/59) with the heroku-18 stack that requires additional dependencies added to the Aptfile to get `gnumeric` to work properly. You can reference the linked issue above to figure out those dependencies, or downgrade to heroku-16 until it is fixed.
79
+
74
80
  ## Benchmarks
75
81
 
76
82
  *Benchmarks test extracting text from the same file 50 times (Macbook pro)*
@@ -80,8 +86,9 @@ antiword
80
86
  | .doc | 1.40s | 74.27s |
81
87
  | .docx | 0.78s | 71.44s |
82
88
  | .pdf* | 1.73s | 82.86s |
83
- | .xlsx | 21.99s | 51.89s |
84
- | .txt | 0.036s | 39.25s |
89
+ | .xlsx | 1.16s | 51.89s |
90
+ | .xls | 0.80s | 67.88s |
91
+ | .txt | 0.04s | 39.25s |
85
92
 
86
93
  * SimpleTextExtract is limited in its text extraction from pdfs, as Tika can also perform OCR on pdfs with Tesseract
87
94
 
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SimpleTextExtract
4
- VERSION = "0.2.0"
4
+ VERSION = "0.2.1"
5
5
  end
@@ -7,9 +7,15 @@ require "simple_text_extract/tempfile_extractor"
7
7
  require "simple_text_extract/format_extractor_factory"
8
8
 
9
9
  module SimpleTextExtract
10
+ SUPPORTED_FILETYPES = ["xls", "xlsx", "doc", "docx", "txt", "pdf"]
11
+
10
12
  class Error < StandardError; end
11
13
 
12
14
  def self.extract(filename: nil, raw: nil, filepath: nil)
13
15
  TextExtractor.call(filename: filename, raw: raw, filepath: filepath).to_s
14
16
  end
17
+
18
+ def self.supports?(filename: nil)
19
+ SUPPORTED_FILETYPES.include?(filename.split(".")[1])
20
+ end
15
21
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: simple_text_extract
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Nick Weiland
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2019-01-25 00:00:00.000000000 Z
11
+ date: 2019-01-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: roo