simple_text_extract 0.2.0 → 0.2.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Gemfile.lock +1 -1
- data/README.md +10 -3
- data/lib/simple_text_extract/version.rb +1 -1
- data/lib/simple_text_extract.rb +6 -0
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 83da9d28803f321b9a13aeaad4972211d40733b96f6b5fd085e52ab293a19d30
|
4
|
+
data.tar.gz: 99769610f1adef1d8fbe46647c7253af7859029a362854f4c8d73ec45fa9d8da
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 6f8dc568cf35fe6519d24dfc9a97a2b3c4d68770d5d489a1a1c4f813307ff7cc2fb973a663656893b448fb2532198f36373827ef202887edb1ad73b0ef53d3e7
|
7
|
+
data.tar.gz: d334282c216656d91cb038d020e4c1da67ca563b708bf4356f0005ff8d1ec2f1dae1ea58c5427828a2a593b2ef238325caab4fdbdf9d3575ca8ca5e14b1791ca
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -51,6 +51,9 @@ You can choose to use SimpleTextExtract without the following dependencies, but
|
|
51
51
|
`doc` parsing requires `antiword`
|
52
52
|
- `brew install antiword`
|
53
53
|
|
54
|
+
`xlsx` and `xls` parsing requires `ssconvert` which is part of `gnumeric`
|
55
|
+
- `brew install gnumeric`
|
56
|
+
|
54
57
|
### Usage on Heroku
|
55
58
|
|
56
59
|
To use on Heroku you'll have to add some custom buildpacks.
|
@@ -64,13 +67,16 @@ If not, you can either add that buildpack, or add `poppler-utils` to your `Aptfi
|
|
64
67
|
|
65
68
|
##### heroku-buildpack-apt
|
66
69
|
|
67
|
-
To add `antiword` as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
|
70
|
+
To add `antiword` and/or `gnumeric`* as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
|
68
71
|
|
69
72
|
In your `Aptfile`, add:
|
70
73
|
```
|
71
74
|
antiword
|
75
|
+
gnumeric
|
72
76
|
```
|
73
77
|
|
78
|
+
* There is currently an [issue](https://github.com/heroku/heroku-buildpack-google-chrome/issues/59) with the heroku-18 stack that requires additional dependencies added to the Aptfile to get `gnumeric` to work properly. You can reference the linked issue above to figure out those dependencies, or downgrade to heroku-16 until it is fixed.
|
79
|
+
|
74
80
|
## Benchmarks
|
75
81
|
|
76
82
|
*Benchmarks test extracting text from the same file 50 times (Macbook pro)*
|
@@ -80,8 +86,9 @@ antiword
|
|
80
86
|
| .doc | 1.40s | 74.27s |
|
81
87
|
| .docx | 0.78s | 71.44s |
|
82
88
|
| .pdf* | 1.73s | 82.86s |
|
83
|
-
| .xlsx |
|
84
|
-
| .
|
89
|
+
| .xlsx | 1.16s | 51.89s |
|
90
|
+
| .xls | 0.80s | 67.88s |
|
91
|
+
| .txt | 0.04s | 39.25s |
|
85
92
|
|
86
93
|
* SimpleTextExtract is limited in its text extraction from pdfs, as Tika can also perform OCR on pdfs with Tesseract
|
87
94
|
|
data/lib/simple_text_extract.rb
CHANGED
@@ -7,9 +7,15 @@ require "simple_text_extract/tempfile_extractor"
|
|
7
7
|
require "simple_text_extract/format_extractor_factory"
|
8
8
|
|
9
9
|
module SimpleTextExtract
|
10
|
+
SUPPORTED_FILETYPES = ["xls", "xlsx", "doc", "docx", "txt", "pdf"]
|
11
|
+
|
10
12
|
class Error < StandardError; end
|
11
13
|
|
12
14
|
def self.extract(filename: nil, raw: nil, filepath: nil)
|
13
15
|
TextExtractor.call(filename: filename, raw: raw, filepath: filepath).to_s
|
14
16
|
end
|
17
|
+
|
18
|
+
def self.supports?(filename: nil)
|
19
|
+
SUPPORTED_FILETYPES.include?(filename.split(".")[1])
|
20
|
+
end
|
15
21
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: simple_text_extract
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Nick Weiland
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2019-01-
|
11
|
+
date: 2019-01-28 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: roo
|