simple_text_extract 0.2.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile.lock +1 -1
- data/README.md +10 -3
- data/lib/simple_text_extract/version.rb +1 -1
- data/lib/simple_text_extract.rb +6 -0
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 83da9d28803f321b9a13aeaad4972211d40733b96f6b5fd085e52ab293a19d30
|
4
|
+
data.tar.gz: 99769610f1adef1d8fbe46647c7253af7859029a362854f4c8d73ec45fa9d8da
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 6f8dc568cf35fe6519d24dfc9a97a2b3c4d68770d5d489a1a1c4f813307ff7cc2fb973a663656893b448fb2532198f36373827ef202887edb1ad73b0ef53d3e7
|
7
|
+
data.tar.gz: d334282c216656d91cb038d020e4c1da67ca563b708bf4356f0005ff8d1ec2f1dae1ea58c5427828a2a593b2ef238325caab4fdbdf9d3575ca8ca5e14b1791ca
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -51,6 +51,9 @@ You can choose to use SimpleTextExtract without the following dependencies, but
|
|
51
51
|
`doc` parsing requires `antiword`
|
52
52
|
- `brew install antiword`
|
53
53
|
|
54
|
+
`xlsx` and `xls` parsing requires `ssconvert` which is part of `gnumeric`
|
55
|
+
- `brew install gnumeric`
|
56
|
+
|
54
57
|
### Usage on Heroku
|
55
58
|
|
56
59
|
To use on Heroku you'll have to add some custom buildpacks.
|
@@ -64,13 +67,16 @@ If not, you can either add that buildpack, or add `poppler-utils` to your `Aptfi
|
|
64
67
|
|
65
68
|
##### heroku-buildpack-apt
|
66
69
|
|
67
|
-
To add `antiword` as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
|
70
|
+
To add `antiword` and/or `gnumeric`* as a dependency on Heroku, install the [heroku-buildpack-apt](https://elements.heroku.com/buildpacks/heroku/heroku-buildpack-apt) buildpack and follow the install instructions.
|
68
71
|
|
69
72
|
In your `Aptfile`, add:
|
70
73
|
```
|
71
74
|
antiword
|
75
|
+
gnumeric
|
72
76
|
```
|
73
77
|
|
78
|
+
* There is currently an [issue](https://github.com/heroku/heroku-buildpack-google-chrome/issues/59) with the heroku-18 stack that requires additional dependencies added to the Aptfile to get `gnumeric` to work properly. You can reference the linked issue above to figure out those dependencies, or downgrade to heroku-16 until it is fixed.
|
79
|
+
|
74
80
|
## Benchmarks
|
75
81
|
|
76
82
|
*Benchmarks test extracting text from the same file 50 times (Macbook pro)*
|
@@ -80,8 +86,9 @@ antiword
|
|
80
86
|
| .doc | 1.40s | 74.27s |
|
81
87
|
| .docx | 0.78s | 71.44s |
|
82
88
|
| .pdf* | 1.73s | 82.86s |
|
83
|
-
| .xlsx |
|
84
|
-
| .
|
89
|
+
| .xlsx | 1.16s | 51.89s |
|
90
|
+
| .xls | 0.80s | 67.88s |
|
91
|
+
| .txt | 0.04s | 39.25s |
|
85
92
|
|
86
93
|
* SimpleTextExtract is limited in its text extraction from pdfs, as Tika can also perform OCR on pdfs with Tesseract
|
87
94
|
|
data/lib/simple_text_extract.rb
CHANGED
@@ -7,9 +7,15 @@ require "simple_text_extract/tempfile_extractor"
|
|
7
7
|
require "simple_text_extract/format_extractor_factory"
|
8
8
|
|
9
9
|
module SimpleTextExtract
|
10
|
+
SUPPORTED_FILETYPES = ["xls", "xlsx", "doc", "docx", "txt", "pdf"]
|
11
|
+
|
10
12
|
class Error < StandardError; end
|
11
13
|
|
12
14
|
def self.extract(filename: nil, raw: nil, filepath: nil)
|
13
15
|
TextExtractor.call(filename: filename, raw: raw, filepath: filepath).to_s
|
14
16
|
end
|
17
|
+
|
18
|
+
def self.supports?(filename: nil)
|
19
|
+
SUPPORTED_FILETYPES.include?(filename.split(".")[1])
|
20
|
+
end
|
15
21
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: simple_text_extract
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Nick Weiland
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2019-01-
|
11
|
+
date: 2019-01-28 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: roo
|