textractor 0.1.2 → 0.1.3
Sign up to get free protection for your applications and to get access to all the features.
- data/Gemfile.lock +1 -1
- data/README.md +36 -11
- data/bin/textractor +6 -0
- data/lib/textractor.rb +10 -13
- data/lib/textractor/content_type_detector.rb +9 -0
- data/lib/textractor/mimetype_fu_content_type_detector.rb +11 -0
- data/lib/textractor/simple_content_type_detector.rb +20 -0
- data/lib/textractor/version.rb +1 -1
- metadata +9 -5
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -1,21 +1,18 @@
|
|
1
1
|
# textractor
|
2
2
|
|
3
|
-
textractor is a ruby library that provides a simple wrapper around CLI
|
4
|
-
tools for extracting text from PDF and Word documents.
|
3
|
+
textractor is a ruby library that provides a simple wrapper around CLI tools for extracting text from PDF and Word documents.
|
5
4
|
|
6
5
|
## Setup
|
7
6
|
|
8
7
|
gem install textractor
|
9
8
|
|
10
|
-
In order to use textractor you have to install a few command line
|
11
|
-
tools.
|
9
|
+
In order to use textractor you have to install a few command line tools.
|
12
10
|
|
13
11
|
### OS X
|
14
12
|
|
15
13
|
port install wv xpdf links
|
16
14
|
|
17
|
-
I recommend using also passing +no_x11 to the install command, but
|
18
|
-
this may not work on all systems due to dependency issues.
|
15
|
+
I recommend using also passing +no_x11 to the install command, but this may not work on all systems due to dependency issues.
|
19
16
|
|
20
17
|
port install wv xpdf links +no_x11
|
21
18
|
|
@@ -23,19 +20,47 @@ this may not work on all systems due to dependency issues.
|
|
23
20
|
|
24
21
|
apt-get install wv xpdf-utils links
|
25
22
|
|
23
|
+
### Optional mimetype-fu
|
24
|
+
|
25
|
+
gem install mimetype-fu
|
26
|
+
|
27
|
+
If you plan on using more than the default extractors it is a good idea to install mimetype-fu. This will allow much more robust content type detection.
|
28
|
+
|
26
29
|
## Usage
|
27
30
|
|
28
|
-
|
29
|
-
|
31
|
+
### Basics
|
32
|
+
|
33
|
+
Due to textractor's reliance on command line tools all the methods in textractor work on paths not File objects.
|
30
34
|
|
31
35
|
Textractor.text_from_path(path_to_document) # => "Ruby on rails developer"
|
32
36
|
|
33
|
-
Textractor will attempt to guess what type of document you're trying
|
34
|
-
to extract text from. However, if you know the content type of your
|
35
|
-
document, you can provide it and Textractor won't guess.
|
37
|
+
Textractor will attempt to guess what type of document you're trying to extract text from. However, if you know the content type of your document, you can provide it and Textractor won't guess.
|
36
38
|
|
37
39
|
Textractor.text_from_path(path_to_document, :content_type => "application/doc")
|
38
40
|
|
41
|
+
### Custom Extractors
|
42
|
+
|
43
|
+
It's possible to define additional extractors for additional content types. An extractor only has to respond to a single method `text_from_path`.
|
44
|
+
|
45
|
+
class HTMLExtractor < Textractor::Extractors::TextExtractor
|
46
|
+
|
47
|
+
def text_from_path(path)
|
48
|
+
document = Nokogiri::HTML(super)
|
49
|
+
document.text
|
50
|
+
end
|
51
|
+
|
52
|
+
end
|
53
|
+
|
54
|
+
Textractor.register_content_type("text/html", HTMLExtractor)
|
55
|
+
|
56
|
+
You can also remove a content type extractor:
|
57
|
+
|
58
|
+
Textractor.remove_content_type("text/html")
|
59
|
+
|
60
|
+
Or clear out all known extractors:
|
61
|
+
|
62
|
+
Textractor.clear_registry
|
63
|
+
|
39
64
|
## TODO
|
40
65
|
|
41
66
|
* Remove vendored docx2txt perl script
|
data/bin/textractor
ADDED
data/lib/textractor.rb
CHANGED
@@ -5,7 +5,9 @@ module Textractor
|
|
5
5
|
ContentTypeAlreadyRegistered = Class.new(StandardError)
|
6
6
|
ContentTypeNotRegistered = Class.new(StandardError)
|
7
7
|
|
8
|
-
autoload :Extractors,
|
8
|
+
autoload :Extractors, 'textractor/extractors'
|
9
|
+
autoload :SimpleContentTypeDetector, 'textractor/simple_content_type_detector'
|
10
|
+
autoload :MimetypeFuContentTypeDetector, 'textractor/mimetype_fu_content_type_detector'
|
9
11
|
|
10
12
|
def self.text_from_path(path, options = {})
|
11
13
|
raise FileNotFound unless File.exists?(path)
|
@@ -16,19 +18,12 @@ module Textractor
|
|
16
18
|
extractor.text_from_path(path)
|
17
19
|
end
|
18
20
|
|
21
|
+
class << self
|
22
|
+
attr_accessor :content_type_detector
|
23
|
+
end
|
24
|
+
|
19
25
|
def self.content_type_for_path(path)
|
20
|
-
|
21
|
-
when /\.pdf$/
|
22
|
-
'application/pdf'
|
23
|
-
when /\.doc$/
|
24
|
-
'application/msword'
|
25
|
-
when /\.docx$/
|
26
|
-
'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
|
27
|
-
when /\.txt$/
|
28
|
-
'text/plain'
|
29
|
-
else
|
30
|
-
raise UnknownContentType, "unable to determine content type for #{path}"
|
31
|
-
end
|
26
|
+
content_type_detector.content_type_for_path(path) or raise UnknownContentType, "unable to determine content type for #{path}"
|
32
27
|
end
|
33
28
|
|
34
29
|
def self.register_content_type(content_type, extractor)
|
@@ -62,3 +57,5 @@ module Textractor
|
|
62
57
|
register_basic_types
|
63
58
|
|
64
59
|
end
|
60
|
+
|
61
|
+
require 'textractor/content_type_detector'
|
@@ -0,0 +1,20 @@
|
|
1
|
+
module Textractor
|
2
|
+
|
3
|
+
class SimpleContentTypeDetector
|
4
|
+
|
5
|
+
def self.content_type_for_path(path)
|
6
|
+
case File.extname(path)
|
7
|
+
when /\.pdf$/
|
8
|
+
'application/pdf'
|
9
|
+
when /\.doc$/
|
10
|
+
'application/msword'
|
11
|
+
when /\.docx$/
|
12
|
+
'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
|
13
|
+
when /\.txt$/
|
14
|
+
'text/plain'
|
15
|
+
end
|
16
|
+
end
|
17
|
+
|
18
|
+
end
|
19
|
+
|
20
|
+
end
|
data/lib/textractor/version.rb
CHANGED
metadata
CHANGED
@@ -1,13 +1,13 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: textractor
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
hash:
|
4
|
+
hash: 29
|
5
5
|
prerelease: false
|
6
6
|
segments:
|
7
7
|
- 0
|
8
8
|
- 1
|
9
|
-
-
|
10
|
-
version: 0.1.
|
9
|
+
- 3
|
10
|
+
version: 0.1.3
|
11
11
|
platform: ruby
|
12
12
|
authors:
|
13
13
|
- Michael Guterl
|
@@ -53,8 +53,8 @@ dependencies:
|
|
53
53
|
description: simple wrapper around CLI for extracting text from PDF and Word documents
|
54
54
|
email:
|
55
55
|
- michael@diminishing.org
|
56
|
-
executables:
|
57
|
-
|
56
|
+
executables:
|
57
|
+
- textractor
|
58
58
|
extensions: []
|
59
59
|
|
60
60
|
extra_rdoc_files:
|
@@ -68,13 +68,17 @@ files:
|
|
68
68
|
- LICENSE
|
69
69
|
- README.md
|
70
70
|
- Rakefile
|
71
|
+
- bin/textractor
|
71
72
|
- lib/textractor.rb
|
73
|
+
- lib/textractor/content_type_detector.rb
|
72
74
|
- lib/textractor/extractors.rb
|
73
75
|
- lib/textractor/extractors/doc_extractor.rb
|
74
76
|
- lib/textractor/extractors/docx_extractor.rb
|
75
77
|
- lib/textractor/extractors/pdf_extractor.rb
|
76
78
|
- lib/textractor/extractors/text_extractor.rb
|
77
79
|
- lib/textractor/extractors/word_extractor.rb
|
80
|
+
- lib/textractor/mimetype_fu_content_type_detector.rb
|
81
|
+
- lib/textractor/simple_content_type_detector.rb
|
78
82
|
- lib/textractor/version.rb
|
79
83
|
- spec/fixtures/document.doc
|
80
84
|
- spec/fixtures/document.docx
|