textractor 0.1.2 → 0.1.3

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- textractor (0.1.2)
4
+ textractor (0.1.3)
5
5
 
6
6
  GEM
7
7
  remote: http://rubygems.org/
data/README.md CHANGED
@@ -1,21 +1,18 @@
1
1
  # textractor
2
2
 
3
- textractor is a ruby library that provides a simple wrapper around CLI
4
- tools for extracting text from PDF and Word documents.
3
+ textractor is a ruby library that provides a simple wrapper around CLI tools for extracting text from PDF and Word documents.
5
4
 
6
5
  ## Setup
7
6
 
8
7
  gem install textractor
9
8
 
10
- In order to use textractor you have to install a few command line
11
- tools.
9
+ In order to use textractor you have to install a few command line tools.
12
10
 
13
11
  ### OS X
14
12
 
15
13
  port install wv xpdf links
16
14
 
17
- I recommend using also passing +no_x11 to the install command, but
18
- this may not work on all systems due to dependency issues.
15
+ I recommend using also passing +no_x11 to the install command, but this may not work on all systems due to dependency issues.
19
16
 
20
17
  port install wv xpdf links +no_x11
21
18
 
@@ -23,19 +20,47 @@ this may not work on all systems due to dependency issues.
23
20
 
24
21
  apt-get install wv xpdf-utils links
25
22
 
23
+ ### Optional mimetype-fu
24
+
25
+ gem install mimetype-fu
26
+
27
+ If you plan on using more than the default extractors it is a good idea to install mimetype-fu. This will allow much more robust content type detection.
28
+
26
29
  ## Usage
27
30
 
28
- Due to textractor's reliance on command line tools all the methods in
29
- textractor work on paths not File objects.
31
+ ### Basics
32
+
33
+ Due to textractor's reliance on command line tools all the methods in textractor work on paths not File objects.
30
34
 
31
35
  Textractor.text_from_path(path_to_document) # => "Ruby on rails developer"
32
36
 
33
- Textractor will attempt to guess what type of document you're trying
34
- to extract text from. However, if you know the content type of your
35
- document, you can provide it and Textractor won't guess.
37
+ Textractor will attempt to guess what type of document you're trying to extract text from. However, if you know the content type of your document, you can provide it and Textractor won't guess.
36
38
 
37
39
  Textractor.text_from_path(path_to_document, :content_type => "application/doc")
38
40
 
41
+ ### Custom Extractors
42
+
43
+ It's possible to define additional extractors for additional content types. An extractor only has to respond to a single method `text_from_path`.
44
+
45
+ class HTMLExtractor < Textractor::Extractors::TextExtractor
46
+
47
+ def text_from_path(path)
48
+ document = Nokogiri::HTML(super)
49
+ document.text
50
+ end
51
+
52
+ end
53
+
54
+ Textractor.register_content_type("text/html", HTMLExtractor)
55
+
56
+ You can also remove a content type extractor:
57
+
58
+ Textractor.remove_content_type("text/html")
59
+
60
+ Or clear out all known extractors:
61
+
62
+ Textractor.clear_registry
63
+
39
64
  ## TODO
40
65
 
41
66
  * Remove vendored docx2txt perl script
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'rubygems'
4
+ require 'textractor'
5
+
6
+ puts Textractor.text_from_path(File.expand_path(ARGV[0]))
@@ -5,7 +5,9 @@ module Textractor
5
5
  ContentTypeAlreadyRegistered = Class.new(StandardError)
6
6
  ContentTypeNotRegistered = Class.new(StandardError)
7
7
 
8
- autoload :Extractors, "textractor/extractors"
8
+ autoload :Extractors, 'textractor/extractors'
9
+ autoload :SimpleContentTypeDetector, 'textractor/simple_content_type_detector'
10
+ autoload :MimetypeFuContentTypeDetector, 'textractor/mimetype_fu_content_type_detector'
9
11
 
10
12
  def self.text_from_path(path, options = {})
11
13
  raise FileNotFound unless File.exists?(path)
@@ -16,19 +18,12 @@ module Textractor
16
18
  extractor.text_from_path(path)
17
19
  end
18
20
 
21
+ class << self
22
+ attr_accessor :content_type_detector
23
+ end
24
+
19
25
  def self.content_type_for_path(path)
20
- case File.extname(path)
21
- when /\.pdf$/
22
- 'application/pdf'
23
- when /\.doc$/
24
- 'application/msword'
25
- when /\.docx$/
26
- 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
27
- when /\.txt$/
28
- 'text/plain'
29
- else
30
- raise UnknownContentType, "unable to determine content type for #{path}"
31
- end
26
+ content_type_detector.content_type_for_path(path) or raise UnknownContentType, "unable to determine content type for #{path}"
32
27
  end
33
28
 
34
29
  def self.register_content_type(content_type, extractor)
@@ -62,3 +57,5 @@ module Textractor
62
57
  register_basic_types
63
58
 
64
59
  end
60
+
61
+ require 'textractor/content_type_detector'
@@ -0,0 +1,9 @@
1
+ begin
2
+ require 'rubygems'
3
+ require 'yaml'
4
+ require 'mimetype_fu'
5
+
6
+ Textractor.content_type_detector = Textractor::MimetypeFuContentTypeDetector
7
+ rescue LoadError => e
8
+ Textractor.content_type_detector = Textractor::SimpleContentTypeDetector
9
+ end
@@ -0,0 +1,11 @@
1
+ module Textractor
2
+
3
+ class MimetypeFuContentTypeDetector
4
+
5
+ def self.content_type_for_path(path)
6
+ File.mime_type?(path)
7
+ end
8
+
9
+ end
10
+
11
+ end
@@ -0,0 +1,20 @@
1
+ module Textractor
2
+
3
+ class SimpleContentTypeDetector
4
+
5
+ def self.content_type_for_path(path)
6
+ case File.extname(path)
7
+ when /\.pdf$/
8
+ 'application/pdf'
9
+ when /\.doc$/
10
+ 'application/msword'
11
+ when /\.docx$/
12
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
13
+ when /\.txt$/
14
+ 'text/plain'
15
+ end
16
+ end
17
+
18
+ end
19
+
20
+ end
@@ -1,3 +1,3 @@
1
1
  module Textractor
2
- VERSION = '0.1.2'
2
+ VERSION = '0.1.3'
3
3
  end
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: textractor
3
3
  version: !ruby/object:Gem::Version
4
- hash: 31
4
+ hash: 29
5
5
  prerelease: false
6
6
  segments:
7
7
  - 0
8
8
  - 1
9
- - 2
10
- version: 0.1.2
9
+ - 3
10
+ version: 0.1.3
11
11
  platform: ruby
12
12
  authors:
13
13
  - Michael Guterl
@@ -53,8 +53,8 @@ dependencies:
53
53
  description: simple wrapper around CLI for extracting text from PDF and Word documents
54
54
  email:
55
55
  - michael@diminishing.org
56
- executables: []
57
-
56
+ executables:
57
+ - textractor
58
58
  extensions: []
59
59
 
60
60
  extra_rdoc_files:
@@ -68,13 +68,17 @@ files:
68
68
  - LICENSE
69
69
  - README.md
70
70
  - Rakefile
71
+ - bin/textractor
71
72
  - lib/textractor.rb
73
+ - lib/textractor/content_type_detector.rb
72
74
  - lib/textractor/extractors.rb
73
75
  - lib/textractor/extractors/doc_extractor.rb
74
76
  - lib/textractor/extractors/docx_extractor.rb
75
77
  - lib/textractor/extractors/pdf_extractor.rb
76
78
  - lib/textractor/extractors/text_extractor.rb
77
79
  - lib/textractor/extractors/word_extractor.rb
80
+ - lib/textractor/mimetype_fu_content_type_detector.rb
81
+ - lib/textractor/simple_content_type_detector.rb
78
82
  - lib/textractor/version.rb
79
83
  - spec/fixtures/document.doc
80
84
  - spec/fixtures/document.docx