RubyGems - doc_ripper - Versions diffs - 0.0.5 → 0.0.6 - Mend

doc_ripper 0.0.5 → 0.0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

checksums.yaml +4 -4
data/README.md +20 -2
data/doc_ripper.gemspec +4 -0
data/lib/doc_ripper.rb +8 -9
data/lib/doc_ripper/{ripper/base.rb → base.rb} +4 -0
data/lib/doc_ripper/formats/docx_ripper.rb +11 -0
data/lib/doc_ripper/formats/ms_doc_ripper.rb +11 -0
data/lib/doc_ripper/formats/pdf_ripper.rb +11 -0
data/lib/doc_ripper/formats/sketch_ripper.rb +84 -0
data/lib/doc_ripper/text_ripper.rb +20 -8
data/lib/doc_ripper/version.rb +1 -1
data/pkg/doc_ripper-0.0.5.gem +0 -0
data/spec/doc_ripper/{ripper/base_spec.rb → base_spec.rb} +0 -0
data/spec/doc_ripper/{doc_ripper_spec.rb → formats/doc_ripper_spec.rb} +0 -0
data/spec/doc_ripper/formats/sketch_ripper_spec.rb +29 -0
data/spec/doc_ripper/{text_ripper_spec.rb → formats/text_ripper_spec.rb} +0 -0
data/spec/fixtures/complex_sketch_text.sketch +0 -0
data/spec/fixtures/simple_sketch_text.sketch +0 -0
metadata +62 -14
data/lib/doc_ripper/docx_ripper.rb +0 -9
data/lib/doc_ripper/ms_doc_ripper.rb +0 -9
data/lib/doc_ripper/pdf_ripper.rb +0 -9
data/spec/fixtures/lorem.txt +0 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: e356e467916b8452aeb2121a234b0011302286ee
-  data.tar.gz: 6d7d1bc5c12f8a7de5e8585fbf5f7ca9bffe6735
+  metadata.gz: a97f6b37326f9f22afd538cd95a86da46caf5c48
+  data.tar.gz: 5ff485ab583cacfec99c9dcff702296d0cb5d1bd
 SHA512:
-  metadata.gz: c58f820acc305465e13c19e2e328de856a4e06d7e5c32a74827332002352a6e982286d0261b1e9cf99871084f65a9f2abb5c75227b7392f463325e476fc2df97
-  data.tar.gz: 812a7e6df98b6f247e0bd46e520611ad93180f43ac6e61306d3fe9b6bfc49ee75668609055ca5162fc595a223394b1bc7dc956b0eb189a2368f502a052b25528
+  metadata.gz: c450f92c3a65d8c2bf0ec167eeadaebcfa1ddbb3d0f699d1577bc112e0c98a8b77dfc8d49dac3f20073726b42b6c7788ff8b128b6b444fe39a4ec336f42f0b5a
+  data.tar.gz: c9e1cc600b2e43a6f6f3551092d674b0eefcebf2983874d79445d50631289747cf0285481caa0252cbc0e0e1dab16c14a221098197577864d172479ab4f07e06

data/README.md CHANGED Viewed

@@ -1,11 +1,29 @@
 # DocRipper
+[![Gem Version](https://badge.fury.io/rb/doc_ripper.svg)](http://badge.fury.io/rb/doc_ripper)
-Grab the text from common document formats with 1 command. DocRipper is an extremely lightweight Ruby wrapper that can be used to parse text contents from common file formats (currently .doc, .docx and .pdf) without the need for a large number of dependencies like an OCR library or OpenOffice/LibreOffice.
+Grab the text from common document formats with 1 command. DocRipper is an extremely lightweight Ruby wrapper that can be used to parse text contents from common file formats (currently .doc, .docx and .pdf, .sketch) without the need for a large number of dependencies like an OCR library or OpenOffice/LibreOffice.
 For simple parsing, you'll likely see a large performance improvement with DocRipper over solutions that rely on OpenOffice/LibreOffice for .doc/.docx conversion.
 Need OCR support or in-image text parsing? Take a look at [Docsplit](https://github.com/documentcloud/docsplit).
+### Supported File Formats
+````
+.doc
+.docx
+.pdf
+.txt
+.sketch
+````
+File format | Supported? | Dependencies
+------------|------------|-------------
+.doc        |     x      |   Antiword
+.docx       |     x      |
+.pdf        |     x      |   Poppler-utils
+.txt        |     x      |
+.sketch     |     x      |
 ## Quickstart
 ```
@@ -27,7 +45,7 @@ Need OCR support or in-image text parsing? Take a look at [Docsplit](https://git
 ```
 #### Want to raise an exception? Use #rip!
-#rip! will raise an exception if rip returns nil or the file type isn't supported
+\#rip! will raise an exception if rip returns nil or the file type isn't supported
 ```
   # invalid file type

data/doc_ripper.gemspec CHANGED Viewed

@@ -21,6 +21,10 @@ Gem::Specification.new do |spec|
   spec.requirements << 'Antiword'
   spec.requirements << "pdftotext/poppler"
+  spec.add_dependency "sqlite3", "~> 1.3.11"
+  spec.add_dependency "activesupport", "~> 4.2.6"
+  spec.add_dependency "CFPropertyList", '~> 2.3'
   spec.add_development_dependency "bundler", "~> 1.6"
   spec.add_development_dependency "rake", "~> 10.0"
   spec.add_development_dependency "rspec"

data/lib/doc_ripper.rb CHANGED Viewed

@@ -1,10 +1,12 @@
 require 'shellwords'
+require "sqlite3"
 require "doc_ripper/version"
-require "doc_ripper/ripper/base"
+require "doc_ripper/base"
 require "doc_ripper/text_ripper"
-require "doc_ripper/pdf_ripper"
-require "doc_ripper/docx_ripper"
-require "doc_ripper/ms_doc_ripper"
+require "doc_ripper/formats/pdf_ripper"
+require "doc_ripper/formats/docx_ripper"
+require "doc_ripper/formats/ms_doc_ripper"
+require "doc_ripper/formats/sketch_ripper"
 require "doc_ripper/exceptions"
 module DocRipper
@@ -15,11 +17,8 @@ module DocRipper
     def rip!(path)
       text = rip(path, raise: true)
-      if text
-        text
-      else
-        raise FileNotFound
-      end
+      text || raise(FileNotFound)
     end
   end
 end

data/lib/doc_ripper/{ripper/base.rb → base.rb} RENAMED Viewed

@@ -10,6 +10,10 @@ module DocRipper
         @options = options
       end
+      def read_type
+        :file
+      end
       private
       def to_shell(file_path)

data/lib/doc_ripper/formats/docx_ripper.rb ADDED Viewed

@@ -0,0 +1,11 @@
+module DocRipper
+  module Formats
+    class DocxRipper < Ripper::Base
+      def rip
+        @text ||= system(%Q[ unzip -p #{to_shell(@file_path)} | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$' > #{to_shell(@text_file_path)} ])
+      end
+    end
+  end
+end

data/lib/doc_ripper/formats/ms_doc_ripper.rb ADDED Viewed

@@ -0,0 +1,11 @@
+module DocRipper
+  module Formats
+    class MsDocRipper < Ripper::Base
+      def rip
+        @text ||= system(%Q[ antiword #{to_shell(@file_path)} > #{to_shell(@text_file_path)} ])
+      end
+    end
+  end
+end

data/lib/doc_ripper/formats/pdf_ripper.rb ADDED Viewed

@@ -0,0 +1,11 @@
+module DocRipper
+  module Formats
+    class PdfRipper < Ripper::Base
+      def rip
+        @text ||= system(%Q[ pdftotext #{to_shell(@file_path)} > #{to_shell(@text_file_path)} ])
+      end
+    end
+  end
+end

data/lib/doc_ripper/formats/sketch_ripper.rb ADDED Viewed

@@ -0,0 +1,84 @@
+require 'active_support'
+require 'active_support/core_ext'
+require 'cfpropertylist'
+module DocRipper
+  module Formats
+    class SketchRipper < Ripper::Base
+      class CFPropertyList::CFString
+        def to_s
+          value
+        end
+      end
+      class CFPropertyList::CFType
+        def blacklisted_class?
+          return false if !self.value.respond_to?(:[])
+          klass = self.value['$class']
+          # Sketch Internal ID References
+          # 39 = rectangle / artboard / page / group
+          # 170 = font definition
+          return false if !klass
+          [170].include?(klass.value)
+        end
+        def sketch_page?
+          return false if !self.value.respond_to?(:[])
+          klass = self.value['$classes']
+          return false if !klass
+          klass.is_a?(CFPropertyList::CFArray)
+        end
+      end
+      def read_type
+        :mem
+      end
+      def rip
+        db = SQLite3::Database.new(@file_path)
+        data = db.execute("SELECT value FROM payload").flatten.first
+        @text ||= text_objects(data).join(" ").strip
+      end
+      def blacklist
+        %w(\$null MSAttributedStringFontAttribute NSColor NSParagraphStyle)
+      end
+      def text_objects(data)
+        objects = CFPropertyList::List.new(data: data).value.value['$objects'].value
+        evaluator = Proc.new do |object, previous_object, n_2_previous_object, next_object|
+          coordinatesRegex = /\{\{\d*, \d*}, \{\d*, \d*\}\}|\{[\d.e-]*, [\d.]*\}/
+          object.is_a?(CFPropertyList::CFString) &&
+            #ignore other blacklisted properties
+            blacklist.select { |bl| object.value.match(/#{bl}/) }.empty? &&
+            #ignore uuids
+            !object.value.match(/\w{8}-\w{4}-\w{4}-\w{4}-\w{12}/) &&
+            #ignore coordinates
+            !object.value.match(coordinatesRegex) &&
+            #ignore font definitions
+            previous_object.value != "NSFontNameAttribute" &&
+            # labels always have an dictionary defined afterwards
+            next_object.is_a?(CFPropertyList::CFDictionary) &&
+            # Check if the string is defining the name of an artboard or font
+            !(previous_object.respond_to?(:blacklisted_class?) && previous_object.blacklisted_class?) &&
+            !(n_2_previous_object.respond_to?(:blacklisted_class?) && n_2_previous_object.blacklisted_class?)
+          end
+        objects.select.with_index do |object,i|
+          next_object = objects[i+1]
+          previous_object = objects[i-1]
+          n_2_previous_object = objects[i-2]
+          evaluator.call(object, previous_object, n_2_previous_object, next_object)
+        end
+      end
+    end
+  end
+end

data/lib/doc_ripper/text_ripper.rb CHANGED Viewed

@@ -4,27 +4,39 @@ module DocRipper
   class TextRipper < Ripper::Base
     attr_reader :text_file_path, :file_path
-    def rip
+    def ripped?
       @is_ripped ||=choose_ripper
     end
     def text
-      @text ||= IO.read(@text_file_path).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) if rip
+      if ripped? && @ripper.read_type == :file
+        @text = IO.read(@text_file_path).force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
+        File.delete(@text_file_path)
+      elsif ripped? && @ripper.read_type == :mem
+        @text = @ripper.text
+      end
+      @text
     end
     private
     def choose_ripper
       case
-      when !!(@file_path[-5.. -1] =~ /.docx/i)
-        DocxRipper.new(@file_path).rip
-      when !!(@file_path[-4.. -1] =~ /.doc/i)
-        MsDocRipper.new(@file_path).rip
-      when !!(@file_path[-4..-1]  =~ /.pdf/i)
-        PdfRipper.new(@file_path).rip
+      when !!(@file_path =~ /.docx$/i)
+        @ripper = Formats::DocxRipper.new(@file_path)
+      when !!(@file_path =~ /.doc$/i)
+        @ripper = Formats::MsDocRipper.new(@file_path)
+      when !!(@file_path =~ /.pdf$/i)
+        @ripper = Formats::PdfRipper.new(@file_path)
+      when !!(@file_path =~ /.sketch$/i)
+        @ripper = Formats::SketchRipper.new(@file_path)
       when @options[:raise]
         raise UnsupportedFileType
       end
+      @ripper.rip
     end
   end

data/lib/doc_ripper/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 module DocRipper
-  VERSION = "0.0.5"
+  VERSION = "0.0.6"
 end

data/pkg/doc_ripper-0.0.5.gem ADDED Viewed

Binary file

data/spec/doc_ripper/{ripper/base_spec.rb → base_spec.rb} RENAMED Viewed

File without changes

data/spec/doc_ripper/{doc_ripper_spec.rb → formats/doc_ripper_spec.rb} RENAMED Viewed

File without changes

data/spec/doc_ripper/formats/sketch_ripper_spec.rb ADDED Viewed

@@ -0,0 +1,29 @@
+require 'spec_helper'
+module DocRipper
+  describe 'SketchRipper' do
+    let(:simple_sketch_path) { "#{FIXTURE_PATH}simple_sketch_text.sketch" }
+    let(:simple_sketch_text) { "Page 1 t Grab some text Grab some text t copy" }
+    let(:complex_sketch_path) { "#{FIXTURE_PATH}complex_sketch_text.sketch" }
+    let(:complex_sketch_text) do
+      "Page 1 Onboarding Wizard -- Step 3 Header Rectangle 20 Path UtilityZen UtilityZen Line notification-icons---download-for-free-at-icons8 Shape gear-icons---download-for-free-at-icons8 Rectangle 293 Sync the accounts us Sync the accounts used by 484 Sexton. You\u2019ll be asked to approve access so that we can begin monitoring home usage. Don\u2019t see one of the Don\u2019t see one of the utilities your home uses? Let us know.  2/2 Your Accounts 2/2 Your Accounts Group Rectangle 294 Next step Utility Full Chit Gas + Power Utility Rectangle 279  Pacific_Gas_and_Electric_Company_(logo) Layer_1 g2105 g2107 g2109 path2111 path2111-path g2113 path2115 path2115-path g2117 path2119 path2119-path g2121 path2123 path2123-path g2125 path2127 path2127-path g2129 path2131 path2131-path g2133 path2135 path2135-path path2135-path path2137 path2137-path path2137-path Utility Full Chit Gas + Power Rectangle 279 sfpuc-logo-vert"
+    end
+    describe '#rip' do
+      let(:ripper) { DocRipper.rip(simple_sketch_path) }
+      it 'returns all text labels, layer names and page names from Sketch documents' do
+        expect(ripper).to eq(simple_sketch_text)
+      end
+      context 'complex sketch example' do
+        let(:ripper) { DocRipper.rip(complex_sketch_path) }
+        it 'returns matching text from labels' do
+          expect(ripper.split(' ')).to match_array(complex_sketch_text.split(' '))
+        end
+      end
+    end
+  end
+end

data/spec/doc_ripper/{text_ripper_spec.rb → formats/text_ripper_spec.rb} RENAMED Viewed

File without changes

data/spec/fixtures/complex_sketch_text.sketch ADDED Viewed

Binary file

data/spec/fixtures/simple_sketch_text.sketch ADDED Viewed

Binary file

metadata CHANGED Viewed

@@ -1,15 +1,57 @@
 --- !ruby/object:Gem::Specification
 name: doc_ripper
 version: !ruby/object:Gem::Version
-  version: 0.0.5
+  version: 0.0.6
 platform: ruby
 authors:
 - Paul Zaich
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-12-11 00:00:00.000000000 Z
+date: 2016-07-11 00:00:00.000000000 Z
 dependencies:
+- !ruby/object:Gem::Dependency
+  name: sqlite3
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 1.3.11
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 1.3.11
+- !ruby/object:Gem::Dependency
+  name: activesupport
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 4.2.6
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 4.2.6
+- !ruby/object:Gem::Dependency
+  name: CFPropertyList
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.3'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.3'
 - !ruby/object:Gem::Dependency
   name: bundler
   requirement: !ruby/object:Gem::Requirement
@@ -68,21 +110,25 @@ files:
 - Rakefile
 - doc_ripper.gemspec
 - lib/doc_ripper.rb
-- lib/doc_ripper/docx_ripper.rb
+- lib/doc_ripper/base.rb
 - lib/doc_ripper/exceptions.rb
-- lib/doc_ripper/ms_doc_ripper.rb
-- lib/doc_ripper/pdf_ripper.rb
-- lib/doc_ripper/ripper/base.rb
+- lib/doc_ripper/formats/docx_ripper.rb
+- lib/doc_ripper/formats/ms_doc_ripper.rb
+- lib/doc_ripper/formats/pdf_ripper.rb
+- lib/doc_ripper/formats/sketch_ripper.rb
 - lib/doc_ripper/text_ripper.rb
 - lib/doc_ripper/version.rb
-- spec/doc_ripper/doc_ripper_spec.rb
-- spec/doc_ripper/ripper/base_spec.rb
-- spec/doc_ripper/text_ripper_spec.rb
+- pkg/doc_ripper-0.0.5.gem
+- spec/doc_ripper/base_spec.rb
+- spec/doc_ripper/formats/doc_ripper_spec.rb
+- spec/doc_ripper/formats/sketch_ripper_spec.rb
+- spec/doc_ripper/formats/text_ripper_spec.rb
+- spec/fixtures/complex_sketch_text.sketch
 - spec/fixtures/lorem.doc
 - spec/fixtures/lorem.docx
 - spec/fixtures/lorem.pdf
-- spec/fixtures/lorem.txt
 - spec/fixtures/missing_file.txt
+- spec/fixtures/simple_sketch_text.sketch
 - spec/fixtures/some_missing_path.txt
 - spec/spec_helper.rb
 homepage: https://github.com/pzaich/doc_ripper
@@ -112,13 +158,15 @@ signing_key:
 specification_version: 4
 summary: Rip out text from pdf, doc and docx formats
 test_files:
-- spec/doc_ripper/doc_ripper_spec.rb
-- spec/doc_ripper/ripper/base_spec.rb
-- spec/doc_ripper/text_ripper_spec.rb
+- spec/doc_ripper/base_spec.rb
+- spec/doc_ripper/formats/doc_ripper_spec.rb
+- spec/doc_ripper/formats/sketch_ripper_spec.rb
+- spec/doc_ripper/formats/text_ripper_spec.rb
+- spec/fixtures/complex_sketch_text.sketch
 - spec/fixtures/lorem.doc
 - spec/fixtures/lorem.docx
 - spec/fixtures/lorem.pdf
-- spec/fixtures/lorem.txt
 - spec/fixtures/missing_file.txt
+- spec/fixtures/simple_sketch_text.sketch
 - spec/fixtures/some_missing_path.txt
 - spec/spec_helper.rb

data/lib/doc_ripper/docx_ripper.rb DELETED Viewed

@@ -1,9 +0,0 @@
-module DocRipper
-  class DocxRipper < Ripper::Base
-    def rip
-      @text ||= system(%Q[ unzip -p #{to_shell(@file_path)} | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$' > #{to_shell(@text_file_path)} ])
-    end
-  end
-end

data/lib/doc_ripper/ms_doc_ripper.rb DELETED Viewed

@@ -1,9 +0,0 @@
-module DocRipper
-  class MsDocRipper < Ripper::Base
-    def rip
-      @text ||= system(%Q[ antiword #{to_shell(@file_path)} > #{to_shell(@text_file_path)} ])
-    end
-  end
-end

data/lib/doc_ripper/pdf_ripper.rb DELETED Viewed

@@ -1,9 +0,0 @@
-module DocRipper
-  class PdfRipper < Ripper::Base
-    def rip
-      @text ||= system(%Q[ pdftotext #{to_shell(@file_path)} > #{to_shell(@text_file_path)} ])
-    end
-  end
-end

data/spec/fixtures/lorem.txt DELETED Viewed

@@ -1 +0,0 @@

- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.