RubyGems - pdfshaver - Versions diffs - 0.0.1 → 0.0.2 - Mend

pdfshaver 0.0.1 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/LICENSE +21 -0
data/Readme.md +21 -28
data/bench/extract_doc.rb +3 -0
data/bench/setup.rb +10 -3
data/ext/pdfium_ruby/document.cpp +12 -5
data/ext/pdfium_ruby/document.h +1 -3
data/ext/pdfium_ruby/extconf.rb +36 -27
data/ext/pdfium_ruby/page.cpp +31 -10
data/ext/pdfium_ruby/page.h +15 -7
data/ext/pdfium_ruby/pdfium_ruby.cpp +1 -1
data/lib/pdfshaver/page.rb +5 -0
data/lib/pdfshaver/version.rb +1 -1
data/test/page_spec.rb +2 -0
metadata +5 -4
data/Gemfile.lock +0 -26

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 179892daf5c810a3516fef72ded8233423ebb6e7
-  data.tar.gz: ee7d17c718fff0ce34cec44de9f3abed6a86e2c2
+  metadata.gz: d8f76428237b693ba33bc3c3f73482fa769ac19e
+  data.tar.gz: 948f9868b24deee695115999875b2bccddc915be
 SHA512:
-  metadata.gz: 97b911682b430e6f8d4314e39563f1f9d23332143d9b0170465b35f26a5caf0466b71543cba257329f9016c7fc304b7d769956f87008beed5bd0f39024fe377e
-  data.tar.gz: 78f35366a38991906e6097f79a1b13751bcd1f6dc75caada0825588407246422c01ed59550eb45709b5e2480cce8febb683a8b8954af2bc0332b404830270251
+  metadata.gz: dc2545b699bb2f8b16e4372434eb2a2c7c0a71c99eceb94353c09726b8a9e8bda8ed02f608f29775ec668cfc877e8190bfdde6907de62bb78819a8146e8bbe8d
+  data.tar.gz: da77e75e8435285efaef20bafcbb7fdcf2247560ef753727084d7e09f1472396080ab14f0e49c3af896bee8f378a070a09ef6a3b8318ff1e773431db505dd289

data/.gitignore CHANGED

@@ -6,3 +6,4 @@ mkmf.log
 test/output
 .DS_Store
 output
+Gemfile.lock

data/LICENSE ADDED

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2015 Ted Han, Nathan Stitt, DocumentCloud, Investigative Reporters & Editors
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/Readme.md CHANGED

@@ -2,7 +2,7 @@
 # N.B. THIS IS A WORK IN PROGRESS
-Shave pages off of PDFs as images
+Shave pages off of PDFs as images.
 ### Examples
@@ -17,51 +17,44 @@ copyright 2015 Ted Han, Nathan Stitt & DocumentCloud
 ## Installation
-PDFShaver is distributed as a Ruby gem.  Once you have it's dependencies installed, all you have to do is type `gem install pdfshaver` (although in some cases you'll need to stick a `sudo` before the command).
+PDFShaver is distributed as a Ruby gem.  Once you have its dependencies installed, all you have to do is type `gem install pdfshaver` (although in some cases you'll need to stick a `sudo` before the command).
-PDFShaver depends on [Google Chrome's `PDFium` library][pdfium], and for now, installing `PDFium` takes a little bit of doing.
+PDFShaver depends on [Google Chrome's `PDFium` library][pdfium], and, for now, installing `PDFium` takes a little bit of doing.
 [pdfium]: https://code.google.com/p/pdfium/
-In order install PDFium, you'll need Python, a C++ compiler, FreeImage and `git`.  All of these tools should be available for your operating system.
+### Getting PDFium and FreeImage
-### OSX
+#### On Ubuntu/Debian
+We've built a .deb that you can download by running: `wget 'http://assets.documentcloud.org/pdfium/libpdfium-dev_20151024.163326_amd64.deb'`
-#### C++ compiler
-Check whether you have the xcode command line tools installed by typing `xcode-select -p`.  If this command returns something like `/Applications/Xcode.app/Contents/Developer` then you have the command line tools installed already.
+Once you have downloaded the file, you can install it like this:
-If you do not already have the xcode commandline tools installed running `xcode-select --install` will start you off down the correct path.
+`sudo dpkg -i libpdfium-dev_20151024.163326_amd64.deb` (where `libpdfium-dev_20151024.163326_amd64.deb` is the name of the file you just downloaded)
--------------------
+And install FreeImage and FreeType:
-At this point, it may be convenient to install Homebrew.
+`sudo apt-get install libfreeimage-dev libfreetype6-dev`
-#### Python
+#### On OSX
-If you're using a recent Mac, you should already have Python 2.7 installed on your machine.  You can check what version of Python you're running by typing `python --version` into your terminal.  If you don't have a recent version of python (version 2.7 or greater) installed, you'll
+You can use homebrew to install pdfium's current code using our Homebrew formula:
-#### `git`
+`brew install --HEAD https://raw.githubusercontent.com/knowtheory/homebrew/45606ddde3fdd657655208be0fb1a065e142a4f1/Library/Formula/pdfium.rb`
-If you have homebrew installed simply type `brew install git`
+Then install FreeImage:
-### Linux (we'll assume ubuntu or debian)
+`brew install freeimage`
-#### C++ Compiler
-`sudo apt-get install build-essential`
-#### `git`
-`sudo apt-get install git`
-#### FreeImage
-`sudo apt-get install libfreeimage-dev`
+#### On Windows
-### Getting PDFium's dependencies
+Unfortunately, we don't have a Windows package yet.
-If you have any trouble check [PDFium's build instructions](https://code.google.com/p/pdfium/wiki/Build) for the most up to date instructions.
+#### On other Linux/Unix systems
+Sorry we don't have a release for your OS but we'd be happy to talk to you about how we packaged PDFium for OSX and Ubuntu if you'd like to help package PDFium for your distribution/os!
+### Install PDFShaver
-### Getting the PDFium code
-`git clone https://pdfium.googlesource.com/pdfium`
+`gem install pdfshaver` (you may have to use `sudo gem` instead of just `gem`)

data/bench/extract_doc.rb ADDED

@@ -0,0 +1,3 @@
+require_relative 'setup'
+extract(ARGV.pop)

data/bench/setup.rb CHANGED

@@ -2,17 +2,24 @@ require_relative '../lib/pdfshaver'
 require 'fileutils'
 require 'pp'
+def print_rss(message="", io=STDOUT)
+  io.puts "#{message} [RSS: #{`ps -eo rss,pid | grep #{Process.pid} | grep -v grep | awk '{ print $1;  }'`.chomp}]"
+end
 def extract(doc_path, prefix=rand(10**10))
   out_dir = File.join(".", "output", prefix.to_s)
   FileUtils.mkdir_p(out_dir)
-  log = File.open(File.join(out_dir, "log.txt"), 'w')
-  log.sync = true
+  #log = File.open(File.join(out_dir, "log.txt"), 'w')
+  #log.sync = true
+  log = STDOUT
   doc = PDFShaver::Document.new(doc_path)
   doc.pages.each do |page|
+    log.puts("Waiting for confirmation...")
+    STDIN.gets
     log.puts("#{Time.now}: rendering page #{page.number}")
     # shamelessly stolen from http://samsaffron.com/archive/2014/04/08/ruby-2-1-garbage-collection-ready-for-production
     log.puts "RSS: #{`ps -eo rss,pid | grep #{Process.pid} | grep -v grep | awk '{ print $1;  }'`}"
-    #GC.start
+    GC.start
     #log.puts(GC.stat)
     easy_render(page, out_dir)
   end

data/ext/pdfium_ruby/document.cpp CHANGED

@@ -68,6 +68,17 @@ void Define_Document() {
                             CPP_RUBY_METHOD_FUNC(initialize_document_internals), -1);
 };
+// Because a PDFium document's lifecycle has some complexity,
+// its ruby deallocator doesn't immediately release its memory.
+// The deallocator instead defers to the C++ class to track when
+// it should be deallocated.
+static void destroy_document_when_safe(Document* document) {
+  document->flagDocumentAsReadyForRelease();
+  document->destroyUnlessPagesAreOpen();
+}
+// Whenever a PDFShaver::Document is created, create a C++ Document
+// in the document instance.
 VALUE document_allocate(VALUE rb_PDFShaver_Document) {
   Document* document = new Document();
   return Data_Wrap_Struct(rb_PDFShaver_Document, NULL, destroy_document_when_safe, document);
@@ -79,6 +90,7 @@ VALUE initialize_document_internals(int arg_count, VALUE* args, VALUE self) {
   // `path` argument and an optional `options` hash.
   VALUE path, options;
   int number_of_args = rb_scan_args(arg_count, args, "11", &path, &options);
+  if (number_of_args > 1) { /* there are options */}
   // attempt to open document.
   // path should at this point be validated & known to exist.
@@ -121,8 +133,3 @@ void document_handle_parse_status(int status, VALUE path) {
   //    break;
   //}
 }
-static void destroy_document_when_safe(Document* document) {
-  document->flagDocumentAsReadyForRelease();
-  document->destroyUnlessPagesAreOpen();
-}

data/ext/pdfium_ruby/document.h CHANGED

@@ -5,7 +5,6 @@
 class Page;
 #include "pdfium_ruby.h"
 #include "fpdf_ext.h"
-//#include "core/include/fpdfapi/fpdf_parser.h"
 #include "page.h"
 #include <unordered_set>
@@ -45,9 +44,8 @@ class Document {
     std::unordered_set<Page*> open_pages;
 };
-static void destroy_document_when_safe(Document* document);
 VALUE initialize_document_internals(int arg_count, VALUE* args, VALUE self);
 VALUE document_allocate(VALUE rb_PDFShaver_Document);
+//static void destroy_document_when_safe(Document* document);
 void document_handle_parse_status(int status, VALUE path);
 #endif // __DOCUMENT_H__

data/ext/pdfium_ruby/extconf.rb CHANGED

@@ -1,52 +1,61 @@
 require "mkmf"
 require 'rbconfig'
-# List directories to search for PDFium headers and library files to link against
-def append_pdfium_directory_to paths
-  paths.map do |dir|
-    [
-      File.join(dir, 'pdfium'),
-      File.join(dir, 'pdfium', 'fpdfsdk', 'include'),
-      File.join(dir, 'pdfium', 'third_party', 'base', 'numerics')
-    ]
-  end.flatten + paths
+# Take a set of directories to search (usually system paths)
+# and append the paths that we expect to find PDFium's peices.
+def append_search_paths_to search_dirs, search_suffixes
+  search_dirs.map do |dir|
+    search_suffixes.map{ |path| File.join(dir, path) }
+  end.flatten + search_dirs
 end
-LIB_DIRS    = append_pdfium_directory_to %w[
-  /usr/local/lib/
-  /usr/lib/
+lib_dirs = %w[
+  /usr/local/Cellar/pdfium/HEAD/lib
+  /usr/local/lib/pdfium
+  /usr/lib/pdfium
+  /usr/local/lib
+  /usr/lib
 ]
-HEADER_DIRS = append_pdfium_directory_to %w[
+header_dirs = %w[
+  /usr/local/Cellar/pdfium/HEAD/include
+  /usr/local/include/pdfium
+  /usr/include/pdfium
   /usr/local/include/
   /usr/include/
 ]
+header_paths = [
+  'public',
+  File.join('core', 'include'),
+  File.join('fpdfsdk', 'include'),
+  File.join('third_party', 'base', 'numerics')
+]
+LIB_DIRS    = append_search_paths_to lib_dirs, ['third_party']
+HEADER_DIRS = append_search_paths_to header_dirs, header_paths
 # Tell ruby we want to search in the specified paths
 dir_config("pdfium", HEADER_DIRS, LIB_DIRS)
+# lib order needs to be in dependency loaded order, or will not link properly.
 LIB_FILES= %w[
   javascript
   bigint
-  freetype
+  fx_freetype
+  fx_agg
+  fx_lcms2
+  fx_libjpeg
+  fx_libopenjpeg
+  fx_zlib
+  fxedit
+  fxcrt
+  fxcodec
+  fxge
   fpdfdoc
   fpdftext
   formfiller
-  icudata
-  icuuc
-  icui18n
-  v8_libbase
-  v8_base
-  v8_snapshot
-  v8_libplatform
-  jsapi
   pdfwindow
-  fxedit
-  fxcrt
-  fxcodec
   fpdfdoc
   fdrm
-  fxge
   fpdfapi
-  freetype
   pdfium
   pthread
   freeimage

data/ext/pdfium_ruby/page.cpp CHANGED

@@ -9,7 +9,7 @@
 Page::Page() { this->opened = false; }
 // When destroying a C++ Page, make sure to dispose of the internals properly.
-// And notify the parent document that this page is no longer going to be used.
+// And notify the parent document that this page will no longer be used.
 Page::~Page() {
   if (this->opened) {
     this->unload();
@@ -30,6 +30,7 @@ void Page::initialize(Document* document, int page_index) {
 bool Page::load() {
   if (!this->opened) {
     this->fpdf_page = FPDF_LoadPage(this->document->fpdf_document, this->page_index);
+    this->text_page = FPDFText_LoadPage(this->fpdf_page);
     this->opened = true;
   }
   return this->opened;
@@ -37,14 +38,18 @@ bool Page::load() {
 // Unload the page (freeing the page's memory) and mark it as not open.
 void Page::unload() {
-  if (this->opened){ FPDF_ClosePage(this->fpdf_page); }
+  if (this->opened){
+    FPDFText_ClosePage(this->text_page);
+    FPDF_ClosePage(this->fpdf_page);
+  }
   this->opened = false;
 }
 // readers for the page's dimensions.
-double Page::width(){ return FPDF_GetPageWidth(this->fpdf_page); }
-double Page::height(){ return FPDF_GetPageHeight(this->fpdf_page); }
-double Page::aspect() { return width() / height(); }
+double Page::width()  {      return FPDF_GetPageWidth(this->fpdf_page); }
+double Page::height() {      return FPDF_GetPageHeight(this->fpdf_page); }
+double Page::aspect() {      return width() / height(); }
+//int    Page::text_length() { return FPDFText_CountChars(this->text_page); }
 // Render the page to a destination path with the dimensions
 // specified by width & height (or appropriate defaults).
@@ -156,14 +161,16 @@ void Define_Page() {
   rb_define_private_method(rb_PDFShaver_Page, "unload_data", CPP_RUBY_METHOD_FUNC(page_unload_data), 0);
 }
-// Create a new C++ Page object and store it in any newly created
-// Ruby page instances.
+// the C++ page can be deleted when we're done with the Ruby page.
+static void destroy_page(Page* page) { delete page; }
+// Whenever a PDFShaver::Page is created, we'll create a new C++ Page object
+// and store it in the newly created Ruby page instances, and inform it to
+// clean the page up using `destroy_page`.
 VALUE page_allocate(VALUE rb_PDFShaver_Page) {
   Page* page = new Page();
   return Data_Wrap_Struct(rb_PDFShaver_Page, NULL, destroy_page, page);
 }
-// And delete the C++ page when we're done with the Ruby page.
-static void destroy_page(Page* page) { delete page; }
 // This function does the actual initialization of the C++ page's internals
 // defining which page of the document will be opened when `load_data` is called.
@@ -171,6 +178,7 @@ VALUE initialize_page_internals(int arg_count, VALUE* args, VALUE self) {
   // use Ruby's argument scanner to pull out a required
   VALUE rb_document, page_index, options;
   int number_of_args = rb_scan_args(arg_count, args, "21", &rb_document, &page_index, &options);
+  if (number_of_args > 2) { /* there are options */ }
   // fetch the C++ document from the Ruby document the page has been initialized with
   Document* document;
@@ -191,6 +199,7 @@ VALUE page_load_data(VALUE self) {
   rb_ivar_set(self, rb_intern("@width"),  INT2FIX(page->width()));
   rb_ivar_set(self, rb_intern("@height"), INT2FIX(page->height()));
   rb_ivar_set(self, rb_intern("@aspect"), rb_float_new(page->aspect()));
+  //rb_ivar_set(self, rb_intern("@length"), INT2FIX(page->text_length()));
   return Qtrue;
 }
@@ -202,12 +211,24 @@ VALUE page_unload_data(VALUE self) {
   return Qtrue;
 }
+//VALUE page_text_length(VALUE self) {
+//  Page* page;
+//  Data_Get_Struct(self, Page, page);
+//  return INT2FIX(page->text_length());
+//}
+//VALUE page_text(VALUE self) {
+//  Page* page;
+//  Data_Get_Struct(self, Page, page);
+//  return INT2FIX(page->text());
+//}
 //bool page_render(int arg_count, VALUE* args, VALUE self) {
 VALUE page_render(int arg_count, VALUE* args, VALUE self) {
   VALUE path, options;
   int width = 0, height = 0;
-  int number_of_args = rb_scan_args(arg_count, args, "1:", &path, &options);
+  rb_scan_args(arg_count, args, "1:", &path, &options);
   if (arg_count > 1) {
     VALUE rb_width  = rb_hash_aref(options, ID2SYM(rb_intern("width")));
     VALUE rb_height = rb_hash_aref(options, ID2SYM(rb_intern("height")));

data/ext/pdfium_ruby/page.h CHANGED

@@ -5,28 +5,35 @@
 class Document;
 #include "pdfium_ruby.h"
 #include "document.h"
+#include "fpdf_text.h"
 class Page {
   public:
+    // C++ constructor & destructor.
     Page();
+    ~Page();
+    // Ruby Data Initializer
     void initialize(Document* document, int page_number);
+    // PDFium data initializer & cleanup
     bool load();
     void unload();
+    // Data access methods
     double width();
     double height();
     double aspect();
+    int text_length();
     bool render(char* path, int width, int height);
-    ~Page();
   private:
-    int page_index;
-    bool opened;
-    Document *document;
-    FPDF_PAGE fpdf_page;
+    int           page_index;
+    bool          opened;
+    Document      *document;
+    FPDF_PAGE     fpdf_page;
+    FPDF_TEXTPAGE text_page;
 };
 void Define_Page();
@@ -35,6 +42,7 @@ VALUE page_render(int arg_count, VALUE* args, VALUE self);
 VALUE page_allocate(VALUE rb_PDFShaver_Page);
 VALUE page_load_data(VALUE rb_PDFShaver_Page);
 VALUE page_unload_data(VALUE rb_PDFShaver_Page);
-static void destroy_page(Page* page);
+VALUE page_text_length(VALUE rb_PDFShaver_Page);
+//static void destroy_page(Page* page);
 #endif

data/ext/pdfium_ruby/pdfium_ruby.cpp CHANGED

@@ -8,7 +8,7 @@ void Init_pdfium_ruby (void) {
   FPDF_InitLibrary();
   // Define `PDFShaver` module as a namespace for all of our other objects
-  VALUE rb_PDFShaver = rb_define_module("PDFShaver");
+  rb_define_module("PDFShaver");
   // Define `Document` and `Page` classes
   Define_Document();

data/lib/pdfshaver/page.rb CHANGED

@@ -39,6 +39,11 @@ module PDFShaver
       load_dimensions unless @aspect
       @aspect
     end
+    def length
+      load_dimensions unless @length
+      @length
+    end
     def with_data_loaded &block
       load_data

data/lib/pdfshaver/version.rb CHANGED

@@ -1,3 +1,3 @@
 module PDFShaver
-  VERSION='0.0.1'
+  VERSION='0.0.2'
 end

data/test/page_spec.rb CHANGED

@@ -128,6 +128,7 @@ describe PDFShaver::Page do
       @page.instance_variable_get("@height").must_equal nil
       @page.instance_variable_get("@width").must_equal nil
       @page.instance_variable_get("@aspect").must_equal nil
+      @page.instance_variable_get("@length").must_equal nil
       @page.instance_variable_get("@extension_data_is_loaded").must_equal false
       @page.send(:load_dimensions)
@@ -135,6 +136,7 @@ describe PDFShaver::Page do
       @page.height.wont_equal nil
       @page.width.wont_equal nil
       @page.aspect.wont_equal nil
+      #@page.length.wont_equal nil
       @page.instance_variable_get("@extension_data_is_loaded").must_equal false
     end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: pdfshaver
 version: !ruby/object:Gem::Version
-  version: 0.0.1
+  version: 0.0.2
 platform: ruby
 authors:
 - Ted Han
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-02-27 00:00:00.000000000 Z
+date: 2015-10-29 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -91,10 +91,11 @@ extra_rdoc_files: []
 files:
 - ".gitignore"
 - Gemfile
-- Gemfile.lock
+- LICENSE
 - Rakefile
 - Readme.md
 - bench/data_loading_speed.rb
+- bench/extract_doc.rb
 - bench/memory_stress.rb
 - bench/setup.rb
 - ext/pdfium_ruby/document.cpp
@@ -138,7 +139,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.5
+rubygems_version: 2.4.5.1
 signing_key:
 specification_version: 4
 summary: Shave pages off of PDFs as images

data/Gemfile.lock DELETED

@@ -1,26 +0,0 @@
-PATH
-  remote: .
-  specs:
-    pdfium (0.0.1)
-GEM
-  remote: https://rubygems.org/
-  specs:
-    addressable (2.3.7)
-    fastimage (1.6.6)
-      addressable (~> 2.3, >= 2.3.5)
-    minitest (5.5.1)
-    rake (10.4.2)
-    rake-compiler (0.9.5)
-      rake
-PLATFORMS
-  ruby
-DEPENDENCIES
-  bundler (~> 1.5)
-  fastimage
-  minitest
-  pdfium!
-  rake
-  rake-compiler