copy_paste_pdf 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 136b95ea0d8543294bce8905905ae9f638d34367
4
+ data.tar.gz: c5f220fe4fcf77b4fedf3d6fc5e186487b492ef8
5
+ SHA512:
6
+ metadata.gz: 2522bdfe9adf4f87114d73d3b0d8ca11576fd517af6c1459c286ed22340ca6225aee4320a84e4a9513ad18f98d56117a48f9e1ffd8130260b0c05fb0130c233b
7
+ data.tar.gz: 91686dc0fdcc83700e5d4e72500fa2667ec212880176eb82587111bfa8c3bab7503c6476979b66130271f120e6604354c05a050feec0e8f76ef2079dff2cba86
data/.gitignore ADDED
@@ -0,0 +1,6 @@
1
+ *.gem
2
+ .bundle
3
+ .yardoc
4
+ Gemfile.lock
5
+ doc/*
6
+ pkg/*
data/.travis.yml ADDED
@@ -0,0 +1,7 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.8.7
4
+ - 1.9.2
5
+ - 1.9.3
6
+ - 2.0.0
7
+ - ree
data/.yardopts ADDED
@@ -0,0 +1,4 @@
1
+ --no-private
2
+ --hide-void-return
3
+ --embed-mixin ClassMethods
4
+ --markup=markdown
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in the gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2013 Open North Inc.
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,102 @@
1
+ # Copy-Paste PDF
2
+
3
+ <!--
4
+ [![Build Status](https://secure.travis-ci.org/opennorth/copy_paste_pdf.png)](http://travis-ci.org/opennorth/copy_paste_pdf)
5
+ [![Dependency Status](https://gemnasium.com/opennorth/copy_paste_pdf.png)](https://gemnasium.com/opennorth/copy_paste_pdf)
6
+ [![Coverage Status](https://coveralls.io/repos/opennorth/copy_paste_pdf/badge.png?branch=master)](https://coveralls.io/r/opennorth/copy_paste_pdf)
7
+ [![Code Climate](https://codeclimate.com/github/opennorth/copy_paste_pdf.png)](https://codeclimate.com/github/opennorth/copy_paste_pdf)
8
+ -->
9
+
10
+ [Tabula](https://github.com/jazzido/tabula) was written for those cases where you can’t easily copy-and-paste tables from a PDF to a spreadsheet. Surprisingly, Tabula sometimes fails where copy-and-pasting succeeds. This project is for [those cases](http://www.atipp.gov.nl.ca/info/coordinators.html) when copy-and-pasting is all you need (and where nothing else works).
11
+
12
+ This gem only works on OS X.
13
+
14
+ ## Getting Started
15
+
16
+ ### PDF to CSV
17
+
18
+ Install with:
19
+
20
+ gem install --no-wrappers copy_paste_pdf
21
+
22
+ If you omit the `--no-wrappers` switch, the AppleScript will not install properly. You may run the script with:
23
+
24
+ copy-paste-pdf.applescript /path/to/input.pdf /path/to/output.csv
25
+
26
+ * The script will open the PDF in Preview and copy the contents of the PDF
27
+ * The script will open Microsoft Excel, paste the contents and save as CSV
28
+
29
+ If you want the script to quit Preview and Excel once it's done, pass a third argument, like:
30
+
31
+ copy-paste-pdf.applescript /path/to/input.pdf /path/to/output.csv true
32
+
33
+ The script may [pinwheel](http://en.wikipedia.org/wiki/Spinning_pinwheel) while copying the contents of the PDF and while pasting the contents to the spreadsheet. If it looks like nothing is happening, wait a few seconds.
34
+
35
+ You can work in other applications while the script is running - just don't use the clipboard as it may interfere with the script.
36
+
37
+ This method is admittedly not very efficient. Running time averages under 2 seconds per page but varies considerably depending on your system's load.
38
+
39
+ ### Data Cleaning
40
+
41
+ The Ruby gem defines helper methods for cleaning the CSV. In most cases, the PDF to CSV conversion will create many empty rows. You can easily remove those rows with:
42
+
43
+ ```ruby
44
+ require 'csv'
45
+
46
+ require 'copy_paste_pdf'
47
+
48
+ table = CopyPastePDF::Table.new(CSV.read('/path/to/output.csv'))
49
+
50
+ table.remove_empty_rows!
51
+
52
+ CSV.open('/path/to/clean.csv', 'w') do |csv|
53
+ table.each do |row|
54
+ csv << row
55
+ end
56
+ end
57
+ ```
58
+
59
+ If the table in the PDF contained vertically-merged cells, then, in the CSV, the first of the merged cells will have a value and the others will be empty. To copy the value of the first cell to the others, use the `copy_into_cell_below` method, which accepts the indices of columns containing merged cells:
60
+
61
+ ```ruby
62
+ table.copy_into_cell_below(0, 3, 4)
63
+ ```
64
+
65
+ Sometimes, if a cell contains multiple lines of text, the PDF to CSV conversion will incorrectly break the cell into multiple rows. To remove the spurious row and merge its values into the row above, use the `merge_into_cell_above` method, which accepts the indices of columns in which this error occurs:
66
+
67
+ ```ruby
68
+ table.merge_into_cell_above(1, 2)
69
+ ```
70
+
71
+ With additional time and effort, these two methods can be made to operate without needing columns as hints.
72
+
73
+ ## Troubleshooting
74
+
75
+ If you see warnings on the command-line like:
76
+
77
+ 2013-10-09 14:30:03.704 osascript[2056:707] Error loading /Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types: dlopen(/Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types, 262): no suitable image found. Did find:
78
+ /Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types: no matching architecture in universal wrapper
79
+ osascript: OpenScripting.framework - scripting addition "/Library/ScriptingAdditions/Adobe Unit Types.osax" declares no loadable handlers.
80
+
81
+ See [this Adobe help article](http://helpx.adobe.com/photoshop/kb/unit-type-conversion-error-applescript.html).
82
+
83
+ ## Developers
84
+
85
+ If, like me, you almost never write AppleScript, you can access much of AppleScript's documentation through Apple's AppleScript Editor. See, for example, how to access [the entries about Microsoft Excel](http://support.microsoft.com/kb/113891).
86
+
87
+ ## Why?
88
+
89
+ Most of the PDFs I work with contain no tables. In those cases I either:
90
+
91
+ * Run `pdftotext filename.pdf` to convert the PDF to text, and write a script using regular expressions to parse the output.
92
+ * Run `pdftotext -layout filename.pdf` to convert the PDF to text while preserving the text layout – very useful when working with two-column layouts.
93
+ * Use [commercial software](http://reviews.reporterslab.org/search?q=&type=products&category=pdf-tools-2011-11-09) like Adobe Acrobat Pro to save the PDF to another format, usually Excel.
94
+ * I recently learned that Apple's Automator has an `Extract PDF Text` action which performs well.
95
+
96
+ For PDFs containing tables, I discovered that copy-pasting from Apple's Preview to Microsoft Excel worked better than all alternatives tested, for the PDFs I was interested in.
97
+
98
+ ## Bugs? Questions?
99
+
100
+ This project's main repository is on GitHub: [http://github.com/opennorth/copy_paste_pdf](http://github.com/opennorth/copy_paste_pdf), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
101
+
102
+ Copyright (c) 2013 Open North Inc., released under the MIT license
data/Rakefile ADDED
@@ -0,0 +1,16 @@
1
+ require 'bundler'
2
+ Bundler::GemHelper.install_tasks
3
+
4
+ require 'rspec/core/rake_task'
5
+ RSpec::Core::RakeTask.new(:spec)
6
+
7
+ task :default => :spec
8
+
9
+ begin
10
+ require 'yard'
11
+ YARD::Rake::YardocTask.new
12
+ rescue LoadError
13
+ task :yard do
14
+ abort 'YARD is not available. In order to run yard, you must: gem install yard'
15
+ end
16
+ end
data/USAGE ADDED
@@ -0,0 +1 @@
1
+ See README.md for full usage details.
@@ -0,0 +1,63 @@
1
+ #!/usr/bin/osascript
2
+ -- @see http://appscript.sourceforge.net/osascript.html
3
+
4
+ on run argv
5
+ set start to current date
6
+
7
+ -- Default arguments.
8
+ set close_applications to true
9
+
10
+ -- Parse command-line arguments and translate paths from `/path/to/file.ext`
11
+ -- to `Macintosh HD:path:to:file.ext`.
12
+ set input to POSIX file (item 1 of argv) as alias
13
+ set output to POSIX file (item 2 of argv) as string
14
+ if count of argv is 2 then set close_applications to false
15
+
16
+ tell application "Preview"
17
+ activate
18
+ open input
19
+ end
20
+
21
+ -- Preview is not fully scriptable.
22
+ set the clipboard to ""
23
+ tell application "System Events" to tell process "Preview"
24
+ click menu item "Select All" of menu 1 of menu bar item "Edit" of menu bar 1
25
+ click menu item "Copy" of menu 1 of menu bar item "Edit" of menu bar 1
26
+ end tell
27
+ if close_applications then tell application "Preview" to quit
28
+
29
+ -- Yes, this is how to make AppleScript block.
30
+ repeat
31
+ try
32
+ if (the clipboard) is not "" then
33
+ -- One idea is to set a variable to the clipboard, in order to allow use
34
+ -- of the clipboard by the user past this point. However, the clipboard
35
+ -- to be complex based on the output of `return clipboard info`.
36
+ exit repeat
37
+ end if
38
+ -- Calling `the clipboard` can sometimes cause an exception to be thrown,
39
+ -- either due to race condition or software error. Not sure how to recover.
40
+ -- Observed error codes are: -25130, -25132, -25133.
41
+ -- @see http://search.cpan.org/~wyant/Mac-Pasteboard-0.002/lib/Mac/Pasteboard.pm#badPasteboardSyncErr
42
+ -- @see http://search.cpan.org/~wyant/Mac-Pasteboard-0.002/lib/Mac/Pasteboard.pm#badPasteboardItemErr
43
+ -- @see http://search.cpan.org/~wyant/Mac-Pasteboard-0.002/lib/Mac/Pasteboard.pm#badPasteboardFlavorErr
44
+ on error error_message number error_number
45
+ if {-25130, -25132, -25133} contains error_number
46
+ error "Known error " & error_number & " occurred" number error_number
47
+ else
48
+ error error_message number error_number
49
+ end if
50
+ end try
51
+ end repeat
52
+
53
+ tell application "Microsoft Excel"
54
+ activate
55
+ make new workbook
56
+ paste worksheet active sheet
57
+ save workbook as active workbook filename output file format CSV file format overwrite yes
58
+ close active workbook saving no
59
+ if close_applications then quit
60
+ end tell
61
+
62
+ log "" & (current date - start) & " seconds"
63
+ end run
@@ -0,0 +1,23 @@
1
+ # -*- encoding: utf-8 -*-
2
+ require File.expand_path('../lib/copy_paste_pdf/version', __FILE__)
3
+
4
+ Gem::Specification.new do |s|
5
+ s.name = "copy_paste_pdf"
6
+ s.version = CopyPastePDF::VERSION
7
+ s.platform = Gem::Platform::RUBY
8
+ s.authors = ["Open North"]
9
+ s.email = ["info@opennorth.ca"]
10
+ s.homepage = "http://github.com/opennorth/copy_paste_pdf"
11
+ s.summary = %q{Converts PDF to CSV by copy-pasting from Apple's Preview to Microsoft Excel}
12
+ s.license = 'MIT'
13
+
14
+ s.files = `git ls-files`.split("\n")
15
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
16
+ s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
17
+ s.require_paths = ["lib"]
18
+
19
+ s.add_development_dependency('coveralls')
20
+ s.add_development_dependency('json', '~> 1.7.7') # to silence coveralls warning
21
+ s.add_development_dependency('rake')
22
+ s.add_development_dependency('rspec', '~> 2.10')
23
+ end
@@ -0,0 +1,66 @@
1
+ module CopyPastePDF
2
+ class Table < Array
3
+ # Removes all empty rows from the table.
4
+ def remove_empty_rows!
5
+ reject! do |row|
6
+ row.all?(&:nil?)
7
+ end
8
+ end
9
+
10
+ # Copies the values of the given cells from a line to any following lines
11
+ # whose corresponding cells are empty.
12
+ #
13
+ # @param [Array] indices the cell indices to copy
14
+ # @yieldparam [Array] row a row in the table
15
+ # @yieldreturn [Boolean] whether to skip the row from the table
16
+ # @raise if a destination has no source
17
+ # @raise if a destination cell has a value
18
+ # @raise if a row is neither a source nor a destination
19
+ def copy_into_cell_below(*indices)
20
+ source = nil
21
+ each do |row|
22
+ if !block_given? || !yield(row)
23
+ values = row.values_at(*indices)
24
+ case values.count(&:nil?)
25
+ when 0
26
+ source = values
27
+ when indices.size
28
+ if source
29
+ indices.each_with_index do |index,i|
30
+ if row[index]
31
+ raise "#{index} contains #{row[index]}"
32
+ else
33
+ row[index] = source[i]
34
+ end
35
+ end
36
+ else
37
+ raise "#{row} has no source"
38
+ end
39
+ else
40
+ raise "#{row} is neither a source nor a destination"
41
+ end
42
+ end
43
+ end
44
+ end
45
+
46
+ # Merges the values of the given cells from a line, whose other cells are
47
+ # empty, into the corresponding cells of the prececeding line.
48
+ #
49
+ # @param [Array] indices the cell indices to merge
50
+ # @raise if a destination cell is empty
51
+ def merge_into_cell_above(*indices)
52
+ each_with_index.reverse_each do |row,i|
53
+ if row.each_with_index.all?{|value,j| value.nil? || indices.include?(j)}
54
+ indices.each do |index|
55
+ if self[i - 1][index]
56
+ self[i - 1][index] = "#{self[i - 1][index]}\n#{row[index]}"
57
+ else
58
+ raise "#{index} is empty"
59
+ end
60
+ end
61
+ delete_at(i)
62
+ end
63
+ end
64
+ end
65
+ end
66
+ end
@@ -0,0 +1,3 @@
1
+ module CopyPastePDF
2
+ VERSION = "0.0.1"
3
+ end
@@ -0,0 +1 @@
1
+ require 'copy_paste_pdf/table'
@@ -0,0 +1,3 @@
1
+ require 'rubygems'
2
+ require 'rspec'
3
+ require File.dirname(__FILE__) + '/../lib/copy_paste_pdf'
metadata ADDED
@@ -0,0 +1,116 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: copy_paste_pdf
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - Open North
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2013-10-10 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: coveralls
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - '>='
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - '>='
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: json
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ~>
32
+ - !ruby/object:Gem::Version
33
+ version: 1.7.7
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ~>
39
+ - !ruby/object:Gem::Version
40
+ version: 1.7.7
41
+ - !ruby/object:Gem::Dependency
42
+ name: rake
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - '>='
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - '>='
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: rspec
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ~>
60
+ - !ruby/object:Gem::Version
61
+ version: '2.10'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ~>
67
+ - !ruby/object:Gem::Version
68
+ version: '2.10'
69
+ description:
70
+ email:
71
+ - info@opennorth.ca
72
+ executables:
73
+ - copy-paste-pdf.applescript
74
+ extensions: []
75
+ extra_rdoc_files: []
76
+ files:
77
+ - .gitignore
78
+ - .travis.yml
79
+ - .yardopts
80
+ - Gemfile
81
+ - LICENSE
82
+ - README.md
83
+ - Rakefile
84
+ - USAGE
85
+ - bin/copy-paste-pdf.applescript
86
+ - copy_paste_pdf.gemspec
87
+ - lib/copy_paste_pdf.rb
88
+ - lib/copy_paste_pdf/table.rb
89
+ - lib/copy_paste_pdf/version.rb
90
+ - spec/spec_helper.rb
91
+ homepage: http://github.com/opennorth/copy_paste_pdf
92
+ licenses:
93
+ - MIT
94
+ metadata: {}
95
+ post_install_message:
96
+ rdoc_options: []
97
+ require_paths:
98
+ - lib
99
+ required_ruby_version: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - '>='
102
+ - !ruby/object:Gem::Version
103
+ version: '0'
104
+ required_rubygems_version: !ruby/object:Gem::Requirement
105
+ requirements:
106
+ - - '>='
107
+ - !ruby/object:Gem::Version
108
+ version: '0'
109
+ requirements: []
110
+ rubyforge_project:
111
+ rubygems_version: 2.0.8
112
+ signing_key:
113
+ specification_version: 4
114
+ summary: Converts PDF to CSV by copy-pasting from Apple's Preview to Microsoft Excel
115
+ test_files:
116
+ - spec/spec_helper.rb