copy_paste_pdf 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +6 -0
- data/.travis.yml +7 -0
- data/.yardopts +4 -0
- data/Gemfile +4 -0
- data/LICENSE +20 -0
- data/README.md +102 -0
- data/Rakefile +16 -0
- data/USAGE +1 -0
- data/bin/copy-paste-pdf.applescript +63 -0
- data/copy_paste_pdf.gemspec +23 -0
- data/lib/copy_paste_pdf/table.rb +66 -0
- data/lib/copy_paste_pdf/version.rb +3 -0
- data/lib/copy_paste_pdf.rb +1 -0
- data/spec/spec_helper.rb +3 -0
- metadata +116 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 136b95ea0d8543294bce8905905ae9f638d34367
|
4
|
+
data.tar.gz: c5f220fe4fcf77b4fedf3d6fc5e186487b492ef8
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 2522bdfe9adf4f87114d73d3b0d8ca11576fd517af6c1459c286ed22340ca6225aee4320a84e4a9513ad18f98d56117a48f9e1ffd8130260b0c05fb0130c233b
|
7
|
+
data.tar.gz: 91686dc0fdcc83700e5d4e72500fa2667ec212880176eb82587111bfa8c3bab7503c6476979b66130271f120e6604354c05a050feec0e8f76ef2079dff2cba86
|
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/.yardopts
ADDED
data/Gemfile
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2013 Open North Inc.
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,102 @@
|
|
1
|
+
# Copy-Paste PDF
|
2
|
+
|
3
|
+
<!--
|
4
|
+
[![Build Status](https://secure.travis-ci.org/opennorth/copy_paste_pdf.png)](http://travis-ci.org/opennorth/copy_paste_pdf)
|
5
|
+
[![Dependency Status](https://gemnasium.com/opennorth/copy_paste_pdf.png)](https://gemnasium.com/opennorth/copy_paste_pdf)
|
6
|
+
[![Coverage Status](https://coveralls.io/repos/opennorth/copy_paste_pdf/badge.png?branch=master)](https://coveralls.io/r/opennorth/copy_paste_pdf)
|
7
|
+
[![Code Climate](https://codeclimate.com/github/opennorth/copy_paste_pdf.png)](https://codeclimate.com/github/opennorth/copy_paste_pdf)
|
8
|
+
-->
|
9
|
+
|
10
|
+
[Tabula](https://github.com/jazzido/tabula) was written for those cases where you can’t easily copy-and-paste tables from a PDF to a spreadsheet. Surprisingly, Tabula sometimes fails where copy-and-pasting succeeds. This project is for [those cases](http://www.atipp.gov.nl.ca/info/coordinators.html) when copy-and-pasting is all you need (and where nothing else works).
|
11
|
+
|
12
|
+
This gem only works on OS X.
|
13
|
+
|
14
|
+
## Getting Started
|
15
|
+
|
16
|
+
### PDF to CSV
|
17
|
+
|
18
|
+
Install with:
|
19
|
+
|
20
|
+
gem install --no-wrappers copy_paste_pdf
|
21
|
+
|
22
|
+
If you omit the `--no-wrappers` switch, the AppleScript will not install properly. You may run the script with:
|
23
|
+
|
24
|
+
copy-paste-pdf.applescript /path/to/input.pdf /path/to/output.csv
|
25
|
+
|
26
|
+
* The script will open the PDF in Preview and copy the contents of the PDF
|
27
|
+
* The script will open Microsoft Excel, paste the contents and save as CSV
|
28
|
+
|
29
|
+
If you want the script to quit Preview and Excel once it's done, pass a third argument, like:
|
30
|
+
|
31
|
+
copy-paste-pdf.applescript /path/to/input.pdf /path/to/output.csv true
|
32
|
+
|
33
|
+
The script may [pinwheel](http://en.wikipedia.org/wiki/Spinning_pinwheel) while copying the contents of the PDF and while pasting the contents to the spreadsheet. If it looks like nothing is happening, wait a few seconds.
|
34
|
+
|
35
|
+
You can work in other applications while the script is running - just don't use the clipboard as it may interfere with the script.
|
36
|
+
|
37
|
+
This method is admittedly not very efficient. Running time averages under 2 seconds per page but varies considerably depending on your system's load.
|
38
|
+
|
39
|
+
### Data Cleaning
|
40
|
+
|
41
|
+
The Ruby gem defines helper methods for cleaning the CSV. In most cases, the PDF to CSV conversion will create many empty rows. You can easily remove those rows with:
|
42
|
+
|
43
|
+
```ruby
|
44
|
+
require 'csv'
|
45
|
+
|
46
|
+
require 'copy_paste_pdf'
|
47
|
+
|
48
|
+
table = CopyPastePDF::Table.new(CSV.read('/path/to/output.csv'))
|
49
|
+
|
50
|
+
table.remove_empty_rows!
|
51
|
+
|
52
|
+
CSV.open('/path/to/clean.csv', 'w') do |csv|
|
53
|
+
table.each do |row|
|
54
|
+
csv << row
|
55
|
+
end
|
56
|
+
end
|
57
|
+
```
|
58
|
+
|
59
|
+
If the table in the PDF contained vertically-merged cells, then, in the CSV, the first of the merged cells will have a value and the others will be empty. To copy the value of the first cell to the others, use the `copy_into_cell_below` method, which accepts the indices of columns containing merged cells:
|
60
|
+
|
61
|
+
```ruby
|
62
|
+
table.copy_into_cell_below(0, 3, 4)
|
63
|
+
```
|
64
|
+
|
65
|
+
Sometimes, if a cell contains multiple lines of text, the PDF to CSV conversion will incorrectly break the cell into multiple rows. To remove the spurious row and merge its values into the row above, use the `merge_into_cell_above` method, which accepts the indices of columns in which this error occurs:
|
66
|
+
|
67
|
+
```ruby
|
68
|
+
table.merge_into_cell_above(1, 2)
|
69
|
+
```
|
70
|
+
|
71
|
+
With additional time and effort, these two methods can be made to operate without needing columns as hints.
|
72
|
+
|
73
|
+
## Troubleshooting
|
74
|
+
|
75
|
+
If you see warnings on the command-line like:
|
76
|
+
|
77
|
+
2013-10-09 14:30:03.704 osascript[2056:707] Error loading /Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types: dlopen(/Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types, 262): no suitable image found. Did find:
|
78
|
+
/Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types: no matching architecture in universal wrapper
|
79
|
+
osascript: OpenScripting.framework - scripting addition "/Library/ScriptingAdditions/Adobe Unit Types.osax" declares no loadable handlers.
|
80
|
+
|
81
|
+
See [this Adobe help article](http://helpx.adobe.com/photoshop/kb/unit-type-conversion-error-applescript.html).
|
82
|
+
|
83
|
+
## Developers
|
84
|
+
|
85
|
+
If, like me, you almost never write AppleScript, you can access much of AppleScript's documentation through Apple's AppleScript Editor. See, for example, how to access [the entries about Microsoft Excel](http://support.microsoft.com/kb/113891).
|
86
|
+
|
87
|
+
## Why?
|
88
|
+
|
89
|
+
Most of the PDFs I work with contain no tables. In those cases I either:
|
90
|
+
|
91
|
+
* Run `pdftotext filename.pdf` to convert the PDF to text, and write a script using regular expressions to parse the output.
|
92
|
+
* Run `pdftotext -layout filename.pdf` to convert the PDF to text while preserving the text layout – very useful when working with two-column layouts.
|
93
|
+
* Use [commercial software](http://reviews.reporterslab.org/search?q=&type=products&category=pdf-tools-2011-11-09) like Adobe Acrobat Pro to save the PDF to another format, usually Excel.
|
94
|
+
* I recently learned that Apple's Automator has an `Extract PDF Text` action which performs well.
|
95
|
+
|
96
|
+
For PDFs containing tables, I discovered that copy-pasting from Apple's Preview to Microsoft Excel worked better than all alternatives tested, for the PDFs I was interested in.
|
97
|
+
|
98
|
+
## Bugs? Questions?
|
99
|
+
|
100
|
+
This project's main repository is on GitHub: [http://github.com/opennorth/copy_paste_pdf](http://github.com/opennorth/copy_paste_pdf), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
|
101
|
+
|
102
|
+
Copyright (c) 2013 Open North Inc., released under the MIT license
|
data/Rakefile
ADDED
@@ -0,0 +1,16 @@
|
|
1
|
+
require 'bundler'
|
2
|
+
Bundler::GemHelper.install_tasks
|
3
|
+
|
4
|
+
require 'rspec/core/rake_task'
|
5
|
+
RSpec::Core::RakeTask.new(:spec)
|
6
|
+
|
7
|
+
task :default => :spec
|
8
|
+
|
9
|
+
begin
|
10
|
+
require 'yard'
|
11
|
+
YARD::Rake::YardocTask.new
|
12
|
+
rescue LoadError
|
13
|
+
task :yard do
|
14
|
+
abort 'YARD is not available. In order to run yard, you must: gem install yard'
|
15
|
+
end
|
16
|
+
end
|
data/USAGE
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
See README.md for full usage details.
|
@@ -0,0 +1,63 @@
|
|
1
|
+
#!/usr/bin/osascript
|
2
|
+
-- @see http://appscript.sourceforge.net/osascript.html
|
3
|
+
|
4
|
+
on run argv
|
5
|
+
set start to current date
|
6
|
+
|
7
|
+
-- Default arguments.
|
8
|
+
set close_applications to true
|
9
|
+
|
10
|
+
-- Parse command-line arguments and translate paths from `/path/to/file.ext`
|
11
|
+
-- to `Macintosh HD:path:to:file.ext`.
|
12
|
+
set input to POSIX file (item 1 of argv) as alias
|
13
|
+
set output to POSIX file (item 2 of argv) as string
|
14
|
+
if count of argv is 2 then set close_applications to false
|
15
|
+
|
16
|
+
tell application "Preview"
|
17
|
+
activate
|
18
|
+
open input
|
19
|
+
end
|
20
|
+
|
21
|
+
-- Preview is not fully scriptable.
|
22
|
+
set the clipboard to ""
|
23
|
+
tell application "System Events" to tell process "Preview"
|
24
|
+
click menu item "Select All" of menu 1 of menu bar item "Edit" of menu bar 1
|
25
|
+
click menu item "Copy" of menu 1 of menu bar item "Edit" of menu bar 1
|
26
|
+
end tell
|
27
|
+
if close_applications then tell application "Preview" to quit
|
28
|
+
|
29
|
+
-- Yes, this is how to make AppleScript block.
|
30
|
+
repeat
|
31
|
+
try
|
32
|
+
if (the clipboard) is not "" then
|
33
|
+
-- One idea is to set a variable to the clipboard, in order to allow use
|
34
|
+
-- of the clipboard by the user past this point. However, the clipboard
|
35
|
+
-- to be complex based on the output of `return clipboard info`.
|
36
|
+
exit repeat
|
37
|
+
end if
|
38
|
+
-- Calling `the clipboard` can sometimes cause an exception to be thrown,
|
39
|
+
-- either due to race condition or software error. Not sure how to recover.
|
40
|
+
-- Observed error codes are: -25130, -25132, -25133.
|
41
|
+
-- @see http://search.cpan.org/~wyant/Mac-Pasteboard-0.002/lib/Mac/Pasteboard.pm#badPasteboardSyncErr
|
42
|
+
-- @see http://search.cpan.org/~wyant/Mac-Pasteboard-0.002/lib/Mac/Pasteboard.pm#badPasteboardItemErr
|
43
|
+
-- @see http://search.cpan.org/~wyant/Mac-Pasteboard-0.002/lib/Mac/Pasteboard.pm#badPasteboardFlavorErr
|
44
|
+
on error error_message number error_number
|
45
|
+
if {-25130, -25132, -25133} contains error_number
|
46
|
+
error "Known error " & error_number & " occurred" number error_number
|
47
|
+
else
|
48
|
+
error error_message number error_number
|
49
|
+
end if
|
50
|
+
end try
|
51
|
+
end repeat
|
52
|
+
|
53
|
+
tell application "Microsoft Excel"
|
54
|
+
activate
|
55
|
+
make new workbook
|
56
|
+
paste worksheet active sheet
|
57
|
+
save workbook as active workbook filename output file format CSV file format overwrite yes
|
58
|
+
close active workbook saving no
|
59
|
+
if close_applications then quit
|
60
|
+
end tell
|
61
|
+
|
62
|
+
log "" & (current date - start) & " seconds"
|
63
|
+
end run
|
@@ -0,0 +1,23 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
require File.expand_path('../lib/copy_paste_pdf/version', __FILE__)
|
3
|
+
|
4
|
+
Gem::Specification.new do |s|
|
5
|
+
s.name = "copy_paste_pdf"
|
6
|
+
s.version = CopyPastePDF::VERSION
|
7
|
+
s.platform = Gem::Platform::RUBY
|
8
|
+
s.authors = ["Open North"]
|
9
|
+
s.email = ["info@opennorth.ca"]
|
10
|
+
s.homepage = "http://github.com/opennorth/copy_paste_pdf"
|
11
|
+
s.summary = %q{Converts PDF to CSV by copy-pasting from Apple's Preview to Microsoft Excel}
|
12
|
+
s.license = 'MIT'
|
13
|
+
|
14
|
+
s.files = `git ls-files`.split("\n")
|
15
|
+
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
16
|
+
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
|
17
|
+
s.require_paths = ["lib"]
|
18
|
+
|
19
|
+
s.add_development_dependency('coveralls')
|
20
|
+
s.add_development_dependency('json', '~> 1.7.7') # to silence coveralls warning
|
21
|
+
s.add_development_dependency('rake')
|
22
|
+
s.add_development_dependency('rspec', '~> 2.10')
|
23
|
+
end
|
@@ -0,0 +1,66 @@
|
|
1
|
+
module CopyPastePDF
|
2
|
+
class Table < Array
|
3
|
+
# Removes all empty rows from the table.
|
4
|
+
def remove_empty_rows!
|
5
|
+
reject! do |row|
|
6
|
+
row.all?(&:nil?)
|
7
|
+
end
|
8
|
+
end
|
9
|
+
|
10
|
+
# Copies the values of the given cells from a line to any following lines
|
11
|
+
# whose corresponding cells are empty.
|
12
|
+
#
|
13
|
+
# @param [Array] indices the cell indices to copy
|
14
|
+
# @yieldparam [Array] row a row in the table
|
15
|
+
# @yieldreturn [Boolean] whether to skip the row from the table
|
16
|
+
# @raise if a destination has no source
|
17
|
+
# @raise if a destination cell has a value
|
18
|
+
# @raise if a row is neither a source nor a destination
|
19
|
+
def copy_into_cell_below(*indices)
|
20
|
+
source = nil
|
21
|
+
each do |row|
|
22
|
+
if !block_given? || !yield(row)
|
23
|
+
values = row.values_at(*indices)
|
24
|
+
case values.count(&:nil?)
|
25
|
+
when 0
|
26
|
+
source = values
|
27
|
+
when indices.size
|
28
|
+
if source
|
29
|
+
indices.each_with_index do |index,i|
|
30
|
+
if row[index]
|
31
|
+
raise "#{index} contains #{row[index]}"
|
32
|
+
else
|
33
|
+
row[index] = source[i]
|
34
|
+
end
|
35
|
+
end
|
36
|
+
else
|
37
|
+
raise "#{row} has no source"
|
38
|
+
end
|
39
|
+
else
|
40
|
+
raise "#{row} is neither a source nor a destination"
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
45
|
+
|
46
|
+
# Merges the values of the given cells from a line, whose other cells are
|
47
|
+
# empty, into the corresponding cells of the prececeding line.
|
48
|
+
#
|
49
|
+
# @param [Array] indices the cell indices to merge
|
50
|
+
# @raise if a destination cell is empty
|
51
|
+
def merge_into_cell_above(*indices)
|
52
|
+
each_with_index.reverse_each do |row,i|
|
53
|
+
if row.each_with_index.all?{|value,j| value.nil? || indices.include?(j)}
|
54
|
+
indices.each do |index|
|
55
|
+
if self[i - 1][index]
|
56
|
+
self[i - 1][index] = "#{self[i - 1][index]}\n#{row[index]}"
|
57
|
+
else
|
58
|
+
raise "#{index} is empty"
|
59
|
+
end
|
60
|
+
end
|
61
|
+
delete_at(i)
|
62
|
+
end
|
63
|
+
end
|
64
|
+
end
|
65
|
+
end
|
66
|
+
end
|
@@ -0,0 +1 @@
|
|
1
|
+
require 'copy_paste_pdf/table'
|
data/spec/spec_helper.rb
ADDED
metadata
ADDED
@@ -0,0 +1,116 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: copy_paste_pdf
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Open North
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2013-10-10 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: coveralls
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - '>='
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '0'
|
20
|
+
type: :development
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - '>='
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '0'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: json
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - ~>
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: 1.7.7
|
34
|
+
type: :development
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - ~>
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: 1.7.7
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: rake
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - '>='
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - '>='
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: rspec
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - ~>
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '2.10'
|
62
|
+
type: :development
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - ~>
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: '2.10'
|
69
|
+
description:
|
70
|
+
email:
|
71
|
+
- info@opennorth.ca
|
72
|
+
executables:
|
73
|
+
- copy-paste-pdf.applescript
|
74
|
+
extensions: []
|
75
|
+
extra_rdoc_files: []
|
76
|
+
files:
|
77
|
+
- .gitignore
|
78
|
+
- .travis.yml
|
79
|
+
- .yardopts
|
80
|
+
- Gemfile
|
81
|
+
- LICENSE
|
82
|
+
- README.md
|
83
|
+
- Rakefile
|
84
|
+
- USAGE
|
85
|
+
- bin/copy-paste-pdf.applescript
|
86
|
+
- copy_paste_pdf.gemspec
|
87
|
+
- lib/copy_paste_pdf.rb
|
88
|
+
- lib/copy_paste_pdf/table.rb
|
89
|
+
- lib/copy_paste_pdf/version.rb
|
90
|
+
- spec/spec_helper.rb
|
91
|
+
homepage: http://github.com/opennorth/copy_paste_pdf
|
92
|
+
licenses:
|
93
|
+
- MIT
|
94
|
+
metadata: {}
|
95
|
+
post_install_message:
|
96
|
+
rdoc_options: []
|
97
|
+
require_paths:
|
98
|
+
- lib
|
99
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
100
|
+
requirements:
|
101
|
+
- - '>='
|
102
|
+
- !ruby/object:Gem::Version
|
103
|
+
version: '0'
|
104
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
105
|
+
requirements:
|
106
|
+
- - '>='
|
107
|
+
- !ruby/object:Gem::Version
|
108
|
+
version: '0'
|
109
|
+
requirements: []
|
110
|
+
rubyforge_project:
|
111
|
+
rubygems_version: 2.0.8
|
112
|
+
signing_key:
|
113
|
+
specification_version: 4
|
114
|
+
summary: Converts PDF to CSV by copy-pasting from Apple's Preview to Microsoft Excel
|
115
|
+
test_files:
|
116
|
+
- spec/spec_helper.rb
|