pdf-reader 2.4.0 → 2.6.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (40) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG +31 -0
  3. data/README.md +17 -2
  4. data/Rakefile +1 -1
  5. data/examples/extract_fonts.rb +12 -7
  6. data/lib/pdf/reader/afm/Courier-Bold.afm +342 -342
  7. data/lib/pdf/reader/afm/Courier-BoldOblique.afm +342 -342
  8. data/lib/pdf/reader/afm/Courier-Oblique.afm +342 -342
  9. data/lib/pdf/reader/afm/Courier.afm +342 -342
  10. data/lib/pdf/reader/afm/Helvetica-Bold.afm +2827 -2827
  11. data/lib/pdf/reader/afm/Helvetica-BoldOblique.afm +2827 -2827
  12. data/lib/pdf/reader/afm/Helvetica-Oblique.afm +3051 -3051
  13. data/lib/pdf/reader/afm/Helvetica.afm +3051 -3051
  14. data/lib/pdf/reader/afm/MustRead.html +19 -0
  15. data/lib/pdf/reader/afm/Symbol.afm +213 -213
  16. data/lib/pdf/reader/afm/Times-Bold.afm +2588 -2588
  17. data/lib/pdf/reader/afm/Times-BoldItalic.afm +2384 -2384
  18. data/lib/pdf/reader/afm/Times-Italic.afm +2667 -2667
  19. data/lib/pdf/reader/afm/Times-Roman.afm +2419 -2419
  20. data/lib/pdf/reader/afm/ZapfDingbats.afm +225 -225
  21. data/lib/pdf/reader/buffer.rb +62 -21
  22. data/lib/pdf/reader/encoding.rb +1 -1
  23. data/lib/pdf/reader/error.rb +3 -3
  24. data/lib/pdf/reader/filter/ascii85.rb +5 -1
  25. data/lib/pdf/reader/filter/depredict.rb +3 -3
  26. data/lib/pdf/reader/filter/flate.rb +28 -16
  27. data/lib/pdf/reader/font.rb +3 -1
  28. data/lib/pdf/reader/glyph_hash.rb +15 -9
  29. data/lib/pdf/reader/glyphlist-zapfdingbats.txt +245 -0
  30. data/lib/pdf/reader/object_hash.rb +3 -1
  31. data/lib/pdf/reader/orientation_detector.rb +2 -2
  32. data/lib/pdf/reader/page.rb +28 -0
  33. data/lib/pdf/reader/page_layout.rb +19 -13
  34. data/lib/pdf/reader/page_state.rb +7 -5
  35. data/lib/pdf/reader/page_text_receiver.rb +22 -1
  36. data/lib/pdf/reader/parser.rb +8 -6
  37. data/lib/pdf/reader/width_calculator/built_in.rb +7 -15
  38. data/lib/pdf/reader/xref.rb +6 -1
  39. data/lib/pdf/reader/zero_width_runs_filter.rb +11 -0
  40. metadata +17 -14
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c1acf8110733b6aff447e40353cc7d847e6edbb9bad016beab35bebe191bc91a
4
- data.tar.gz: 1e50a894289d9c4f8df83bcf55e1f0bfe9e7c3088365705bea9b253a34a9254e
3
+ metadata.gz: ccc4d14f5820ca798f6eafa1c0978207759ec1668c6f6307acb7cd43bcd0626e
4
+ data.tar.gz: 466bfe0a91f57463a56d9697ccd2529f981c6917e4ed578b4103f2bc87065522
5
5
  SHA512:
6
- metadata.gz: 473d030dd4e12e6aba2e037fe7b73410e99f14112ddd7823483056281cdd89e6865efb1f02866760038f892b7d60f153db8c88643d73035ee354e43d1cf047c5
7
- data.tar.gz: 30d842ae8260a8a2005b484c9550ae04a77e23b17e36b324e36e6d6faa4a7e0f83ebdab7cc69b504e143740ebcb9fad341e5ca8c0b0b6339fa826634f94996e2
6
+ metadata.gz: 45d6c16b3d9ed029e6eb5a45cc64aa95e7ada2950e052053cbe0b6f5aae632f824a86f0505a5cee660abd1cd896177a0637a2f2f5a3f3633e829e8d46fb59817
7
+ data.tar.gz: e3e566344bd5560387577597dea20b2f7da40aed2a7fa8b8d074c0742486db59d7e349f6c38c91c8dcd9b0a8cf2aa4c19a00d0ee097003449504b3f06f18ca3c
data/CHANGELOG CHANGED
@@ -1,3 +1,34 @@
1
+ v2.6.0 (12th November 2021)
2
+ - Text extraction improvements
3
+ - Improved text layout on pages with a variery of font sizes (http://github.com/yob/pdf-reader/pull/355)
4
+ - Fixed text positioning for some rotated pages (http://github.com/yob/pdf-reader/pull/356)
5
+ - Improved character width calculation for PDFs using built-in (non-embedded) ZapfDingbats (http://github.com/yob/pdf-reader/pull/373)
6
+ - Skip zero-width characters (http://github.com/yob/pdf-reader/pull/372)
7
+ - Performance improvements
8
+ - Reduced memory pressure when decoding TIFF images (http://github.com/yob/pdf-reader/pull/360)
9
+ - Optional dependency on ascii81_native gem for faster processing of files using the ascii85 filter (http://github.com/yob/pdf-reader/pull/359)
10
+ - Successfully parse more files
11
+ - Gracefully handle some non-spec compliant CR/LF issues (http://github.com/yob/pdf-reader/pull/364)
12
+ - Fix parsing of some escape sequences in content streams (http://github.com/yob/pdf-reader/pull/368)
13
+ - Increase the amount of junk bytes we detect and skip at the end of a file (382)
14
+ - Ignore "/Prev 0" in trailers (http://github.com/yob/pdf-reader/pull/383)
15
+ - Fix parsing of some inline images (BI ID EI tokens) (http://github.com/yob/pdf-reader/pull/389)
16
+ - Gracefully handle some xref tables that incorrectly start with 1 (http://github.com/yob/pdf-reader/pull/384)
17
+
18
+ v2.5.0 (6th June 2021)
19
+ - bump minimum ruby version to 2.0
20
+ - Correctly handle trascoding to UTF-8 from some fonts that use a difference table [#344](https://github.com/yob/pdf-reader/pull/344/)
21
+ - Fix some character spacing issues with the TJ operator [#343](https://github.com/yob/pdf-reader/pull/343)
22
+ - Fix crash with some encrypted PDFs [#348](https://github.com/yob/pdf-reader/pull/348/)
23
+ - Fix positions of text on some PDFs with pages rotated 90° [#350](https://github.com/yob/pdf-reader/pull/350/)
24
+
25
+ v2.4.2 (28th January 2021)
26
+ - relax ASCII85 dependency to allow 1.x
27
+ - improved support for decompressing objects with slightly malformed zlib data
28
+
29
+ v.2.4.1 (24th September 2020)
30
+ - Re-vendor font metrics from Adobe to clarify their license
31
+
1
32
  v2.4.0 (21st November 2019)
2
33
  - Optimise overlapping characters code introduced in 2.3.0. Text extraction of pages with
3
34
  thousands of characters is still slower than it was in 2.2.1, but it might tolerable
data/README.md CHANGED
@@ -166,6 +166,19 @@ http://groups.google.com/group/pdf-reader
166
166
  The easiest way to explain how this works in practice is to show some examples.
167
167
  Check out the examples/ directory for a few files.
168
168
 
169
+ # Alternate Decoder
170
+
171
+ For PDF files containing Ascii85 streams, the [ascii85_native](https://github.com/AnomalousBit/ascii85_native) gem can be used for increased performance. If the ascii85_native gem is detected, pdf-reader will automatically use the gem.
172
+
173
+ First, run `gem install ascii85_native` and then require the gem alongside pdf-reader:
174
+
175
+ ```ruby
176
+ require "pdf-reader"
177
+ require "ascii85_native"
178
+ ```
179
+
180
+ Another way of enabling native Ascii85 decoding is to place `gem 'ascii85_native'` in your project's `Gemfile`.
181
+
169
182
  # Known Limitations
170
183
 
171
184
  Occasionally some text cannot be extracted properly due to the way it has been
@@ -176,8 +189,10 @@ little UTF-8 friendly box to indicate an unrecognisable character.
176
189
 
177
190
  * PDF::Reader Code Repository: http://github.com/yob/pdf-reader
178
191
 
179
- * PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
192
+ * PDF Specification: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
193
+
194
+ * Adobe PDF Developer Resources: http://www.adobe.com/devnet/pdf/pdf_reference.html
180
195
 
181
- * PDF Tutorial Slide Presentations: http://home.comcast.net/~jk05/presentations/PDFTutorials.html
196
+ * PDF Tutorial Slide Presentations: https://web.archive.org/web/20150110042057/http://home.comcast.net/~jk05/presentations/PDFTutorials.html
182
197
 
183
198
  * Developing with PDF (book): http://shop.oreilly.com/product/0636920025269.do
data/Rakefile CHANGED
@@ -14,7 +14,7 @@ desc "Run cane to check quality metrics"
14
14
  Cane::RakeTask.new(:quality) do |cane|
15
15
  cane.abc_max = 20
16
16
  cane.style_measure = 100
17
- cane.max_violations = 31
17
+ cane.max_violations = 32
18
18
 
19
19
  cane.use Morecane::EncodingCheck, :encoding_glob => "{app,lib,spec}/**/*.rb"
20
20
  end
@@ -17,8 +17,8 @@ module ExtractFonts
17
17
  return count if page.fonts.nil? || page.fonts.empty?
18
18
 
19
19
  page.fonts.each do |label, font|
20
- next if complete_refs[font]
21
- complete_refs[font] = true
20
+ next if complete_refs[label]
21
+ complete_refs[label] = true
22
22
 
23
23
  process_font(page, font)
24
24
 
@@ -39,7 +39,7 @@ module ExtractFonts
39
39
  when :TrueType, :CIDFontType2 then
40
40
  ExtractFonts::TTF.new(page.objects, font).save("#{font[:BaseFont]}.ttf")
41
41
  else
42
- $stderr.puts "unsupported font type #{font[:Subtype]}"
42
+ $stderr.puts "unsupported font type #{font[:Subtype]} for #{font[:BaseFont]}"
43
43
  end
44
44
  end
45
45
 
@@ -68,10 +68,15 @@ module ExtractFonts
68
68
  end
69
69
  end
70
70
 
71
- filename = File.expand_path(File.dirname(__FILE__)) + "/../spec/data/cairo-unicode.pdf"
71
+ if ARGV.size == 0 # default file name
72
+ ARGV << File.expand_path(File.join(File.dirname(__dir__), "spec", "data", "cairo-unicode.pdf"))
73
+ end
74
+
72
75
  extractor = ExtractFonts::Extractor.new
73
76
 
74
- PDF::Reader.open(filename) do |reader|
75
- page = reader.page(1)
76
- extractor.page(page)
77
+ ARGV.each do |arg|
78
+ PDF::Reader.open(arg) do |reader|
79
+ page = reader.page(1)
80
+ extractor.page(page)
81
+ end
77
82
  end