pdf-reader 2.12.0 → 2.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e3b00946c8b23b65d19ace187550b15bb3fd2537e518c778f4c12da28672c9d8
4
- data.tar.gz: 4c2ebeb19dada9f257fa65c2add2f2f6d64f011cb13e997533a4b63fc81baa6d
3
+ metadata.gz: 7174f6e8c3c655cc9a1c120e5f0b99d06b0c2355803480d1cf8347c1825ddb01
4
+ data.tar.gz: ade7c031fe3c3d6e022125ccd3ef65a9482e21a7d90e6ea9d9d65aeeee3b30e8
5
5
  SHA512:
6
- metadata.gz: 99c9ac879424056221f616d7f7299d03dfc9906c6b81c333ad255439780cf56d2dfc0c31a62347a7a163bcdb4075f8d0c914e2deeebb5d78e8ebc34e19cd7abc
7
- data.tar.gz: 50ef8b5e1061dd1d6b24a7727b5537664bcb22473757274b4cc2b92c89b9ba5ea7516f055571f5c8b72d678f7cef549858631408c86a6984196ba7d1773daaca
6
+ metadata.gz: '08d343015a23dd678264053ade37a3449c91bdfda9764c65bc6ae196529062c33272ffe191beaea9f1e20d31cb1de9c117535e27fee763968e822288324931c6'
7
+ data.tar.gz: 8b0df463cc6292048f0ad68dd682131d53a4bc9630f84841103f8c73a6c837e91cdf5af08f98c255bcefc7d2a67146387d26ab0c3cbece42853f2581981a613b
data/CHANGELOG CHANGED
@@ -1,3 +1,7 @@
1
+ v2.13.0 (2nd November 2024)
2
+ - Permit Ascii86 v1.0 and v2.0 (https://github.com/yob/pdf-reader/pull/539)
3
+ - Allow StringIO type for PDF::Reader input (https://github.com/yob/pdf-reader/pull/535)
4
+
1
5
  v2.12.0 (26th December 2023)
2
6
  - Fix a sorbet method signature (http://github.com/yob/pdf-reader/pull/512)
3
7
  - Reduce allocations when parsing PDFs with hex strings (http://github.com/yob/pdf-reader/pull/528)
data/README.md CHANGED
@@ -20,7 +20,7 @@ page.
20
20
  The recommended installation method is via Rubygems.
21
21
 
22
22
  ```ruby
23
- gem install pdf-reader
23
+ gem install pdf-reader
24
24
  ```
25
25
 
26
26
  # Usage
@@ -30,23 +30,23 @@ level information (metadata, page count, bookmarks, etc) is available via
30
30
  this object.
31
31
 
32
32
  ```ruby
33
- reader = PDF::Reader.new("somefile.pdf")
33
+ reader = PDF::Reader.new("somefile.pdf")
34
34
 
35
- puts reader.pdf_version
36
- puts reader.info
37
- puts reader.metadata
38
- puts reader.page_count
35
+ puts reader.pdf_version
36
+ puts reader.info
37
+ puts reader.metadata
38
+ puts reader.page_count
39
39
  ```
40
40
 
41
41
  PDF::Reader.new accepts an IO stream or a filename. Here's an example with
42
42
  an IO stream:
43
43
 
44
44
  ```ruby
45
- require 'open-uri'
45
+ require 'open-uri'
46
46
 
47
- io = open('http://example.com/somefile.pdf')
48
- reader = PDF::Reader.new(io)
49
- puts reader.info
47
+ io = open('http://example.com/somefile.pdf')
48
+ reader = PDF::Reader.new(io)
49
+ puts reader.info
50
50
  ```
51
51
 
52
52
  If you open a PDF with File#open or IO#open, I strongly recommend using "rb"
@@ -54,47 +54,47 @@ mode to ensure the file isn't mangled by ruby being 'helpful'. This is
54
54
  particularly important on windows and MRI >= 1.9.2.
55
55
 
56
56
  ```ruby
57
- File.open("somefile.pdf", "rb") do |io|
58
- reader = PDF::Reader.new(io)
59
- puts reader.info
60
- end
57
+ File.open("somefile.pdf", "rb") do |io|
58
+ reader = PDF::Reader.new(io)
59
+ puts reader.info
60
+ end
61
61
  ```
62
62
 
63
63
  PDF is a page based file format, so most visible information is available via
64
64
  page-based iteration
65
65
 
66
66
  ```ruby
67
- reader = PDF::Reader.new("somefile.pdf")
67
+ reader = PDF::Reader.new("somefile.pdf")
68
68
 
69
- reader.pages.each do |page|
70
- puts page.fonts
71
- puts page.text
72
- puts page.raw_content
73
- end
69
+ reader.pages.each do |page|
70
+ puts page.fonts
71
+ puts page.text
72
+ puts page.raw_content
73
+ end
74
74
  ```
75
75
 
76
76
  If you need to access the full program for rendering a page, use the walk() method
77
77
  of PDF::Reader::Page.
78
78
 
79
79
  ```ruby
80
- class RedGreenBlue
81
- def set_rgb_color_for_nonstroking(r, g, b)
82
- puts "R: #{r}, G: #{g}, B: #{b}"
83
- end
84
- end
85
-
86
- reader = PDF::Reader.new("somefile.pdf")
87
- page = reader.page(1)
88
- receiver = RedGreenBlue.new
89
- page.walk(receiver)
80
+ class RedGreenBlue
81
+ def set_rgb_color_for_nonstroking(r, g, b)
82
+ puts "R: #{r}, G: #{g}, B: #{b}"
83
+ end
84
+ end
85
+
86
+ reader = PDF::Reader.new("somefile.pdf")
87
+ page = reader.page(1)
88
+ receiver = RedGreenBlue.new
89
+ page.walk(receiver)
90
90
  ```
91
91
 
92
92
  For low level access to the objects in a PDF file, use the ObjectHash class like
93
93
  so:
94
94
 
95
95
  ```ruby
96
- reader = PDF::Reader.new("somefile.pdf")
97
- puts reader.objects.inspect
96
+ reader = PDF::Reader.new("somefile.pdf")
97
+ puts reader.objects.inspect
98
98
  ```
99
99
 
100
100
  # Text Encoding
@@ -141,7 +141,7 @@ the spec folder when you checkout a branch from Git.
141
141
  To remove any invalid CRLF characters added while checking out a branch from Git, run:
142
142
 
143
143
  ```ruby
144
- rake fix_integrity
144
+ rake fix_integrity
145
145
  ```
146
146
 
147
147
  # Maintainers
data/Rakefile CHANGED
@@ -41,7 +41,7 @@ end
41
41
  desc "Create a YAML file of integrity info for PDFs in the spec suite"
42
42
  task :integrity_yaml do
43
43
  data = {}
44
- Dir.glob("spec/data/**/*.*").sort.each do |path|
44
+ Dir.glob("spec/data/**/*.pdf").sort.each do |path|
45
45
  path_without_spec = path.gsub("spec/","")
46
46
  data[path_without_spec] = {
47
47
  :bytes => File.size(path),
@@ -0,0 +1,137 @@
1
+ # coding: utf-8
2
+ # frozen_string_literal: true
3
+ # typed: strict
4
+
5
+ class PDF::Reader
6
+ # Filter a collection of TextRun objects based on a set of conditions.
7
+ # It can be used to filter text runs based on their attributes.
8
+ # The filter can return the text runs that matches the conditions (only) or
9
+ # the text runs that do not match the conditions (exclude).
10
+ #
11
+ # You can filter the text runs based on all its attributes with the operators
12
+ # mentioned in VALID_OPERATORS.
13
+ # The filter can be nested with 'or' and 'and' conditions.
14
+ #
15
+ # Examples:
16
+ # 1. Single condition
17
+ # AdvancedTextRunFilter.exclude(text_runs, text: { include: 'sample' })
18
+ #
19
+ # 2. Multiple conditions (and)
20
+ # AdvancedTextRunFilter.exclude(text_runs, {
21
+ # font_size: { greater_than: 10, less_than: 15 }
22
+ # })
23
+ #
24
+ # 3. Multiple possible values (or)
25
+ # AdvancedTextRunFilter.exclude(text_runs, {
26
+ # font_size: { equal: [10, 12] }
27
+ # })
28
+ #
29
+ # 4. Complex AND/OR filter
30
+ # AdvancedTextRunFilter.exclude(text_runs, {
31
+ # and: [
32
+ # { font_size: { greater_than: 10 } },
33
+ # { or: [
34
+ # { text: { include: "sample" } },
35
+ # { width: { greater_than: 100 } }
36
+ # ]}
37
+ # ]
38
+ # })
39
+ class AdvancedTextRunFilter
40
+ VALID_OPERATORS = %i[
41
+ equal
42
+ not_equal
43
+ greater_than
44
+ less_than
45
+ greater_than_or_equal
46
+ less_than_or_equal
47
+ include
48
+ exclude
49
+ ]
50
+
51
+ def self.only(text_runs, filter_hash)
52
+ new(text_runs, filter_hash).only
53
+ end
54
+
55
+ def self.exclude(text_runs, filter_hash)
56
+ new(text_runs, filter_hash).exclude
57
+ end
58
+
59
+ attr_reader :text_runs, :filter_hash
60
+
61
+ def initialize(text_runs, filter_hash)
62
+ @text_runs = text_runs
63
+ @filter_hash = filter_hash
64
+ end
65
+
66
+ def only
67
+ return text_runs if filter_hash.empty?
68
+ text_runs.select { |text_run| evaluate_filter(text_run) }
69
+ end
70
+
71
+ def exclude
72
+ return text_runs if filter_hash.empty?
73
+ text_runs.reject { |text_run| evaluate_filter(text_run) }
74
+ end
75
+
76
+ private
77
+
78
+ def evaluate_filter(text_run)
79
+ if filter_hash[:or]
80
+ evaluate_or_filters(text_run, filter_hash[:or])
81
+ elsif filter_hash[:and]
82
+ evaluate_and_filters(text_run, filter_hash[:and])
83
+ else
84
+ evaluate_filters(text_run, filter_hash)
85
+ end
86
+ end
87
+
88
+ def evaluate_or_filters(text_run, conditions)
89
+ conditions.any? do |condition|
90
+ evaluate_filters(text_run, condition)
91
+ end
92
+ end
93
+
94
+ def evaluate_and_filters(text_run, conditions)
95
+ conditions.all? do |condition|
96
+ evaluate_filters(text_run, condition)
97
+ end
98
+ end
99
+
100
+ def evaluate_filters(text_run, filter_hash)
101
+ filter_hash.all? do |attribute, conditions|
102
+ evaluate_attribute_conditions(text_run, attribute, conditions)
103
+ end
104
+ end
105
+
106
+ def evaluate_attribute_conditions(text_run, attribute, conditions)
107
+ conditions.all? do |operator, value|
108
+ unless VALID_OPERATORS.include?(operator)
109
+ raise ArgumentError, "Invalid operator: #{operator}"
110
+ end
111
+
112
+ apply_operator(text_run.send(attribute), operator, value)
113
+ end
114
+ end
115
+
116
+ def apply_operator(attribute_value, operator, filter_value)
117
+ case operator
118
+ when :equal
119
+ Array(filter_value).include?(attribute_value)
120
+ when :not_equal
121
+ !Array(filter_value).include?(attribute_value)
122
+ when :greater_than
123
+ attribute_value > filter_value
124
+ when :less_than
125
+ attribute_value < filter_value
126
+ when :greater_than_or_equal
127
+ attribute_value >= filter_value
128
+ when :less_than_or_equal
129
+ attribute_value <= filter_value
130
+ when :include
131
+ Array(filter_value).any? { |v| attribute_value.to_s.include?(v.to_s) }
132
+ when :exclude
133
+ Array(filter_value).none? { |v| attribute_value.to_s.include?(v.to_s) }
134
+ end
135
+ end
136
+ end
137
+ end
@@ -1,5 +1,5 @@
1
1
  # coding: ASCII-8BIT
2
- # typed: true
2
+ # typed: strict
3
3
  # frozen_string_literal: true
4
4
 
5
5
  ################################################################################
@@ -52,7 +52,7 @@ class PDF::Reader
52
52
  CR = "\r"
53
53
  LF = "\n"
54
54
  CRLF = "\r\n"
55
- WHITE_SPACE = [LF, CR, ' ']
55
+ WHITE_SPACE = ["\n", "\r", ' ']
56
56
 
57
57
  # Quite a few PDFs have trailing junk.
58
58
  # This can be several k of nuls in some cases
@@ -1,5 +1,5 @@
1
1
  # coding: utf-8
2
- # typed: true
2
+ # typed: strict
3
3
  # frozen_string_literal: true
4
4
 
5
5
  ################################################################################
@@ -68,6 +68,14 @@ module PDF
68
68
  runs = merge_runs(runs)
69
69
  end
70
70
 
71
+ if (only_filter = opts.fetch(:only, nil))
72
+ runs = AdvancedTextRunFilter.only(runs, only_filter)
73
+ end
74
+
75
+ if (exclude_filter = opts.fetch(:exclude, nil))
76
+ runs = AdvancedTextRunFilter.exclude(runs, exclude_filter)
77
+ end
78
+
71
79
  runs
72
80
  end
73
81
 
data/lib/pdf/reader.rb CHANGED
@@ -280,6 +280,7 @@ end
280
280
  ################################################################################
281
281
 
282
282
  require 'pdf/reader/resources'
283
+ require 'pdf/reader/advanced_text_run_filter'
283
284
  require 'pdf/reader/buffer'
284
285
  require 'pdf/reader/bounding_rectangle_runs_filter'
285
286
  require 'pdf/reader/cid_widths'
data/rbi/pdf-reader.rbi CHANGED
@@ -4,7 +4,7 @@ module PDF
4
4
  sig { returns(PDF::Reader::ObjectHash) }
5
5
  attr_reader :objects
6
6
 
7
- sig { params(input: T.any(String, Tempfile, IO), opts: T::Hash[T.untyped, T.untyped]).void }
7
+ sig { params(input: T.any(String, Tempfile, IO, StringIO), opts: T::Hash[T.untyped, T.untyped]).void }
8
8
  def initialize(input, opts = {})
9
9
  @cache = T.let(T.unsafe(nil), PDF::Reader::ObjectCache)
10
10
  @objects = T.let(T.unsafe(nil), PDF::Reader::ObjectHash)
@@ -75,20 +75,20 @@ module PDF
75
75
  end
76
76
 
77
77
  class Buffer
78
- TOKEN_WHITESPACE = T.let(T.unsafe(nil), T::Array[Integer])
79
- TOKEN_DELIMITER = T.let(T.unsafe(nil), T::Array[Integer])
80
- LEFT_PAREN = T.let(T.unsafe(nil), String)
81
- LESS_THAN = T.let(T.unsafe(nil), String)
82
- STREAM = T.let(T.unsafe(nil), String)
83
- ID = T.let(T.unsafe(nil), String)
84
- FWD_SLASH = T.let(T.unsafe(nil), String)
85
- NULL_BYTE = T.let(T.unsafe(nil), String)
86
- CR = T.let(T.unsafe(nil), String)
87
- LF = T.let(T.unsafe(nil), String)
88
- CRLF = T.let(T.unsafe(nil), String)
89
- WHITE_SPACE = T.let(T.unsafe(nil), T::Array[String])
90
- TRAILING_BYTECOUNT = T.let(T.unsafe(nil), Integer)
91
- DIGITS_ONLY = T.let(T.unsafe(nil), Regexp)
78
+ TOKEN_WHITESPACE = T.let(T::Array[String])
79
+ TOKEN_DELIMITER = T.let(T::Array[Integer])
80
+ LEFT_PAREN = T.let(String)
81
+ LESS_THAN = T.let(String)
82
+ STREAM = T.let(String)
83
+ ID = T.let(String)
84
+ FWD_SLASH = T.let(String)
85
+ NULL_BYTE = T.let(String)
86
+ CR = T.let(String)
87
+ LF = T.let(String)
88
+ CRLF = T.let(String)
89
+ WHITE_SPACE = T.let(T::Array[String])
90
+ TRAILING_BYTECOUNT = T.let(Integer)
91
+ DIGITS_ONLY = T.let(Regexp)
92
92
 
93
93
  sig { returns(Integer) }
94
94
  attr_reader :pos
@@ -851,6 +851,52 @@ module PDF
851
851
  def self.exclude_empty_strings(runs); end
852
852
  end
853
853
 
854
+ class AdvancedTextRunFilter
855
+ VALID_OPERATORS = T.let(T::Array[Symbol])
856
+
857
+ sig { params(text_runs: T::Array[TextRun], filter_hash: T::Hash[Symbol, T.untyped]).returns(T::Array[TextRun]) }
858
+ def self.only(text_runs, filter_hash); end
859
+
860
+ sig { params(text_runs: T::Array[TextRun], filter_hash: T::Hash[Symbol, T.untyped]).returns(T::Array[TextRun]) }
861
+ def self.exclude(text_runs, filter_hash); end
862
+
863
+ sig { returns(T::Array[TextRun]) }
864
+ attr_reader :text_runs
865
+
866
+ sig { returns(T::Hash[Symbol, T.untyped]) }
867
+ attr_reader :filter_hash
868
+
869
+ sig { params(text_runs: T::Array[TextRun], filter_hash: T::Hash[Symbol, T.untyped]).void }
870
+ def initialize(text_runs, filter_hash)
871
+ @text_runs = T.let(T.unsafe(nil), T::Array[TextRun])
872
+ @filter_hash = T.let(T.unsafe(nil), T::Hash[Symbol, T.untyped])
873
+ end
874
+
875
+ sig { returns(T::Array[TextRun]) }
876
+ def only; end
877
+
878
+ sig { returns(T::Array[TextRun]) }
879
+ def exclude; end
880
+
881
+ sig { params(text_run: TextRun).returns(T::Boolean) }
882
+ def evaluate_filter(text_run); end
883
+
884
+ sig { params(text_run: TextRun, conditions: T::Array[T::Hash[Symbol, T.untyped]]).returns(T::Boolean) }
885
+ def evaluate_or_filters(text_run, conditions); end
886
+
887
+ sig { params(text_run: TextRun, conditions: T::Array[T::Hash[Symbol, T.untyped]]).returns(T::Boolean) }
888
+ def evaluate_and_filters(text_run, conditions); end
889
+
890
+ sig { params(text_run: TextRun, filter_hash: T::Hash[Symbol, T.untyped]).returns(T::Boolean) }
891
+ def evaluate_filters(text_run, filter_hash); end
892
+
893
+ sig { params(text_run: TextRun, attribute: Symbol, conditions: T::Hash[Symbol, T.untyped]).returns(T::Boolean) }
894
+ def evaluate_attribute_conditions(text_run, attribute, conditions); end
895
+
896
+ sig { params(attribute_value: T.untyped, operator: Symbol, filter_value: T.untyped).returns(T::Boolean) }
897
+ def apply_operator(attribute_value, operator, filter_value); end
898
+ end
899
+
854
900
  class EventPoint
855
901
  sig { returns(Numeric) }
856
902
  attr_reader :x
metadata CHANGED
@@ -1,14 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pdf-reader
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.12.0
4
+ version: 2.14.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - James Healy
8
- autorequire:
9
8
  bindir: bin
10
9
  cert_chain: []
11
- date: 2023-12-26 00:00:00.000000000 Z
10
+ date: 2025-01-29 00:00:00.000000000 Z
12
11
  dependencies:
13
12
  - !ruby/object:Gem::Dependency
14
13
  name: rake
@@ -98,16 +97,28 @@ dependencies:
98
97
  name: Ascii85
99
98
  requirement: !ruby/object:Gem::Requirement
100
99
  requirements:
101
- - - "~>"
100
+ - - ">="
102
101
  - !ruby/object:Gem::Version
103
102
  version: '1.0'
103
+ - - "<"
104
+ - !ruby/object:Gem::Version
105
+ version: '3.0'
106
+ - - "!="
107
+ - !ruby/object:Gem::Version
108
+ version: 2.0.0
104
109
  type: :runtime
105
110
  prerelease: false
106
111
  version_requirements: !ruby/object:Gem::Requirement
107
112
  requirements:
108
- - - "~>"
113
+ - - ">="
109
114
  - !ruby/object:Gem::Version
110
115
  version: '1.0'
116
+ - - "<"
117
+ - !ruby/object:Gem::Version
118
+ version: '3.0'
119
+ - - "!="
120
+ - !ruby/object:Gem::Version
121
+ version: 2.0.0
111
122
  - !ruby/object:Gem::Dependency
112
123
  name: ruby-rc4
113
124
  requirement: !ruby/object:Gem::Requirement
@@ -200,6 +211,7 @@ files:
200
211
  - examples/version.rb
201
212
  - lib/pdf-reader.rb
202
213
  - lib/pdf/reader.rb
214
+ - lib/pdf/reader/advanced_text_run_filter.rb
203
215
  - lib/pdf/reader/aes_v2_security_handler.rb
204
216
  - lib/pdf/reader/aes_v3_security_handler.rb
205
217
  - lib/pdf/reader/afm/Courier-Bold.afm
@@ -289,10 +301,9 @@ licenses:
289
301
  - MIT
290
302
  metadata:
291
303
  bug_tracker_uri: https://github.com/yob/pdf-reader/issues
292
- changelog_uri: https://github.com/yob/pdf-reader/blob/v2.12.0/CHANGELOG
293
- documentation_uri: https://www.rubydoc.info/gems/pdf-reader/2.12.0
294
- source_code_uri: https://github.com/yob/pdf-reader/tree/v2.12.0
295
- post_install_message:
304
+ changelog_uri: https://github.com/yob/pdf-reader/blob/v2.14.0/CHANGELOG
305
+ documentation_uri: https://www.rubydoc.info/gems/pdf-reader/2.14.0
306
+ source_code_uri: https://github.com/yob/pdf-reader/tree/v2.14.0
296
307
  rdoc_options:
297
308
  - "--title"
298
309
  - PDF::Reader Documentation
@@ -305,15 +316,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
305
316
  requirements:
306
317
  - - ">="
307
318
  - !ruby/object:Gem::Version
308
- version: '2.0'
319
+ version: '2.1'
309
320
  required_rubygems_version: !ruby/object:Gem::Requirement
310
321
  requirements:
311
322
  - - ">="
312
323
  - !ruby/object:Gem::Version
313
324
  version: '0'
314
325
  requirements: []
315
- rubygems_version: 3.4.10
316
- signing_key:
326
+ rubygems_version: 3.6.2
317
327
  specification_version: 4
318
328
  summary: A library for accessing the content of PDF files
319
329
  test_files: []