metaclean 1.0.2 → 4.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,23 +1,41 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- # ───────────────────────────────────────────────────────────────────────────
4
3
  # The "policy" module: which tools to run for which file, and what counts as
5
4
  # privacy-relevant if it survives a clean.
6
5
  #
7
6
  # Keeping this logic in its own file means the runner doesn't need to know
8
7
  # about formats — it just asks Strategy.tools_for(path) and runs whatever
9
8
  # comes back.
10
- # ───────────────────────────────────────────────────────────────────────────
11
9
 
12
10
  module Metaclean
13
11
  module Strategy
14
- # Tag GROUPS that almost always carry personally identifying info.
15
- # Survival of any tag in these groups raises a flag to the user.
16
- PRIVACY_GROUPS = %w[GPS MakerNotes XMP-dc XMP-photoshop IPTC ICC-header].freeze
12
+ # Group-name PREFIXES treated as privacy-bearing. Matching whole families by
13
+ # prefix keeps the residual check fail-closed instead of an exact allowlist
14
+ # that silently misses variants:
15
+ # GPS* — GPS plus any sub-group
16
+ # XMP-* — every XMP namespace (XMP-exif GPS, XMP-mwg-rs face/person
17
+ # names, XMP-xmpMM DocumentID, XMP-iptcExt, …)
18
+ # MakerNotes*, IPTC*
19
+ # IFD1 — the embedded thumbnail IFD; a surviving thumbnail can carry
20
+ # the original's full EXIF+GPS
21
+ # Over-flagging here is deliberate: for a privacy tool a false "still
22
+ # present" is far cheaper than a false "Cleaned". (ICC colour-profile groups
23
+ # are intentionally NOT flagged — a colour profile isn't PII; any genuinely
24
+ # identifying field such as Copyright is still caught by PRIVACY_TAGS below.)
25
+ PRIVACY_GROUP_PREFIXES = %w[GPS XMP- MakerNotes IPTC IFD1].freeze
26
+
27
+ # Formats ExifTool can't WRITE, so it leaves document-internal metadata only
28
+ # mat2's rebuild removes (and can't re-read to verify). If mat2 won't run for
29
+ # one of these, the runner warns coverage is reduced rather than reporting a
30
+ # confident "Cleaned". (PDF is NOT here: ExifTool writes PDF metadata and qpdf
31
+ # rebuilds the file, so PDF is fully handled and verifiable without mat2.)
32
+ MAT2_ESSENTIAL = %w[docx xlsx pptx odt ods odp odg odf epub].freeze
17
33
 
18
34
  # Specific tag NAMES (regardless of group) we never want to leak.
19
35
  # If exiftool reports e.g. "EXIF:Artist" we still flag it because of the
20
- # tag-name match, not the group.
36
+ # tag-name match, not the group. exiftool's `-all=` normally strips these,
37
+ # so this list is a fail-closed BACKSTOP: if any survive a strip we'd rather
38
+ # over-warn than report a confident "Cleaned".
21
39
  PRIVACY_TAGS = %w[
22
40
  Artist Author Creator Copyright Rights
23
41
  By-line By-lineTitle Credit Source Contact OwnerName
@@ -25,45 +43,70 @@ module Metaclean
25
43
  Software HostComputer ProcessingSoftware
26
44
  ImageDescription UserComment
27
45
  LastModifiedBy LastSavedBy LastAuthor
46
+ Make Model LensModel DateTimeOriginal CreateDate
47
+ Title Subject Keywords Description Category Producer Company Manager
48
+ CreationDate ModDate
49
+ XPAuthor XPComment XPSubject XPKeywords XPTitle Comment
28
50
  ].freeze
29
51
 
30
52
  # File extensions where mat2 is meaningfully stricter than ExifTool and
31
53
  # should run first. For other formats, ExifTool is the broader expert.
54
+ # (mkv/webm are NOT here — see FFMPEG_FORMATS; no mat2/ExifTool path writes
55
+ # Matroska.)
32
56
  MAT2_PREFERRED = %w[
33
- pdf docx xlsx pptx odt ods odp odg epub png svg
34
- mp4 avi mkv mov webm
57
+ docx xlsx pptx odt ods odp odg odf epub png svg
58
+ mp4 avi
35
59
  ].freeze
36
60
 
61
+ # Matroska containers. ExifTool is read-only for them and mat2 has no
62
+ # Matroska parser, so neither can strip mkv/webm. ffmpeg is the only tool in
63
+ # the set that can — it remuxes the container dropping all metadata while
64
+ # copying every stream verbatim (lossless, no re-encode).
65
+ FFMPEG_FORMATS = %w[mkv webm].freeze
66
+
67
+ # Raster formats mat2 cannot strip without DAMAGING the file: it rebuilds via
68
+ # Pillow, which recompresses JPEG/WebP (visible quality loss — a clean
69
+ # wallpaper drops ~65% in size with no metadata to remove) and downconverts
70
+ # TIFF (16-bit → 8-bit). ExifTool strips all of these completely and IN PLACE
71
+ # (pixels byte-identical), so ExifTool owns them and mat2 is skipped —
72
+ # cleaning metadata must never silently damage the file.
73
+ MAT2_DEGRADES = %w[jpg jpeg webp tif tiff].freeze
74
+
37
75
  module_function
38
76
 
39
77
  # Returns an ordered list of tool symbols (e.g. `[:mat2, :exiftool, :qpdf]`)
40
78
  # to run on `path`. The runner executes them in order; if one fails or
41
- # is skipped, the next still runs.
42
- #
43
- # `prefer:` is a hash of user opt-outs from the CLI flags
44
- # (--no-mat2, --exiftool-only, etc.). The pattern `prefer[:mat2] != false`
45
- # treats both `nil` (not set) and `true` as "use it" — only an explicit
46
- # `false` disables.
47
- def tools_for(path, prefer: {})
48
- ext = File.extname(path).downcase.delete('.')
79
+ # is skipped, the next still runs. The three tools are always used together
80
+ # for maximum coverage — there is no per-tool opt-out; a tool that isn't
81
+ # installed is simply left out (the `.available?`/`.supports?` checks).
82
+ def tools_for(path)
83
+ ext = Metaclean.ext_of(path)
49
84
  tools = []
50
85
 
51
86
  if ext == 'pdf'
52
- # PDFs benefit from all three, in this order:
53
- # mat2 cleans the high-level metadata + content streams it knows
54
- # exiftool → strips the Info dictionary (Author, Title, Producer)
55
- # qpdf → rebuilds the file, dropping any unreferenced bits
56
- tools << :mat2 if prefer[:mat2] != false && Mat2.available?
57
- tools << :exiftool if prefer[:exiftool] != false
58
- tools << :qpdf if prefer[:qpdf] != false && Qpdf.available?
59
- elsif MAT2_PREFERRED.include?(ext) && prefer[:mat2] != false && Mat2.available?
87
+ # mat2 cleans PDFs by RASTERIZING every page (text images): it destroys
88
+ # the text layer and balloons the file (~35×). So PDFs skip mat2 and use:
89
+ # exiftool → strips the Info dictionary + XMP (Author, Title, Producer)
90
+ # qpdf → rebuilds the file, dropping unreferenced objects / old revisions
91
+ # Both are lossless and leave the text intact. (PDF JS/macros are out of
92
+ # scope see README.)
93
+ tools << :exiftool
94
+ tools << :qpdf if Qpdf.available?
95
+ elsif FFMPEG_FORMATS.include?(ext)
96
+ # Matroska (mkv/webm): ffmpeg is the ONLY tool that can clean these.
97
+ # ExifTool still re-reads the result afterwards, so the residual check
98
+ # (the false-clean backstop) is not blind.
99
+ tools << :ffmpeg if Ffmpeg.available?
100
+ elsif MAT2_PREFERRED.include?(ext) && Mat2.available?
60
101
  # Office docs, modern image/video containers — mat2 leads.
61
102
  tools << :mat2
62
- tools << :exiftool if prefer[:exiftool] != false
103
+ tools << :exiftool
63
104
  else
64
- # Everything else (JPEG, MP3, RAW, …) — ExifTool is the gold standard.
65
- tools << :exiftool if prefer[:exiftool] != false
66
- tools << :mat2 if prefer[:mat2] != false && Mat2.supports?(path)
105
+ # Everything else (JPEG, MP3, RAW, …) — ExifTool has the broadest coverage.
106
+ # mat2 still adds coverage for many, but NOT for rasters it would damage
107
+ # (MAT2_DEGRADES) there ExifTool's in-place strip is complete and lossless.
108
+ tools << :exiftool
109
+ tools << :mat2 if Mat2.supports?(path) && !MAT2_DEGRADES.include?(ext)
67
110
  end
68
111
 
69
112
  tools
@@ -78,19 +121,55 @@ module Metaclean
78
121
  # "Artist" in EXIF). Combining the two keeps coverage broad without
79
122
  # having to enumerate every {group, tag} pair.
80
123
  def privacy_residual(meta)
81
- meta.reject { |k, _| k == 'SourceFile' }.select do |k, _|
82
- # ExifTool keys look like "GPS:GPSLatitude". Split on the first ":".
124
+ meta.select do |k, v|
125
+ # Skip SourceFile and the System/File/etc. groups not user metadata.
126
+ next false unless Display.embedded_key?(k)
127
+
128
+ # ExifTool keys look like "GPS:GPSLatitude". Split on the first ":";
129
+ # no "Group:" prefix means the whole key is the tag name.
83
130
  group, tag = k.to_s.split(':', 2)
84
- # Skip System/File/etc. those aren't user metadata.
85
- next false if Display::NON_METADATA_GROUPS.include?(group)
86
-
87
- if tag.nil?
88
- # No "Group:" prefix the whole key is the tag name.
89
- PRIVACY_TAGS.include?(group.to_s)
90
- else
91
- PRIVACY_GROUPS.include?(group) || PRIVACY_TAGS.include?(tag)
92
- end
131
+ name = tag.nil? ? group.to_s : tag
132
+
133
+ # A zeroed/empty value is not a leak for un-removable container atoms like
134
+ # QuickTime:CreateDate (deletable only by zeroing, "0000:00:00 …") — without
135
+ # this every video would fail the gate on an already-zeroed date. GPS is the
136
+ # exception: 0,0 is a REAL location (Null Island) and a coordinate ExifTool
137
+ # reports as 0 (or null) must still be caught, so the blank exemption NEVER
138
+ # applies to GPS-family entries — the whole point of the fail-closed backstop.
139
+ gps = group.to_s.start_with?('GPS') || name.start_with?('GPS')
140
+ next false if !gps && blank_value?(v)
141
+
142
+ privacy_group?(group) || privacy_tag?(name)
93
143
  end
94
144
  end
145
+
146
+ # True when a value carries no information: empty, or only zeros plus date/time
147
+ # punctuation and the "Z" (UTC) marker — e.g. "0000:00:00 00:00:00", or the ASF
148
+ # variant "0000:00:00 00:00:00Z" that mat2 writes into WMV's mandatory date
149
+ # field. Only the digit 0 is stripped (never 1-9), so a real value like
150
+ # "59.9139", "Jane Doe", or a real "2024:..." date keeps other characters and
151
+ # is NOT blank. (GPS is exempt from this check entirely — see privacy_residual.)
152
+ def blank_value?(value)
153
+ s = value.to_s
154
+ s.strip.empty? || s.gsub(/[Z0\s:.+-]/, '').empty?
155
+ end
156
+
157
+ # A group is privacy-bearing if it matches one of the family prefixes
158
+ # (GPS, XMP-, MakerNotes, IPTC, IFD1).
159
+ def privacy_group?(group)
160
+ PRIVACY_GROUP_PREFIXES.any? { |p| group.to_s.start_with?(p) }
161
+ end
162
+
163
+ # A tag is privacy-bearing if it's in the exact list OR is any GPS* tag
164
+ # (GPSLatitude/GPSLongitude/GPSPosition/… regardless of group).
165
+ def privacy_tag?(tag)
166
+ t = tag.to_s
167
+ PRIVACY_TAGS.include?(t) || t.start_with?('GPS')
168
+ end
169
+
170
+ # Does this path need mat2 for adequate coverage? (See MAT2_ESSENTIAL.)
171
+ def mat2_essential?(path)
172
+ MAT2_ESSENTIAL.include?(Metaclean.ext_of(path))
173
+ end
95
174
  end
96
175
  end
@@ -1,11 +1,9 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- # ───────────────────────────────────────────────────────────────────────────
4
3
  # Single source of truth for the program's version.
5
4
  # Both the gemspec and `metaclean --version` read from here, so we only have
6
5
  # one place to bump.
7
- # ───────────────────────────────────────────────────────────────────────────
8
6
 
9
7
  module Metaclean
10
- VERSION = '1.0.2'
8
+ VERSION = '4.0.1'
11
9
  end
data/lib/metaclean.rb CHANGED
@@ -1,33 +1,82 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- # ───────────────────────────────────────────────────────────────────────────
4
- # lib/metaclean.rb — the library's "front door".
5
- #
6
- # In Ruby, a module is a namespace. We put everything inside `Metaclean::*`
7
- # so we don't pollute the global namespace and so it's obvious where each
8
- # piece belongs.
9
- #
10
- # The `require` order matters: a file can only reference constants from
11
- # files already loaded. We load the smallest pieces first, then the bigger
12
- # ones that depend on them.
13
- # ───────────────────────────────────────────────────────────────────────────
14
-
15
- require 'metaclean/version' # just defines VERSION
16
- require 'metaclean/display' # ANSI colors and formatters (no deps)
17
- require 'metaclean/exiftool' # ExifTool wrapper
18
- require 'metaclean/mat2' # mat2 wrapper
19
- require 'metaclean/qpdf' # qpdf wrapper
20
- require 'metaclean/strategy' # picks which tools run for each file type
21
- require 'metaclean/runner' # orchestrates a clean across many files
22
- require 'metaclean/cli' # parses ARGV and calls Runner
3
+ # Library entry point. require order matters: dependencies before dependents.
4
+
5
+ require 'metaclean/version'
6
+ require 'metaclean/display'
7
+ require 'metaclean/exiftool'
8
+ require 'metaclean/mat2'
9
+ require 'metaclean/qpdf'
10
+ require 'metaclean/ffmpeg'
11
+ require 'metaclean/strategy'
12
+ require 'metaclean/runner'
13
+ require 'metaclean/cli'
23
14
 
24
15
  module Metaclean
25
- # Custom exception classes. Inheriting from StandardError lets callers do
26
- # `rescue Metaclean::Error` to catch any of our errors without accidentally
27
- # catching things like NoMemoryError or SystemExit.
28
16
  class Error < StandardError; end
29
17
 
30
- # A more specific error so the CLI can show a tailored install hint when
31
- # ExifTool itself is missing.
32
- class ExiftoolMissing < Error; end
18
+ # Raised by ensure_tools! when any of the four required external tools is not
19
+ # on PATH. metaclean runs ExifTool, mat2, qpdf and ffmpeg together and refuses
20
+ # to run without all of them.
21
+ class ToolsMissing < Error; end
22
+
23
+ # A path beginning with "-" is misread as an *option* by the tools we shell
24
+ # out to — e.g. exiftool's `-config FILE` loads and runs arbitrary Perl.
25
+ # Open3 argument arrays bypass the shell, but NOT the invoked tool's own
26
+ # option parser. Prefixing a leading-dash relative path with "./" makes it
27
+ # unambiguously a filename to every tool. Absolute paths and normal names
28
+ # pass through untouched. Used at every shell-out boundary.
29
+ def self.safe_path(path)
30
+ s = path.to_s
31
+ s.start_with?('-') ? File.join('.', s) : s
32
+ end
33
+
34
+ # Lower-cased, dot-stripped extension used for FORMAT ROUTING decisions
35
+ # (Strategy#tools_for, Strategy#mat2_essential?, Mat2.supports?). One
36
+ # definition so every routing path normalizes the extension identically —
37
+ # a future tweak (double extensions, locale-safe downcasing) lands once.
38
+ def self.ext_of(path)
39
+ File.extname(path.to_s).downcase.delete('.')
40
+ end
41
+
42
+ # Marker embedded in every staging-temp filename (Runner, Ffmpeg, Qpdf) and
43
+ # matched by Runner#skip?, so a leftover temp from an interrupted run is
44
+ # ignored on a later directory scan. One literal keeps the producers and the
45
+ # matcher from drifting (qpdf previously embedded a divergent
46
+ # ".metaclean.qpdf.tmp." that didn't contain this marker).
47
+ TMP_MARKER = '.metaclean.tmp.'
48
+
49
+ # Suffix of the default "<name>_clean.<ext>" outputs. Runner#build_clean_path
50
+ # writes it; CLEAN_OUTPUT_RE derives the loop-prevention match from it so the
51
+ # producer and Runner#skip? can't disagree.
52
+ CLEAN_SUFFIX = '_clean'
53
+
54
+ # Matches our own "<name>_clean.<ext>" outputs (with optional "_N" collision
55
+ # counter) so a recursive re-run doesn't re-clean them. Compiled once here,
56
+ # in the module body that runs after the requires, so CLEAN_SUFFIX exists.
57
+ CLEAN_OUTPUT_RE = /#{Regexp.escape(CLEAN_SUFFIX)}(_\d+)?\.[^.]+\z/
58
+
59
+ # Preflight: all four tools must be installed. We run them together for full
60
+ # coverage and to verify the strip, so a partial toolchain is not "good enough"
61
+ # — bail with one clear message naming what's missing and how to install
62
+ # everything. Called once by the CLI before any inspect/clean work.
63
+ def self.ensure_tools!
64
+ missing = []
65
+ missing << 'exiftool' unless Exiftool.available?
66
+ missing << 'mat2' unless Mat2.available?
67
+ missing << 'qpdf' unless Qpdf.available?
68
+ missing << 'ffmpeg' unless Ffmpeg.available?
69
+ return if missing.empty?
70
+
71
+ raise ToolsMissing, <<~MSG
72
+ Missing required tool(s): #{missing.join(', ')}
73
+
74
+ metaclean needs ExifTool, mat2, qpdf and ffmpeg together. Install all four:
75
+ macOS: brew install exiftool mat2 qpdf ffmpeg
76
+ Debian/Ubuntu: sudo apt install libimage-exiftool-perl mat2 qpdf ffmpeg
77
+ Fedora: sudo dnf install perl-Image-ExifTool mat2 qpdf ffmpeg
78
+ Arch: sudo pacman -S perl-image-exiftool mat2 qpdf ffmpeg
79
+ Windows: use WSL2 (https://learn.microsoft.com/windows/wsl/install) + the Debian/Ubuntu line
80
+ MSG
81
+ end
33
82
  end
metadata CHANGED
@@ -1,17 +1,17 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: metaclean
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.2
4
+ version: 4.0.1
5
5
  platform: ruby
6
6
  authors:
7
- - 26zl
7
+ - Laurent Zogaj
8
8
  bindir: bin
9
9
  cert_chain: []
10
10
  date: 1980-01-02 00:00:00.000000000 Z
11
11
  dependencies: []
12
12
  description: |
13
- metaclean is a small Ruby CLI that wraps ExifTool, mat2 and qpdf to strip
14
- removable embedded tags (EXIF, IPTC, XMP, GPS, MakerNotes, ID3, document
13
+ metaclean is a small Ruby CLI that wraps ExifTool, mat2, qpdf and ffmpeg to
14
+ strip removable embedded tags (EXIF, IPTC, XMP, GPS, MakerNotes, ID3, document
15
15
  properties, etc.) from images, audio, video, PDFs and Office documents —
16
16
  and shows a before/after diff of what was removed.
17
17
  executables:
@@ -26,6 +26,7 @@ files:
26
26
  - lib/metaclean/cli.rb
27
27
  - lib/metaclean/display.rb
28
28
  - lib/metaclean/exiftool.rb
29
+ - lib/metaclean/ffmpeg.rb
29
30
  - lib/metaclean/mat2.rb
30
31
  - lib/metaclean/qpdf.rb
31
32
  - lib/metaclean/runner.rb
@@ -37,7 +38,6 @@ licenses:
37
38
  metadata:
38
39
  allowed_push_host: https://rubygems.org
39
40
  bug_tracker_uri: https://github.com/26zl/metaclean/issues
40
- changelog_uri: https://github.com/26zl/metaclean/releases
41
41
  source_code_uri: https://github.com/26zl/metaclean
42
42
  rubygems_mfa_required: 'true'
43
43
  rdoc_options: []
@@ -54,8 +54,11 @@ required_rubygems_version: !ruby/object:Gem::Requirement
54
54
  - !ruby/object:Gem::Version
55
55
  version: '0'
56
56
  requirements:
57
- - ExifTool (https://exiftool.org) on PATH
58
- rubygems_version: 3.7.2
57
+ - ExifTool (https://exiftool.org) on PATH — required
58
+ - mat2 (https://github.com/jvoisin/mat2) on PATH — required
59
+ - qpdf (https://qpdf.sourceforge.io) on PATH — required
60
+ - ffmpeg (https://ffmpeg.org) on PATH — required
61
+ rubygems_version: 3.6.9
59
62
  specification_version: 4
60
- summary: Cross-platform CLI that strips file metadata with ExifTool, mat2 and qpdf.
63
+ summary: CLI that strips file metadata with ExifTool, mat2, qpdf and ffmpeg.
61
64
  test_files: []