rdup 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (8) hide show
  1. checksums.yaml +7 -0
  2. data/LICENSE +22 -0
  3. data/README.md +72 -0
  4. data/bin/rdup +79 -0
  5. data/lib/rdup.rb +252 -0
  6. data/lib/rdup/version.rb +5 -0
  7. data/rdup.gemspec +28 -0
  8. metadata +55 -0
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: b46c4a3517e675b2710af4baea234d33ff235520
4
+ data.tar.gz: 9a5f237afaee2c2d3c184c3d07b0e543e2d5ce6b
5
+ SHA512:
6
+ metadata.gz: 98a8e3114fa9e4081037a0eca3e4cee6d6e8ad485f2562a25cb3e07a22f77636f701b39612586f94cbc619f10b199f250dcd42f59c96113e5a97c695d6342b6a
7
+ data.tar.gz: c4dd4443d3a022d19a03f3eed0f3e5518d9af9e7c6f7df8309b6a02b238f0f84c7a50d482530cfb8d376464605c8bb10d82a084a3334faad48605e2a11302bc2
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2015 physacco
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
22
+
@@ -0,0 +1,72 @@
1
+ # RDup
2
+
3
+ This program finds duplicate files in multiple directories and interactively remove any of them as you wish.
4
+
5
+ It is inspired by [fdupes](https://github.com/adrianlopezroche/fdupes) and [fastdupes](https://github.com/ssokolow/fastdupes).
6
+
7
+ Written in pure Ruby. No external dependencies. Cross-platform.
8
+
9
+ Tested on Ruby 2.2.2, but it should be able to run on Ruby >= 2.0.
10
+
11
+ ## Algorithm
12
+
13
+ Files with the same SHA1 digests are considered as duplicates.
14
+
15
+ RDup implements the Duplicates Finding Algorithm used by fastdupes:
16
+
17
+ 1. The given paths are recursively walked to gather a list of files.
18
+ 2. Files are grouped by size and single-entry groups are pruned away.
19
+ 3. Groups are subdivided and pruned by hashing the first *16KiB* of each file.
20
+ 4. Groups are subdivided and pruned again by hashing full contents.
21
+ 5. Any groups which remain are sets of duplicates.
22
+
23
+ By using this algorithm, RDup performs much faster than fdupes.
24
+
25
+ ## Installation
26
+
27
+ `gem install rdup`
28
+
29
+ ## Usage
30
+
31
+ ```
32
+ Usage: rdup [options] dir1 [dir2 ...]
33
+
34
+ Options:
35
+ -h, --help Print this help message and exit
36
+ -v, --version Print version information and exit
37
+ -t, --mtime Show each file's mtime
38
+ -d, --delete Delete duplicated files (with prompt)
39
+ -n, --dry-run Don't actually delete any files
40
+ --min-size=NUM Files below this size will be ignored
41
+ ```
42
+
43
+ ## Example
44
+
45
+ ```
46
+ $ rdup --mtime --delete --dry-run foo/ bar/
47
+ Found 5 files to be compared for duplication.
48
+ Found 2 sets of files with identical sizes. (5 files in total)
49
+ Found 2 sets of files with identical header hashes. (5 files in total)
50
+ Found 2 sets of files with identical hashes. (5 files in total)
51
+
52
+ [1/2] SHA1: c56351f9f9eb825c743141dd4acc870166838e3c, Size: 880 bytes
53
+ 1) 2015-12-13 17:07:52 +0800 foo/abc/abc.dat
54
+ 2) 2015-12-13 17:08:04 +0800 bar/abc.dat
55
+ Which to preserve (1,2 or all): 1
56
+ [+] foo/abc/abc.dat
57
+ [-] bar/abc.dat
58
+
59
+ [2/2] SHA1: 500fe2c2d2018bbe97a1341cf826335aaafab3d9, Size: 1,076 bytes
60
+ 1) 2015-12-13 17:07:13 +0800 foo/abc/foo.txt
61
+ 2) 2015-12-13 17:06:49 +0800 foo/foo.txt
62
+ 3) 2015-12-13 17:07:25 +0800 bar/bar.txt
63
+ Which to preserve (1,2,3 or all): 2
64
+ [-] foo/abc/foo.txt
65
+ [+] foo/foo.txt
66
+ [-] bar/bar.txt
67
+ ```
68
+
69
+ ## Known issues
70
+
71
+ * RDup doesn't follow Windows [Shortcut files](https://en.wikipedia.org/wiki/File_shortcut). Shortcuts are treated as normal files.
72
+
@@ -0,0 +1,79 @@
1
+ #!/usr/bin/env ruby
2
+ # encoding: utf-8
3
+
4
+ require 'getoptlong'
5
+ require_relative '../lib/rdup'
6
+
7
+ Usage = <<EOF
8
+ Usage: rdup [options] dir1 [dir2 ...]
9
+
10
+ Options:
11
+ -h, --help Print this help message and exit
12
+ -v, --version Print version information and exit
13
+ -t, --mtime Show each file's mtime
14
+ -d, --delete Delete duplicated files (with prompt)
15
+ -n, --dry-run Don't actually delete any files
16
+ --min-size=NUM Files below this size will be ignored
17
+ EOF
18
+
19
+ VersionInfo = "rdup v#{RDup::VERSION}"
20
+
21
+ def parse_arguments
22
+ opts = GetoptLong.new(
23
+ ['--help', '-h', GetoptLong::NO_ARGUMENT],
24
+ ['--version', '-v', GetoptLong::NO_ARGUMENT],
25
+ ['--mtime', '-t', GetoptLong::NO_ARGUMENT],
26
+ ['--delete', '-d', GetoptLong::NO_ARGUMENT],
27
+ ['--dry-run', '-n', GetoptLong::NO_ARGUMENT],
28
+ ['--min-size', GetoptLong::REQUIRED_ARGUMENT],
29
+ )
30
+
31
+ show_mtime = false
32
+ deletion = false
33
+ dry_run = false
34
+ min_size = 0
35
+
36
+ begin
37
+ opts.each do |opt, arg|
38
+ case opt
39
+ when '--help'
40
+ puts Usage
41
+ exit 0
42
+ when '--version'
43
+ puts VersionInfo
44
+ exit 0
45
+ when '--mtime'
46
+ show_mtime = true
47
+ when '--delete'
48
+ deletion = true
49
+ when '--dry-run'
50
+ dry_run = true
51
+ when '--min-size'
52
+ if arg =~ /^\d+$/
53
+ min_size = arg.to_i
54
+ else
55
+ STDERR.puts "rdup: --min-size requires a number"
56
+ exit 1
57
+ end
58
+ end
59
+ end
60
+ rescue GetoptLong::Error
61
+ exit 1
62
+ end
63
+
64
+ if ARGV.size == 0
65
+ STDERR.puts 'rdup: no directories specified'
66
+ exit 1
67
+ end
68
+
69
+ return {
70
+ :show_mtime => show_mtime,
71
+ :deletion => deletion,
72
+ :dry_run => dry_run,
73
+ :min_size => min_size,
74
+ :arguments => ARGV
75
+ }
76
+ end
77
+
78
+ args = parse_arguments
79
+ RDup::Scanner.new(args).run
@@ -0,0 +1,252 @@
1
+ # encoding: utf-8
2
+
3
+ require 'digest/sha1'
4
+ require_relative 'rdup/version'
5
+
6
+ module RDup
7
+ Defaults = {
8
+ :show_mtime => false,
9
+ :deletion => false,
10
+ :dry_run => false,
11
+ :min_size => 0,
12
+ :header_size => 2 ** 14,
13
+ }
14
+
15
+ class FileStat
16
+ attr_reader :size, :mtime
17
+ attr_accessor :header_hash, :full_hash
18
+
19
+ def initialize(size, mtime)
20
+ @size = size
21
+ @mtime = mtime
22
+ @header_hash = nil
23
+ @full_hash = nil
24
+ end
25
+ end
26
+
27
+ class Scanner
28
+ attr_accessor :opts, :files, :dirs
29
+ attr_reader :stats, :header_hashes, :full_hashes
30
+ attr_reader :size_map, :header_hash_map, :full_hash_map
31
+
32
+ def initialize(opts)
33
+ @opts = Defaults.dup
34
+ @files = []
35
+ @dirs = []
36
+ @stats = {}
37
+ @header_hashes = {}
38
+ @full_hashes = {}
39
+ @size_map = {}
40
+ @header_hash_map = {}
41
+ @full_hash_map = {}
42
+
43
+ @opts.update(opts)
44
+
45
+ opts[:arguments].each do |path|
46
+ if File.file?(path)
47
+ @files << path
48
+ elsif File.directory?(path)
49
+ @dirs << path
50
+ else
51
+ STDERR.puts "Warning: skip `#{dir}' because it's neither a file nor a directory"
52
+ end
53
+ end
54
+ end
55
+
56
+ def run
57
+ find_all_files
58
+ fcount = @stats.size
59
+ puts "Found #{fcount} files to be compared for duplication."
60
+ if fcount == 0
61
+ return
62
+ end
63
+
64
+ build_size_map
65
+ reduce_groups(@size_map)
66
+ gcount = @size_map.size
67
+ fcount = count_files(@size_map)
68
+ puts "Found #{gcount} sets of files with identical sizes. (#{fcount} files in total)"
69
+ if fcount == 0
70
+ return
71
+ end
72
+
73
+ build_header_hash_map
74
+ reduce_groups(@header_hash_map)
75
+ gcount = @header_hash_map.size
76
+ fcount = count_files(@header_hash_map)
77
+ puts "Found #{gcount} sets of files with identical header hashes. (#{fcount} files in total)"
78
+ if fcount == 0
79
+ return
80
+ end
81
+
82
+ build_full_hash_map
83
+ reduce_groups(@full_hash_map)
84
+ gcount = @full_hash_map.size
85
+ fcount = count_files(@full_hash_map)
86
+ puts "Found #{gcount} sets of files with identical hashes. (#{fcount} files in total)"
87
+ if fcount == 0
88
+ return
89
+ end
90
+
91
+ @full_hash_map.each_with_index do |pair, i|
92
+ full_hash, group = pair
93
+ size = @stats[group[0]].size
94
+ puts "\n[#{i + 1}/#{gcount}] SHA1: #{full_hash}, Size: #{csf(size)} bytes"
95
+ group.each_with_index do |path, j|
96
+ stat = @stats[path]
97
+ if @opts[:show_mtime]
98
+ puts " #{j + 1}) #{stat.mtime} #{path}"
99
+ else
100
+ puts " #{j + 1}) #{path}"
101
+ end
102
+ end
103
+
104
+ if @opts[:deletion]
105
+ survivals = which_to_preserve(group)
106
+ group.each_with_index do |path, index|
107
+ if survivals.include?(index + 1)
108
+ puts " [+] #{path}"
109
+ else
110
+ puts " [-] #{path}"
111
+ remove_file(path) unless @opts[:dry_run]
112
+ end
113
+ end
114
+ end
115
+ end
116
+ end
117
+
118
+ private
119
+
120
+ def find_all_files
121
+ @files.each do |path|
122
+ stat = File.stat(path)
123
+ if stat.size >= @opts[:min_size]
124
+ @stats[path] = FileStat.new(stat.size, stat.mtime)
125
+ else
126
+ @files.delete(path)
127
+ end
128
+ end
129
+
130
+ pwd = Dir.pwd
131
+ @dirs.each do |dir|
132
+ begin
133
+ Dir.chdir(dir)
134
+ Dir['**/*'].each do |path|
135
+ stat = File.stat(path)
136
+ if stat.file? and stat.size >= @opts[:min_size]
137
+ path = File.join(dir, path)
138
+ @files << path
139
+ @stats[path] = FileStat.new(stat.size, stat.mtime)
140
+ end
141
+ end
142
+ rescue => e
143
+ STDERR.puts "Error: #{e}"
144
+ ensure
145
+ Dir.chdir(pwd)
146
+ end
147
+ end
148
+ end
149
+
150
+ # Group the files by size
151
+ # @size_map: file_size => [file1, file2, ...]
152
+ def build_size_map
153
+ @stats.each do |path, stat|
154
+ size = stat.size
155
+ if @size_map.has_key?(size)
156
+ @size_map[size] << path
157
+ else
158
+ @size_map[size] = [path]
159
+ end
160
+ end
161
+ end
162
+
163
+ # @header_hash_map: header_hash => [file1, file2, ...]
164
+ def build_header_hash_map
165
+ header_size = @opts[:header_size]
166
+ @size_map.each do |size, paths|
167
+ paths.each do |path|
168
+ header = File.open(path, 'rb'){|f| f.read(header_size)}
169
+ header = '' if header.nil? # empty file
170
+
171
+ header_hash = Digest::SHA1.new.hexdigest(header)
172
+ @stats[path].header_hash = header_hash
173
+ @stats[path].full_hash = header_hash if size <= header_size
174
+
175
+ if @header_hash_map.has_key?(header_hash)
176
+ @header_hash_map[header_hash] << path
177
+ else
178
+ @header_hash_map[header_hash] = [path]
179
+ end
180
+ end
181
+ end
182
+ end
183
+
184
+ # @header_hash_map: full_hash => [file1, file2, ...]
185
+ def build_full_hash_map
186
+ @header_hash_map.each_value do |paths|
187
+ paths.each do |path|
188
+ stat = @stats[path]
189
+ if stat.size <= @opts[:header_size]
190
+ full_hash = stat.full_hash
191
+ else
192
+ full_hash = Digest::SHA1.new.file(path).hexdigest
193
+ @stats[path].full_hash = full_hash
194
+ end
195
+
196
+ if @full_hash_map.has_key?(full_hash)
197
+ @full_hash_map[full_hash] << path
198
+ else
199
+ @full_hash_map[full_hash] = [path]
200
+ end
201
+ end
202
+ end
203
+ end
204
+
205
+ def reduce_groups(map)
206
+ map.delete_if {|key, paths| paths.size == 1}
207
+ end
208
+
209
+ def count_files(map)
210
+ map.values.flatten.size
211
+ end
212
+
213
+ # Comma-separated format
214
+ def csf(number)
215
+ number.to_s.reverse.gsub(/(\d{3})(?=\d)/, '\\1,').reverse
216
+ end
217
+
218
+ # Ask the user which files to preserve.
219
+ # Return an array of numbers
220
+ def which_to_preserve(group)
221
+ while true
222
+ all = 1.upto(group.size).to_a
223
+ print "Which to preserve (#{all.join(',')} or all): "
224
+ input = STDIN.readline.strip
225
+ if input.empty?
226
+ # continue
227
+ elsif ['a', 'all'].include?(input.downcase)
228
+ return all
229
+ elsif input =~ /^[\d\s,]+$/
230
+ nums = input.split(/[,\s]+/).delete_if(&:empty?).map(&:to_i)
231
+ if nums.empty?
232
+ STDERR.puts 'Illegal answer. Please input some numbers.'
233
+ elsif nums.min < 1 || nums.max > group.size
234
+ STDERR.puts "Illegal number. Allowed range: [1, #{group.size}]"
235
+ else # good answer
236
+ return nums
237
+ end
238
+ else
239
+ STDERR.puts 'Illegal answer. Only numbers/commas/spaces allowed.'
240
+ end
241
+ end
242
+ end
243
+
244
+ def remove_file(path)
245
+ begin
246
+ File.unlink(path)
247
+ rescue => e
248
+ STDERR.puts "Error: #{e}"
249
+ end
250
+ end
251
+ end
252
+ end
@@ -0,0 +1,5 @@
1
+ # encoding: utf-8
2
+
3
+ module RDup
4
+ VERSION = '0.1.0'
5
+ end
@@ -0,0 +1,28 @@
1
+ # encoding: utf-8
2
+
3
+ require_relative 'lib/rdup/version'
4
+
5
+ Gem::Specification.new do |s|
6
+ s.name = 'rdup'
7
+ s.version = RDup::VERSION
8
+ s.date = '2015-12-05'
9
+
10
+ s.summary = 'Find and remove duplicate files in multiple directories.'
11
+ s.description = <<EOF
12
+ This program finds duplicate files in multiple directories and
13
+ interactively remove any of them as you wish. It is similar to fdupes,
14
+ but much faster. Written in pure Ruby. No external dependencies.
15
+ EOF
16
+
17
+ s.authors = ['physacco']
18
+ s.email = ['physacco@gmail.com']
19
+ s.homepage = 'https://github.com/physacco/rdup'
20
+ s.license = 'MIT'
21
+
22
+ s.files = Dir['lib/**/*.rb'] + Dir['bin/*'] +
23
+ ['README.md', 'LICENSE', 'rdup.gemspec']
24
+ s.executables = ['rdup']
25
+
26
+ s.platform = Gem::Platform::RUBY
27
+ s.required_ruby_version = '>= 2.0.0'
28
+ end
metadata ADDED
@@ -0,0 +1,55 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rdup
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - physacco
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2015-12-05 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: |
14
+ This program finds duplicate files in multiple directories and
15
+ interactively remove any of them as you wish. It is similar to fdupes,
16
+ but much faster. Written in pure Ruby. No external dependencies.
17
+ email:
18
+ - physacco@gmail.com
19
+ executables:
20
+ - rdup
21
+ extensions: []
22
+ extra_rdoc_files: []
23
+ files:
24
+ - LICENSE
25
+ - README.md
26
+ - bin/rdup
27
+ - lib/rdup.rb
28
+ - lib/rdup/version.rb
29
+ - rdup.gemspec
30
+ homepage: https://github.com/physacco/rdup
31
+ licenses:
32
+ - MIT
33
+ metadata: {}
34
+ post_install_message:
35
+ rdoc_options: []
36
+ require_paths:
37
+ - lib
38
+ required_ruby_version: !ruby/object:Gem::Requirement
39
+ requirements:
40
+ - - ">="
41
+ - !ruby/object:Gem::Version
42
+ version: 2.0.0
43
+ required_rubygems_version: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ requirements: []
49
+ rubyforge_project:
50
+ rubygems_version: 2.4.5
51
+ signing_key:
52
+ specification_version: 4
53
+ summary: Find and remove duplicate files in multiple directories.
54
+ test_files: []
55
+ has_rdoc: