rdup 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (8) hide show
  1. checksums.yaml +7 -0
  2. data/LICENSE +22 -0
  3. data/README.md +72 -0
  4. data/bin/rdup +79 -0
  5. data/lib/rdup.rb +252 -0
  6. data/lib/rdup/version.rb +5 -0
  7. data/rdup.gemspec +28 -0
  8. metadata +55 -0
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: b46c4a3517e675b2710af4baea234d33ff235520
4
+ data.tar.gz: 9a5f237afaee2c2d3c184c3d07b0e543e2d5ce6b
5
+ SHA512:
6
+ metadata.gz: 98a8e3114fa9e4081037a0eca3e4cee6d6e8ad485f2562a25cb3e07a22f77636f701b39612586f94cbc619f10b199f250dcd42f59c96113e5a97c695d6342b6a
7
+ data.tar.gz: c4dd4443d3a022d19a03f3eed0f3e5518d9af9e7c6f7df8309b6a02b238f0f84c7a50d482530cfb8d376464605c8bb10d82a084a3334faad48605e2a11302bc2
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2015 physacco
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
22
+
@@ -0,0 +1,72 @@
1
+ # RDup
2
+
3
+ This program finds duplicate files in multiple directories and interactively remove any of them as you wish.
4
+
5
+ It is inspired by [fdupes](https://github.com/adrianlopezroche/fdupes) and [fastdupes](https://github.com/ssokolow/fastdupes).
6
+
7
+ Written in pure Ruby. No external dependencies. Cross-platform.
8
+
9
+ Tested on Ruby 2.2.2, but it should be able to run on Ruby >= 2.0.
10
+
11
+ ## Algorithm
12
+
13
+ Files with the same SHA1 digests are considered as duplicates.
14
+
15
+ RDup implements the Duplicates Finding Algorithm used by fastdupes:
16
+
17
+ 1. The given paths are recursively walked to gather a list of files.
18
+ 2. Files are grouped by size and single-entry groups are pruned away.
19
+ 3. Groups are subdivided and pruned by hashing the first *16KiB* of each file.
20
+ 4. Groups are subdivided and pruned again by hashing full contents.
21
+ 5. Any groups which remain are sets of duplicates.
22
+
23
+ By using this algorithm, RDup performs much faster than fdupes.
24
+
25
+ ## Installation
26
+
27
+ `gem install rdup`
28
+
29
+ ## Usage
30
+
31
+ ```
32
+ Usage: rdup [options] dir1 [dir2 ...]
33
+
34
+ Options:
35
+ -h, --help Print this help message and exit
36
+ -v, --version Print version information and exit
37
+ -t, --mtime Show each file's mtime
38
+ -d, --delete Delete duplicated files (with prompt)
39
+ -n, --dry-run Don't actually delete any files
40
+ --min-size=NUM Files below this size will be ignored
41
+ ```
42
+
43
+ ## Example
44
+
45
+ ```
46
+ $ rdup --mtime --delete --dry-run foo/ bar/
47
+ Found 5 files to be compared for duplication.
48
+ Found 2 sets of files with identical sizes. (5 files in total)
49
+ Found 2 sets of files with identical header hashes. (5 files in total)
50
+ Found 2 sets of files with identical hashes. (5 files in total)
51
+
52
+ [1/2] SHA1: c56351f9f9eb825c743141dd4acc870166838e3c, Size: 880 bytes
53
+ 1) 2015-12-13 17:07:52 +0800 foo/abc/abc.dat
54
+ 2) 2015-12-13 17:08:04 +0800 bar/abc.dat
55
+ Which to preserve (1,2 or all): 1
56
+ [+] foo/abc/abc.dat
57
+ [-] bar/abc.dat
58
+
59
+ [2/2] SHA1: 500fe2c2d2018bbe97a1341cf826335aaafab3d9, Size: 1,076 bytes
60
+ 1) 2015-12-13 17:07:13 +0800 foo/abc/foo.txt
61
+ 2) 2015-12-13 17:06:49 +0800 foo/foo.txt
62
+ 3) 2015-12-13 17:07:25 +0800 bar/bar.txt
63
+ Which to preserve (1,2,3 or all): 2
64
+ [-] foo/abc/foo.txt
65
+ [+] foo/foo.txt
66
+ [-] bar/bar.txt
67
+ ```
68
+
69
+ ## Known issues
70
+
71
+ * RDup doesn't follow Windows [Shortcut files](https://en.wikipedia.org/wiki/File_shortcut). Shortcuts are treated as normal files.
72
+
@@ -0,0 +1,79 @@
1
+ #!/usr/bin/env ruby
2
+ # encoding: utf-8
3
+
4
+ require 'getoptlong'
5
+ require_relative '../lib/rdup'
6
+
7
+ Usage = <<EOF
8
+ Usage: rdup [options] dir1 [dir2 ...]
9
+
10
+ Options:
11
+ -h, --help Print this help message and exit
12
+ -v, --version Print version information and exit
13
+ -t, --mtime Show each file's mtime
14
+ -d, --delete Delete duplicated files (with prompt)
15
+ -n, --dry-run Don't actually delete any files
16
+ --min-size=NUM Files below this size will be ignored
17
+ EOF
18
+
19
+ VersionInfo = "rdup v#{RDup::VERSION}"
20
+
21
+ def parse_arguments
22
+ opts = GetoptLong.new(
23
+ ['--help', '-h', GetoptLong::NO_ARGUMENT],
24
+ ['--version', '-v', GetoptLong::NO_ARGUMENT],
25
+ ['--mtime', '-t', GetoptLong::NO_ARGUMENT],
26
+ ['--delete', '-d', GetoptLong::NO_ARGUMENT],
27
+ ['--dry-run', '-n', GetoptLong::NO_ARGUMENT],
28
+ ['--min-size', GetoptLong::REQUIRED_ARGUMENT],
29
+ )
30
+
31
+ show_mtime = false
32
+ deletion = false
33
+ dry_run = false
34
+ min_size = 0
35
+
36
+ begin
37
+ opts.each do |opt, arg|
38
+ case opt
39
+ when '--help'
40
+ puts Usage
41
+ exit 0
42
+ when '--version'
43
+ puts VersionInfo
44
+ exit 0
45
+ when '--mtime'
46
+ show_mtime = true
47
+ when '--delete'
48
+ deletion = true
49
+ when '--dry-run'
50
+ dry_run = true
51
+ when '--min-size'
52
+ if arg =~ /^\d+$/
53
+ min_size = arg.to_i
54
+ else
55
+ STDERR.puts "rdup: --min-size requires a number"
56
+ exit 1
57
+ end
58
+ end
59
+ end
60
+ rescue GetoptLong::Error
61
+ exit 1
62
+ end
63
+
64
+ if ARGV.size == 0
65
+ STDERR.puts 'rdup: no directories specified'
66
+ exit 1
67
+ end
68
+
69
+ return {
70
+ :show_mtime => show_mtime,
71
+ :deletion => deletion,
72
+ :dry_run => dry_run,
73
+ :min_size => min_size,
74
+ :arguments => ARGV
75
+ }
76
+ end
77
+
78
+ args = parse_arguments
79
+ RDup::Scanner.new(args).run
@@ -0,0 +1,252 @@
1
+ # encoding: utf-8
2
+
3
+ require 'digest/sha1'
4
+ require_relative 'rdup/version'
5
+
6
+ module RDup
7
+ Defaults = {
8
+ :show_mtime => false,
9
+ :deletion => false,
10
+ :dry_run => false,
11
+ :min_size => 0,
12
+ :header_size => 2 ** 14,
13
+ }
14
+
15
+ class FileStat
16
+ attr_reader :size, :mtime
17
+ attr_accessor :header_hash, :full_hash
18
+
19
+ def initialize(size, mtime)
20
+ @size = size
21
+ @mtime = mtime
22
+ @header_hash = nil
23
+ @full_hash = nil
24
+ end
25
+ end
26
+
27
+ class Scanner
28
+ attr_accessor :opts, :files, :dirs
29
+ attr_reader :stats, :header_hashes, :full_hashes
30
+ attr_reader :size_map, :header_hash_map, :full_hash_map
31
+
32
+ def initialize(opts)
33
+ @opts = Defaults.dup
34
+ @files = []
35
+ @dirs = []
36
+ @stats = {}
37
+ @header_hashes = {}
38
+ @full_hashes = {}
39
+ @size_map = {}
40
+ @header_hash_map = {}
41
+ @full_hash_map = {}
42
+
43
+ @opts.update(opts)
44
+
45
+ opts[:arguments].each do |path|
46
+ if File.file?(path)
47
+ @files << path
48
+ elsif File.directory?(path)
49
+ @dirs << path
50
+ else
51
+ STDERR.puts "Warning: skip `#{dir}' because it's neither a file nor a directory"
52
+ end
53
+ end
54
+ end
55
+
56
+ def run
57
+ find_all_files
58
+ fcount = @stats.size
59
+ puts "Found #{fcount} files to be compared for duplication."
60
+ if fcount == 0
61
+ return
62
+ end
63
+
64
+ build_size_map
65
+ reduce_groups(@size_map)
66
+ gcount = @size_map.size
67
+ fcount = count_files(@size_map)
68
+ puts "Found #{gcount} sets of files with identical sizes. (#{fcount} files in total)"
69
+ if fcount == 0
70
+ return
71
+ end
72
+
73
+ build_header_hash_map
74
+ reduce_groups(@header_hash_map)
75
+ gcount = @header_hash_map.size
76
+ fcount = count_files(@header_hash_map)
77
+ puts "Found #{gcount} sets of files with identical header hashes. (#{fcount} files in total)"
78
+ if fcount == 0
79
+ return
80
+ end
81
+
82
+ build_full_hash_map
83
+ reduce_groups(@full_hash_map)
84
+ gcount = @full_hash_map.size
85
+ fcount = count_files(@full_hash_map)
86
+ puts "Found #{gcount} sets of files with identical hashes. (#{fcount} files in total)"
87
+ if fcount == 0
88
+ return
89
+ end
90
+
91
+ @full_hash_map.each_with_index do |pair, i|
92
+ full_hash, group = pair
93
+ size = @stats[group[0]].size
94
+ puts "\n[#{i + 1}/#{gcount}] SHA1: #{full_hash}, Size: #{csf(size)} bytes"
95
+ group.each_with_index do |path, j|
96
+ stat = @stats[path]
97
+ if @opts[:show_mtime]
98
+ puts " #{j + 1}) #{stat.mtime} #{path}"
99
+ else
100
+ puts " #{j + 1}) #{path}"
101
+ end
102
+ end
103
+
104
+ if @opts[:deletion]
105
+ survivals = which_to_preserve(group)
106
+ group.each_with_index do |path, index|
107
+ if survivals.include?(index + 1)
108
+ puts " [+] #{path}"
109
+ else
110
+ puts " [-] #{path}"
111
+ remove_file(path) unless @opts[:dry_run]
112
+ end
113
+ end
114
+ end
115
+ end
116
+ end
117
+
118
+ private
119
+
120
+ def find_all_files
121
+ @files.each do |path|
122
+ stat = File.stat(path)
123
+ if stat.size >= @opts[:min_size]
124
+ @stats[path] = FileStat.new(stat.size, stat.mtime)
125
+ else
126
+ @files.delete(path)
127
+ end
128
+ end
129
+
130
+ pwd = Dir.pwd
131
+ @dirs.each do |dir|
132
+ begin
133
+ Dir.chdir(dir)
134
+ Dir['**/*'].each do |path|
135
+ stat = File.stat(path)
136
+ if stat.file? and stat.size >= @opts[:min_size]
137
+ path = File.join(dir, path)
138
+ @files << path
139
+ @stats[path] = FileStat.new(stat.size, stat.mtime)
140
+ end
141
+ end
142
+ rescue => e
143
+ STDERR.puts "Error: #{e}"
144
+ ensure
145
+ Dir.chdir(pwd)
146
+ end
147
+ end
148
+ end
149
+
150
+ # Group the files by size
151
+ # @size_map: file_size => [file1, file2, ...]
152
+ def build_size_map
153
+ @stats.each do |path, stat|
154
+ size = stat.size
155
+ if @size_map.has_key?(size)
156
+ @size_map[size] << path
157
+ else
158
+ @size_map[size] = [path]
159
+ end
160
+ end
161
+ end
162
+
163
+ # @header_hash_map: header_hash => [file1, file2, ...]
164
+ def build_header_hash_map
165
+ header_size = @opts[:header_size]
166
+ @size_map.each do |size, paths|
167
+ paths.each do |path|
168
+ header = File.open(path, 'rb'){|f| f.read(header_size)}
169
+ header = '' if header.nil? # empty file
170
+
171
+ header_hash = Digest::SHA1.new.hexdigest(header)
172
+ @stats[path].header_hash = header_hash
173
+ @stats[path].full_hash = header_hash if size <= header_size
174
+
175
+ if @header_hash_map.has_key?(header_hash)
176
+ @header_hash_map[header_hash] << path
177
+ else
178
+ @header_hash_map[header_hash] = [path]
179
+ end
180
+ end
181
+ end
182
+ end
183
+
184
+ # @header_hash_map: full_hash => [file1, file2, ...]
185
+ def build_full_hash_map
186
+ @header_hash_map.each_value do |paths|
187
+ paths.each do |path|
188
+ stat = @stats[path]
189
+ if stat.size <= @opts[:header_size]
190
+ full_hash = stat.full_hash
191
+ else
192
+ full_hash = Digest::SHA1.new.file(path).hexdigest
193
+ @stats[path].full_hash = full_hash
194
+ end
195
+
196
+ if @full_hash_map.has_key?(full_hash)
197
+ @full_hash_map[full_hash] << path
198
+ else
199
+ @full_hash_map[full_hash] = [path]
200
+ end
201
+ end
202
+ end
203
+ end
204
+
205
+ def reduce_groups(map)
206
+ map.delete_if {|key, paths| paths.size == 1}
207
+ end
208
+
209
+ def count_files(map)
210
+ map.values.flatten.size
211
+ end
212
+
213
+ # Comma-separated format
214
+ def csf(number)
215
+ number.to_s.reverse.gsub(/(\d{3})(?=\d)/, '\\1,').reverse
216
+ end
217
+
218
+ # Ask the user which files to preserve.
219
+ # Return an array of numbers
220
+ def which_to_preserve(group)
221
+ while true
222
+ all = 1.upto(group.size).to_a
223
+ print "Which to preserve (#{all.join(',')} or all): "
224
+ input = STDIN.readline.strip
225
+ if input.empty?
226
+ # continue
227
+ elsif ['a', 'all'].include?(input.downcase)
228
+ return all
229
+ elsif input =~ /^[\d\s,]+$/
230
+ nums = input.split(/[,\s]+/).delete_if(&:empty?).map(&:to_i)
231
+ if nums.empty?
232
+ STDERR.puts 'Illegal answer. Please input some numbers.'
233
+ elsif nums.min < 1 || nums.max > group.size
234
+ STDERR.puts "Illegal number. Allowed range: [1, #{group.size}]"
235
+ else # good answer
236
+ return nums
237
+ end
238
+ else
239
+ STDERR.puts 'Illegal answer. Only numbers/commas/spaces allowed.'
240
+ end
241
+ end
242
+ end
243
+
244
+ def remove_file(path)
245
+ begin
246
+ File.unlink(path)
247
+ rescue => e
248
+ STDERR.puts "Error: #{e}"
249
+ end
250
+ end
251
+ end
252
+ end
@@ -0,0 +1,5 @@
1
+ # encoding: utf-8
2
+
3
+ module RDup
4
+ VERSION = '0.1.0'
5
+ end
@@ -0,0 +1,28 @@
1
+ # encoding: utf-8
2
+
3
+ require_relative 'lib/rdup/version'
4
+
5
+ Gem::Specification.new do |s|
6
+ s.name = 'rdup'
7
+ s.version = RDup::VERSION
8
+ s.date = '2015-12-05'
9
+
10
+ s.summary = 'Find and remove duplicate files in multiple directories.'
11
+ s.description = <<EOF
12
+ This program finds duplicate files in multiple directories and
13
+ interactively remove any of them as you wish. It is similar to fdupes,
14
+ but much faster. Written in pure Ruby. No external dependencies.
15
+ EOF
16
+
17
+ s.authors = ['physacco']
18
+ s.email = ['physacco@gmail.com']
19
+ s.homepage = 'https://github.com/physacco/rdup'
20
+ s.license = 'MIT'
21
+
22
+ s.files = Dir['lib/**/*.rb'] + Dir['bin/*'] +
23
+ ['README.md', 'LICENSE', 'rdup.gemspec']
24
+ s.executables = ['rdup']
25
+
26
+ s.platform = Gem::Platform::RUBY
27
+ s.required_ruby_version = '>= 2.0.0'
28
+ end
metadata ADDED
@@ -0,0 +1,55 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rdup
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - physacco
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2015-12-05 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: |
14
+ This program finds duplicate files in multiple directories and
15
+ interactively remove any of them as you wish. It is similar to fdupes,
16
+ but much faster. Written in pure Ruby. No external dependencies.
17
+ email:
18
+ - physacco@gmail.com
19
+ executables:
20
+ - rdup
21
+ extensions: []
22
+ extra_rdoc_files: []
23
+ files:
24
+ - LICENSE
25
+ - README.md
26
+ - bin/rdup
27
+ - lib/rdup.rb
28
+ - lib/rdup/version.rb
29
+ - rdup.gemspec
30
+ homepage: https://github.com/physacco/rdup
31
+ licenses:
32
+ - MIT
33
+ metadata: {}
34
+ post_install_message:
35
+ rdoc_options: []
36
+ require_paths:
37
+ - lib
38
+ required_ruby_version: !ruby/object:Gem::Requirement
39
+ requirements:
40
+ - - ">="
41
+ - !ruby/object:Gem::Version
42
+ version: 2.0.0
43
+ required_rubygems_version: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ requirements: []
49
+ rubyforge_project:
50
+ rubygems_version: 2.4.5
51
+ signing_key:
52
+ specification_version: 4
53
+ summary: Find and remove duplicate files in multiple directories.
54
+ test_files: []
55
+ has_rdoc: