ruby-duplicates 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 03bc2b060ce66e3f69e705b7574ce6f3b30520fdb2da8e0e94f838c41cc4176d
4
+ data.tar.gz: 6f711d2390e3608f68bf079ab01a278cef417ee408439f3cf765ec8b25caeb77
5
+ SHA512:
6
+ metadata.gz: a039bbfa6462b0955a10b0390b1007b414defe9e5536dc224846e06c6e40d71d05dad6fae0b7b2223a5a56db31e875168e1fd3680c36dbf05e803da7376d4ede
7
+ data.tar.gz: 73c1dfdea3cefeb83c8fc9b9a3d7b58c632689218db5f7b9570cf35f36d5e0baa7b110118beb955cebbd9e2aa20f96060f5271e5c76b4c073fc3402a305e5f94
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Bartas Urba
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,99 @@
1
+ # ruby-duplicates
2
+
3
+ A small duplicate-code metric for Ruby.
4
+
5
+ `ruby-duplicates` parses Ruby with the standard library `Ripper`, normalizes syntax trees so names and literal values do not dominate the comparison, fingerprints method subtrees, and reports methods with high Jaccard similarity.
6
+
7
+ It is inspired by Uncle Bob's [`dry4clj`](https://github.com/unclebob/dry4clj), which applies the same broad idea to Clojure code: compare normalized structure instead of doing plain text clone detection.
8
+
9
+ This is a metric tool, not a refactoring engine. It points at suspiciously similar methods so a human or coding agent can decide whether the duplication is accidental, intentional symmetry, or data-shaped boilerplate.
10
+
11
+ ## Install
12
+
13
+ From this repo:
14
+
15
+ ```bash
16
+ bundle install
17
+ exe/ruby-duplicates app lib test
18
+ ```
19
+
20
+ As a gem from a local checkout:
21
+
22
+ ```bash
23
+ gem build ruby-duplicates.gemspec
24
+ gem install ruby-duplicates-*.gem
25
+ ruby-duplicates app lib test
26
+ ```
27
+
28
+ From another project before a RubyGems release, point at the GitHub repo:
29
+
30
+ ```ruby
31
+ gem "ruby-duplicates", git: "https://github.com/barturba/ruby-duplicates.git"
32
+ ```
33
+
34
+ ## Usage
35
+
36
+ ```bash
37
+ ruby-duplicates [options] [file-or-directory ...]
38
+ ```
39
+
40
+ Examples:
41
+
42
+ ```bash
43
+ ruby-duplicates app lib test
44
+ ruby-duplicates --threshold 0.9 --min-lines 5 --min-nodes 30 app
45
+ ruby-duplicates --json app/models app/controllers
46
+ ```
47
+
48
+ Options:
49
+
50
+ ```bash
51
+ --threshold N Minimum similarity score, default 0.82
52
+ --min-lines N Minimum method source lines, default 4
53
+ --min-nodes N Minimum normalized syntax nodes, default 20
54
+ --max-results N Maximum matches to print, default 50
55
+ --format F text or json, default text
56
+ --json Same as --format json
57
+ --ignore-dir N Directory basename or path to skip; may be repeated
58
+ ```
59
+
60
+ Example output:
61
+
62
+ ```text
63
+ ruby_duplicates candidates=3 matches=1 threshold=0.82
64
+
65
+ DUPLICATE score=1.00 shared=21
66
+ examples/duplicate_sample.rb:1-4 alpha nodes=64
67
+ examples/duplicate_sample.rb:7-10 beta nodes=64
68
+ ```
69
+
70
+ ## How It Works
71
+
72
+ For each Ruby method, the scanner:
73
+
74
+ 1. Parses the file with `Ripper.sexp`.
75
+ 2. Extracts `def` and `defs` method nodes.
76
+ 3. Normalizes identifiers, constants, instance variables, globals, labels, strings, and numbers into token classes.
77
+ 4. Normalizes most non-head symbols so tiny operator/name differences do not hide repeated shape.
78
+ 5. Fingerprints every normalized subtree with SHA1.
79
+ 6. Compares method fingerprint sets with Jaccard similarity.
80
+
81
+ The defaults intentionally favor high-signal matches. Lower `--threshold`, `--min-lines`, or `--min-nodes` when exploring.
82
+
83
+ ## Limits
84
+
85
+ - It only scans Ruby methods, not arbitrary repeated blocks.
86
+ - It is structural, not semantic.
87
+ - Metaprogrammed code can look sparse because the useful behavior is hidden in data.
88
+ - Rails controllers and tests can produce intentional symmetry. Treat those as review candidates, not automatic refactors.
89
+
90
+ ## Development
91
+
92
+ ```bash
93
+ ruby -Ilib test/ruby_duplicates_test.rb
94
+ gem build ruby-duplicates.gemspec
95
+ ```
96
+
97
+ ## Inspiration
98
+
99
+ - Uncle Bob's `dry4clj`: https://github.com/unclebob/dry4clj
@@ -0,0 +1,11 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ begin
5
+ require "ruby_duplicates"
6
+ rescue LoadError
7
+ $LOAD_PATH.unshift File.expand_path("../lib", __dir__)
8
+ require "ruby_duplicates"
9
+ end
10
+
11
+ RubyDuplicates::CLI.run(ARGV)
@@ -0,0 +1,334 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "digest/sha1"
4
+ require "find"
5
+ require "json"
6
+ require "optparse"
7
+ require "ripper"
8
+ require "set"
9
+
10
+ Candidate = Data.define(:file, :name, :start_line, :end_line, :node_count, :fingerprints) do
11
+ def line_count
12
+ end_line - start_line + 1
13
+ end
14
+
15
+ def location
16
+ "#{file}:#{start_line}-#{end_line}"
17
+ end
18
+ end
19
+
20
+ Match = Data.define(:score, :shared, :left, :right)
21
+
22
+ class RubyDuplicates
23
+ DEFAULT_IGNORES = %w[
24
+ .git
25
+ .bundle
26
+ .burndown-swarm
27
+ coverage
28
+ log
29
+ node_modules
30
+ tmp
31
+ vendor/bundle
32
+ ].freeze
33
+
34
+ TOKEN_GROUPS = {
35
+ :@ident => :identifier,
36
+ :@const => :constant,
37
+ :@ivar => :ivar,
38
+ :@cvar => :cvar,
39
+ :@gvar => :gvar,
40
+ :@label => :label,
41
+ :@int => :number,
42
+ :@float => :number,
43
+ :@rational => :number,
44
+ :@imaginary => :number,
45
+ :@CHAR => :string,
46
+ :@tstring_content => :string,
47
+ :@regexp_beg => :regexp,
48
+ :@regexp_end => :regexp,
49
+ :@period => :dot,
50
+ :@op => :operator
51
+ }.freeze
52
+
53
+ def initialize(options)
54
+ @threshold = options.fetch(:threshold)
55
+ @min_lines = options.fetch(:min_lines)
56
+ @min_nodes = options.fetch(:min_nodes)
57
+ @max_results = options.fetch(:max_results)
58
+ @format = options.fetch(:format)
59
+ @paths = options.fetch(:paths)
60
+ @ignore_dirs = options.fetch(:ignore_dirs)
61
+ end
62
+
63
+ def run
64
+ candidates = collect_candidates
65
+ matches = find_matches(candidates)
66
+
67
+ if @format == "json"
68
+ puts JSON.pretty_generate(json_payload(matches, candidates.length))
69
+ else
70
+ print_text(matches, candidates.length)
71
+ end
72
+ end
73
+
74
+ private
75
+
76
+ def collect_candidates
77
+ ruby_files.flat_map { |file| candidates_for_file(file) }
78
+ .select { |candidate| candidate.line_count >= @min_lines && candidate.node_count >= @min_nodes }
79
+ end
80
+
81
+ def ruby_files
82
+ @paths.flat_map do |path|
83
+ if File.directory?(path)
84
+ files_under(path)
85
+ elsif File.file?(path) && path.end_with?(".rb")
86
+ [path]
87
+ else
88
+ []
89
+ end
90
+ end.uniq.sort
91
+ end
92
+
93
+ def files_under(root)
94
+ files = []
95
+ Find.find(root) do |path|
96
+ if File.directory?(path) && ignored_directory?(path)
97
+ Find.prune
98
+ elsif File.file?(path) && path.end_with?(".rb")
99
+ files << path
100
+ end
101
+ end
102
+ files
103
+ end
104
+
105
+ def ignored_directory?(path)
106
+ base = File.basename(path)
107
+ relative = clean_path(path)
108
+ @ignore_dirs.include?(base) || @ignore_dirs.include?(relative)
109
+ end
110
+
111
+ def candidates_for_file(file)
112
+ source = File.read(file)
113
+ sexp = Ripper.sexp(source)
114
+ return warn_parse_failure(file) unless sexp
115
+
116
+ lines = source.lines
117
+ method_nodes(sexp).map do |node|
118
+ name, start_line = method_name_and_start(node)
119
+ end_line = [max_line(node), start_line].compact.max
120
+ end_line = [end_line, lines.length].min
121
+ fingerprints = Set.new
122
+ normalized = normalize(node)
123
+ collect_fingerprints(normalized, fingerprints)
124
+
125
+ Candidate.new(
126
+ file: clean_path(file),
127
+ name: name,
128
+ start_line: start_line,
129
+ end_line: end_line,
130
+ node_count: count_nodes(normalized),
131
+ fingerprints: fingerprints
132
+ )
133
+ end
134
+ rescue StandardError => error
135
+ warn "#{file}: #{error.class}: #{error.message}"
136
+ []
137
+ end
138
+
139
+ def warn_parse_failure(file)
140
+ warn "#{file}: could not parse Ruby source"
141
+ []
142
+ end
143
+
144
+ def method_nodes(node, found = [])
145
+ return found unless node.is_a?(Array)
146
+
147
+ found << node if %i[def defs].include?(node.first)
148
+ node.each { |child| method_nodes(child, found) if child.is_a?(Array) }
149
+ found
150
+ end
151
+
152
+ def method_name_and_start(node)
153
+ token = node.first == :def ? node[1] : node[3]
154
+ [token_value(token), token_line(token)]
155
+ end
156
+
157
+ def token_value(token)
158
+ token.is_a?(Array) ? token[1].to_s : "unknown"
159
+ end
160
+
161
+ def token_line(token)
162
+ token.is_a?(Array) && token[2].is_a?(Array) ? token[2][0].to_i : 1
163
+ end
164
+
165
+ def max_line(node)
166
+ lines = []
167
+ visit_tokens(node) do |token|
168
+ position = token[2]
169
+ lines << position[0].to_i if position.is_a?(Array)
170
+ end
171
+ lines.max || 1
172
+ end
173
+
174
+ def visit_tokens(node, &block)
175
+ return unless node.is_a?(Array)
176
+
177
+ if token_node?(node)
178
+ yield node
179
+ return
180
+ end
181
+
182
+ node.each { |child| visit_tokens(child, &block) if child.is_a?(Array) }
183
+ end
184
+
185
+ def normalize(node)
186
+ if token_node?(node)
187
+ normalize_token(node)
188
+ elsif node.is_a?(Array)
189
+ normalize_array(node)
190
+ elsif node.is_a?(Symbol)
191
+ :symbol
192
+ elsif node.nil? || node == false || node == true
193
+ node
194
+ else
195
+ :value
196
+ end
197
+ end
198
+
199
+ def normalize_array(node)
200
+ node.each_with_index.map do |child, index|
201
+ if index.zero? && child.is_a?(Symbol)
202
+ child
203
+ else
204
+ normalize(child)
205
+ end
206
+ end
207
+ end
208
+
209
+ def token_node?(node)
210
+ node.is_a?(Array) && node[0].is_a?(Symbol) && node[0].to_s.start_with?("@")
211
+ end
212
+
213
+ def normalize_token(token)
214
+ type = token[0]
215
+ group = TOKEN_GROUPS.fetch(type, type)
216
+
217
+ if type == :@op
218
+ [type, token[1].to_s]
219
+ else
220
+ [type, group]
221
+ end
222
+ end
223
+
224
+ def collect_fingerprints(node, fingerprints)
225
+ return unless node.is_a?(Array)
226
+
227
+ fingerprints << Digest::SHA1.hexdigest(JSON.generate(node))
228
+ node.each { |child| collect_fingerprints(child, fingerprints) if child.is_a?(Array) }
229
+ end
230
+
231
+ def count_nodes(node)
232
+ return 0 unless node.is_a?(Array)
233
+
234
+ 1 + node.sum { |child| count_nodes(child) }
235
+ end
236
+
237
+ def find_matches(candidates)
238
+ matches = []
239
+ candidates.each_with_index do |left, index|
240
+ candidates[(index + 1)..].to_a.each do |right|
241
+ score, shared = jaccard(left.fingerprints, right.fingerprints)
242
+ next if score < @threshold
243
+
244
+ matches << Match.new(score: score, shared: shared, left: left, right: right)
245
+ end
246
+ end
247
+ matches.sort_by { |match| [-match.score, -match.shared, match.left.file, match.left.start_line] }.first(@max_results)
248
+ end
249
+
250
+ def jaccard(left, right)
251
+ shared = (left & right).length
252
+ total = (left | right).length
253
+ return [0.0, 0] if total.zero?
254
+
255
+ [shared.to_f / total, shared]
256
+ end
257
+
258
+ def print_text(matches, candidate_count)
259
+ puts "ruby_duplicates candidates=#{candidate_count} matches=#{matches.length} threshold=#{format_score(@threshold)}"
260
+ matches.each do |match|
261
+ puts
262
+ puts "DUPLICATE score=#{format_score(match.score)} shared=#{match.shared}"
263
+ puts " #{match.left.location} #{match.left.name} nodes=#{match.left.node_count}"
264
+ puts " #{match.right.location} #{match.right.name} nodes=#{match.right.node_count}"
265
+ end
266
+ end
267
+
268
+ def json_payload(matches, candidate_count)
269
+ {
270
+ candidates: candidate_count,
271
+ threshold: @threshold,
272
+ matches: matches.map do |match|
273
+ {
274
+ score: match.score,
275
+ shared: match.shared,
276
+ left: candidate_payload(match.left),
277
+ right: candidate_payload(match.right)
278
+ }
279
+ end
280
+ }
281
+ end
282
+
283
+ def candidate_payload(candidate)
284
+ {
285
+ file: candidate.file,
286
+ name: candidate.name,
287
+ start_line: candidate.start_line,
288
+ end_line: candidate.end_line,
289
+ nodes: candidate.node_count
290
+ }
291
+ end
292
+
293
+ def clean_path(path)
294
+ path = File.expand_path(path)
295
+ cwd = "#{Dir.pwd}/"
296
+ path.start_with?(cwd) ? path.delete_prefix(cwd) : path
297
+ end
298
+
299
+ def format_score(score)
300
+ format("%.2f", score)
301
+ end
302
+ end
303
+
304
+ class RubyDuplicates::CLI
305
+ def self.run(argv = ARGV)
306
+ options = {
307
+ threshold: 0.82,
308
+ min_lines: 4,
309
+ min_nodes: 20,
310
+ max_results: 50,
311
+ format: "text",
312
+ ignore_dirs: RubyDuplicates::DEFAULT_IGNORES.dup
313
+ }
314
+
315
+ parser = OptionParser.new do |opts|
316
+ opts.banner = "Usage: ruby-duplicates [options] [file-or-directory ...]"
317
+
318
+ opts.on("--threshold N", Float, "Minimum structural similarity score, default 0.82") { |value| options[:threshold] = value }
319
+ opts.on("--min-lines N", Integer, "Minimum method source lines, default 4") { |value| options[:min_lines] = value }
320
+ opts.on("--min-nodes N", Integer, "Minimum normalized syntax nodes, default 20") { |value| options[:min_nodes] = value }
321
+ opts.on("--max-results N", Integer, "Maximum matches to print, default 50") { |value| options[:max_results] = value }
322
+ opts.on("--format FORMAT", "text or json, default text") { |value| options[:format] = value }
323
+ opts.on("--json", "Same as --format json") { options[:format] = "json" }
324
+ opts.on("--ignore-dir NAME", "Directory basename or path to skip; may be repeated") { |value| options[:ignore_dirs] << value }
325
+ end
326
+
327
+ parser.parse!(argv)
328
+ options[:paths] = argv.empty? ? ["."] : argv
329
+
330
+ abort "format must be text or json" unless %w[text json].include?(options[:format])
331
+
332
+ RubyDuplicates.new(options).run
333
+ end
334
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ class RubyDuplicates
4
+ VERSION = "0.1.0"
5
+ end
@@ -0,0 +1,4 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "ruby_duplicates/version"
4
+ require_relative "ruby_duplicates/cli"
metadata ADDED
@@ -0,0 +1,50 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: ruby-duplicates
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Bartas Urba
8
+ bindir: exe
9
+ cert_chain: []
10
+ date: 1980-01-02 00:00:00.000000000 Z
11
+ dependencies: []
12
+ description: A small duplicate-code metric for Ruby that compares normalized Ripper
13
+ syntax fingerprints with Jaccard similarity. Inspired by Uncle Bob's dry4clj.
14
+ email:
15
+ - b@bartas.co
16
+ executables:
17
+ - ruby-duplicates
18
+ extensions: []
19
+ extra_rdoc_files: []
20
+ files:
21
+ - LICENSE
22
+ - README.md
23
+ - exe/ruby-duplicates
24
+ - lib/ruby_duplicates.rb
25
+ - lib/ruby_duplicates/cli.rb
26
+ - lib/ruby_duplicates/version.rb
27
+ homepage: https://github.com/barturba/ruby-duplicates
28
+ licenses:
29
+ - MIT
30
+ metadata:
31
+ source_code_uri: https://github.com/barturba/ruby-duplicates
32
+ inspiration_uri: https://github.com/unclebob/dry4clj
33
+ rdoc_options: []
34
+ require_paths:
35
+ - lib
36
+ required_ruby_version: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '3.2'
41
+ required_rubygems_version: !ruby/object:Gem::Requirement
42
+ requirements:
43
+ - - ">="
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ requirements: []
47
+ rubygems_version: 3.6.9
48
+ specification_version: 4
49
+ summary: Find structurally similar Ruby methods.
50
+ test_files: []