RubyGems - memdump - Versions diffs - 0.1.0 → 0.2.0 - Mend

memdump 0.1.0 → 0.2.0

Files changed (17) hide show

checksums.yaml +4 -4
data/Gemfile +2 -0
data/README.md +117 -8
data/bin/memdump +1 -0
data/lib/memdump.rb +19 -2
data/lib/memdump/cli.rb +40 -21
data/lib/memdump/common_ancestor.rb +44 -0
data/lib/memdump/convert_to_gml.rb +23 -33
data/lib/memdump/json_dump.rb +50 -7
data/lib/memdump/memory_dump.rb +662 -0
data/lib/memdump/out_degree.rb +7 -0
data/lib/memdump/replace_class_address_by_name.rb +22 -5
data/lib/memdump/version.rb +1 -1
data/memdump.gemspec +2 -1
metadata +21 -7
data/lib/memdump/diff.rb +0 -44
data/lib/memdump/stats.rb +0 -15

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: bb8daaffee80155c4227920118ed3793e3afe4ab
-  data.tar.gz: b57f30d96b9e35d8519d800cbca852ef6995d965
+  metadata.gz: deaf03849e0949a5cf0f6150ea598b02e55411cb
+  data.tar.gz: e8d53128488b1d0c83392b0b56eed940327d5a0a
 SHA512:
-  metadata.gz: 84b9700c70b49d35b7c7910e1a047d4f1fd9aba7161888090bc3190e6e983a835b9417363376c3eabe648763eb537e57d89726440f9b8ac022b0920be60bd2fd
-  data.tar.gz: 206a0adbfd4c9f3b8d10f21c0ad07f911384977f79dee4a22e01af9c3710ab6e430e6f4ade72a37655e24eef2137025cec050dce232d2dacb0db75d1ae8f86e3
+  metadata.gz: bf78d4e885d83b66e47f8f642b90ba74117a3b7c5a6963ce602f0dbbbe5eab1c19ce2bce1a54f01290f787926a2170520b39e20661ce23ef9f4ee08d1fc2ee68
+  data.tar.gz: bbb79a73ec1e0dc13e42040380c12168ebdabc2e9c91d9801421aa025245c1eb3e6070b79f886642d275504ce74749afdd7bdc17c5ccd22d867db101164134b7

data/Gemfile CHANGED

@@ -1,4 +1,6 @@
 source 'https://rubygems.org'
+gem 'rbtrace', platform: 'mri'
 # Specify your gem's dependencies in memdump.gemspec
 gemspec

data/README.md CHANGED

@@ -86,10 +86,10 @@ Allocation tracing is enabled with
 ~~~ ruby
 require 'objspace'
-ObjectSpace.trace_objects_allocation_start
+ObjectSpace.trace_object_allocations_start
 ~~~
-## Analyzing the dump
+## Basic analysis
 The first thing you will probably want to do is to run the replace-class command
 on the dump. It replaces the class attribute, which in the original dump is the
@@ -105,13 +105,122 @@ count by class. For memory leaks, the **diff** command allows you to output the
 part of the graph that involves new objects (removing the
 "old-and-not-referred-to-by-new")
+Beyond, this analyzing the dump is best done through the interactive mode:
+```
+memdump interactive /tmp/mydump
+```
+will get you a pry shell in the context of the loaded MemoryDump object. Use
+the MemoryDump API to filter out what you need. If you're dealing with big dumps,
+it is usually a good idea to save them regularly with `#dump`.
+One useful call to do at the beginning is #common_cleanup. It collapses the
+common collections (Array, Set, Hash) as well as internal bookkeeping objects
+(ICLASS, …). I usually run this, save the result and re-load the result (which
+is usually significantly smaller).
+After, the usual process is to find out which non-standard classes are
+unexpectedly present in high numbers using `stats`, extract the objects from
+these classes with `dump = objects_of_class('classname')` and the subgraph that
+keeps them alive with `roots_of(dump)`
+```
+# Get the subgraph of all objects whose class name matches /Plan/ and export
+# it to GML to process with Gephi (see below)
+parent_dump, _ = roots_of(objects_of_class(/Plan/))
+parent_dump.to_gml('plan-subgraph.gml')
+```
+Once you start filtering dumps, don't forget to simplify your life by `cd`'ing
+in the context of the newly filtered dumps
 Beyond that, I usually go back and forth between the memory dump and
-[gephi](http://gephi.org), a graph analysis application. the **gml** command
-allows to convert the memory dump into a graph format that gephi can import.
-From there, use gephi's layouting and filtering algorithms to get an idea of the
-most likely objects. Then, you can "massage" the dump using the **root_of**,
-**subgraph_of** and **remove-node** commands to narrow the dump to its most useful
-parts.
+[gephi](http://gephi.org), a graph analysis application. `to_gml` allows to
+convert the memory dump into a graph format that gephi can import.  From there,
+use gephi's layouting and filtering algorithms to get an idea of the shape of
+the dump. Note that you need to first get a graph smaller than a few 10k of objects
+before you can use gephi.
+## Dump diffs
+One powerful way to find out where memory is leaked is to look at objects that
+got allocated and find the interface between the long-term objects and these
+objects. memdump supports this by computing diffs.
+If you mean to use dump diffs you **MUST** enable allocation tracing. Not doing
+so will make the diffs inaccurate, as memdump will not be able to recognize that some
+object addresses have been reused after a garbage collection.
+Let's assume that we have a "before.json" and "after.json" dumps. Start an interactive
+shell loading `before`.
+```
+memdump interactive before.json
+```
+Then, in the shell, let's load the after dump
+```
+> after = MemDump::JSONDump.load('after.json')
+```
+The set of objects that are in `after` and `before` is given by `#diff`
+```
+d = diff(after)
+```
+We'll also add a special marker to the records in `d` so that we can easily colorize
+them differently in Gephi.
+```
+d = d.map { |r| r['in_after'] = 1; r }
+```
+## Case 1: few new objects are linked to the old ones
+One possibility is that there are only a few objects in the diff that are kept
+alive from `before`. These objects in turn keep alive a lot more objects (which
+cause the noticeable memory leak). What's interesting in this case is to
+visualize the interface, that is that set of objects.
+In memdump, one computes it with the `interface_with` method, which computes the
+interface between the receiver and the argument. The receiver must contain the
+edges between itself and the argument, which means in our case that we must use
+`after`.
+```
+self_border, diff_border = after.interface_with(d)
+```
+In addition to computing the border, it computes the count of objects that are
+kept alive by each object in `diff_border`. Each record in `diff_border` has an
+attribute called `keepalive_count` that counts the amount of nodes in `after`
+that are reachable (i.e. kept alive by) it. It is usually a good idea to
+visualize the distribution of `keepalive_count` to see whether there's indeed
+only a few nodes, and whether some are keeping a lot more objects alive than
+others. Note that cycles that involve more than one "border node" will be
+counted multiple ones (so the sum of `keepalive_count` will be higher than
+`d.size`)
+```
+diff_border.size # is this much smaller than d.size ?
+diff_border.each_record.map { |r| r['keepalive_count'] }.sort.reverse # are there some high counts at the top ?
+```
+From there, one needs to do a bunch of back-and-forth between memdump and Gephi.
+What I usually do is start by dumping the whole subgraph that contains the border
+and visualize. If I can't make any sense of it, I isolate the high-count elements
+in the border and visualize the related subgraph
+```
+full_subgraph = after.roots_of(diff_border)
+full_subgraph.to_gml 'full.gml'
+filtered_border = diff_border.find_all { |r| r['keepalive_count'] > 1000 }
+filtered_subgraph = after.roots_of(filtered_border)
+filtered_subgraph.to_gml 'filtered.gml'
+```
 ## Contributing

data/bin/memdump CHANGED

@@ -1,2 +1,3 @@
+#! /usr/bin/env ruby
 require 'memdump/cli'
 MemDump::CLI.start(ARGV)

data/lib/memdump.rb CHANGED

@@ -1,5 +1,22 @@
+require 'rgl/adjacency'
+require 'rgl/dijkstra'
+require 'rgl/traversal'
 require "memdump/version"
+require 'memdump/json_dump'
+require 'memdump/memory_dump'
+require 'memdump/cleanup_references'
+require 'memdump/common_ancestor'
+require 'memdump/convert_to_gml'
+require 'memdump/out_degree'
+require 'memdump/remove_node'
+require 'memdump/replace_class_address_by_name'
+require 'memdump/root_of'
+require 'memdump/subgraph_of'
-module Memdump
-  # Your code goes here...
+module MemDump
+    def self.pry(dump)
+        binding.pry
+    end
 end

data/lib/memdump/cli.rb CHANGED

@@ -1,6 +1,6 @@
 require 'thor'
 require 'pathname'
-require 'memdump/json_dump'
+require 'memdump'
 module MemDump
     class CLI < Thor
@@ -17,17 +17,14 @@ module MemDump
         desc 'diff SOURCE TARGET OUTPUT', 'generate a memory dump that contains the objects in TARGET not in SOURCE, and all their parents'
         def diff(source, target, output)
-            require 'memdump/diff'
-            STDOUT.sync = true
-            from = MemDump::JSONDump.new(Pathname.new(source))
-            to   = MemDump::JSONDump.new(Pathname.new(target))
-            records = MemDump.diff(from, to)
-            File.open(output, 'w') do |io|
-                records.each do |r|
-                    io.puts JSON.dump(r)
-                end
-            end
+            from = MemDump::JSONDump.load(source)
+            to   = MemDump::JSONDump.load(target)
+            diff = from.diff(to)
+            STDOUT.sync
+            puts "#{diff.size} nodes are in target but not in source"
+            diff = to.roots_of(diff)
+            puts "#{diff.size} nodes in final dump"
+            diff.save(output)
         end
         desc 'gml DUMP GML', 'converts a memory dump into a graph in the GML format (for processing by e.g. gephi)'
@@ -82,13 +79,9 @@ module MemDump
                 if output_path then Pathname.new(output_path)
                 else dump_path
                 end
-            dump = MemDump::JSONDump.new(dump_path)
-            result = MemDump.replace_class_address_by_name(dump, add_reference_to_class: options[:add_ref])
-            output_path.open('w') do |io|
-                result.each do |r|
-                    io.puts JSON.dump(r)
-                end
-            end
+            dump = MemDump::JSONDump.load(dump_path)
+            dump = dump.replace_class_id_by_class_name(add_reference_to_class: options[:add_ref])
+            dump.save(output_path)
         end
         desc 'cleanup-refs DUMP OUTPUT', "removes references to deleted objects"
@@ -121,13 +114,39 @@ module MemDump
         def stats(dump)
             require 'pp'
             require 'memdump/stats'
-            dump = MemDump::JSONDump.new(Pathname.new(dump))
-            unknown, by_type = MemDump.stats(dump)
+            dump = MemDump::JSONDump.load(dump)
+            unknown, by_type = dump.stats
             puts "#{unknown} objects without a known type"
             by_type.sort_by { |n, v| v }.reverse.each do |n, v|
                 puts "#{n}: #{v}"
             end
         end
+        desc 'out_degree DUMP', 'display the direct count of objects held by each object in the dump'
+        option "min", desc: "hide the objects whose degree is lower than this",
+            type: :numeric
+        def out_degree(dump)
+            dump = MemDump::JSONDump.new(Pathname.new(dump))
+            min = options[:min] || 0
+            sorted = dump.each_record.sort_by { |r| (r['references'] || Array.new).size }
+            sorted.each do |r|
+                size = (r['references'] || Array.new).size
+                break if size > min
+                puts "#{size} #{r}"
+            end
+        end
+        desc 'interactive DUMP', 'loads a dump file and spawn a pry shell'
+        option :load, desc: 'load the whole dump in memory', type: :boolean, default: true
+        def interactive(dump)
+            require 'memdump'
+            require 'pry'
+            dump = MemDump::JSONDump.new(Pathname.new(dump))
+            if options[:load]
+                dump = dump.load
+            end
+            dump.pry
+        end
     end
 end

data/lib/memdump/common_ancestor.rb ADDED

@@ -0,0 +1,44 @@
+module MemDump
+    def self.common_ancestors(dump, class_name, threshold: 0.1)
+        selected_records  = Hash.new
+        remaining_records = Array.new
+        dump.each_record do |r|
+            if class_name === r['class']
+                selected_records[r['address']] = r
+            else
+                remaining_records << r
+            end
+        end
+        remaining_records = Array.new
+        selected_records = Hash.new
+        selected_root = root_address
+        dump.each_record do |r|
+            address = (r['address'] || r['root'])
+            if selected_root == address
+                selected_records[address] = r
+                selected_root = nil;
+            else
+                remaining_records << r
+            end
+        end
+        count = 0
+        while count != selected_records.size
+            count = selected_records.size
+            remaining_records.delete_if do |r|
+                references = r['references']
+                if references && references.any? { |a| selected_records.has_key?(a) }
+                    address = (r['address'] || r['root'])
+                    selected_records[address] = r
+                end
+            end
+        end
+        selected_records.values.reverse.each do |r|
+            if refs = r['references']
+                refs.delete_if { |a| !selected_records.has_key?(a) }
+            end
+        end
+    end
+end

data/lib/memdump/convert_to_gml.rb CHANGED

@@ -1,47 +1,37 @@
-require 'set'
 module MemDump
     def self.convert_to_gml(dump, io)
-        nodes = dump.each_record.map do |row|
-            if row['class_address'] # transformed with replace_class_address_by_name
-                name    = row['class']
-            else
-                name    = row['struct'] || row['root'] || row['type']
-            end
-            address = row['address'] || row['root']
-            refs = Hash.new
-            if row_refs = row['references']
-                row_refs.each { |r| refs[r] = nil }
-            end
-            [address, refs, name]
-        end
         io.puts "graph"
         io.puts "["
-        known_addresses = Set.new
-        nodes.each do |address, refs, name|
-            known_addresses << address
+        edges = []
+        dump.each_record do |row|
+            address = row['address']
             io.puts "  node"
             io.puts "  ["
             io.puts "    id #{address}"
-            io.puts "    label \"#{name}\""
+            row.each do |key, value|
+                if value.respond_to?(:to_str)
+                    io.puts "    #{key} \"#{value}\""
+                elsif value.kind_of?(Numeric)
+                    io.puts "    #{key} #{value}"
+                end
+            end
             io.puts "  ]"
-        end
-        nodes.each do |address, refs, _|
-            refs.each do |ref_address, ref_label|
-                io.puts "  edge"
-                io.puts "  ["
-                io.puts "    source #{address}"
-                io.puts "    target #{ref_address}"
-                if ref_label
-                    io.puts "    label \"#{ref_label}\""
-                end
-                io.puts "  ]"
+            row['references'].each do |ref_address|
+                edges << address << ref_address
             end
         end
+        edges.each_slice(2) do |address, ref_address|
+            io.puts "  edge"
+            io.puts "  ["
+            io.puts "    source #{address}"
+            io.puts "    target #{ref_address}"
+            io.puts "  ]"
+        end
         io.puts "]"
     end
 end

data/lib/memdump/json_dump.rb CHANGED

@@ -1,22 +1,65 @@
+require 'pathname'
 require 'json'
 module MemDump
     class JSONDump
+        def self.load(filename)
+            new(filename).load
+        end
         def initialize(filename)
-            @filename = filename
+            @filename = Pathname(filename)
         end
         def each_record
             return enum_for(__method__) if !block_given?
-            if @cached_entries
-                @cached_entries.each(&proc)
-            else
-                @filename.open do |f|
-                    f.each_line do |line|
-                        yield JSON.parse(line)
+            @filename.open do |f|
+                f.each_line do |line|
+                    r = JSON.parse(line)
+                    r['address'] ||= r['root']
+                    r['references'] ||= Set.new
+                    yield r
+                end
+            end
+        end
+        def load
+            address_to_record = Hash.new
+            generations = Hash.new
+            each_record do |r|
+                if !(address = r['address'])
+                    raise "no address in #{r}"
+                end
+                r = r.dup
+                if generation = r['generation']
+                    generations[address] = r['address'] = "#{address}:#{generation}"
+                end
+                r['references'] = r['references'].to_set
+                address_to_record[r['address']] = r
+            end
+            if !generations.empty?
+                address_to_record.each_value do |r|
+                    if class_address = r['class']
+                        r['class'] = generations.fetch(class_address, class_address)
+                    end
+                    if class_address = r['class_address']
+                        r['class_address'] = generations.fetch(class_address, class_address)
                     end
+                    refs = Set.new
+                    r['references'].each do |ref_address|
+                        refs << generations.fetch(ref_address, ref_address)
+                    end
+                    r['references'] = refs
                 end
             end
+            MemoryDump.new(address_to_record)
+        end
+        def inspect
+            to_s
         end
     end
 end

data/lib/memdump/memory_dump.rb ADDED

@@ -0,0 +1,662 @@
+module MemDump
+    class MemoryDump
+        attr_reader :address_to_record
+        def initialize(address_to_record)
+            @address_to_record = address_to_record
+            @forward_graph = nil
+            @backward_graph = nil
+        end
+        def include?(address)
+            address_to_record.has_key?(address)
+        end
+        def each_record(&block)
+            address_to_record.each_value(&block)
+        end
+        def addresses
+            address_to_record.keys
+        end
+        def size
+            address_to_record.size
+        end
+        def find_by_address(address)
+            address_to_record[address]
+        end
+        def inspect
+            to_s
+        end
+        def save(io_or_path)
+            if io_or_path.respond_to?(:open)
+                io_or_path.open 'w' do |io|
+                    save(io)
+                end
+            else
+                each_record do |r|
+                    io_or_path.puts JSON.dump(r)
+                end
+            end
+        end
+        # Filter the records
+        #
+        # @yieldparam record a record
+        # @yieldreturn [Object] the record object that should be included in the
+        #   returned dump
+        # @return [MemoryDump]
+        def find_all
+            return enum_for(__method__) if !block_given?
+            address_to_record = Hash.new
+            each_record do |r|
+                if yield(r)
+                    address_to_record[r['address']] = r
+                end
+            end
+            MemoryDump.new(address_to_record)
+        end
+        # Map the records
+        #
+        # @yieldparam record a record
+        # @yieldreturn [Object] the record object that should be included in the
+        #   returned dump
+        # @return [MemoryDump]
+        def map
+            return enum_for(__method__) if !block_given?
+            address_to_record = Hash.new
+            each_record do |r|
+                address_to_record[r['address']] = yield(r.dup).to_hash
+            end
+            MemoryDump.new(address_to_record)
+        end
+        # Filter the entries, removing those for which the block returns falsy
+        #
+        # @yieldparam record a record
+        # @yieldreturn [nil,Object] either a record object, or falsy to remove
+        #   this record in the returned dump
+        # @return [MemoryDump]
+        def find_and_map
+            return enum_for(__method__) if !block_given?
+            address_to_record = Hash.new
+            each_record do |r|
+                if result = yield(r.dup)
+                    address_to_record[r['address']] = result.to_hash
+                end
+            end
+            MemoryDump.new(address_to_record)
+        end
+        # Return the records of a given type
+        #
+        # @param [String] name the type
+        # @return [MemoryDump] the matching records
+        #
+        # @example return all ICLASS (singleton) records
+        #   objects_of_class("ICLASS")
+        def objects_of_type(name)
+            find_all { |r| name === r['type'] }
+        end
+        # Return the records of a given class
+        #
+        # @param [String] name the class
+        # @return [MemoryDump] the matching entries
+        #
+        # @example return all string records
+        #   objects_of_class("String")
+        def objects_of_class(name)
+            find_all { |r| name === r['class'] }
+        end
+        # Return the entries that refer to the entries in the dump
+        #
+        # @param [MemoryDump] the set of entries whose parents we're looking for
+        # @param [Integer] min only return the entries in self that refer to
+        #   more than this much entries in 'dump'
+        # @param [Boolean] exclude_dump exclude the entries that are already in
+        #   'dump'
+        # @return [(MemoryDump,Hash)] the parent entries, and a mapping from
+        #   records in the parent entries to the count of entries in 'dump' they
+        #   refer to
+        def parents_of(dump, min: 0, exclude_dump: false)
+            children = dump.addresses.to_set
+            counts = Hash.new
+            filtered = find_all do |r|
+                next if exclude_dump && children.include?(r['address'])
+                count = r['references'].count { |r| children.include?(r) }
+                if count > min
+                    counts[r] = count
+                    true
+                end
+            end
+            return filtered, counts
+        end
+        # Remove entries from this dump, keeping the transitivity in the
+        # remaining graph
+        #
+        # @param [MemoryDump] entries entries to remove
+        #
+        # @example remove all entries that are of type HASH
+        #    collapse(objects_of_type('HASH'))
+        def collapse(entries)
+            collapsed_entries = Hash.new
+            entries.each_record do |r|
+                collapsed_entries[r['address']] = r['references'].dup
+            end
+            # Remove references in-between the entries to collapse
+            already_expanded = Hash.new { |h, k| h[k] = Set[k] }
+            begin
+                changed_entries  = Hash.new
+                collapsed_entries.each do |address, references|
+                    sets = references.classify { |ref_address| collapsed_entries.has_key?(ref_address) }
+                    updated_references = sets[false] || Set.new
+                    if to_collapse = sets[true]
+                        to_collapse.each do |ref_address|
+                            next if already_expanded[address].include?(ref_address)
+                            updated_references.merge(collapsed_entries[ref_address])
+                        end
+                        already_expanded[address].merge(to_collapse)
+                        changed_entries[address] = updated_references
+                    end
+                end
+                puts "#{changed_entries.size} changed entries"
+                collapsed_entries.merge!(changed_entries)
+            end while !changed_entries.empty?
+            find_and_map do |record|
+                next if collapsed_entries.has_key?(record['address'])
+                sets = record['references'].classify do |ref_address|
+                    collapsed_entries.has_key?(ref_address)
+                end
+                updated_references = sets[false] || Set.new
+                if to_collapse = sets[true]
+                    to_collapse.each do |ref_address|
+                        updated_references.merge(collapsed_entries[ref_address])
+                    end
+                    record = record.dup
+                    record['references'] = updated_references
+                end
+                record
+            end
+        end
+        # Remove entries from the dump, and all references to them
+        #
+        # @param [MemoryDump] the set of entries to remove, as e.g. returned by
+        #   {#objects_of_class}
+        # @return [MemoryDump] the filtered dump
+        def without(entries)
+            find_and_map do |record|
+                next if entries.include?(record['address'])
+                record_refs = record['references']
+                references = record_refs.find_all { |r| !entries.include?(r) }
+                if references.size != record_refs.size
+                    record = record.dup
+                    record['references'] = references.to_set
+                end
+                record
+            end
+        end
+        # Write the dump to a GML file that can loaded by Gephi
+        #
+        # @param [Pathname,String,IO] the path or the IO stream into which we should
+        #   dump
+        def to_gml(io_or_path)
+            if io_or_path.kind_of?(IO)
+                MemDump.convert_to_gml(self, io_or_path)
+            else
+                Pathname(io_or_path).open 'w' do |io|
+                    to_gml(io)
+                end
+            end
+            nil
+        end
+        # Save the dump
+        def save(io_or_path)
+            if io_or_path.kind_of?(IO)
+                each_record do |r|
+                    r = r.dup
+                    r['address'] = r['address'].gsub(/:\d+$/, '')
+                    if r['class_address']
+                        r['class_address'] = r['class_address'].gsub(/:\d+$/, '')
+                    elsif r['address']
+                        r['address'] = r['address'].gsub(/:\d+$/, '')
+                    end
+                    r['references'] = r['references'].map { |ref_addr| ref_addr.gsub(/:\d+$/, '') }
+                    io_or_path.puts JSON.dump(r)
+                end
+                nil
+            else
+                Pathname(io_or_path).open 'w' do |io|
+                    save(io)
+                end
+            end
+        end
+        COMMON_COLLAPSE_TYPES = %w{IMEMO HASH ARRAY}
+        COMMON_COLLAPSE_CLASSES = %w{Set RubyVM::Env}
+        # Perform common initial cleanup
+        #
+        # It basically removes common classes that usually make a dump analysis
+        # more complicated without providing more information
+        #
+        # Namely, it collapses internal Ruby node types ROOT and IMEMO, as well
+        # as common collection classes {COMMON_COLLAPSE_CLASSES}.
+        #
+        # One usually analyses a cleaned-up dump before getting into the full
+        # dump
+        #
+        # @return [MemDump] the filtered dump
+        def common_cleanup
+            without_weakrefs = remove(objects_of_class 'WeakRef')
+            to_collapse = without_weakrefs.find_all do |r|
+                COMMON_COLLAPSE_CLASSES.include?(r['class']) ||
+                    COMMON_COLLAPSE_TYPES.include?(r['type']) ||
+                    r['method'] == 'dump_all'
+            end
+            without_weakrefs.collapse(to_collapse)
+        end
+        # Remove entries in the reference for which we can't find an object with
+        # the matching address
+        #
+        # @return [(MemoryDump,Set)] the filtered dump and the set of missing addresses found
+        def remove_invalid_references
+            addresses = self.addresses.to_set
+            missing = Set.new
+            result = map do |r|
+                common = (addresses & r['references'])
+                if common.size != r['references'].size
+                    missing.merge(r['references'] - common)
+                end
+                r = r.dup
+                r['references'] = common
+                r
+            end
+            return result, missing
+        end
+        # Return the graph of object that keeps objects in dump alive
+        #
+        # It contains only the shortest paths from the roots to the objects in
+        # dump
+        #
+        # @param [MemoryDump] dump
+        # @return [MemoryDump]
+        def roots_of(dump, root_dump: nil)
+            if root_dump && root_dump.empty?
+                raise ArgumentError, "no roots provided"
+            end
+            root_addresses =
+                if root_dump then root_dump.addresses
+                else
+                    ['ALL_ROOTS']
+                end
+            ensure_graphs_computed
+            result_nodes = Set.new
+            dump_addresses = dump.addresses
+            root_addresses.each do |root_address|
+                visitor = RGL::DijkstraVisitor.new(@forward_graph)
+                dijkstra = RGL::DijkstraAlgorithm.new(@forward_graph, Hash.new(1), visitor)
+                dijkstra.find_shortest_paths(root_address)
+                path_builder = RGL::PathBuilder.new(root_address, visitor.parents_map)
+                dump_addresses.each_with_index do |record_address, record_i|
+                    if path = path_builder.path(record_address)
+                        result_nodes.merge(path)
+                    end
+                end
+            end
+            find_and_map do |record|
+                address = record['address']
+                next if !result_nodes.include?(address)
+                # Prefer records in 'dump' to allow for annotations in the
+                # source
+                record = dump.find_by_address(address) || record
+                record = record.dup
+                record['references'] = result_nodes & record['references']
+                record
+            end
+        end
+        def minimum_spanning_tree(root_dump)
+            if root_dump.size != 1
+                raise ArgumentError, "there should be exactly one root"
+            end
+            root_address, _ = root_dump.address_to_record.first
+            if !(root = address_to_record[root_address])
+                raise ArgumentError, "no record with address #{root_address} in self"
+            end
+            ensure_graphs_computed
+            mst = @forward_graph.minimum_spanning_tree(root)
+            map = Hash.new
+            mst.each_vertex do |record|
+                record = record.dup
+                record['references'] = record['references'].dup
+                record['references'].delete_if { |ref_address| !mst.has_vertex?(ref_address) }
+            end
+            MemoryDump.new(map)
+        end
+        # @api private
+        #
+        # Ensure that @forward_graph and @backward_graph are computed
+        def ensure_graphs_computed
+            if !@forward_graph
+                @forward_graph, @backward_graph = compute_graphs
+            end
+        end
+        # @api private
+        #
+        # Force recomputation of the graph representation of the dump the next
+        # time it is needed
+        def clear_graph
+            @forward_graph = nil
+            @backward_graph = nil
+        end
+        # @api private
+        #
+        # Create two RGL::DirectedAdjacencyGraph, for the forward and backward edges of the graph
+        def compute_graphs
+            forward_graph  = RGL::DirectedAdjacencyGraph.new
+            forward_graph.add_vertex 'ALL_ROOTS'
+            address_to_record.each do |address, record|
+                forward_graph.add_vertex(address)
+                if record['type'] == 'ROOT'
+                    forward_graph.add_edge('ALL_ROOTS', address)
+                end
+                record['references'].each do |ref_address|
+                    forward_graph.add_edge(address, ref_address)
+                end
+            end
+            backward_graph  = RGL::DirectedAdjacencyGraph.new
+            forward_graph.each_edge do |u, v|
+                backward_graph.add_edge(v, u)
+            end
+            return forward_graph, backward_graph
+        end
+        def depth_first_visit(root, &block)
+            ensure_graphs_computed
+            @forward_graph.depth_first_visit(root, &block)
+        end
+        # Validate that all reference entries have a matching dump entry
+        #
+        # @raise [RuntimeError] if references have been found
+        def validate_references
+            addresses = self.addresses.to_set
+            each_record do |r|
+                common = addresses & r['references']
+                if common.size != r['references'].size
+                    missing = r['references'] - common
+                    raise "#{r} references #{missing.to_a.sort.join(", ")} which do not exist"
+                end
+            end
+            nil
+        end
+        # Get a random sample of the records
+        #
+        # The sampling is random, so the returned set might be bigger or smaller
+        # than expected. Do not use on small sets.
+        #
+        # @param [Float] the ratio of selected samples vs. total samples (0.1
+        #   will select approximately 10% of the samples)
+        def sample(ratio)
+            result = Hash.new
+            each_record do |record|
+                if rand <= ratio
+                    result[record['address']] = record
+                end
+            end
+            MemoryDump.new(result)
+        end
+        # @api private
+        #
+        # Return the set of record addresses that are the addresses of roots in
+        # the live graph
+        #
+        # @return [Set<String>]
+        def root_addresses
+            roots = self.addresses.to_set.dup
+            each_record do |r|
+                roots.subtract(r['references'])
+            end
+            roots
+        end
+        # Returns the set of roots
+        def roots(with_keepalive_count: false)
+            result = Hash.new
+            self.root_addresses.each do |addr|
+                record = find_by_address(addr)
+                if with_keepalive_count
+                    record = record.dup
+                    count = 0
+                    depth_first_visit(addr) { count += 1 }
+                    record['keepalive_count'] = count
+                end
+                result[addr] = record
+            end
+            MemoryDump.new(result)
+        end
+        def add_children(roots, with_keepalive_count: false)
+            result = Hash.new
+            roots.each_record do |root_record|
+                result[root_record['address']] = root_record
+                root_record['references'].each do |addr|
+                    ref_record = find_by_address(addr)
+                    next if !ref_record
+                    if with_keepalive_count
+                        ref_record = ref_record.dup
+                        count = 0
+                        depth_first_visit(addr) { count += 1 }
+                        ref_record['keepalive_count'] = count
+                    end
+                    result[addr] = ref_record
+                end
+            end
+            MemoryDump.new(result)
+        end
+        def dup
+            find_all { true }
+        end
+        # Simply remove the given objects
+        def remove(objects)
+            removed_addresses = objects.addresses.to_set
+            return dup if removed_addresses.empty?
+            find_and_map do |r|
+                if !removed_addresses.include?(r['address'])
+                    references = r['references'].dup
+                    references.delete_if { |a| removed_addresses.include?(a) }
+                    r['references'] = references
+                    r
+                end
+            end
+        end
+        # Remove all components that are smaller than the given number of nodes
+        #
+        # It really looks only at the number of nodes reachable from a root
+        # (i.e. won't notice if two smaller-than-threshold roots have nodes in
+        # common)
+        def remove_small_components(max_size: 1)
+            roots = self.addresses.to_set.dup
+            leaves  = Set.new
+            each_record do |r|
+                refs = r['references']
+                if refs.empty?
+                    leaves << r['address']
+                else
+                    roots.subtract(r['references'])
+                end
+            end
+            to_remove = Set.new
+            roots.each do |root_address|
+                component = Set[]
+                queue = Set[root_address]
+                while !queue.empty? && (component.size <= max_size)
+                    address = queue.first
+                    queue.delete(address)
+                    next if component.include?(address)
+                    component << address
+                    queue.merge(address_to_record[address]['references'])
+                end
+                if component.size <= max_size
+                    to_remove.merge(component)
+                end
+            end
+            without(find_all { |r| to_remove.include?(r['address']) })
+        end
+        def stats
+            unknown_class = 0
+            by_class = Hash.new(0)
+            each_record do |r|
+                if klass = (r['class'] || r['type'] || r['root'])
+                    by_class[klass] += 1
+                else
+                    unknown_class += 1
+                end
+            end
+            return unknown_class, by_class
+        end
+        # Compute the set of records that are not in self but are in to
+        #
+        # @param [MemoryDump]
+        # @return [MemoryDump]
+        def diff(to)
+            diff = Hash.new
+            to.each_record do |r|
+                address = r['address']
+                if !@address_to_record.include?(address)
+                    diff[address] = r
+                end
+            end
+            MemoryDump.new(diff)
+        end
+        # Compute the interface between self and the other dump, that is the
+        # elements of self that have a child in dump, and the elements of dump
+        # that have a parent in self
+        def interface_with(dump)
+            self_border = Hash.new
+            dump_border = Hash.new
+            each_record do |r|
+                next if dump.find_by_address(r['address'])
+                refs_in_dump = r['references'].map do |addr|
+                    dump.find_by_address(addr)
+                end.compact
+                if !refs_in_dump.empty?
+                    self_border[r['address']] = r
+                    refs_in_dump.each do |child|
+                        dump_border[child['address']] = child.dup
+                    end
+                end
+            end
+            self_border = MemoryDump.new(self_border)
+            dump_border = MemoryDump.new(dump_border)
+            dump.update_keepalive_count(dump_border)
+            return self_border, dump_border
+        end
+        # Replace all objects in dump by a single "group" object
+        def group(name, dump, attributes = Hash.new)
+            group_addresses   = Set.new
+            group_references  = Set.new
+            dump.each_record do |r|
+                group_addresses << r['address']
+                group_references.merge(r['references'])
+            end
+            group_record = attributes.dup
+            group_record['address']    = name
+            group_record['references'] = group_references - group_addresses
+            updated = Hash[name => group_record]
+            each_record do |record|
+                next if group_addresses.include?(record['address'])
+                updated_record = record.dup
+                updated_record['references'] -= group_addresses
+                if updated_record['references'].size != record['references'].size
+                    updated_record['references'] << name
+                end
+                if group_addresses.include?(updated_record['class_address'])
+                    updated_record['class_address'] = name
+                end
+                if group_addresses.include?(updated_record['class'])
+                    updated_record['class'] = name
+                end
+                updated[updated_record['address']] = updated_record
+            end
+            MemoryDump.new(updated)
+        end
+        def update_keepalive_count(dump)
+            ensure_graphs_computed
+            dump.each_record do |record|
+                count = 0
+                dump.depth_first_visit(record['address']) { |obj| count += 1 }
+                record['keepalive_count'] = count
+                record
+            end
+        end
+        def replace_class_id_by_class_name(add_reference_to_class: false)
+            MemDump.replace_class_address_by_name(self, add_reference_to_class: add_reference_to_class)
+        end
+        def to_s
+            "#<MemoryDump size=#{size}>"
+        end
+    end
+end

data/lib/memdump/out_degree.rb ADDED

@@ -0,0 +1,7 @@
+module MemDump
+    def self.out_degree(dump)
+        records = dump.each_record.sort_by { |r| (r['references'] || Array.new).size }
+    end
+end

data/lib/memdump/replace_class_address_by_name.rb CHANGED

@@ -2,23 +2,40 @@ module MemDump
     # Replace the address in the 'class' attribute by the class name
     def self.replace_class_address_by_name(dump, add_reference_to_class: false)
         class_names = Hash.new
+        iclasses = Hash.new
         dump.each_record do |row|
             if row['type'] == 'CLASS' || row['type'] == 'MODULE'
                 class_names[row['address']] = row['name']
+            elsif row['type'] == 'ICLASS' || row['type'] == "IMEMO"
+                iclasses[row['address']] = row
             end
         end
-        dump.each_record.map do |r|
+        iclass_size = 0
+        while !iclasses.empty? && (iclass_size != iclasses.size)
+            iclass_size = iclasses.size
+            iclasses.delete_if do |_, r|
+                if (klass = r['class']) && (class_name = class_names[klass])
+                    class_names[r['address']] = "I(#{class_name})"
+                    r['class'] = class_name
+                    r['class_address'] = klass
+                    if add_reference_to_class
+                        (r['references'] ||= Set.new) << klass
+                    end
+                    true
+                end
+            end
+        end
+        dump.map do |r|
             if klass = r['class']
+                r = r.dup
                 r['class'] = class_names[klass] || klass
                 r['class_address'] = klass
                 if add_reference_to_class
-                    (r['references'] ||= Array.new) << klass
+                    (r['references'] ||= Set.new) << klass
                 end
             end
-            if r['type'] == 'ICLASS'
-                r['class'] = "I(#{r['class']})"
-            end
             r
         end
     end

data/lib/memdump/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Memdump
-    VERSION = "0.1.0"
+    VERSION = "0.2.0"
 end

data/memdump.gemspec CHANGED

@@ -20,7 +20,8 @@ Gem::Specification.new do |spec|
   spec.require_paths = ["lib"]
   spec.add_dependency 'thor'
-  spec.add_dependency 'rbtrace'
+  spec.add_dependency 'rgl'
+  spec.add_dependency 'pry'
   spec.add_development_dependency "bundler", "~> 1.11"
   spec.add_development_dependency "rake", "~> 10.0"
   spec.add_development_dependency "minitest", "~> 5.0"

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: memdump
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
 platform: ruby
 authors:
 - Sylvain Joyeux
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-04-25 00:00:00.000000000 Z
+date: 2018-02-03 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: thor
@@ -25,7 +25,21 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: rbtrace
+  name: rgl
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: pry
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -98,13 +112,14 @@ files:
 - lib/memdump.rb
 - lib/memdump/cleanup_references.rb
 - lib/memdump/cli.rb
+- lib/memdump/common_ancestor.rb
 - lib/memdump/convert_to_gml.rb
-- lib/memdump/diff.rb
 - lib/memdump/json_dump.rb
+- lib/memdump/memory_dump.rb
+- lib/memdump/out_degree.rb
 - lib/memdump/remove_node.rb
 - lib/memdump/replace_class_address_by_name.rb
 - lib/memdump/root_of.rb
-- lib/memdump/stats.rb
 - lib/memdump/subgraph_of.rb
 - lib/memdump/version.rb
 - memdump.gemspec
@@ -128,9 +143,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.2.3
+rubygems_version: 2.5.1
 signing_key:
 specification_version: 4
 summary: Tools to manipulate Ruby 2.1+ memory dumps
 test_files: []
-has_rdoc:

data/lib/memdump/diff.rb DELETED

@@ -1,44 +0,0 @@
-require 'set'
-module MemDump
-    def self.diff(from, to)
-        from_objects = Set.new
-        from.each_record { |r| from_objects << (r['address'] || r['root']) }
-        puts "#{from_objects.size} objects found in source dump"
-        selected_records = Hash.new
-        remaining_records = Array.new
-        to.each_record do |r|
-            address = (r['address'] || r['root'])
-            if !from_objects.include?(address)
-                selected_records[address] = r
-                r['only_in_target'] = 1
-            else
-                remaining_records << r
-            end
-        end
-        total = remaining_records.size + selected_records.size
-        count = 0
-        while selected_records.size != count
-            count = selected_records.size
-            puts "#{count}/#{total} records selected so far"
-            remaining_records.delete_if do |r|
-                address = (r['address'] || r['root'])
-                references = r['references']
-                if references && references.any? { |r| selected_records.has_key?(r) }
-                    selected_records[address] = r
-                end
-            end
-        end
-        puts "#{count}/#{total} records selected"
-        selected_records.each_value do |r|
-            if references = r['references']
-                references.delete_if { |a| !selected_records.has_key?(a) }
-            end
-        end
-        selected_records.each_value
-    end
-end

data/lib/memdump/stats.rb DELETED

@@ -1,15 +0,0 @@
-module MemDump
-    def self.stats(memdump)
-        unknown_class = 0
-        by_class = Hash.new(0)
-        memdump.each_record do |r|
-            if klass = (r['class'] || r['type'] || r['root'])
-                by_class[klass] += 1
-            else
-                unknown_class += 1
-            end
-        end
-        return unknown_class, by_class
-    end
-end