RubyGems - packfile_reader - Versions diffs - 0.0.1 - Mend

packfile_reader 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +7 -0
data/README.md +118 -0
data/bin/packfile_reader +62 -0
data/lib/packfile_reader.rb +3 -0
data/lib/packfile_reader/packfile_entry.rb +107 -0
data/lib/packfile_reader/packfile_header.rb +66 -0
data/lib/packfile_reader/packfile_hunk.rb +50 -0
metadata +63 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 3b8e93c293d659c69561f903785550ebdcf9cbb4eff620d5f711a2eb55e2452f
+  data.tar.gz: 504bec783784ff1bd3233d7e39b1e53e3621239c79e54d6c54375358d69d504c
+SHA512:
+  metadata.gz: 06f757f92ed4ea72c888784ce94ede9d4502da4ee840f986ee74fb4fc984387387eba316ed7fc54dbb604d7df295f38bf5df589c0fee5bb962b49eac9fb5ebf7
+  data.tar.gz: 23dbeb6424004becebd49eb8823d14230a95fcd14e842054d3be46b54acfbee887e1d47d97f4f6f66e01d63b435d97331fe21daa27b6267f142230e5bdd14fad

data/README.md ADDED

@@ -0,0 +1,118 @@
+# Packfile Reader
+Git packs up several of "loose" objects into a single binary file called a “packfile” in order to save space and be more efficient.
+"packfiles" usually come in pairs: a `.pack` file and a `.idx` file.
+The `.idx` file contains offsets for all the objects in the `.pack` file, so it is easier to find the content you are looking for on the packfile.
+When we have both files, we can use `git verify-pack' command to read the content and metadata about the objects in the packfile, but sometimes we only have the `.pack` file, and in this case `git` is not really helpful.
+![packfile](packfile-format.png?raw=true "Packfile Format")
+I created this tool to help with parsing packfiles without their index files counterpart.
+# Installation
+This gem is published at rubygems.org already, so you can just `gem install packfile_reader`.
+If you want to download the source code and play with it locally, clone the repository then
+```
+gem build packfile_reader.gemspec
+gem install ./packfile_reader-<version>.gem
+```
+# Usage
+```
+This tool is used to parse and extract data from git packfiles without a .idx file.
+By default, the script will only report the object ids, their type and their deflated sizes.
+You can also make the script expand the content of the objects in the local directory or a directory
+of your choice.
+Usage:
+  packfile_reader [options] <packfile>
+where [options] are:
+  -h, --headers-only         Display only the headers of the packfile
+  -n, --no-headers           Skip displaying the headers of the packfile
+  -i, --filter-by-ids=<s>    Comma separated list of object ids to look for (default: any)
+  -e, --expand-objects       Whether to expand objects data
+  -o, --output-dir=<s>       Directory to store the expanded objects (default: .)
+  -v, --version              Print version and exit
+  -l, --help                 Show this message
+```
+## Example:
+The output of the command includes the headers for the packfile and a list of objects found in the file in format:
+```
+<object-id> <object-type> <size-uncompressed>
+```
+### Filtering by object ids
+```
+packfile_reader -i "5297f8f21ad868d9eb6a9c01ad09a9d186177047,96438dd1e26e6963fa65be0012e8f6e84209bc5d" pack.sample
+```
+```
+Packfile Headers
+- Signature: PACK
+- Version: 2
+- Entries: 3
+96438dd1e26e6963fa65be0012e8f6e84209bc5d	OBJ_COMMIT	653
+5297f8f21ad868d9eb6a9c01ad09a9d186177047	OBJ_BLOB	10
+```
+### Getting header only
+```
+packfile_reader --headers-only pack.sample
+```
+```
+Packfile Headers
+- Signature: PACK
+- Version: 2
+- Entries: 3
+```
+### Getting the list of objects only
+```
+packfile_reader --no-headers pack.sample
+```
+```
+96438dd1e26e6963fa65be0012e8f6e84209bc5d	OBJ_COMMIT	653
+5297f8f21ad868d9eb6a9c01ad09a9d186177047	OBJ_BLOB	10
+bf195faf9d23ce0615cdefd2b746a077ef82f03f	OBJ_TREE	37
+```
+### Deflating an object data
+You can also ask the tool to deflate the object data, by using the `--expand-objects` option. `packfile_reader` will create a file named after the object id. By default the file is created on the local directory. To change that, use `--output-dir` option.
+```
+packfile_reader -i "5297f8f21ad868d9eb6a9c01ad09a9d186177047" -e -o /tmp pack.sample
+```
+```
+Packfile Headers
+- Signature: PACK
+- Version: 2
+- Entries: 3
+5297f8f21ad868d9eb6a9c01ad09a9d186177047	OBJ_BLOB	10
+```
+```
+$ cat /tmp/5297f8f21ad868d9eb6a9c01ad09a9d186177047.txt
+# test-git%
+```
+# References
+-  http://shafiul.github.io/gitbook/7_the_packfile.html

data/bin/packfile_reader ADDED

@@ -0,0 +1,62 @@
+#!/usr/bin/env ruby
+require 'optimist'
+require 'packfile_reader'
+opts = Optimist::options do
+  version "v0.0.1 (c) 2020 Robison WR Santos"
+  banner <<~EOS
+    This tool is used to parse and extract data from git packfiles without a .idx file.
+    By default, the script will only report the object ids, their type and their deflated sizes.
+    You can also make the script expand the content of the objects in the local directory or a directory
+    of your choice.
+    Usage:
+      packfile_reader [options] <packfile>
+    where [options] are:
+  EOS
+  opt :headers_only, 'Display only the headers of the packfile'
+  opt :no_headers, 'Skip displaying the headers of the packfile'
+  opt :filter_by_ids, 'Comma separated list of object ids to look for', :default => 'any', :short => '-i', :type => String
+  opt :expand_objects, 'Whether to expand objects data', :default => false, :short => '-e'
+  opt :output_dir, 'Directory to store the expanded objects', :default => '.', :short => '-o', :type => String
+end
+(puts "You must inform a single packfile, found #{ARGV.size}"; exit 1) if ARGV.size > 1 or ARGV.empty?
+packfile = ARGV.first
+(puts 'Packfile not found'; exit 2) unless File.exist?(packfile)
+File.open(packfile, 'rb') do |f|
+  begin
+    header = PackfileReader::PackfileHeader.new(f)
+  rescue RuntimeError => e
+    $stderr.puts e.messase
+    exit 3
+  end
+  unless opts[:no_headers]
+    puts header
+    puts
+  end
+  exit(0) if opts[:headers_only]
+  ids_to_filter = opts[:filter_by_ids]
+  ids_to_filter = 'any' if ids_to_filter.empty?
+  objects_to_find = ids_to_filter == 'any' ? :any : ids_to_filter.split(',').map(&:strip)
+  entries_processed = 0
+  limit = objects_to_find == :any ? header.n_entries : objects_to_find.size
+  (0...limit).each do
+    entry = PackfileReader::PackfileEntry.next_entry(f, objects_to_find) do |c,u,id|
+      if opts[:expand_objects]
+        dir = opts[:output_dir]
+        File.open(File.join(dir, "#{id}.txt"), 'w') {|o| o.write u}
+      end
+    end
+    puts "#{entry.id}\t#{entry.type}\t#{entry.size}"
+  end
+end

data/lib/packfile_reader.rb ADDED

@@ -0,0 +1,3 @@
+require 'packfile_reader/packfile_header'
+require 'packfile_reader/packfile_entry'
+require 'packfile_reader/packfile_hunk'

data/lib/packfile_reader/packfile_entry.rb ADDED

@@ -0,0 +1,107 @@
+require 'digest'
+require 'zlib'
+require 'packfile_reader/packfile_hunk'
+module PackfileReader
+  class PackfileEntry
+    attr_reader :type
+    attr_reader :size
+    attr_reader :id
+    # ZLIB RFC: https://tools.ietf.org/html/rfc1950
+    ZLIB_HEADERS = [
+      0b0111100000000001, #0x78 0x01
+      0b0111100010011100, #0x78 0x9c
+      0b0111100011011010, #0x78 0xda
+    ]
+    # Accepts a block that will receive the compressed data, uncompressed data and
+    # the computed object id
+    def self.next_entry(packfile_io, objects_to_find=:any)
+      loop do
+        return nil if packfile_io.eof?
+        hunk = PackfileReader::Hunk.new_with_type(packfile_io)
+        type = hunk.type
+        size = hunk.size
+        offset = hunk.offset_size
+        while hunk.continuation?
+          hunk = PackfileReader::Hunk.new_without_type(packfile_io)
+          size = (hunk.size << offset) | size # Data size is a combination of all hunk sizes
+          offset += hunk.offset_size
+        end
+        compressed_data, uncompressed_data = find_data(packfile_io)
+        object_id = compute_id(type, size, uncompressed_data)
+        if objects_to_find == :any || objects_to_find.member?(object_id)
+          yield compressed_data, uncompressed_data, object_id if block_given?
+          return PackfileEntry.new(type, size, object_id)
+        end
+      end
+    end
+    private
+    def self.find_data(packfile_io)
+      data_header = find_zlib_data_header(packfile_io)
+      # since we don't have the index file that accompanies pack files
+      # we need to use brute force to find where the compressed data ends
+      # to do that, we go byte by byte and try to deflate the data, when
+      # that succeedes, we know we go it all
+      compressed_data = data_header
+      compressed_data += packfile_io.read(1)
+      begin
+        uncompressed_data = Zlib.inflate(compressed_data)
+      rescue Zlib::BufError
+        compressed_data += packfile_io.read(1)
+        retry
+      end
+      [compressed_data, uncompressed_data]
+    end
+    def self.find_zlib_data_header(packfile_io)
+      # If type is OBJ, TREE or COMMIT, data is a zlib stream data
+      # ref-delta uses a 20 byte hash of the base object at the beginning of data
+      # ofs-delta stores an offset within the same packfile to identify the base object
+      #
+      # Need to skip until we find a compressed data
+      # We really don't care about the delta objects
+      data_header = packfile_io.read(2) # 2 bytes to find the zlib header
+      while (not ZLIB_HEADERS.member?(data_header.unpack('n')[0]))
+        packfile_io.seek(packfile_io.pos - 1) # need to walk 2 by 2 bytes
+        data_header = packfile_io.read(2)
+      end
+      data_header
+    end
+    def self.compute_id(type, size, uncompressed_data)
+      header_type = case type
+                    when :OBJ_COMMIT then 'commit'
+                    when :OBJ_TREE then 'tree'
+                    when :OBJ_BLOB then 'blob'
+                    when :OBJ_TAG then 'tag'
+                    else ''
+      end
+      return '000' if header_type.empty?
+      header = "#{header_type} #{size}\0"
+      store = "#{header}#{uncompressed_data}"
+      Digest::SHA1.hexdigest(store)
+    end
+    def initialize(type, size, id)
+      @type = type
+      @size = size
+      @id = id
+    end
+  end
+end

data/lib/packfile_reader/packfile_header.rb ADDED

@@ -0,0 +1,66 @@
+module PackfileReader
+  # Defines a class for the HEADER portion of a git packfile
+  # From: https://git-scm.com/docs/pack-format
+  #
+  # A header appears at the beginning and consists of the following:
+  #
+  # 4-byte signature:
+  #     The signature is: {'P', 'A', 'C', 'K'}
+  #
+  # 4-byte version number (network byte order):
+  #   Git currently accepts version number 2 or 3 but generates version 2 only.
+  #
+  # 4-byte number of objects contained in the pack (network byte order)
+  #
+  # Observation: we cannot have more than 4G versions ;-) and more than 4G objects in a pack.
+  class PackfileHeader
+    attr_reader :version   # version of the packfile
+    attr_reader :sign      # signature (must be PACK)
+    attr_reader :n_entries # how many objects in the packfile
+    # Creates a new PackfileHeader instance reading data from the beginning
+    # of a packfile. It fails if it cannot parse the data or if the signature
+    # does not match the expected 'PACK' string
+    #
+    # Params:
+    # +packfile_io+:: the opened packfile handler in binary format (usually the output of File.open('path', 'rb'))
+    def initialize(packfile_io)
+      begin
+        go_to_start(packfile_io)
+        @sign = parse_sign(packfile_io)
+        @version = parse_version(packfile_io)
+        @n_entries = parse_n_entries(packfile_io)
+      rescue
+        raise 'Invalid packfile. Cannot parse header'
+      end
+      raise "Invalid signature. Got '#{@sign}' expected 'PACK'" unless @sign == 'PACK'
+    end
+    def to_s
+      <<~EOS.strip
+      Packfile Headers
+      - Signature: #{@sign}
+      - Version: #{@version}
+      - Entries: #{@n_entries}
+      EOS
+    end
+    private
+    def go_to_start(packfile_io)
+      packfile_io.seek(0)
+    end
+    def parse_sign(packfile_io)
+      packfile_io.read(4)
+    end
+    def parse_version(packfile_io)
+      packfile_io.read(4).unpack("N")[0]
+    end
+    def parse_n_entries(packfile_io)
+      packfile_io.read(4).unpack("N")[0]
+    end
+  end
+end

data/lib/packfile_reader/packfile_hunk.rb ADDED

@@ -0,0 +1,50 @@
+module PackfileReader
+  class Hunk
+    attr_reader :size
+    attr_reader :type
+    attr_reader :offset_size
+    TYPE_MAP = {
+      1 => :OBJ_COMMIT,
+      2 => :OBJ_TREE,
+      3 => :OBJ_BLOB,
+      4 => :OBJ_TAG,
+      6 => :OBJ_OFS_DELTA,
+      7 => :OBJ_REF_DELTA,
+    }
+    HUNK_TYPE_MASK = 0b01110000
+    HUNK_SIZE_4_MASK = 0b00001111
+    HUNK_SIZE_7_MASK = 0b01111111
+    def self.new_with_type(packfile_io)
+      hunk_bytes = packfile_io.read(1).unpack('C')[0]
+      continuation = hunk_bytes[7] # First representative bit of the byte
+      type = (hunk_bytes & HUNK_TYPE_MASK) >> 4 # Adjust type position (remove the extra 4 bits at the end)
+      size = (hunk_bytes & HUNK_SIZE_4_MASK) # We only have 4 bits in a hunk with type for size
+      Hunk.new(continuation == 1, size, 4, TYPE_MAP[type])
+    end
+    def self.new_without_type(packfile_io)
+      hunk_bytes = packfile_io.read(1).unpack('C')[0]
+      continuation = hunk_bytes[7] # First representative bit of the byte
+      size = (hunk_bytes & HUNK_SIZE_7_MASK) # We only have 7 bits in a hunk with type for size
+      Hunk.new(continuation == 1, size, 7)
+    end
+    def continuation?
+      @continuation
+    end
+    private
+    def initialize(continuation, size, offset_size, type=nil)
+      @continuation = continuation
+      @size = size
+      @offset_size = offset_size
+      @type = type
+    end
+  end
+end

metadata ADDED

@@ -0,0 +1,63 @@
+--- !ruby/object:Gem::Specification
+name: packfile_reader
+version: !ruby/object:Gem::Version
+  version: 0.0.1
+platform: ruby
+authors:
+- Robison WR Santos
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2020-11-30 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: optimist
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 3.0.1
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 3.0.1
+description: A tool to parse git packfile when idx files are not present
+email: ''
+executables:
+- packfile_reader
+extensions: []
+extra_rdoc_files: []
+files:
+- README.md
+- bin/packfile_reader
+- lib/packfile_reader.rb
+- lib/packfile_reader/packfile_entry.rb
+- lib/packfile_reader/packfile_header.rb
+- lib/packfile_reader/packfile_hunk.rb
+homepage: https://github.com/robisonsantos/packfile_reader
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.1.2
+signing_key:
+specification_version: 4
+summary: Parses git packfiles without the help of idx companion
+test_files: []