packfile_reader 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 3b8e93c293d659c69561f903785550ebdcf9cbb4eff620d5f711a2eb55e2452f
4
+ data.tar.gz: 504bec783784ff1bd3233d7e39b1e53e3621239c79e54d6c54375358d69d504c
5
+ SHA512:
6
+ metadata.gz: 06f757f92ed4ea72c888784ce94ede9d4502da4ee840f986ee74fb4fc984387387eba316ed7fc54dbb604d7df295f38bf5df589c0fee5bb962b49eac9fb5ebf7
7
+ data.tar.gz: 23dbeb6424004becebd49eb8823d14230a95fcd14e842054d3be46b54acfbee887e1d47d97f4f6f66e01d63b435d97331fe21daa27b6267f142230e5bdd14fad
@@ -0,0 +1,118 @@
1
+ # Packfile Reader
2
+
3
+ Git packs up several of "loose" objects into a single binary file called a “packfile” in order to save space and be more efficient.
4
+
5
+ "packfiles" usually come in pairs: a `.pack` file and a `.idx` file.
6
+
7
+ The `.idx` file contains offsets for all the objects in the `.pack` file, so it is easier to find the content you are looking for on the packfile.
8
+
9
+ When we have both files, we can use `git verify-pack' command to read the content and metadata about the objects in the packfile, but sometimes we only have the `.pack` file, and in this case `git` is not really helpful.
10
+
11
+ ![packfile](packfile-format.png?raw=true "Packfile Format")
12
+
13
+ I created this tool to help with parsing packfiles without their index files counterpart.
14
+
15
+ # Installation
16
+
17
+ This gem is published at rubygems.org already, so you can just `gem install packfile_reader`.
18
+
19
+ If you want to download the source code and play with it locally, clone the repository then
20
+
21
+ ```
22
+ gem build packfile_reader.gemspec
23
+ gem install ./packfile_reader-<version>.gem
24
+ ```
25
+
26
+ # Usage
27
+
28
+ ```
29
+ This tool is used to parse and extract data from git packfiles without a .idx file.
30
+ By default, the script will only report the object ids, their type and their deflated sizes.
31
+ You can also make the script expand the content of the objects in the local directory or a directory
32
+ of your choice.
33
+
34
+ Usage:
35
+ packfile_reader [options] <packfile>
36
+ where [options] are:
37
+ -h, --headers-only Display only the headers of the packfile
38
+ -n, --no-headers Skip displaying the headers of the packfile
39
+ -i, --filter-by-ids=<s> Comma separated list of object ids to look for (default: any)
40
+ -e, --expand-objects Whether to expand objects data
41
+ -o, --output-dir=<s> Directory to store the expanded objects (default: .)
42
+ -v, --version Print version and exit
43
+ -l, --help Show this message
44
+
45
+ ```
46
+
47
+ ## Example:
48
+
49
+ The output of the command includes the headers for the packfile and a list of objects found in the file in format:
50
+
51
+ ```
52
+ <object-id> <object-type> <size-uncompressed>
53
+ ```
54
+
55
+ ### Filtering by object ids
56
+
57
+ ```
58
+ packfile_reader -i "5297f8f21ad868d9eb6a9c01ad09a9d186177047,96438dd1e26e6963fa65be0012e8f6e84209bc5d" pack.sample
59
+ ```
60
+
61
+ ```
62
+ Packfile Headers
63
+ - Signature: PACK
64
+ - Version: 2
65
+ - Entries: 3
66
+
67
+ 96438dd1e26e6963fa65be0012e8f6e84209bc5d OBJ_COMMIT 653
68
+ 5297f8f21ad868d9eb6a9c01ad09a9d186177047 OBJ_BLOB 10
69
+ ```
70
+
71
+ ### Getting header only
72
+
73
+ ```
74
+ packfile_reader --headers-only pack.sample
75
+ ```
76
+
77
+ ```
78
+ Packfile Headers
79
+ - Signature: PACK
80
+ - Version: 2
81
+ - Entries: 3
82
+ ```
83
+
84
+ ### Getting the list of objects only
85
+
86
+ ```
87
+ packfile_reader --no-headers pack.sample
88
+ ```
89
+
90
+ ```
91
+ 96438dd1e26e6963fa65be0012e8f6e84209bc5d OBJ_COMMIT 653
92
+ 5297f8f21ad868d9eb6a9c01ad09a9d186177047 OBJ_BLOB 10
93
+ bf195faf9d23ce0615cdefd2b746a077ef82f03f OBJ_TREE 37
94
+ ```
95
+
96
+ ### Deflating an object data
97
+
98
+ You can also ask the tool to deflate the object data, by using the `--expand-objects` option. `packfile_reader` will create a file named after the object id. By default the file is created on the local directory. To change that, use `--output-dir` option.
99
+
100
+ ```
101
+ packfile_reader -i "5297f8f21ad868d9eb6a9c01ad09a9d186177047" -e -o /tmp pack.sample
102
+ ```
103
+
104
+ ```
105
+ Packfile Headers
106
+ - Signature: PACK
107
+ - Version: 2
108
+ - Entries: 3
109
+ 5297f8f21ad868d9eb6a9c01ad09a9d186177047 OBJ_BLOB 10
110
+ ```
111
+
112
+ ```
113
+ $ cat /tmp/5297f8f21ad868d9eb6a9c01ad09a9d186177047.txt
114
+ # test-git%
115
+ ```
116
+
117
+ # References
118
+ - http://shafiul.github.io/gitbook/7_the_packfile.html
@@ -0,0 +1,62 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'optimist'
4
+ require 'packfile_reader'
5
+
6
+ opts = Optimist::options do
7
+ version "v0.0.1 (c) 2020 Robison WR Santos"
8
+ banner <<~EOS
9
+ This tool is used to parse and extract data from git packfiles without a .idx file.
10
+ By default, the script will only report the object ids, their type and their deflated sizes.
11
+ You can also make the script expand the content of the objects in the local directory or a directory
12
+ of your choice.
13
+
14
+ Usage:
15
+ packfile_reader [options] <packfile>
16
+ where [options] are:
17
+ EOS
18
+
19
+ opt :headers_only, 'Display only the headers of the packfile'
20
+ opt :no_headers, 'Skip displaying the headers of the packfile'
21
+ opt :filter_by_ids, 'Comma separated list of object ids to look for', :default => 'any', :short => '-i', :type => String
22
+ opt :expand_objects, 'Whether to expand objects data', :default => false, :short => '-e'
23
+ opt :output_dir, 'Directory to store the expanded objects', :default => '.', :short => '-o', :type => String
24
+ end
25
+
26
+ (puts "You must inform a single packfile, found #{ARGV.size}"; exit 1) if ARGV.size > 1 or ARGV.empty?
27
+
28
+ packfile = ARGV.first
29
+ (puts 'Packfile not found'; exit 2) unless File.exist?(packfile)
30
+
31
+ File.open(packfile, 'rb') do |f|
32
+ begin
33
+ header = PackfileReader::PackfileHeader.new(f)
34
+ rescue RuntimeError => e
35
+ $stderr.puts e.messase
36
+ exit 3
37
+ end
38
+
39
+ unless opts[:no_headers]
40
+ puts header
41
+ puts
42
+ end
43
+
44
+ exit(0) if opts[:headers_only]
45
+
46
+ ids_to_filter = opts[:filter_by_ids]
47
+ ids_to_filter = 'any' if ids_to_filter.empty?
48
+
49
+ objects_to_find = ids_to_filter == 'any' ? :any : ids_to_filter.split(',').map(&:strip)
50
+ entries_processed = 0
51
+ limit = objects_to_find == :any ? header.n_entries : objects_to_find.size
52
+ (0...limit).each do
53
+ entry = PackfileReader::PackfileEntry.next_entry(f, objects_to_find) do |c,u,id|
54
+ if opts[:expand_objects]
55
+ dir = opts[:output_dir]
56
+ File.open(File.join(dir, "#{id}.txt"), 'w') {|o| o.write u}
57
+ end
58
+ end
59
+
60
+ puts "#{entry.id}\t#{entry.type}\t#{entry.size}"
61
+ end
62
+ end
@@ -0,0 +1,3 @@
1
+ require 'packfile_reader/packfile_header'
2
+ require 'packfile_reader/packfile_entry'
3
+ require 'packfile_reader/packfile_hunk'
@@ -0,0 +1,107 @@
1
+ require 'digest'
2
+ require 'zlib'
3
+ require 'packfile_reader/packfile_hunk'
4
+
5
+ module PackfileReader
6
+ class PackfileEntry
7
+ attr_reader :type
8
+ attr_reader :size
9
+ attr_reader :id
10
+
11
+ # ZLIB RFC: https://tools.ietf.org/html/rfc1950
12
+ ZLIB_HEADERS = [
13
+ 0b0111100000000001, #0x78 0x01
14
+ 0b0111100010011100, #0x78 0x9c
15
+ 0b0111100011011010, #0x78 0xda
16
+ ]
17
+
18
+ # Accepts a block that will receive the compressed data, uncompressed data and
19
+ # the computed object id
20
+ def self.next_entry(packfile_io, objects_to_find=:any)
21
+ loop do
22
+ return nil if packfile_io.eof?
23
+
24
+ hunk = PackfileReader::Hunk.new_with_type(packfile_io)
25
+
26
+ type = hunk.type
27
+ size = hunk.size
28
+ offset = hunk.offset_size
29
+
30
+ while hunk.continuation?
31
+ hunk = PackfileReader::Hunk.new_without_type(packfile_io)
32
+ size = (hunk.size << offset) | size # Data size is a combination of all hunk sizes
33
+ offset += hunk.offset_size
34
+ end
35
+
36
+ compressed_data, uncompressed_data = find_data(packfile_io)
37
+ object_id = compute_id(type, size, uncompressed_data)
38
+
39
+ if objects_to_find == :any || objects_to_find.member?(object_id)
40
+ yield compressed_data, uncompressed_data, object_id if block_given?
41
+ return PackfileEntry.new(type, size, object_id)
42
+ end
43
+ end
44
+ end
45
+
46
+ private
47
+ def self.find_data(packfile_io)
48
+ data_header = find_zlib_data_header(packfile_io)
49
+
50
+ # since we don't have the index file that accompanies pack files
51
+ # we need to use brute force to find where the compressed data ends
52
+ # to do that, we go byte by byte and try to deflate the data, when
53
+ # that succeedes, we know we go it all
54
+ compressed_data = data_header
55
+ compressed_data += packfile_io.read(1)
56
+
57
+ begin
58
+ uncompressed_data = Zlib.inflate(compressed_data)
59
+ rescue Zlib::BufError
60
+ compressed_data += packfile_io.read(1)
61
+ retry
62
+ end
63
+
64
+ [compressed_data, uncompressed_data]
65
+ end
66
+
67
+ def self.find_zlib_data_header(packfile_io)
68
+ # If type is OBJ, TREE or COMMIT, data is a zlib stream data
69
+ # ref-delta uses a 20 byte hash of the base object at the beginning of data
70
+ # ofs-delta stores an offset within the same packfile to identify the base object
71
+ #
72
+ # Need to skip until we find a compressed data
73
+ # We really don't care about the delta objects
74
+
75
+ data_header = packfile_io.read(2) # 2 bytes to find the zlib header
76
+
77
+ while (not ZLIB_HEADERS.member?(data_header.unpack('n')[0]))
78
+ packfile_io.seek(packfile_io.pos - 1) # need to walk 2 by 2 bytes
79
+ data_header = packfile_io.read(2)
80
+ end
81
+
82
+ data_header
83
+ end
84
+
85
+ def self.compute_id(type, size, uncompressed_data)
86
+ header_type = case type
87
+ when :OBJ_COMMIT then 'commit'
88
+ when :OBJ_TREE then 'tree'
89
+ when :OBJ_BLOB then 'blob'
90
+ when :OBJ_TAG then 'tag'
91
+ else ''
92
+ end
93
+
94
+ return '000' if header_type.empty?
95
+
96
+ header = "#{header_type} #{size}\0"
97
+ store = "#{header}#{uncompressed_data}"
98
+ Digest::SHA1.hexdigest(store)
99
+ end
100
+
101
+ def initialize(type, size, id)
102
+ @type = type
103
+ @size = size
104
+ @id = id
105
+ end
106
+ end
107
+ end
@@ -0,0 +1,66 @@
1
+ module PackfileReader
2
+
3
+ # Defines a class for the HEADER portion of a git packfile
4
+ # From: https://git-scm.com/docs/pack-format
5
+ #
6
+ # A header appears at the beginning and consists of the following:
7
+ #
8
+ # 4-byte signature:
9
+ # The signature is: {'P', 'A', 'C', 'K'}
10
+ #
11
+ # 4-byte version number (network byte order):
12
+ # Git currently accepts version number 2 or 3 but generates version 2 only.
13
+ #
14
+ # 4-byte number of objects contained in the pack (network byte order)
15
+ #
16
+ # Observation: we cannot have more than 4G versions ;-) and more than 4G objects in a pack.
17
+ class PackfileHeader
18
+ attr_reader :version # version of the packfile
19
+ attr_reader :sign # signature (must be PACK)
20
+ attr_reader :n_entries # how many objects in the packfile
21
+
22
+ # Creates a new PackfileHeader instance reading data from the beginning
23
+ # of a packfile. It fails if it cannot parse the data or if the signature
24
+ # does not match the expected 'PACK' string
25
+ #
26
+ # Params:
27
+ # +packfile_io+:: the opened packfile handler in binary format (usually the output of File.open('path', 'rb'))
28
+ def initialize(packfile_io)
29
+ begin
30
+ go_to_start(packfile_io)
31
+ @sign = parse_sign(packfile_io)
32
+ @version = parse_version(packfile_io)
33
+ @n_entries = parse_n_entries(packfile_io)
34
+ rescue
35
+ raise 'Invalid packfile. Cannot parse header'
36
+ end
37
+ raise "Invalid signature. Got '#{@sign}' expected 'PACK'" unless @sign == 'PACK'
38
+ end
39
+
40
+ def to_s
41
+ <<~EOS.strip
42
+ Packfile Headers
43
+ - Signature: #{@sign}
44
+ - Version: #{@version}
45
+ - Entries: #{@n_entries}
46
+ EOS
47
+ end
48
+
49
+ private
50
+ def go_to_start(packfile_io)
51
+ packfile_io.seek(0)
52
+ end
53
+
54
+ def parse_sign(packfile_io)
55
+ packfile_io.read(4)
56
+ end
57
+
58
+ def parse_version(packfile_io)
59
+ packfile_io.read(4).unpack("N")[0]
60
+ end
61
+
62
+ def parse_n_entries(packfile_io)
63
+ packfile_io.read(4).unpack("N")[0]
64
+ end
65
+ end
66
+ end
@@ -0,0 +1,50 @@
1
+ module PackfileReader
2
+ class Hunk
3
+ attr_reader :size
4
+ attr_reader :type
5
+ attr_reader :offset_size
6
+
7
+ TYPE_MAP = {
8
+ 1 => :OBJ_COMMIT,
9
+ 2 => :OBJ_TREE,
10
+ 3 => :OBJ_BLOB,
11
+ 4 => :OBJ_TAG,
12
+ 6 => :OBJ_OFS_DELTA,
13
+ 7 => :OBJ_REF_DELTA,
14
+ }
15
+
16
+ HUNK_TYPE_MASK = 0b01110000
17
+ HUNK_SIZE_4_MASK = 0b00001111
18
+ HUNK_SIZE_7_MASK = 0b01111111
19
+
20
+ def self.new_with_type(packfile_io)
21
+ hunk_bytes = packfile_io.read(1).unpack('C')[0]
22
+ continuation = hunk_bytes[7] # First representative bit of the byte
23
+ type = (hunk_bytes & HUNK_TYPE_MASK) >> 4 # Adjust type position (remove the extra 4 bits at the end)
24
+ size = (hunk_bytes & HUNK_SIZE_4_MASK) # We only have 4 bits in a hunk with type for size
25
+
26
+ Hunk.new(continuation == 1, size, 4, TYPE_MAP[type])
27
+ end
28
+
29
+ def self.new_without_type(packfile_io)
30
+ hunk_bytes = packfile_io.read(1).unpack('C')[0]
31
+ continuation = hunk_bytes[7] # First representative bit of the byte
32
+ size = (hunk_bytes & HUNK_SIZE_7_MASK) # We only have 7 bits in a hunk with type for size
33
+
34
+ Hunk.new(continuation == 1, size, 7)
35
+ end
36
+
37
+ def continuation?
38
+ @continuation
39
+ end
40
+
41
+ private
42
+ def initialize(continuation, size, offset_size, type=nil)
43
+ @continuation = continuation
44
+ @size = size
45
+ @offset_size = offset_size
46
+ @type = type
47
+ end
48
+
49
+ end
50
+ end
metadata ADDED
@@ -0,0 +1,63 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: packfile_reader
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - Robison WR Santos
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2020-11-30 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: optimist
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: 3.0.1
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: 3.0.1
27
+ description: A tool to parse git packfile when idx files are not present
28
+ email: ''
29
+ executables:
30
+ - packfile_reader
31
+ extensions: []
32
+ extra_rdoc_files: []
33
+ files:
34
+ - README.md
35
+ - bin/packfile_reader
36
+ - lib/packfile_reader.rb
37
+ - lib/packfile_reader/packfile_entry.rb
38
+ - lib/packfile_reader/packfile_header.rb
39
+ - lib/packfile_reader/packfile_hunk.rb
40
+ homepage: https://github.com/robisonsantos/packfile_reader
41
+ licenses:
42
+ - MIT
43
+ metadata: {}
44
+ post_install_message:
45
+ rdoc_options: []
46
+ require_paths:
47
+ - lib
48
+ required_ruby_version: !ruby/object:Gem::Requirement
49
+ requirements:
50
+ - - ">="
51
+ - !ruby/object:Gem::Version
52
+ version: '0'
53
+ required_rubygems_version: !ruby/object:Gem::Requirement
54
+ requirements:
55
+ - - ">="
56
+ - !ruby/object:Gem::Version
57
+ version: '0'
58
+ requirements: []
59
+ rubygems_version: 3.1.2
60
+ signing_key:
61
+ specification_version: 4
62
+ summary: Parses git packfiles without the help of idx companion
63
+ test_files: []