s3grep 0.1.8 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 804681319bd1ddfd4ff89ab54f21f9cff854b06d3a254ed7870da7be8dcfb059
4
- data.tar.gz: f615b79c46f460a73f530fbe075e16f9c9ee9dd4c778cf26eee9ad63eea15ae7
3
+ metadata.gz: 686a06c854681a05e7b5915cc5819fe710b2b3da842926b6d5acb019de08c1d8
4
+ data.tar.gz: ebbfd6d9f891ca5a3a7036c90775595357f376385c5573c413fff740d9a7e086
5
5
  SHA512:
6
- metadata.gz: 6ac540deed47a2e02ab396b0f92c018ae1ec608b4824b4adb13e2d9e5fe4631366879fdfb176cd2dfe7bbee7ec040ce53d74a95f92e30994d1b1e87120d09e6a
7
- data.tar.gz: 8eaa9a8e01415551c0e0770ac59b4cf8e5374f75317fbf2a9ee60f3bd16766b072acd4b0d440a79af6013a2dd221b06c7ced8a1ffbcecd8973e95397225f6695
6
+ metadata.gz: fd00546c540f5ef8688b9af0e3668366bc496e4eae2120f394aac67400ade50c485ae43b00e4987e03423cc2d53404bafe6fa59f07c122c82d5c9de1de5f6071
7
+ data.tar.gz: 216072d06695ac0404584cc1273c1e84fc988e61eab9b8700c9d3e4ffcee786f169077e860502a4db2c86aa23aa0a39e8a3ae53c9b9b3d4c98073ac123026b33
data/.ruby-version CHANGED
@@ -1 +1 @@
1
- 3.1.0
1
+ 3.4.2
data/ARCHITECTURE.md ADDED
@@ -0,0 +1,188 @@
1
+ # Architecture
2
+
3
+ ## Overview
4
+
5
+ s3grep is a Ruby gem providing grep-like functionality for AWS S3 objects. It streams files directly from S3 without downloading them locally, enabling efficient searching of large files.
6
+
7
+ ## Component Diagram
8
+
9
+ ```
10
+ ┌─────────────────────────────────────────────────────────────────┐
11
+ │ CLI Layer (bin/) │
12
+ ├──────────┬──────────┬──────────────┬───────────────────────────┤
13
+ │ s3grep │ s3cat │ s3info │ s3report │
14
+ │ (search) │ (stream) │ (dir stats) │ (bucket inventory) │
15
+ └────┬─────┴────┬─────┴──────┬───────┴─────────────┬─────────────┘
16
+ │ │ │ │
17
+ ▼ ▼ ▼ ▼
18
+ ┌─────────────────────────────────────────────────────────────────┐
19
+ │ S3Grep Module (lib/) │
20
+ ├─────────────────────┬───────────────────┬───────────────────────┤
21
+ │ Search │ Directory │ DirectoryInfo │
22
+ │ (file streaming) │ (object listing) │ (stats aggregation) │
23
+ └──────────┬──────────┴─────────┬─────────┴───────────┬───────────┘
24
+ │ │ │
25
+ ▼ ▼ ▼
26
+ ┌─────────────────────────────────────────────────────────────────┐
27
+ │ AWS SDK (aws-sdk-s3) │
28
+ │ get_object │ list_objects │ list_buckets │
29
+ └─────────────────────────────────────────────────────────────────┘
30
+ ```
31
+
32
+ ## Core Classes
33
+
34
+ ### S3Grep::Search
35
+
36
+ True streaming S3 object reader with line-by-line regex matching. Uses chunked transfer to avoid loading entire files into memory.
37
+
38
+ **Responsibilities:**
39
+ - Parse S3 URL to extract bucket and key
40
+ - Stream object content via `get_object` block form (chunked transfer)
41
+ - Buffer partial lines across chunk boundaries
42
+ - Auto-detect and decompress .gz files (streaming) and .zip files (buffered)
43
+ - Yield matching lines with line numbers
44
+ - Enforce size limits to prevent resource exhaustion
45
+
46
+ **Key Methods:**
47
+ - `Search.search(s3_url, client, regex)` - Class method for simple searches
48
+ - `Search.detect_compression(s3_url)` - Infers compression from file extension
49
+ - `#each_line` - Core streaming iterator, yields lines as they arrive
50
+ - `#to_io` - Returns StreamingIO adapter for backward compatibility
51
+
52
+ **Streaming Implementation:**
53
+ - Raw files: Chunks streamed directly, lines extracted from buffer
54
+ - Gzip files: Chunks decompressed via `Zlib::Inflate` as they arrive
55
+ - ZIP files: Must buffer entire archive (ZIP format requires central directory at EOF)
56
+
57
+ ### S3Grep::Directory
58
+
59
+ Lists objects in an S3 prefix with optional glob-style filtering.
60
+
61
+ **Responsibilities:**
62
+ - Parse S3 URL to extract bucket and prefix
63
+ - Handle pagination (1000 objects per request)
64
+ - URL-encode/decode object keys with special characters
65
+ - Support regex filtering via `glob` method
66
+
67
+ **Key Methods:**
68
+ - `Directory.glob(s3_url, client, regex)` - List objects matching pattern
69
+ - `#each` - Iterate full S3 URLs for all objects
70
+ - `#each_content` - Iterate raw S3 object metadata (for DirectoryInfo)
71
+ - `#info` - Factory method returning DirectoryInfo
72
+
73
+ ### S3Grep::DirectoryInfo
74
+
75
+ Aggregates statistics while iterating through directory contents.
76
+
77
+ **Responsibilities:**
78
+ - Count files and total size
79
+ - Track newest/oldest files by modification date
80
+ - Breakdown counts and sizes by storage class
81
+
82
+ **Key Methods:**
83
+ - `DirectoryInfo.get(directory)` - Process directory and return populated info
84
+ - `#last_modified` / `#first_modified` - Timestamp accessors
85
+ - `#newest_file` / `#first_file` - Key accessors
86
+
87
+ ## Data Flow
88
+
89
+ ### Search Flow (s3grep)
90
+
91
+ ```
92
+ User Input: regex + s3://bucket/key
93
+
94
+
95
+ ┌──────────────┐
96
+ │ Parse S3 URL │
97
+ └──────┬───────┘
98
+
99
+
100
+ ┌──────────────────────┐
101
+ │ Detect compression │
102
+ │ (.gz, .zip, or none) │
103
+ └──────────┬───────────┘
104
+
105
+
106
+ ┌──────────────────────────────────────┐
107
+ │ aws_s3_client.get_object(block form) │
108
+ │ Streams chunks as they arrive │
109
+ └──────────┬───────────────────────────┘
110
+
111
+ ▼ (for each chunk)
112
+ ┌──────────────────────────────────────┐
113
+ │ Decompress chunk if gzip │
114
+ │ (Zlib::Inflate streaming) │
115
+ └──────────┬───────────────────────────┘
116
+
117
+
118
+ ┌──────────────────────────────────────┐
119
+ │ Append to line buffer │
120
+ │ Extract complete lines │
121
+ │ Yield matches with line numbers │
122
+ └──────────┬───────────────────────────┘
123
+
124
+
125
+ ┌──────────────────────────────────────┐
126
+ │ Repeat until stream exhausted │
127
+ │ Yield final partial line if any │
128
+ └──────────────────────────────────────┘
129
+ ```
130
+
131
+ **Memory Behavior:**
132
+ - Raw/Gzip: Only current chunk + line buffer in memory (~64KB typical)
133
+ - ZIP: Entire archive buffered (ZIP format limitation)
134
+
135
+ ### Directory Listing Flow (s3info, recursive s3grep)
136
+
137
+ ```
138
+ User Input: s3://bucket/prefix/
139
+
140
+
141
+ ┌──────────────────────┐
142
+ │ list_objects │
143
+ │ (max_keys: 1000) │
144
+ └──────────┬───────────┘
145
+
146
+
147
+ ┌──────────────────────┐
148
+ │ More results? │──No──▶ Done
149
+ │ (size == max_keys) │
150
+ └──────────┬───────────┘
151
+ │ Yes
152
+
153
+ ┌──────────────────────┐
154
+ │ list_objects with │
155
+ │ marker = last key │
156
+ └──────────┬───────────┘
157
+
158
+ └───────▶ (repeat until exhausted)
159
+ ```
160
+
161
+ ## AWS Integration
162
+
163
+ ### Authentication
164
+ Uses the AWS SDK default credential chain:
165
+ 1. Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
166
+ 2. Shared credentials file (`~/.aws/credentials`)
167
+ 3. IAM instance profile (EC2/ECS)
168
+
169
+ ### Region Handling
170
+ - Default client uses `AWS_REGION` or `~/.aws/config`
171
+ - `s3report` creates region-specific clients per bucket via `get_bucket_location`
172
+
173
+ ### S3 URL Format
174
+ All tools expect: `s3://bucket-name/path/to/prefix`
175
+ - Host = bucket name
176
+ - Path = object key or prefix (URL-decoded internally)
177
+
178
+ ## Compression Support
179
+
180
+ | Extension | Library | Streaming | Notes |
181
+ |-----------|---------|-----------|-------|
182
+ | `.gz` | zlib (stdlib) | ✅ Yes | Zlib::Inflate processes chunks as they arrive |
183
+ | `.zip` | rubyzip | ❌ No | ZIP format requires buffering (central directory at EOF) |
184
+ | (none) | - | ✅ Yes | Chunks streamed directly |
185
+
186
+ **Size Limits:**
187
+ - `MAX_BYTES_PROCESSED` (100MB default) prevents resource exhaustion
188
+ - Configurable via `S3Grep::Search::MAX_BYTES_PROCESSED`
data/CLAUDE.md ADDED
@@ -0,0 +1,55 @@
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ s3grep is a Ruby gem for searching through S3 files without downloading them. It provides CLI tools for grep-like searching, file viewing, and bucket reporting directly on S3 objects.
8
+
9
+ ## Development Commands
10
+
11
+ ```bash
12
+ # Install dependencies
13
+ bundle install
14
+
15
+ # Build the gem
16
+ gem build s3grep.gemspec
17
+
18
+ # Install locally for testing
19
+ gem install s3grep-*.gem
20
+ ```
21
+
22
+ ## CLI Tools
23
+
24
+ - `s3grep` - Search for patterns in S3 files (supports `-i` for case-insensitive, `-r` for recursive, `--include` for file patterns)
25
+ - `s3cat` - Stream S3 file contents to stdout
26
+ - `s3info` - Get directory statistics (file count, size, storage classes, date ranges) as JSON
27
+ - `s3report` - Generate CSV report of all buckets in an AWS account
28
+
29
+ ## Architecture
30
+
31
+ **Core Classes (lib/s3grep/):**
32
+
33
+ - `Search` - Streams S3 objects and searches line-by-line with regex. Auto-detects compression (.gz, .zip)
34
+ - `Directory` - Lists S3 objects with prefix filtering. Handles pagination via marker-based iteration
35
+ - `DirectoryInfo` - Aggregates statistics from Directory iteration (counts, sizes, timestamps by storage class)
36
+
37
+ **S3 URL Convention:** All tools use `s3://bucket-name/path/to/object` format. The bucket name is parsed from the URL host, and the path becomes the S3 key prefix.
38
+
39
+ **AWS Authentication:** Uses default AWS SDK credential chain (env vars, ~/.aws/credentials, IAM roles). Region-specific clients are created automatically for cross-region bucket access in s3report.
40
+
41
+ ## Code Commits
42
+
43
+ Format using angular formatting:
44
+ ```
45
+ <type>(<scope>): <short summary>
46
+ ```
47
+ - **type**: build|ci|docs|feat|fix|perf|refactor|test
48
+ - **scope**: The feature or component of the service we're working on
49
+ - **summary**: Summary in present tense. Not capitalized. No period at the end.
50
+
51
+ ## Documentation Maintenance
52
+
53
+ When modifying the codebase, keep documentation in sync:
54
+ - **ARCHITECTURE.md** - Update when adding/removing classes, changing component relationships, or altering data flow patterns
55
+ - **README.md** - Update when adding new features, changing public APIs, or modifying usage examples
data/Gemfile.lock CHANGED
@@ -1,29 +1,36 @@
1
1
  GEM
2
2
  remote: http://rubygems.org/
3
3
  specs:
4
- aws-eventstream (1.2.0)
5
- aws-partitions (1.571.0)
6
- aws-sdk-core (3.130.0)
7
- aws-eventstream (~> 1, >= 1.0.2)
8
- aws-partitions (~> 1, >= 1.525.0)
9
- aws-sigv4 (~> 1.1)
10
- jmespath (~> 1.0)
11
- aws-sdk-kms (1.55.0)
12
- aws-sdk-core (~> 3, >= 3.127.0)
13
- aws-sigv4 (~> 1.1)
14
- aws-sdk-s3 (1.113.0)
15
- aws-sdk-core (~> 3, >= 3.127.0)
4
+ aws-eventstream (1.4.0)
5
+ aws-partitions (1.1211.0)
6
+ aws-sdk-core (3.241.4)
7
+ aws-eventstream (~> 1, >= 1.3.0)
8
+ aws-partitions (~> 1, >= 1.992.0)
9
+ aws-sigv4 (~> 1.9)
10
+ base64
11
+ bigdecimal
12
+ jmespath (~> 1, >= 1.6.1)
13
+ logger
14
+ aws-sdk-kms (1.121.0)
15
+ aws-sdk-core (~> 3, >= 3.241.4)
16
+ aws-sigv4 (~> 1.5)
17
+ aws-sdk-s3 (1.213.0)
18
+ aws-sdk-core (~> 3, >= 3.241.4)
16
19
  aws-sdk-kms (~> 1)
17
- aws-sigv4 (~> 1.4)
18
- aws-sigv4 (1.4.0)
20
+ aws-sigv4 (~> 1.5)
21
+ aws-sigv4 (1.12.1)
19
22
  aws-eventstream (~> 1, >= 1.0.2)
20
- jmespath (1.6.1)
23
+ base64 (0.3.0)
24
+ bigdecimal (4.0.1)
25
+ jmespath (1.6.2)
26
+ logger (1.7.0)
21
27
 
22
28
  PLATFORMS
23
- x86_64-darwin-21
29
+ arm64-darwin-24
30
+ ruby
24
31
 
25
32
  DEPENDENCIES
26
33
  aws-sdk-s3
27
34
 
28
35
  BUNDLED WITH
29
- 2.3.3
36
+ 2.6.2
data/README.md CHANGED
@@ -2,13 +2,133 @@
2
2
 
3
3
  Search through S3 files without downloading them.
4
4
 
5
- # Basic Usage
5
+ ## Installation
6
6
 
7
- Search for a pattern in a S3 file.
7
+ ```bash
8
+ gem install s3grep
9
+ ```
10
+
11
+ Or add to your Gemfile:
8
12
 
9
- example:
13
+ ```ruby
14
+ gem 's3grep'
10
15
  ```
11
- s3grep Bob s3://exammple.com/users.csv
16
+
17
+ ## CLI Tools
18
+
19
+ ### s3grep
20
+
21
+ Search for a pattern in S3 files. Supports gzip and zip compressed files automatically.
22
+
23
+ ```bash
24
+ # Basic search
25
+ s3grep "pattern" s3://bucket-name/path/to/file.csv
26
+
27
+ # Case-insensitive search
28
+ s3grep -i "pattern" s3://bucket-name/path/to/file.csv
29
+
30
+ # Recursive search through a directory
31
+ s3grep -r "pattern" s3://bucket-name/path/to/directory/
32
+
33
+ # Recursive search with file pattern filter
34
+ s3grep -r --include "\.csv$" "pattern" s3://bucket-name/logs/
35
+
36
+ # Search compressed files (auto-detected)
37
+ s3grep "error" s3://bucket-name/logs/app.log.gz
12
38
  ```
13
39
 
14
- Outputs S3 file with line number and the matching line.
40
+ Output format: `s3://bucket/path/file:line_number content`
41
+
42
+ ### s3cat
43
+
44
+ Stream S3 file contents to stdout.
45
+
46
+ ```bash
47
+ # Print file contents
48
+ s3cat s3://bucket-name/path/to/file.txt
49
+
50
+ # Pipe to other commands
51
+ s3cat s3://bucket-name/data.csv | head -20
52
+
53
+ # Use with standard unix tools
54
+ s3cat s3://bucket-name/users.json | jq '.users[]'
55
+ ```
56
+
57
+ ### s3info
58
+
59
+ Get statistics about an S3 directory as JSON.
60
+
61
+ ```bash
62
+ # Get info for a prefix
63
+ s3info s3://bucket-name/path/to/directory/
64
+ ```
65
+
66
+ Output includes:
67
+ - `bucket` - Bucket name
68
+ - `base_prefix` - S3 prefix path
69
+ - `total_size` - Total bytes across all files
70
+ - `num_files` - File count
71
+ - `last_modified` / `newest_file` - Most recently modified file
72
+ - `first_modified` / `first_file` - Oldest file
73
+ - `num_files_by_storage_class` - File count breakdown by storage class
74
+ - `total_size_by_storage_class` - Size breakdown by storage class
75
+
76
+ Example output:
77
+ ```json
78
+ {
79
+ "bucket": "my-bucket",
80
+ "base_prefix": "logs/2024/",
81
+ "total_size": 1048576000,
82
+ "num_files": 365,
83
+ "last_modified": "2024-12-31T23:59:59+00:00",
84
+ "newest_file": "logs/2024/12/31/app.log",
85
+ "first_modified": "2024-01-01T00:00:00+00:00",
86
+ "first_file": "logs/2024/01/01/app.log",
87
+ "num_files_by_storage_class": {
88
+ "STANDARD": 100,
89
+ "STANDARD_IA": 265
90
+ },
91
+ "total_size_by_storage_class": {
92
+ "STANDARD": 500000000,
93
+ "STANDARD_IA": 548576000
94
+ }
95
+ }
96
+ ```
97
+
98
+ ### s3report
99
+
100
+ Generate a CSV report of all S3 buckets in your AWS account.
101
+
102
+ ```bash
103
+ s3report
104
+ ```
105
+
106
+ Creates a file named `AWS-S3-Usage-Report-YYYY-MM-DDTHHMMSS.csv` with columns:
107
+ - Bucket
108
+ - Creation Date
109
+ - Total Size
110
+ - Number of Files
111
+ - Last Modified
112
+ - Newest File
113
+ - First Modified
114
+ - First File
115
+
116
+ ## AWS Configuration
117
+
118
+ Authentication uses the standard AWS SDK credential chain:
119
+
120
+ 1. Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
121
+ 2. Shared credentials file (`~/.aws/credentials`)
122
+ 3. IAM instance profile (EC2/ECS)
123
+
124
+ Set your region via `AWS_REGION` environment variable or `~/.aws/config`.
125
+
126
+ Use `AWS_PROFILE` to select a named profile:
127
+
128
+ ```bash
129
+ AWS_PROFILE=stage s3grep "error" s3://my-bucket/logs/app.log
130
+ ```
131
+
132
+ ## License
133
+
134
+ MIT
data/bin/s3cat CHANGED
@@ -1,12 +1,37 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
- require 'optparse'
4
3
  require 's3grep'
5
4
  require 'aws-sdk-s3'
6
5
 
7
- s3_file = ARGV[0]
8
- aws_s3_client = Aws::S3::Client.new
9
- search = S3Grep::Search.new(s3_file, aws_s3_client, nil)
10
- search.to_io.each do |line|
11
- print line
6
+ # Exit cleanly on broken pipe (e.g., when piping to head)
7
+ Signal.trap("PIPE", "EXIT")
8
+
9
+ s3_url = ARGV[0]
10
+
11
+ if s3_url.nil? || s3_url.empty?
12
+ $stderr.puts "Usage: s3cat s3://bucket/path/to/file"
13
+ exit 1
14
+ end
15
+
16
+ unless s3_url.start_with?('s3://')
17
+ $stderr.puts "Error: Invalid S3 URL format. Expected s3://bucket/path"
18
+ exit 1
19
+ end
20
+
21
+ begin
22
+ # max_attempts: 1 disables retries (required for streaming)
23
+ aws_s3_client = Aws::S3::Client.new(max_attempts: 1)
24
+ search = S3Grep::Search.new(s3_url, aws_s3_client)
25
+ search.to_io.each do |line|
26
+ print line
27
+ end
28
+ rescue Errno::EPIPE
29
+ # Broken pipe (e.g., piping to head) - exit silently
30
+ exit 0
31
+ rescue Aws::S3::Errors::ServiceError => e
32
+ $stderr.puts "S3 Error: #{e.message}"
33
+ exit 1
34
+ rescue => e
35
+ $stderr.puts "Error: #{e.message}"
36
+ exit 1
12
37
  end
data/bin/s3grep CHANGED
@@ -4,13 +4,31 @@ require 'optparse'
4
4
  require 's3grep'
5
5
  require 'aws-sdk-s3'
6
6
 
7
+ # Exit cleanly on broken pipe (e.g., when piping to head)
8
+ Signal.trap("PIPE", "EXIT")
9
+
10
+ # Maximum regex pattern length to prevent ReDoS
11
+ MAX_PATTERN_LENGTH = 1000
12
+
13
+ def safe_regexp(pattern, options = 0)
14
+ if pattern.length > MAX_PATTERN_LENGTH
15
+ $stderr.puts "Error: Pattern too long (max #{MAX_PATTERN_LENGTH} characters)"
16
+ exit 1
17
+ end
18
+ Regexp.new(pattern, options)
19
+ rescue RegexpError => e
20
+ $stderr.puts "Error: Invalid regular expression: #{e.message}"
21
+ exit 1
22
+ end
23
+
7
24
  options = {
8
25
  ignore_case: false,
9
26
  recursive: false,
10
27
  file_pattern: /.*/
11
28
  }
29
+
12
30
  OptionParser.new do |opts|
13
- opts.banner = 'Usage: s3grep [options]'
31
+ opts.banner = 'Usage: s3grep [options] PATTERN s3://bucket/path'
14
32
 
15
33
  opts.on('-i', '--ignore-case', 'Ignore case') do
16
34
  options[:ignore_case] = true
@@ -21,30 +39,48 @@ OptionParser.new do |opts|
21
39
  end
22
40
 
23
41
  opts.on('--include FILE_PATTERN', 'Include matching files') do |v|
24
- options[:file_pattern] = Regexp.new(v, Regexp::IGNORECASE)
42
+ options[:file_pattern] = safe_regexp(v, Regexp::IGNORECASE)
25
43
  end
26
44
  end.parse!
27
45
 
28
- regex_options =
29
- if options[:ignore_case]
30
- Regexp::IGNORECASE
31
- else
32
- 0
33
- end
46
+ if ARGV.length < 2
47
+ $stderr.puts "Usage: s3grep [options] PATTERN s3://bucket/path"
48
+ exit 1
49
+ end
34
50
 
35
- regex = Regexp.new(ARGV[0], regex_options)
51
+ pattern = ARGV[0]
36
52
  s3_url = ARGV[1]
37
53
 
38
- aws_s3_client = Aws::S3::Client.new
54
+ unless s3_url.start_with?('s3://')
55
+ $stderr.puts "Error: Invalid S3 URL format. Expected s3://bucket/path"
56
+ exit 1
57
+ end
58
+
59
+ regex_options = options[:ignore_case] ? Regexp::IGNORECASE : 0
60
+ regex = safe_regexp(pattern, regex_options)
61
+
62
+ begin
63
+ # max_attempts: 1 disables retries (required for streaming)
64
+ aws_s3_client = Aws::S3::Client.new(max_attempts: 1)
39
65
 
40
- if options[:recursive]
41
- S3Grep::Directory.glob(s3_url, aws_s3_client, options[:file_pattern]) do |s3_file|
42
- S3Grep::Search.search(s3_file, aws_s3_client, regex) do |line_number, line|
43
- puts "#{s3_file}:#{line_number} #{line}"
66
+ if options[:recursive]
67
+ S3Grep::Directory.glob(s3_url, aws_s3_client, options[:file_pattern]) do |s3_file|
68
+ S3Grep::Search.search(s3_file, aws_s3_client, regex) do |line_number, line|
69
+ puts "#{s3_file}:#{line_number} #{line}"
70
+ end
71
+ end
72
+ else
73
+ S3Grep::Search.search(s3_url, aws_s3_client, regex) do |line_number, line|
74
+ puts "#{s3_url}:#{line_number} #{line}"
44
75
  end
45
76
  end
46
- else
47
- S3Grep::Search.search(s3_url, aws_s3_client, regex) do |line_number, line|
48
- puts "#{s3_url}:#{line_number} #{line}"
49
- end
77
+ rescue Errno::EPIPE
78
+ # Broken pipe (e.g., piping to head) - exit silently
79
+ exit 0
80
+ rescue Aws::S3::Errors::ServiceError => e
81
+ $stderr.puts "S3 Error: #{e.message}"
82
+ exit 1
83
+ rescue => e
84
+ $stderr.puts "Error: #{e.message}"
85
+ exit 1
50
86
  end
data/bin/s3info CHANGED
@@ -1,23 +1,47 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
- require 'optparse'
3
+ require 'pathname'
4
+ BASE_DIR = Pathname.new(File.expand_path('..', __dir__))
5
+ $LOAD_PATH << "#{BASE_DIR}/lib"
6
+
4
7
  require 's3grep'
5
8
  require 'aws-sdk-s3'
6
9
  require 'json'
7
10
 
8
- s3_file = ARGV[0]
9
- aws_s3_client = Aws::S3::Client.new
10
- info = S3Grep::Directory.new(s3_file, aws_s3_client).info
11
-
12
- stats = {
13
- bucket: info.bucket,
14
- base_prefix: info.base_prefix,
15
- total_size: info.total_size,
16
- num_files: info.num_files,
17
- last_modified: info.last_modified,
18
- newest_file: info.newest_file,
19
- first_modified: info.first_modified,
20
- first_file: info.first_file
21
- }
22
-
23
- print JSON.pretty_generate(stats) + "\n"
11
+ s3_url = ARGV[0]
12
+
13
+ if s3_url.nil? || s3_url.empty?
14
+ $stderr.puts "Usage: s3info s3://bucket/path"
15
+ exit 1
16
+ end
17
+
18
+ unless s3_url.start_with?('s3://')
19
+ $stderr.puts "Error: Invalid S3 URL format. Expected s3://bucket/path"
20
+ exit 1
21
+ end
22
+
23
+ begin
24
+ aws_s3_client = Aws::S3::Client.new
25
+ info = S3Grep::Directory.new(s3_url, aws_s3_client).info
26
+
27
+ stats = {
28
+ bucket: info.bucket,
29
+ base_prefix: info.base_prefix,
30
+ total_size: info.total_size,
31
+ num_files: info.num_files,
32
+ last_modified: info.last_modified,
33
+ newest_file: info.newest_file,
34
+ first_modified: info.first_modified,
35
+ first_file: info.first_file,
36
+ num_files_by_storage_class: info.num_files_by_storage_class,
37
+ total_size_by_storage_class: info.total_size_by_storage_class
38
+ }
39
+
40
+ print JSON.pretty_generate(stats) + "\n"
41
+ rescue Aws::S3::Errors::ServiceError => e
42
+ $stderr.puts "S3 Error: #{e.message}"
43
+ exit 1
44
+ rescue => e
45
+ $stderr.puts "Error: #{e.message}"
46
+ exit 1
47
+ end
data/bin/s3report CHANGED
@@ -1,56 +1,82 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
- require 'optparse'
4
3
  require 's3grep'
5
4
  require 'aws-sdk-s3'
6
5
  require 'csv'
7
6
 
8
- bucket_info = {}
9
- aws_s3_client = Aws::S3::Client.new
10
- aws_s3_client.list_buckets[:buckets].each do |bucket|
11
- name = bucket[:name]
12
- puts name
13
-
14
- bucket_location = aws_s3_client.get_bucket_location(bucket: name)
15
- aws_s3_client_region_specific =
16
- if bucket_location[:location_constraint] == ''
17
- aws_s3_client
18
- else
19
- Aws::S3::Client.new(region: bucket_location[:location_constraint])
7
+ # Sanitize values to prevent CSV injection attacks
8
+ # Prefixes dangerous characters with a single quote
9
+ def sanitize_csv_value(value)
10
+ return value unless value.is_a?(String)
11
+ return value if value.empty?
12
+
13
+ # Characters that can trigger formula execution in spreadsheets
14
+ if %w[= + - @ \t \r].include?(value[0])
15
+ "'#{value}"
16
+ else
17
+ value
18
+ end
19
+ end
20
+
21
+ begin
22
+ bucket_info = {}
23
+ aws_s3_client = Aws::S3::Client.new
24
+
25
+ aws_s3_client.list_buckets[:buckets].each do |bucket|
26
+ name = bucket[:name]
27
+ puts name
28
+
29
+ begin
30
+ bucket_location = aws_s3_client.get_bucket_location(bucket: name)
31
+ aws_s3_client_region_specific =
32
+ if bucket_location[:location_constraint].nil? || bucket_location[:location_constraint] == ''
33
+ aws_s3_client
34
+ else
35
+ Aws::S3::Client.new(region: bucket_location[:location_constraint])
36
+ end
37
+
38
+ info = S3Grep::Directory.new("s3://#{name}/", aws_s3_client_region_specific).info
39
+
40
+ bucket_info[name] = {
41
+ bucket: info.bucket,
42
+ creation_date: bucket[:creation_date],
43
+ total_size: info.total_size,
44
+ num_files: info.num_files,
45
+ last_modified: info.last_modified,
46
+ newest_file: info.newest_file,
47
+ first_modified: info.first_modified,
48
+ first_file: info.first_file
49
+ }
50
+ rescue Aws::S3::Errors::ServiceError => e
51
+ $stderr.puts "Warning: Could not access bucket '#{name}': #{e.message}"
20
52
  end
53
+ end
21
54
 
22
- info = S3Grep::Directory.new("s3://#{name}/", aws_s3_client_region_specific).info
23
-
24
- bucket_info[name] = {
25
- bucket: info.bucket,
26
- creation_date: bucket[:creation_date],
27
- total_size: info.total_size,
28
- num_files: info.num_files,
29
- last_modified: info.last_modified,
30
- newest_file: info.newest_file,
31
- first_modified: info.first_modified,
32
- first_file: info.first_file
55
+ csv_headers = {
56
+ bucket: 'Bucket',
57
+ creation_date: 'Creation Date',
58
+ total_size: 'Total Size',
59
+ num_files: 'Number of Files',
60
+ last_modified: 'Last Modified',
61
+ newest_file: 'Newest File',
62
+ first_modified: 'First Modified',
63
+ first_file: 'First File'
33
64
  }
34
- end
35
65
 
36
- csv_headers = {
37
- bucket: 'Bucket',
38
- creation_date: 'Creation Date',
39
- total_size: 'Total Size',
40
- num_files: 'Number of Files',
41
- last_modified: 'Last Modified',
42
- newest_file: 'Newest File',
43
- first_modified: 'First Modified',
44
- first_file: 'First File'
45
- }
46
-
47
- file = "AWS-S3-Usage-Report-#{Time.now.strftime('%Y-%m-%dT%H%M%S')}.csv"
48
- CSV.open(file, 'w') do |csv|
49
- csv << csv_headers.values
50
-
51
- bucket_info.each_value do |stats|
52
- csv << csv_headers.keys.map { |k| stats[k] }
66
+ file = "AWS-S3-Usage-Report-#{Time.now.strftime('%Y-%m-%dT%H%M%S')}.csv"
67
+ CSV.open(file, 'w') do |csv|
68
+ csv << csv_headers.values
69
+
70
+ bucket_info.each_value do |stats|
71
+ csv << csv_headers.keys.map { |k| sanitize_csv_value(stats[k]) }
72
+ end
53
73
  end
54
- end
55
74
 
56
- puts file
75
+ puts file
76
+ rescue Aws::S3::Errors::ServiceError => e
77
+ $stderr.puts "S3 Error: #{e.message}"
78
+ exit 1
79
+ rescue => e
80
+ $stderr.puts "Error: #{e.message}"
81
+ exit 1
82
+ end
@@ -12,6 +12,10 @@ module S3Grep
12
12
  @aws_s3_client = aws_s3_client
13
13
  end
14
14
 
15
+ def uri
16
+ @uri ||= URI(s3_url)
17
+ end
18
+
15
19
  def self.glob(s3_url, aws_s3_client, regex, &block)
16
20
  new(s3_url, aws_s3_client).glob(regex, &block)
17
21
  end
@@ -31,18 +35,14 @@ module S3Grep
31
35
  end
32
36
 
33
37
  def each_content
34
- uri = URI(s3_url)
35
-
36
38
  max_keys = 1_000
37
39
 
38
40
  prefix = CGI.unescape(uri.path[1..-1] || '')
39
41
 
40
42
  resp = aws_s3_client.list_objects(
41
- {
42
- bucket: uri.host,
43
- prefix: prefix,
44
- max_keys: max_keys
45
- }
43
+ bucket: uri.host,
44
+ prefix: prefix,
45
+ max_keys: max_keys
46
46
  )
47
47
 
48
48
  resp.contents.each do |content|
@@ -53,12 +53,10 @@ module S3Grep
53
53
  marker = resp.contents.last.key
54
54
 
55
55
  resp = aws_s3_client.list_objects(
56
- {
57
- bucket: uri.host,
58
- prefix: prefix,
59
- max_keys: max_keys,
60
- marker: marker
61
- }
56
+ bucket: uri.host,
57
+ prefix: prefix,
58
+ max_keys: max_keys,
59
+ marker: marker
62
60
  )
63
61
 
64
62
  resp.contents.each do |content|
@@ -1,3 +1,5 @@
1
+ require 'cgi'
2
+
1
3
  module S3Grep
2
4
  class DirectoryInfo
3
5
  attr_reader :bucket,
@@ -5,7 +7,9 @@ module S3Grep
5
7
  :total_size,
6
8
  :num_files,
7
9
  :newest_content,
8
- :oldest_content
10
+ :oldest_content,
11
+ :num_files_by_storage_class,
12
+ :total_size_by_storage_class
9
13
 
10
14
  def self.get(directory)
11
15
  info = new(directory)
@@ -15,6 +19,8 @@ module S3Grep
15
19
  def initialize(directory)
16
20
  @total_size = 0
17
21
  @num_files = 0
22
+ @num_files_by_storage_class = Hash.new(0)
23
+ @total_size_by_storage_class = Hash.new(0)
18
24
  set_path(directory)
19
25
  end
20
26
 
@@ -23,6 +29,9 @@ module S3Grep
23
29
  @num_files += 1
24
30
  @total_size += content[:size]
25
31
 
32
+ @num_files_by_storage_class[content[:storage_class]] += 1
33
+ @total_size_by_storage_class[content[:storage_class]] += content[:size]
34
+
26
35
  set_newest(content)
27
36
  set_oldest(content)
28
37
  end
data/lib/s3grep/search.rb CHANGED
@@ -1,5 +1,6 @@
1
1
  require 'aws-sdk-s3'
2
2
  require 'cgi'
3
+ require 'zlib'
3
4
 
4
5
  module S3Grep
5
6
  class Search
@@ -10,11 +11,11 @@ module S3Grep
10
11
  def initialize(s3_url, aws_s3_client, compression = nil)
11
12
  @s3_url = s3_url
12
13
  @aws_s3_client = aws_s3_client
13
- @compression = compression
14
+ @compression = compression || self.class.detect_compression(s3_url)
14
15
  end
15
16
 
16
17
  def self.search(s3_url, aws_s3_client, regex, &block)
17
- new(s3_url, aws_s3_client, detect_compression(s3_url)).search(regex, &block)
18
+ new(s3_url, aws_s3_client).search(regex, &block)
18
19
  end
19
20
 
20
21
  def self.detect_compression(s3_url)
@@ -24,9 +25,10 @@ module S3Grep
24
25
  nil
25
26
  end
26
27
 
28
+
27
29
  def search(regex)
28
30
  line_number = 0
29
- to_io.each do |line|
31
+ each_line do |line|
30
32
  line_number += 1
31
33
  next unless line.match?(regex)
32
34
 
@@ -34,28 +36,114 @@ module S3Grep
34
36
  end
35
37
  end
36
38
 
37
- def s3_object
38
- uri = URI(s3_url)
39
-
40
- aws_s3_client.get_object(
41
- {
42
- bucket: uri.host,
43
- key: CGI.unescape(uri.path[1..-1])
44
- }
45
- )
39
+ # Stream lines from S3 without loading entire file into memory
40
+ def each_line(&block)
41
+ if compression == :gzip
42
+ each_line_gzip(&block)
43
+ elsif compression == :zip
44
+ each_line_zip(&block)
45
+ else
46
+ each_line_raw(&block)
47
+ end
46
48
  end
47
49
 
50
+ # For backward compatibility - streams content for s3cat
48
51
  def to_io
49
- body = s3_object.body
52
+ StreamingIO.new(self)
53
+ end
50
54
 
51
- if compression == :gzip
52
- Zlib::GzipReader.new(body)
53
- elsif compression == :zip
54
- require 'zip'
55
- zip = Zip::File.open_buffer(body)
56
- zip.entries.first.get_input_stream
57
- else
58
- body
55
+ def bucket
56
+ @bucket ||= URI(s3_url).host
57
+ end
58
+
59
+ def key
60
+ @key ||= CGI.unescape(URI(s3_url).path[1..-1])
61
+ end
62
+
63
+ private
64
+
65
+ # Stream raw (uncompressed) content line by line
66
+ # True streaming - only keeps current chunk + line buffer in memory
67
+ def each_line_raw(&block)
68
+ buffer = "".b
69
+
70
+ aws_s3_client.get_object(bucket: bucket, key: key) do |chunk|
71
+ buffer << chunk
72
+ extract_lines!(buffer, &block)
73
+ end
74
+
75
+ # Yield any remaining content (last line without newline)
76
+ yield buffer unless buffer.empty?
77
+ end
78
+
79
+ # Stream gzip content line by line
80
+ # True streaming - decompresses chunks as they arrive from S3
81
+ def each_line_gzip(&block)
82
+ buffer = "".b
83
+ # Zlib::MAX_WBITS + 32 enables automatic gzip/zlib header detection
84
+ inflater = Zlib::Inflate.new(Zlib::MAX_WBITS + 32)
85
+
86
+ begin
87
+ aws_s3_client.get_object(bucket: bucket, key: key) do |chunk|
88
+ # Decompress this chunk
89
+ decompressed = inflater.inflate(chunk)
90
+ buffer << decompressed
91
+ extract_lines!(buffer, &block)
92
+ end
93
+
94
+ # Finish decompression and process remaining data
95
+ remaining = inflater.finish
96
+ buffer << remaining
97
+ extract_lines!(buffer, &block)
98
+
99
+ yield buffer unless buffer.empty?
100
+ ensure
101
+ inflater.close
102
+ end
103
+ end
104
+
105
+ # ZIP files cannot be truly streamed (central directory is at EOF)
106
+ # We stream the download but must buffer before decompressing
107
+ def each_line_zip(&block)
108
+ require 'zip'
109
+
110
+ # Stream download into buffer (ZIP format requires full file)
111
+ body = StringIO.new("".b)
112
+ aws_s3_client.get_object(bucket: bucket, key: key) do |chunk|
113
+ body << chunk
114
+ end
115
+ body.rewind
116
+
117
+ zip = Zip::File.open_buffer(body)
118
+ entry = zip.entries.first
119
+ raise IOError, "ZIP archive is empty" if entry.nil?
120
+
121
+ buffer = "".b
122
+ entry.get_input_stream.each do |chunk|
123
+ buffer << chunk
124
+ extract_lines!(buffer, &block)
125
+ end
126
+
127
+ yield buffer unless buffer.empty?
128
+ end
129
+
130
+ # Extract complete lines from buffer, yielding each one
131
+ def extract_lines!(buffer)
132
+ while (newline_index = buffer.index("\n"))
133
+ line = buffer.slice!(0, newline_index + 1)
134
+ yield line
135
+ end
136
+ end
137
+
138
+ # Adapter class that provides IO-like interface for streaming
139
+ # Used by s3cat for backward compatibility
140
+ class StreamingIO
141
+ def initialize(search)
142
+ @search = search
143
+ end
144
+
145
+ def each(&block)
146
+ @search.each_line(&block)
59
147
  end
60
148
  end
61
149
  end
data/s3grep.gemspec CHANGED
@@ -2,10 +2,10 @@
2
2
 
3
3
  Gem::Specification.new do |s|
4
4
  s.name = 's3grep'
5
- s.version = '0.1.8'
5
+ s.version = '0.2.0'
6
6
  s.licenses = ['MIT']
7
- s.summary = 'Search through S3 files'
8
- s.description = 'Tools for searching files on S3'
7
+ s.summary = 'Search through S3 files without downloading them'
8
+ s.description = 'CLI tools for streaming search (s3grep), viewing (s3cat), and reporting (s3info, s3report) on S3 objects. Supports gzip compression and searches large files with minimal memory usage.'
9
9
  s.authors = ['Doug Youch']
10
10
  s.email = 'dougyouch@gmail.com'
11
11
  s.homepage = 'https://github.com/dougyouch/s3grep'
@@ -13,5 +13,8 @@ Gem::Specification.new do |s|
13
13
  s.bindir = 'bin'
14
14
  s.executables = s.files.grep(%r{^bin/}) { |f| File.basename(f) }
15
15
 
16
+ s.required_ruby_version = '>= 2.6.0'
17
+
16
18
  s.add_runtime_dependency 'aws-sdk-s3'
19
+ s.add_runtime_dependency 'rubyzip'
17
20
  end
metadata CHANGED
@@ -1,14 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: s3grep
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.8
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Doug Youch
8
- autorequire:
9
8
  bindir: bin
10
9
  cert_chain: []
11
- date: 2024-08-23 00:00:00.000000000 Z
10
+ date: 2026-02-01 00:00:00.000000000 Z
12
11
  dependencies:
13
12
  - !ruby/object:Gem::Dependency
14
13
  name: aws-sdk-s3
@@ -24,7 +23,23 @@ dependencies:
24
23
  - - ">="
25
24
  - !ruby/object:Gem::Version
26
25
  version: '0'
27
- description: Tools for searching files on S3
26
+ - !ruby/object:Gem::Dependency
27
+ name: rubyzip
28
+ requirement: !ruby/object:Gem::Requirement
29
+ requirements:
30
+ - - ">="
31
+ - !ruby/object:Gem::Version
32
+ version: '0'
33
+ type: :runtime
34
+ prerelease: false
35
+ version_requirements: !ruby/object:Gem::Requirement
36
+ requirements:
37
+ - - ">="
38
+ - !ruby/object:Gem::Version
39
+ version: '0'
40
+ description: CLI tools for streaming search (s3grep), viewing (s3cat), and reporting
41
+ (s3info, s3report) on S3 objects. Supports gzip compression and searches large files
42
+ with minimal memory usage.
28
43
  email: dougyouch@gmail.com
29
44
  executables:
30
45
  - s3cat
@@ -37,6 +52,8 @@ files:
37
52
  - ".gitignore"
38
53
  - ".ruby-gemset"
39
54
  - ".ruby-version"
55
+ - ARCHITECTURE.md
56
+ - CLAUDE.md
40
57
  - Gemfile
41
58
  - Gemfile.lock
42
59
  - LICENSE
@@ -55,7 +72,6 @@ homepage: https://github.com/dougyouch/s3grep
55
72
  licenses:
56
73
  - MIT
57
74
  metadata: {}
58
- post_install_message:
59
75
  rdoc_options: []
60
76
  require_paths:
61
77
  - lib
@@ -63,15 +79,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
63
79
  requirements:
64
80
  - - ">="
65
81
  - !ruby/object:Gem::Version
66
- version: '0'
82
+ version: 2.6.0
67
83
  required_rubygems_version: !ruby/object:Gem::Requirement
68
84
  requirements:
69
85
  - - ">="
70
86
  - !ruby/object:Gem::Version
71
87
  version: '0'
72
88
  requirements: []
73
- rubygems_version: 3.3.3
74
- signing_key:
89
+ rubygems_version: 3.6.2
75
90
  specification_version: 4
76
- summary: Search through S3 files
91
+ summary: Search through S3 files without downloading them
77
92
  test_files: []