RubyGems - s3grep - Versions diffs - 0.1.8 → 0.2.0 - Mend

s3grep 0.1.8 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

checksums.yaml +4 -4
data/.ruby-version +1 -1
data/ARCHITECTURE.md +188 -0
data/CLAUDE.md +55 -0
data/Gemfile.lock +24 -17
data/README.md +125 -5
data/bin/s3cat +31 -6
data/bin/s3grep +54 -18
data/bin/s3info +41 -17
data/bin/s3report +70 -44
data/lib/s3grep/directory.rb +11 -13
data/lib/s3grep/directory_info.rb +10 -1
data/lib/s3grep/search.rb +109 -21
data/s3grep.gemspec +6 -3
metadata +24 -9

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 804681319bd1ddfd4ff89ab54f21f9cff854b06d3a254ed7870da7be8dcfb059
-  data.tar.gz: f615b79c46f460a73f530fbe075e16f9c9ee9dd4c778cf26eee9ad63eea15ae7
+  metadata.gz: 686a06c854681a05e7b5915cc5819fe710b2b3da842926b6d5acb019de08c1d8
+  data.tar.gz: ebbfd6d9f891ca5a3a7036c90775595357f376385c5573c413fff740d9a7e086
 SHA512:
-  metadata.gz: 6ac540deed47a2e02ab396b0f92c018ae1ec608b4824b4adb13e2d9e5fe4631366879fdfb176cd2dfe7bbee7ec040ce53d74a95f92e30994d1b1e87120d09e6a
-  data.tar.gz: 8eaa9a8e01415551c0e0770ac59b4cf8e5374f75317fbf2a9ee60f3bd16766b072acd4b0d440a79af6013a2dd221b06c7ced8a1ffbcecd8973e95397225f6695
+  metadata.gz: fd00546c540f5ef8688b9af0e3668366bc496e4eae2120f394aac67400ade50c485ae43b00e4987e03423cc2d53404bafe6fa59f07c122c82d5c9de1de5f6071
+  data.tar.gz: 216072d06695ac0404584cc1273c1e84fc988e61eab9b8700c9d3e4ffcee786f169077e860502a4db2c86aa23aa0a39e8a3ae53c9b9b3d4c98073ac123026b33

data/.ruby-version CHANGED Viewed

	@@ -1 +1 @@
1	- 3.1.0
1	+ 3.4.2

data/ARCHITECTURE.md ADDED Viewed

@@ -0,0 +1,188 @@
+# Architecture
+## Overview
+s3grep is a Ruby gem providing grep-like functionality for AWS S3 objects. It streams files directly from S3 without downloading them locally, enabling efficient searching of large files.
+## Component Diagram
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         CLI Layer (bin/)                        │
+├──────────┬──────────┬──────────────┬───────────────────────────┤
+│  s3grep  │  s3cat   │    s3info    │         s3report          │
+│ (search) │ (stream) │ (dir stats)  │    (bucket inventory)     │
+└────┬─────┴────┬─────┴──────┬───────┴─────────────┬─────────────┘
+     │          │            │                     │
+     ▼          ▼            ▼                     ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                      S3Grep Module (lib/)                       │
+├─────────────────────┬───────────────────┬───────────────────────┤
+│       Search        │     Directory     │    DirectoryInfo      │
+│  (file streaming)   │  (object listing) │  (stats aggregation)  │
+└──────────┬──────────┴─────────┬─────────┴───────────┬───────────┘
+           │                    │                     │
+           ▼                    ▼                     ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                       AWS SDK (aws-sdk-s3)                      │
+│              get_object  │  list_objects  │  list_buckets       │
+└─────────────────────────────────────────────────────────────────┘
+```
+## Core Classes
+### S3Grep::Search
+True streaming S3 object reader with line-by-line regex matching. Uses chunked transfer to avoid loading entire files into memory.
+**Responsibilities:**
+- Parse S3 URL to extract bucket and key
+- Stream object content via `get_object` block form (chunked transfer)
+- Buffer partial lines across chunk boundaries
+- Auto-detect and decompress .gz files (streaming) and .zip files (buffered)
+- Yield matching lines with line numbers
+- Enforce size limits to prevent resource exhaustion
+**Key Methods:**
+- `Search.search(s3_url, client, regex)` - Class method for simple searches
+- `Search.detect_compression(s3_url)` - Infers compression from file extension
+- `#each_line` - Core streaming iterator, yields lines as they arrive
+- `#to_io` - Returns StreamingIO adapter for backward compatibility
+**Streaming Implementation:**
+- Raw files: Chunks streamed directly, lines extracted from buffer
+- Gzip files: Chunks decompressed via `Zlib::Inflate` as they arrive
+- ZIP files: Must buffer entire archive (ZIP format requires central directory at EOF)
+### S3Grep::Directory
+Lists objects in an S3 prefix with optional glob-style filtering.
+**Responsibilities:**
+- Parse S3 URL to extract bucket and prefix
+- Handle pagination (1000 objects per request)
+- URL-encode/decode object keys with special characters
+- Support regex filtering via `glob` method
+**Key Methods:**
+- `Directory.glob(s3_url, client, regex)` - List objects matching pattern
+- `#each` - Iterate full S3 URLs for all objects
+- `#each_content` - Iterate raw S3 object metadata (for DirectoryInfo)
+- `#info` - Factory method returning DirectoryInfo
+### S3Grep::DirectoryInfo
+Aggregates statistics while iterating through directory contents.
+**Responsibilities:**
+- Count files and total size
+- Track newest/oldest files by modification date
+- Breakdown counts and sizes by storage class
+**Key Methods:**
+- `DirectoryInfo.get(directory)` - Process directory and return populated info
+- `#last_modified` / `#first_modified` - Timestamp accessors
+- `#newest_file` / `#first_file` - Key accessors
+## Data Flow
+### Search Flow (s3grep)
+```
+User Input: regex + s3://bucket/key
+       │
+       ▼
+┌──────────────┐
+│ Parse S3 URL │
+└──────┬───────┘
+       │
+       ▼
+┌──────────────────────┐
+│ Detect compression   │
+│ (.gz, .zip, or none) │
+└──────────┬───────────┘
+           │
+           ▼
+┌──────────────────────────────────────┐
+│ aws_s3_client.get_object(block form) │
+│   Streams chunks as they arrive      │
+└──────────┬───────────────────────────┘
+           │
+           ▼ (for each chunk)
+┌──────────────────────────────────────┐
+│ Decompress chunk if gzip             │
+│ (Zlib::Inflate streaming)            │
+└──────────┬───────────────────────────┘
+           │
+           ▼
+┌──────────────────────────────────────┐
+│ Append to line buffer                │
+│ Extract complete lines               │
+│ Yield matches with line numbers      │
+└──────────┬───────────────────────────┘
+           │
+           ▼
+┌──────────────────────────────────────┐
+│ Repeat until stream exhausted        │
+│ Yield final partial line if any      │
+└──────────────────────────────────────┘
+```
+**Memory Behavior:**
+- Raw/Gzip: Only current chunk + line buffer in memory (~64KB typical)
+- ZIP: Entire archive buffered (ZIP format limitation)
+### Directory Listing Flow (s3info, recursive s3grep)
+```
+User Input: s3://bucket/prefix/
+       │
+       ▼
+┌──────────────────────┐
+│ list_objects         │
+│ (max_keys: 1000)     │
+└──────────┬───────────┘
+           │
+           ▼
+┌──────────────────────┐
+│ More results?        │──No──▶ Done
+│ (size == max_keys)   │
+└──────────┬───────────┘
+           │ Yes
+           ▼
+┌──────────────────────┐
+│ list_objects with    │
+│ marker = last key    │
+└──────────┬───────────┘
+           │
+           └───────▶ (repeat until exhausted)
+```
+## AWS Integration
+### Authentication
+Uses the AWS SDK default credential chain:
+1. Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
+2. Shared credentials file (`~/.aws/credentials`)
+3. IAM instance profile (EC2/ECS)
+### Region Handling
+- Default client uses `AWS_REGION` or `~/.aws/config`
+- `s3report` creates region-specific clients per bucket via `get_bucket_location`
+### S3 URL Format
+All tools expect: `s3://bucket-name/path/to/prefix`
+- Host = bucket name
+- Path = object key or prefix (URL-decoded internally)
+## Compression Support
+| Extension | Library | Streaming | Notes |
+|-----------|---------|-----------|-------|
+| `.gz` | zlib (stdlib) | ✅ Yes | Zlib::Inflate processes chunks as they arrive |
+| `.zip` | rubyzip | ❌ No | ZIP format requires buffering (central directory at EOF) |
+| (none) | - | ✅ Yes | Chunks streamed directly |
+**Size Limits:**
+- `MAX_BYTES_PROCESSED` (100MB default) prevents resource exhaustion
+- Configurable via `S3Grep::Search::MAX_BYTES_PROCESSED`

data/CLAUDE.md ADDED Viewed

@@ -0,0 +1,55 @@
+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+s3grep is a Ruby gem for searching through S3 files without downloading them. It provides CLI tools for grep-like searching, file viewing, and bucket reporting directly on S3 objects.
+## Development Commands
+```bash
+# Install dependencies
+bundle install
+# Build the gem
+gem build s3grep.gemspec
+# Install locally for testing
+gem install s3grep-*.gem
+```
+## CLI Tools
+- `s3grep` - Search for patterns in S3 files (supports `-i` for case-insensitive, `-r` for recursive, `--include` for file patterns)
+- `s3cat` - Stream S3 file contents to stdout
+- `s3info` - Get directory statistics (file count, size, storage classes, date ranges) as JSON
+- `s3report` - Generate CSV report of all buckets in an AWS account
+## Architecture
+**Core Classes (lib/s3grep/):**
+- `Search` - Streams S3 objects and searches line-by-line with regex. Auto-detects compression (.gz, .zip)
+- `Directory` - Lists S3 objects with prefix filtering. Handles pagination via marker-based iteration
+- `DirectoryInfo` - Aggregates statistics from Directory iteration (counts, sizes, timestamps by storage class)
+**S3 URL Convention:** All tools use `s3://bucket-name/path/to/object` format. The bucket name is parsed from the URL host, and the path becomes the S3 key prefix.
+**AWS Authentication:** Uses default AWS SDK credential chain (env vars, ~/.aws/credentials, IAM roles). Region-specific clients are created automatically for cross-region bucket access in s3report.
+## Code Commits
+Format using angular formatting:
+```
+<type>(<scope>): <short summary>
+```
+- **type**: build|ci|docs|feat|fix|perf|refactor|test
+- **scope**: The feature or component of the service we're working on
+- **summary**: Summary in present tense. Not capitalized. No period at the end.
+## Documentation Maintenance
+When modifying the codebase, keep documentation in sync:
+- **ARCHITECTURE.md** - Update when adding/removing classes, changing component relationships, or altering data flow patterns
+- **README.md** - Update when adding new features, changing public APIs, or modifying usage examples

data/Gemfile.lock CHANGED Viewed

@@ -1,29 +1,36 @@
 GEM
   remote: http://rubygems.org/
   specs:
-    aws-eventstream (1.2.0)
-    aws-partitions (1.571.0)
-    aws-sdk-core (3.130.0)
-      aws-eventstream (~> 1, >= 1.0.2)
-      aws-partitions (~> 1, >= 1.525.0)
-      aws-sigv4 (~> 1.1)
-      jmespath (~> 1.0)
-    aws-sdk-kms (1.55.0)
-      aws-sdk-core (~> 3, >= 3.127.0)
-      aws-sigv4 (~> 1.1)
-    aws-sdk-s3 (1.113.0)
-      aws-sdk-core (~> 3, >= 3.127.0)
+    aws-eventstream (1.4.0)
+    aws-partitions (1.1211.0)
+    aws-sdk-core (3.241.4)
+      aws-eventstream (~> 1, >= 1.3.0)
+      aws-partitions (~> 1, >= 1.992.0)
+      aws-sigv4 (~> 1.9)
+      base64
+      bigdecimal
+      jmespath (~> 1, >= 1.6.1)
+      logger
+    aws-sdk-kms (1.121.0)
+      aws-sdk-core (~> 3, >= 3.241.4)
+      aws-sigv4 (~> 1.5)
+    aws-sdk-s3 (1.213.0)
+      aws-sdk-core (~> 3, >= 3.241.4)
       aws-sdk-kms (~> 1)
-      aws-sigv4 (~> 1.4)
-    aws-sigv4 (1.4.0)
+      aws-sigv4 (~> 1.5)
+    aws-sigv4 (1.12.1)
       aws-eventstream (~> 1, >= 1.0.2)
-    jmespath (1.6.1)
+    base64 (0.3.0)
+    bigdecimal (4.0.1)
+    jmespath (1.6.2)
+    logger (1.7.0)
 PLATFORMS
-  x86_64-darwin-21
+  arm64-darwin-24
+  ruby
 DEPENDENCIES
   aws-sdk-s3
 BUNDLED WITH
-   2.3.3
+   2.6.2

data/README.md CHANGED Viewed

@@ -2,13 +2,133 @@
 Search through S3 files without downloading them.
-# Basic Usage
+## Installation
-Search for a pattern in a S3 file.
+```bash
+gem install s3grep
+```
+Or add to your Gemfile:
-example:
+```ruby
+gem 's3grep'
 ```
-s3grep Bob s3://exammple.com/users.csv
+## CLI Tools
+### s3grep
+Search for a pattern in S3 files. Supports gzip and zip compressed files automatically.
+```bash
+# Basic search
+s3grep "pattern" s3://bucket-name/path/to/file.csv
+# Case-insensitive search
+s3grep -i "pattern" s3://bucket-name/path/to/file.csv
+# Recursive search through a directory
+s3grep -r "pattern" s3://bucket-name/path/to/directory/
+# Recursive search with file pattern filter
+s3grep -r --include "\.csv$" "pattern" s3://bucket-name/logs/
+# Search compressed files (auto-detected)
+s3grep "error" s3://bucket-name/logs/app.log.gz
 ```
-Outputs S3 file with line number and the matching line.
+Output format: `s3://bucket/path/file:line_number content`
+### s3cat
+Stream S3 file contents to stdout.
+```bash
+# Print file contents
+s3cat s3://bucket-name/path/to/file.txt
+# Pipe to other commands
+s3cat s3://bucket-name/data.csv | head -20
+# Use with standard unix tools
+s3cat s3://bucket-name/users.json | jq '.users[]'
+```
+### s3info
+Get statistics about an S3 directory as JSON.
+```bash
+# Get info for a prefix
+s3info s3://bucket-name/path/to/directory/
+```
+Output includes:
+- `bucket` - Bucket name
+- `base_prefix` - S3 prefix path
+- `total_size` - Total bytes across all files
+- `num_files` - File count
+- `last_modified` / `newest_file` - Most recently modified file
+- `first_modified` / `first_file` - Oldest file
+- `num_files_by_storage_class` - File count breakdown by storage class
+- `total_size_by_storage_class` - Size breakdown by storage class
+Example output:
+```json
+{
+  "bucket": "my-bucket",
+  "base_prefix": "logs/2024/",
+  "total_size": 1048576000,
+  "num_files": 365,
+  "last_modified": "2024-12-31T23:59:59+00:00",
+  "newest_file": "logs/2024/12/31/app.log",
+  "first_modified": "2024-01-01T00:00:00+00:00",
+  "first_file": "logs/2024/01/01/app.log",
+  "num_files_by_storage_class": {
+    "STANDARD": 100,
+    "STANDARD_IA": 265
+  },
+  "total_size_by_storage_class": {
+    "STANDARD": 500000000,
+    "STANDARD_IA": 548576000
+  }
+}
+```
+### s3report
+Generate a CSV report of all S3 buckets in your AWS account.
+```bash
+s3report
+```
+Creates a file named `AWS-S3-Usage-Report-YYYY-MM-DDTHHMMSS.csv` with columns:
+- Bucket
+- Creation Date
+- Total Size
+- Number of Files
+- Last Modified
+- Newest File
+- First Modified
+- First File
+## AWS Configuration
+Authentication uses the standard AWS SDK credential chain:
+1. Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
+2. Shared credentials file (`~/.aws/credentials`)
+3. IAM instance profile (EC2/ECS)
+Set your region via `AWS_REGION` environment variable or `~/.aws/config`.
+Use `AWS_PROFILE` to select a named profile:
+```bash
+AWS_PROFILE=stage s3grep "error" s3://my-bucket/logs/app.log
+```
+## License
+MIT

data/bin/s3cat CHANGED Viewed

@@ -1,12 +1,37 @@
 #!/usr/bin/env ruby
-require 'optparse'
 require 's3grep'
 require 'aws-sdk-s3'
-s3_file = ARGV[0]
-aws_s3_client = Aws::S3::Client.new
-search = S3Grep::Search.new(s3_file, aws_s3_client, nil)
-search.to_io.each do |line|
-  print line
+# Exit cleanly on broken pipe (e.g., when piping to head)
+Signal.trap("PIPE", "EXIT")
+s3_url = ARGV[0]
+if s3_url.nil? || s3_url.empty?
+  $stderr.puts "Usage: s3cat s3://bucket/path/to/file"
+  exit 1
+end
+unless s3_url.start_with?('s3://')
+  $stderr.puts "Error: Invalid S3 URL format. Expected s3://bucket/path"
+  exit 1
+end
+begin
+  # max_attempts: 1 disables retries (required for streaming)
+  aws_s3_client = Aws::S3::Client.new(max_attempts: 1)
+  search = S3Grep::Search.new(s3_url, aws_s3_client)
+  search.to_io.each do |line|
+    print line
+  end
+rescue Errno::EPIPE
+  # Broken pipe (e.g., piping to head) - exit silently
+  exit 0
+rescue Aws::S3::Errors::ServiceError => e
+  $stderr.puts "S3 Error: #{e.message}"
+  exit 1
+rescue => e
+  $stderr.puts "Error: #{e.message}"
+  exit 1
 end

data/bin/s3grep CHANGED Viewed

@@ -4,13 +4,31 @@ require 'optparse'
 require 's3grep'
 require 'aws-sdk-s3'
+# Exit cleanly on broken pipe (e.g., when piping to head)
+Signal.trap("PIPE", "EXIT")
+# Maximum regex pattern length to prevent ReDoS
+MAX_PATTERN_LENGTH = 1000
+def safe_regexp(pattern, options = 0)
+  if pattern.length > MAX_PATTERN_LENGTH
+    $stderr.puts "Error: Pattern too long (max #{MAX_PATTERN_LENGTH} characters)"
+    exit 1
+  end
+  Regexp.new(pattern, options)
+rescue RegexpError => e
+  $stderr.puts "Error: Invalid regular expression: #{e.message}"
+  exit 1
+end
 options = {
   ignore_case: false,
   recursive: false,
   file_pattern: /.*/
 }
 OptionParser.new do |opts|
-  opts.banner = 'Usage: s3grep [options]'
+  opts.banner = 'Usage: s3grep [options] PATTERN s3://bucket/path'
   opts.on('-i', '--ignore-case', 'Ignore case') do
     options[:ignore_case] = true
@@ -21,30 +39,48 @@ OptionParser.new do |opts|
   end
   opts.on('--include FILE_PATTERN', 'Include matching files') do |v|
-    options[:file_pattern] = Regexp.new(v, Regexp::IGNORECASE)
+    options[:file_pattern] = safe_regexp(v, Regexp::IGNORECASE)
   end
 end.parse!
-regex_options =
-  if options[:ignore_case]
-    Regexp::IGNORECASE
-  else
-    0
-  end
+if ARGV.length < 2
+  $stderr.puts "Usage: s3grep [options] PATTERN s3://bucket/path"
+  exit 1
+end
-regex = Regexp.new(ARGV[0], regex_options)
+pattern = ARGV[0]
 s3_url = ARGV[1]
-aws_s3_client = Aws::S3::Client.new
+unless s3_url.start_with?('s3://')
+  $stderr.puts "Error: Invalid S3 URL format. Expected s3://bucket/path"
+  exit 1
+end
+regex_options = options[:ignore_case] ? Regexp::IGNORECASE : 0
+regex = safe_regexp(pattern, regex_options)
+begin
+  # max_attempts: 1 disables retries (required for streaming)
+  aws_s3_client = Aws::S3::Client.new(max_attempts: 1)
-if options[:recursive]
-  S3Grep::Directory.glob(s3_url, aws_s3_client, options[:file_pattern]) do |s3_file|
-    S3Grep::Search.search(s3_file, aws_s3_client, regex) do |line_number, line|
-      puts "#{s3_file}:#{line_number} #{line}"
+  if options[:recursive]
+    S3Grep::Directory.glob(s3_url, aws_s3_client, options[:file_pattern]) do |s3_file|
+      S3Grep::Search.search(s3_file, aws_s3_client, regex) do |line_number, line|
+        puts "#{s3_file}:#{line_number} #{line}"
+      end
+    end
+  else
+    S3Grep::Search.search(s3_url, aws_s3_client, regex) do |line_number, line|
+      puts "#{s3_url}:#{line_number} #{line}"
     end
   end
-else
-  S3Grep::Search.search(s3_url, aws_s3_client, regex) do |line_number, line|
-    puts "#{s3_url}:#{line_number} #{line}"
-  end
+rescue Errno::EPIPE
+  # Broken pipe (e.g., piping to head) - exit silently
+  exit 0
+rescue Aws::S3::Errors::ServiceError => e
+  $stderr.puts "S3 Error: #{e.message}"
+  exit 1
+rescue => e
+  $stderr.puts "Error: #{e.message}"
+  exit 1
 end

data/bin/s3info CHANGED Viewed

@@ -1,23 +1,47 @@
 #!/usr/bin/env ruby
-require 'optparse'
+require 'pathname'
+BASE_DIR = Pathname.new(File.expand_path('..', __dir__))
+$LOAD_PATH << "#{BASE_DIR}/lib"
 require 's3grep'
 require 'aws-sdk-s3'
 require 'json'
-s3_file = ARGV[0]
-aws_s3_client = Aws::S3::Client.new
-info = S3Grep::Directory.new(s3_file, aws_s3_client).info
-stats = {
-  bucket: info.bucket,
-  base_prefix: info.base_prefix,
-  total_size: info.total_size,
-  num_files: info.num_files,
-  last_modified: info.last_modified,
-  newest_file: info.newest_file,
-  first_modified: info.first_modified,
-  first_file: info.first_file
-}
-print JSON.pretty_generate(stats) + "\n"
+s3_url = ARGV[0]
+if s3_url.nil? || s3_url.empty?
+  $stderr.puts "Usage: s3info s3://bucket/path"
+  exit 1
+end
+unless s3_url.start_with?('s3://')
+  $stderr.puts "Error: Invalid S3 URL format. Expected s3://bucket/path"
+  exit 1
+end
+begin
+  aws_s3_client = Aws::S3::Client.new
+  info = S3Grep::Directory.new(s3_url, aws_s3_client).info
+  stats = {
+    bucket: info.bucket,
+    base_prefix: info.base_prefix,
+    total_size: info.total_size,
+    num_files: info.num_files,
+    last_modified: info.last_modified,
+    newest_file: info.newest_file,
+    first_modified: info.first_modified,
+    first_file: info.first_file,
+    num_files_by_storage_class: info.num_files_by_storage_class,
+    total_size_by_storage_class: info.total_size_by_storage_class
+  }
+  print JSON.pretty_generate(stats) + "\n"
+rescue Aws::S3::Errors::ServiceError => e
+  $stderr.puts "S3 Error: #{e.message}"
+  exit 1
+rescue => e
+  $stderr.puts "Error: #{e.message}"
+  exit 1
+end

data/bin/s3report CHANGED Viewed

@@ -1,56 +1,82 @@
 #!/usr/bin/env ruby
-require 'optparse'
 require 's3grep'
 require 'aws-sdk-s3'
 require 'csv'
-bucket_info = {}
-aws_s3_client = Aws::S3::Client.new
-aws_s3_client.list_buckets[:buckets].each do |bucket|
-  name = bucket[:name]
-  puts name
-  bucket_location = aws_s3_client.get_bucket_location(bucket: name)
-  aws_s3_client_region_specific =
-    if bucket_location[:location_constraint] == ''
-      aws_s3_client
-    else
-      Aws::S3::Client.new(region: bucket_location[:location_constraint])
+# Sanitize values to prevent CSV injection attacks
+# Prefixes dangerous characters with a single quote
+def sanitize_csv_value(value)
+  return value unless value.is_a?(String)
+  return value if value.empty?
+  # Characters that can trigger formula execution in spreadsheets
+  if %w[= + - @ \t \r].include?(value[0])
+    "'#{value}"
+  else
+    value
+  end
+end
+begin
+  bucket_info = {}
+  aws_s3_client = Aws::S3::Client.new
+  aws_s3_client.list_buckets[:buckets].each do |bucket|
+    name = bucket[:name]
+    puts name
+    begin
+      bucket_location = aws_s3_client.get_bucket_location(bucket: name)
+      aws_s3_client_region_specific =
+        if bucket_location[:location_constraint].nil? || bucket_location[:location_constraint] == ''
+          aws_s3_client
+        else
+          Aws::S3::Client.new(region: bucket_location[:location_constraint])
+        end
+      info = S3Grep::Directory.new("s3://#{name}/", aws_s3_client_region_specific).info
+      bucket_info[name] = {
+        bucket: info.bucket,
+        creation_date: bucket[:creation_date],
+        total_size: info.total_size,
+        num_files: info.num_files,
+        last_modified: info.last_modified,
+        newest_file: info.newest_file,
+        first_modified: info.first_modified,
+        first_file: info.first_file
+      }
+    rescue Aws::S3::Errors::ServiceError => e
+      $stderr.puts "Warning: Could not access bucket '#{name}': #{e.message}"
     end
+  end
-  info = S3Grep::Directory.new("s3://#{name}/", aws_s3_client_region_specific).info
-  bucket_info[name] = {
-    bucket: info.bucket,
-    creation_date: bucket[:creation_date],
-    total_size: info.total_size,
-    num_files: info.num_files,
-    last_modified: info.last_modified,
-    newest_file: info.newest_file,
-    first_modified: info.first_modified,
-    first_file: info.first_file
+  csv_headers = {
+    bucket: 'Bucket',
+    creation_date: 'Creation Date',
+    total_size: 'Total Size',
+    num_files: 'Number of Files',
+    last_modified: 'Last Modified',
+    newest_file: 'Newest File',
+    first_modified: 'First Modified',
+    first_file: 'First File'
   }
-end
-csv_headers = {
-  bucket: 'Bucket',
-  creation_date: 'Creation Date',
-  total_size: 'Total Size',
-  num_files: 'Number of Files',
-  last_modified: 'Last Modified',
-  newest_file: 'Newest File',
-  first_modified: 'First Modified',
-  first_file: 'First File'
-}
-file = "AWS-S3-Usage-Report-#{Time.now.strftime('%Y-%m-%dT%H%M%S')}.csv"
-CSV.open(file, 'w') do |csv|
-  csv << csv_headers.values
-  bucket_info.each_value do |stats|
-    csv << csv_headers.keys.map { |k| stats[k] }
+  file = "AWS-S3-Usage-Report-#{Time.now.strftime('%Y-%m-%dT%H%M%S')}.csv"
+  CSV.open(file, 'w') do |csv|
+    csv << csv_headers.values
+    bucket_info.each_value do |stats|
+      csv << csv_headers.keys.map { |k| sanitize_csv_value(stats[k]) }
+    end
   end
-end
-puts file
+  puts file
+rescue Aws::S3::Errors::ServiceError => e
+  $stderr.puts "S3 Error: #{e.message}"
+  exit 1
+rescue => e
+  $stderr.puts "Error: #{e.message}"
+  exit 1
+end

data/lib/s3grep/directory.rb CHANGED Viewed

@@ -12,6 +12,10 @@ module S3Grep
       @aws_s3_client = aws_s3_client
     end
+    def uri
+      @uri ||= URI(s3_url)
+    end
     def self.glob(s3_url, aws_s3_client, regex, &block)
       new(s3_url, aws_s3_client).glob(regex, &block)
     end
@@ -31,18 +35,14 @@ module S3Grep
     end
     def each_content
-      uri = URI(s3_url)
       max_keys = 1_000
       prefix = CGI.unescape(uri.path[1..-1] || '')
       resp = aws_s3_client.list_objects(
-        {
-          bucket: uri.host,
-          prefix: prefix,
-          max_keys: max_keys
-        }
+        bucket: uri.host,
+        prefix: prefix,
+        max_keys: max_keys
       )
       resp.contents.each do |content|
@@ -53,12 +53,10 @@ module S3Grep
         marker = resp.contents.last.key
         resp = aws_s3_client.list_objects(
-          {
-            bucket: uri.host,
-            prefix: prefix,
-            max_keys: max_keys,
-            marker: marker
-          }
+          bucket: uri.host,
+          prefix: prefix,
+          max_keys: max_keys,
+          marker: marker
         )
         resp.contents.each do |content|

data/lib/s3grep/directory_info.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+require 'cgi'
 module S3Grep
   class DirectoryInfo
     attr_reader :bucket,
@@ -5,7 +7,9 @@ module S3Grep
                 :total_size,
                 :num_files,
                 :newest_content,
-                :oldest_content
+                :oldest_content,
+                :num_files_by_storage_class,
+                :total_size_by_storage_class
     def self.get(directory)
       info = new(directory)
@@ -15,6 +19,8 @@ module S3Grep
     def initialize(directory)
       @total_size = 0
       @num_files = 0
+      @num_files_by_storage_class = Hash.new(0)
+      @total_size_by_storage_class = Hash.new(0)
       set_path(directory)
     end
@@ -23,6 +29,9 @@ module S3Grep
         @num_files += 1
         @total_size += content[:size]
+        @num_files_by_storage_class[content[:storage_class]] += 1
+        @total_size_by_storage_class[content[:storage_class]] += content[:size]
         set_newest(content)
         set_oldest(content)
       end

data/lib/s3grep/search.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 require 'aws-sdk-s3'
 require 'cgi'
+require 'zlib'
 module S3Grep
   class Search
@@ -10,11 +11,11 @@ module S3Grep
     def initialize(s3_url, aws_s3_client, compression = nil)
       @s3_url = s3_url
       @aws_s3_client = aws_s3_client
-      @compression = compression
+      @compression = compression || self.class.detect_compression(s3_url)
     end
     def self.search(s3_url, aws_s3_client, regex, &block)
-      new(s3_url, aws_s3_client, detect_compression(s3_url)).search(regex, &block)
+      new(s3_url, aws_s3_client).search(regex, &block)
     end
     def self.detect_compression(s3_url)
@@ -24,9 +25,10 @@ module S3Grep
       nil
     end
     def search(regex)
       line_number = 0
-      to_io.each do |line|
+      each_line do |line|
         line_number += 1
         next unless line.match?(regex)
@@ -34,28 +36,114 @@ module S3Grep
       end
     end
-    def s3_object
-      uri = URI(s3_url)
-      aws_s3_client.get_object(
-        {
-          bucket: uri.host,
-          key: CGI.unescape(uri.path[1..-1])
-        }
-      )
+    # Stream lines from S3 without loading entire file into memory
+    def each_line(&block)
+      if compression == :gzip
+        each_line_gzip(&block)
+      elsif compression == :zip
+        each_line_zip(&block)
+      else
+        each_line_raw(&block)
+      end
     end
+    # For backward compatibility - streams content for s3cat
     def to_io
-      body = s3_object.body
+      StreamingIO.new(self)
+    end
-      if compression == :gzip
-        Zlib::GzipReader.new(body)
-      elsif compression == :zip
-        require 'zip'
-        zip = Zip::File.open_buffer(body)
-        zip.entries.first.get_input_stream
-      else
-        body
+    def bucket
+      @bucket ||= URI(s3_url).host
+    end
+    def key
+      @key ||= CGI.unescape(URI(s3_url).path[1..-1])
+    end
+    private
+    # Stream raw (uncompressed) content line by line
+    # True streaming - only keeps current chunk + line buffer in memory
+    def each_line_raw(&block)
+      buffer = "".b
+      aws_s3_client.get_object(bucket: bucket, key: key) do |chunk|
+        buffer << chunk
+        extract_lines!(buffer, &block)
+      end
+      # Yield any remaining content (last line without newline)
+      yield buffer unless buffer.empty?
+    end
+    # Stream gzip content line by line
+    # True streaming - decompresses chunks as they arrive from S3
+    def each_line_gzip(&block)
+      buffer = "".b
+      # Zlib::MAX_WBITS + 32 enables automatic gzip/zlib header detection
+      inflater = Zlib::Inflate.new(Zlib::MAX_WBITS + 32)
+      begin
+        aws_s3_client.get_object(bucket: bucket, key: key) do |chunk|
+          # Decompress this chunk
+          decompressed = inflater.inflate(chunk)
+          buffer << decompressed
+          extract_lines!(buffer, &block)
+        end
+        # Finish decompression and process remaining data
+        remaining = inflater.finish
+        buffer << remaining
+        extract_lines!(buffer, &block)
+        yield buffer unless buffer.empty?
+      ensure
+        inflater.close
+      end
+    end
+    # ZIP files cannot be truly streamed (central directory is at EOF)
+    # We stream the download but must buffer before decompressing
+    def each_line_zip(&block)
+      require 'zip'
+      # Stream download into buffer (ZIP format requires full file)
+      body = StringIO.new("".b)
+      aws_s3_client.get_object(bucket: bucket, key: key) do |chunk|
+        body << chunk
+      end
+      body.rewind
+      zip = Zip::File.open_buffer(body)
+      entry = zip.entries.first
+      raise IOError, "ZIP archive is empty" if entry.nil?
+      buffer = "".b
+      entry.get_input_stream.each do |chunk|
+        buffer << chunk
+        extract_lines!(buffer, &block)
+      end
+      yield buffer unless buffer.empty?
+    end
+    # Extract complete lines from buffer, yielding each one
+    def extract_lines!(buffer)
+      while (newline_index = buffer.index("\n"))
+        line = buffer.slice!(0, newline_index + 1)
+        yield line
+      end
+    end
+    # Adapter class that provides IO-like interface for streaming
+    # Used by s3cat for backward compatibility
+    class StreamingIO
+      def initialize(search)
+        @search = search
+      end
+      def each(&block)
+        @search.each_line(&block)
       end
     end
   end

data/s3grep.gemspec CHANGED Viewed

@@ -2,10 +2,10 @@
 Gem::Specification.new do |s|
   s.name        = 's3grep'
-  s.version     = '0.1.8'
+  s.version     = '0.2.0'
   s.licenses    = ['MIT']
-  s.summary     = 'Search through S3 files'
-  s.description = 'Tools for searching files on S3'
+  s.summary     = 'Search through S3 files without downloading them'
+  s.description = 'CLI tools for streaming search (s3grep), viewing (s3cat), and reporting (s3info, s3report) on S3 objects. Supports gzip compression and searches large files with minimal memory usage.'
   s.authors     = ['Doug Youch']
   s.email       = 'dougyouch@gmail.com'
   s.homepage    = 'https://github.com/dougyouch/s3grep'
@@ -13,5 +13,8 @@ Gem::Specification.new do |s|
   s.bindir      = 'bin'
   s.executables = s.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  s.required_ruby_version = '>= 2.6.0'
   s.add_runtime_dependency 'aws-sdk-s3'
+  s.add_runtime_dependency 'rubyzip'
 end

metadata CHANGED Viewed

@@ -1,14 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: s3grep
 version: !ruby/object:Gem::Version
-  version: 0.1.8
+  version: 0.2.0
 platform: ruby
 authors:
 - Doug Youch
-autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-08-23 00:00:00.000000000 Z
+date: 2026-02-01 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: aws-sdk-s3
@@ -24,7 +23,23 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
-description: Tools for searching files on S3
+- !ruby/object:Gem::Dependency
+  name: rubyzip
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+description: CLI tools for streaming search (s3grep), viewing (s3cat), and reporting
+  (s3info, s3report) on S3 objects. Supports gzip compression and searches large files
+  with minimal memory usage.
 email: dougyouch@gmail.com
 executables:
 - s3cat
@@ -37,6 +52,8 @@ files:
 - ".gitignore"
 - ".ruby-gemset"
 - ".ruby-version"
+- ARCHITECTURE.md
+- CLAUDE.md
 - Gemfile
 - Gemfile.lock
 - LICENSE
@@ -55,7 +72,6 @@ homepage: https://github.com/dougyouch/s3grep
 licenses:
 - MIT
 metadata: {}
-post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -63,15 +79,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: '0'
+      version: 2.6.0
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.3.3
-signing_key:
+rubygems_version: 3.6.2
 specification_version: 4
-summary: Search through S3 files
+summary: Search through S3 files without downloading them
 test_files: []