RubyGems - philiprehberger-csv_kit - Versions diffs - 0.6.0 → 0.7.0 - Mend

philiprehberger-csv_kit 0.6.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +8 -0
data/README.md +10 -0
data/lib/philiprehberger/csv_kit/version.rb +1 -1
data/lib/philiprehberger/csv_kit.rb +35 -0
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 7a49695d1e89645d30e0d325f7c7967db04b044f0cfe69ecf66cabfcc7e28470
-  data.tar.gz: 8a0cd35cd652e1893493ac4a0428251517c4a234cc0dca7b1addc9580b3688c6
+  metadata.gz: ea3cafa68ee9e49b8c1b305af9e7f796a9c417676a6018ee416dbb978e91eb38
+  data.tar.gz: e34151bbdf97d2e78fc348620fd958fa01a3e6d446744593d7a8474b82885de6
 SHA512:
-  metadata.gz: 528761ab2269e586d1d01c20885b5113dd12fe89b94b266b4bcb7452224cedd19a626100218d77def4db8b4d087117717dd3a6738aa5fe54c4c05f20b7592b5c
-  data.tar.gz: 2838f1cb2c0bb9caa43015b2f8f07e3b6252f80d4ad376160619d2c3dc50c4f1fecb65d6d5c47f9e86ffa2537e6b9ace01623a266aa2bc9cfd0fd4c0e9977512
+  metadata.gz: a98d2e7baa28c03c04c44322482cc5a9ec4bef939692f5b3812679efe1d4ab1edc90748d65003ea4ddb287a73d27cb6ba22fd4e254ccf734d9c61cd3fbb28cab
+  data.tar.gz: 04a064667d1cbbde06ab473cad6dd18e4edc9558fa5f6efcff5ea2a0505a3c55846f016ebac118735c3b6c4b035bf48edf4c3cf03750c2ba855bf4ec4fc76f4e

data/CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.7.0] - 2026-04-16
+### Added
+- `CsvKit.sample(path, n, dialect:)` — return n randomly sampled rows as symbolized hashes using reservoir sampling (Algorithm R); O(n) memory regardless of file size; returns all rows if file has fewer than n rows
 ## [0.6.0] - 2026-04-15
 ### Added
@@ -96,6 +101,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Type coercion and row validation
 - Quick load and filtering convenience methods
+[Unreleased]: https://github.com/philiprehberger/rb-csv-kit/compare/v0.7.0...HEAD
+[0.7.0]: https://github.com/philiprehberger/rb-csv-kit/compare/v0.6.0...v0.7.0
+[0.6.0]: https://github.com/philiprehberger/rb-csv-kit/compare/v0.5.0...v0.6.0
 [0.5.0]: https://github.com/philiprehberger/rb-csv-kit/releases/tag/v0.5.0
 [0.4.0]: https://github.com/philiprehberger/rb-csv-kit/releases/tag/v0.4.0
 [0.3.1]: https://github.com/philiprehberger/rb-csv-kit/releases/tag/v0.3.1

data/README.md CHANGED Viewed

@@ -69,6 +69,15 @@ adults = Philiprehberger::CsvKit.each_hash("data.csv")
   .first(10)
 ```
+### Reservoir Sampling
+Return n randomly sampled rows with O(n) memory using Knuth's Algorithm R. If the file has fewer than n rows, all rows are returned:
+```ruby
+rows = Philiprehberger::CsvKit.sample("large.csv", 100)
+# => [{name: "Alice", age: "30"}, ...]
+```
 ### Find First Match
 Return the first row that matches a predicate, streaming and stopping on the first hit:
@@ -175,6 +184,7 @@ delimiter = Philiprehberger::CsvKit::Detector.detect("data.tsv")
 | Method / Class | Description |
 |----------------|-------------|
 | `CsvKit.to_hashes(path, dialect:)` | Load CSV into array of symbolized hashes |
+| `CsvKit.sample(path_or_io, n, dialect:)` | Return n randomly sampled rows using reservoir sampling (Algorithm R) |
 | `CsvKit.pluck(path, *keys, dialect:)` | Extract specific columns |
 | `CsvKit.filter(path, dialect:, &block)` | Filter rows, return CSV string |
 | `CsvKit.find(path, dialect:, &block)` | Return the first row matching the predicate, or nil |

data/lib/philiprehberger/csv_kit/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module Philiprehberger
   module CsvKit
-    VERSION = '0.6.0'
+    VERSION = '0.7.0'
   end
 end

data/lib/philiprehberger/csv_kit.rb CHANGED Viewed

@@ -100,6 +100,41 @@ module Philiprehberger
       block ? enum.each(&block) : enum
     end
+    # Return n randomly sampled rows using reservoir sampling (Algorithm R).
+    # Memory usage is O(n) regardless of file size.
+    # If the file has fewer than n rows, all rows are returned.
+    #
+    # @param path_or_io [String, IO] file path or IO object
+    # @param n [Integer] number of rows to sample
+    # @param dialect [Symbol, Hash, nil] CSV dialect preset or custom options
+    # @return [Array<Hash{Symbol => String}>]
+    def self.sample(path_or_io, n, dialect: nil)
+      csv_opts = { headers: true }
+      csv_opts = Dialect.new(dialect).merge_into(csv_opts) if dialect
+      reservoir = []
+      index = 0
+      iterate = lambda do |row|
+        hash = row.to_h.transform_keys(&:to_sym)
+        if index < n
+          reservoir << hash
+        else
+          j = rand(index + 1)
+          reservoir[j] = hash if j < n
+        end
+        index += 1
+      end
+      if path_or_io.is_a?(String)
+        CSV.foreach(path_or_io, **csv_opts, &iterate)
+      else
+        CSV.new(path_or_io, **csv_opts).each(&iterate)
+      end
+      reservoir
+    end
     # Find the first row matching a predicate, streaming (stops as soon as a match is found).
     #
     # @param path [String] file path

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: philiprehberger-csv_kit
 version: !ruby/object:Gem::Version
-  version: 0.6.0
+  version: 0.7.0
 platform: ruby
 authors:
 - Philip Rehberger
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2026-04-15 00:00:00.000000000 Z
+date: 2026-04-17 00:00:00.000000000 Z
 dependencies: []
 description: Streaming CSV processor with row-by-row transforms, validations, column
   plucking, streaming each_hash iteration, filtering, writing, error recovery, and