RubyGems - concurrent_pipeline - Versions diffs - 0.1.0 → 1.0.0 - Mend

concurrent_pipeline 0.1.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

checksums.yaml +4 -4
data/.claude/settings.local.json +9 -0
data/.ruby-version +1 -1
data/README.md +232 -353
data/Rakefile +4 -2
data/concurrent_pipeline.gemspec +3 -1
data/lib/concurrent_pipeline/pipeline.rb +14 -201
data/lib/concurrent_pipeline/pipelines/processors/asynchronous.rb +92 -0
data/lib/concurrent_pipeline/pipelines/processors/locker.rb +28 -0
data/lib/concurrent_pipeline/pipelines/processors/synchronous.rb +50 -0
data/lib/concurrent_pipeline/pipelines/schema.rb +56 -0
data/lib/concurrent_pipeline/store.rb +88 -13
data/lib/concurrent_pipeline/stores/schema/record.rb +47 -0
data/lib/concurrent_pipeline/stores/schema.rb +35 -0
data/lib/concurrent_pipeline/stores/storage/yaml/fs.rb +140 -0
data/lib/concurrent_pipeline/stores/storage/yaml.rb +196 -0
data/lib/concurrent_pipeline/version.rb +1 -1
data/lib/concurrent_pipeline.rb +13 -9
metadata +40 -14
data/.rubocop.yml +0 -14
data/lib/concurrent_pipeline/changeset.rb +0 -133
data/lib/concurrent_pipeline/model.rb +0 -31
data/lib/concurrent_pipeline/processors/actor_processor.rb +0 -363
data/lib/concurrent_pipeline/producer.rb +0 -156
data/lib/concurrent_pipeline/read_only_store.rb +0 -22
data/lib/concurrent_pipeline/registry.rb +0 -36
data/lib/concurrent_pipeline/stores/versioned.rb +0 -24
data/lib/concurrent_pipeline/stores/yaml/db.rb +0 -110
data/lib/concurrent_pipeline/stores/yaml/history.rb +0 -67
data/lib/concurrent_pipeline/stores/yaml.rb +0 -40

data/README.md CHANGED Viewed

@@ -18,456 +18,335 @@ This code I've just written is already legacy code. Good luck!
 ### License
-WTFPL - website down but you can find it if you care
+[WTFPL](https://www.wtfpl.net/txt/copying/)
 ## Guide and Code Examples
-### Simplest Usage
+The text above was written by a human. The text below was written by Monsieur Claude. Is it correct? Yeah, I guess probably, sure, let's go with "yep" ok?
-Define a producer and add a pipeline to it.
+### Basic Example
-```ruby
-# Define your producer:
+Define a store with records, create a pipeline with processing steps, and run it:
-class MyProducer < ConcurrentPipeline::Producer
-  pipeline do
-    steps(:step_1, :step_2)
+```ruby
+require "concurrent_pipeline"
-    def step_1
-      puts "hi from step_1"
-    end
+# Define your data store
+store = ConcurrentPipeline::Store.define do
+  storage(:yaml, dir: "/tmp/my_pipeline")
-    def step_2
-      puts "hi from step_2"
-    end
+  record(:user) do
+    attribute(:name)
+    attribute(:processed, default: false)
   end
 end
-# Run your producer
-producer = MyProducer.new
-producer.call
-# hi from step_1
-# hi from step_2
-```
-Wow! What a convoluted way to just run two methods!
+# Create some data
+store.create(:user, name: "Alice")
+store.create(:user, name: "Bob")
-### Example of Processing data
+# Define processing pipeline
+pipeline = ConcurrentPipeline::Pipeline.define do
+  processor(:sync)  # Run sequentially
-The previous example just wrote to stdout. In general, we want to be storing all data in a dataset that we can review and potentially write to disk.
+  process(:user, processed: false) do |user|
+    puts "Processing #{user.name}"
+    store.update(user, processed: true)
+  end
+end
-Pipelines provide three methods for you to use: (I should really figure out ruby-doc and link there :| )
+# Run it
+pipeline.process(store)
+```
-- `Pipeline#store`: returns a Store
-- `Pipeline#changeset`: returns a Changeset
-- `Pipeline#stream`: returns a Stream (covered in a later example)
+### Async Processing
-Here, we define a model and provide the producer some initial data. Models are stored in the `store`. The models themselves are immutable. In order to create or update a model, we must use the `changeset`.
+Use `:async` processor to run steps concurrently:
 ```ruby
-# Define your producer:
+pipeline = ConcurrentPipeline::Pipeline.define do
+  processor(:async)  # Run concurrently
-class MyProducer < ConcurrentPipeline::Producer
-  model(:my_model) do
-    attribute :id # an :id attribute is always required!
-    attribute :status
+  process(:user, processed: false) do |user|
+    # Each user processed in parallel
+    sleep 1
+    store.update(user, processed: true)
+  end
+end
+```
-    # You can add more methods here, but remember
-    # models are immutable. If you update an
-    # attribute here it will be forgotten at the
-    # end of the step. All models are re-created
-    # from the store for every step.
+Control concurrency and polling with optional parameters:
-    def updated?
-      status == "updated"
-    end
+```ruby
+pipeline = ConcurrentPipeline::Pipeline.define do
+  # concurrency: max parallel tasks (default: 5)
+  # enqueue_seconds: sleep between checking for new work (default: 0.1)
+  processor(:async, concurrency: 10, enqueue_seconds: 0.5)
+  process(:user, processed: false) do |user|
+    # Up to 10 users processed concurrently
+    expensive_api_call(user)
+    store.update(user, processed: true)
   end
+end
+```
-  pipeline do
-    steps(:step_1, :step_2)
+### Custom Methods on Records
-    def step_1
-      # An :id will automatically be created or you can
-      # pass your own:
-      changeset.create(:my_model, id: 1, status: "created")
-    end
+Records can have custom methods defined in the record block:
-    def step_2
-      # You can find the model in the store:
-      record = store.find(:my_model, 1)
+```ruby
+store = ConcurrentPipeline::Store.define do
+  storage(:yaml, dir: "/tmp/my_pipeline")
-      # or get them all and find it yourself if you prefer
-      record = store.all(:my_model).select { |r| r.id == 1 }
+  record(:user) do
+    attribute(:first_name)
+    attribute(:last_name)
+    attribute(:age)
-      changeset.update(record, status: "updated")
+    def full_name
+      "#{first_name} #{last_name}"
+    end
+    def adult?
+      age >= 18
     end
   end
 end
-producer = MyProducer.new
-# invoke it:
-producer.call
-# view results:
-puts producer.data
-# {
-#   my_model: [
-#     { id: 1, status: "updated" },
-#   ]
-# }
+store.create(:user, first_name: "Alice", last_name: "Smith", age: 25)
+user = store.all(:user).first
+puts user.full_name  # => "Alice Smith"
+puts user.adult?     # => true
 ```
-Future examples show how to pass your initial data to a producer.
+### Filtering Records
-### Example with Concurrency
+Use `where` to filter records, or pass filters directly to `process`:
-There are a few ways to declare what things should be done concurrently:
+```ruby
+# Manual filtering
+pending_users = store.where(:user, processed: false, active: true)
-- Put steps in an array to indicate they can run concurrently
-- Add two pipelines
-- Pass the `each: {model_type}` option to the Pipeline indicating that it should be run for every record of that type.
+# Filter with lambdas/procs for custom logic
+even_ids = store.where(:user, id: ->(id) { id.to_i.even? })
+adults = store.where(:user, age: ->(age) { age >= 18 })
-The following example contains all three.
+# Combine regular values with lambda filters
+active_adults = store.where(:user, active: true, age: ->(age) { age >= 18 })
-```ruby
-class MyProducer < ConcurrentPipeline::Producer
-  model(:my_model) do
-    attribute :id # an :id attribute is always required!
-    attribute :status
-  end
+# Or use filters in pipeline
+pipeline = ConcurrentPipeline::Pipeline.define do
+  processor(:sync)
-  pipeline do
-    # Steps :step_2 and :step_3 will be run concurrently.
-    # Step :step_4 will only be run when they have both
-    # finished successfully
-    steps(
-      :step_1,
-      [:step_2, :step_3],
-      :step_4
-    )
-    # noops since we're just demonstrating usage here.
-    def step_1; end
-    def step_2; end
-    def step_3; end
-    def step_4; end
+  # Old style: pass a lambda
+  process(-> { store.all(:user).select(&:active?) }) do |user|
+    # ...
   end
-  # this pipeline will run concurrently with the prior
-  # pipeline.
-  pipeline do
-    steps(:step_1)
-    def step_1; end
-  end
-  # passing `each:` to the Pipeline indicates that it
-  # should be run for every record of that type. When
-  # `each:` is specified, the record can be accessed
-  # using the `record` method.
-  #
-  # Note: every record will be processed concurrently.
-  # You can limit concurrency by passing the
-  # `concurrency: {integer}` option. The default
-  # concurrency is Infinite! INFINIIIIITE!!1!11!!!1!
-  pipeline(each: :my_model, concurrency: 3) do
-    steps(:process)
-    def process
-      changeset.update(record, status: "processed")
-    end
+  # New style: pass record name and filters
+  process(:user, processed: false, active: true) do |user|
+    # ...
   end
 end
-# Lets Pass some initial data:
-initial_data = {
-  my_model: [
-    { id: 1, status: "waiting" },
-    { id: 2, status: "waiting" },
-    { id: 3, status: "waiting" },
-  ]
-}
-producer = MyProducer.new(data: initial_data)
-# invoke it:
-producer.call
-# view results:
-puts producer.data
-# {
-#   my_model: [
-#     { id: 1, status: "processed" },
-#     { id: 2, status: "processed" },
-#     { id: 3, status: "processed" },
-#   ]
-# }
 ```
-### Viewing history and recovering versions
-A version is created each time a record is updated. This example shows how to view and rerun with a prior version.
+### Error Handling
-It's important to note that the system tracks which steps have been performed and which are still waiting to run by writing records to the store. If you change the structure of your Producer (add/remove pipelines or add/remove steps from a pipeline), then there's no guarantee that your data will be compatible across that change. If, however, you only change the body of a step method, then you should be able to rerun a prior version without issue.
+When errors occur during async processing, they're collected and the pipeline returns `false`:
 ```ruby
-class MyProducer < ConcurrentPipeline::Producer
-  model(:my_model) do
-    attribute :id # an :id attribute is always required!
-    attribute :status
-  end
+pipeline = ConcurrentPipeline::Pipeline.define do
+  processor(:async)
-  pipeline(each: :my_model) do
-    steps(:process)
-    def process
-      changeset.update(record, status: "processed")
-    end
+  process(:user, processed: false) do |user|
+    raise "Something went wrong with #{user.name}" if user.name == "Bob"
+    store.update(user, processed: true)
   end
 end
-initial_data = {
-  my_model: [
-    { id: 1, status: "waiting" },
-    { id: 2, status: "waiting" },
-  ]
-}
-producer = MyProducer.new(data: initial_data)
-producer.call
-# access the versions like so:
-puts producer.history.versions.count
-# 5
-# A version can tell you what diff it applied.
-# Notice here, the :PipelineStep record, that
-# is how the progress is tracked internally.
-puts producer.history.versions[3].diff
-# {
-#   changes: [
-#     {
-#       :action: :update,
-#       id: 1,
-#       type: :my_model,
-#       delta: {:status: "processed"}
-#     },
-#     {
-#       action: :update,
-#       id: "5d02ca83-0435-49b5-a812-d4da4eef080e",
-#       type: :PipelineStep,
-#       delta: {
-#         :completed_at: "2024-05-10T18:44:04+00:00",
-#         result: :success
-#       }
-#     }
-#   ]
-# }
-# Let's re-process using a previous version:
-# This will just pick up where it was left off
-re_producer = MyProducer.new(
-  store: producer.history.versions[3].store
-)
-re_producer.call
-# If you need to change the code, you'd probably
-# want to write the data to disk and then read
-# it the next time you run:
-File.write(
-  "last_good_version.yml",
-  producer.history.versions[3].store.data.to_yaml
-)
-# And then next time, load it like so:
-re_producer = MyProducer.new(
-  data: YAML.unsafe_load_file("last_good_version.yml")
-)
-```
+result = pipeline.process(store)
-### Monitoring progress
-When you run a long-running script it's nice to know that it's doing something (anything!). Staring at an unscrolling terminal might be good news, might be bad news, might be no news. How to know?
+unless result
+  puts "Pipeline failed!"
+  pipeline.errors.each { |error| puts error.message }
+end
+```
-Models are immutable and changesets are only applied after a step is completed. If you want to get some data out during processing, you can just `puts` it. Or if you'd like to be a bit more specific about what you track, you can push data to a centralized "stream".
+### Recovering from Failures
-Here's an example:
+The store automatically versions your data. If processing fails, fix your code and restore from where you left off:
 ```ruby
-class MyProducer < ConcurrentPipeline::Producer
-  stream do
-    on(:start) do |message|
-      puts "Started processing #{message}"
-    end
-    on(:progress) do |data|
-      puts "slept #{data[:slept]} times!"
+# First run - fails partway through
+store = ConcurrentPipeline::Store.define do
+  storage(:yaml, dir: "/tmp/my_pipeline")
+  record(:user) do
+    attribute(:name)
+    attribute(:email)
+    attribute(:email_sent, default: false)
+  end
+end
-      # you don't have to just "puts" here:
-      # Audio.play(:jeopardy_music)
-    end
+5.times { |i| store.create(:user, name: "User#{i}") }
-    on(:finished) do
-      # Notice you have access to the outer scope.
-      #
-      # Streams are really about monitoring progress,
-      # so mutating state here is probably recipe for
-      # chaos and darkness, but hey, it's your code
-      # and I say fortune favors the bold (I've never
-      # actually said that until now).
-      some_other_object.reverse!
-    end
-  end
+pipeline = ConcurrentPipeline::Pipeline.define do
+  processor(:async)
-  pipeline do
-    steps(:process)
-    def process
-      # the `push` method takes exactly two arguments:
-      # type: A symbol
-      # payload: any object, go crazy...
-      # ...but remember...concurrency...
-      stream.push(:start, "some_object!")
-      sleep 1
-      stream.push(:progress, {slept: 1 })
-      sleep 1
-      stream.push(:progress, { slept: 2 })
-      changeset.update(record, status: "processed")
-      # Don't feel pressured into sending an object
-      # if you don't feel like it.
-      stream.push(:finished)
-    end
+  process(:user, email_sent: false) do |user|
+    # Oops, forgot to handle missing emails
+    email = fetch_email_for(user.name)  # Might return nil!
+    send_email(email)  # This will fail if email is nil
+    store.update(user, email: email, email_sent: true)
   end
 end
-some_other_object = [1, 2, 3]
+pipeline.process(store)  # Some succeed, some fail
-producer = MyProducer.new
-producer.call
-puts some_other_object.inspect
+# Check what versions exist
+store.versions.each_with_index do |version, i|
+  puts "Version #{i}: #{version.all(:user).count { |u| u.email_sent }} emails sent"
+end
-# Started processing some_object!
-# slept 1 times!
-# slept 2 times!
-# [3, 2, 1]
-```
+# Fix the code and restore from last version
+last_version = store.versions.first
+restored_store = last_version.restore
-### Halting, Blocking, Triggering, etc
+# Now run with fixed logic
+pipeline = ConcurrentPipeline::Pipeline.define do
+  processor(:async)
-Perhaps you have a scenario where all ModelOnes have to be processed before any ModelTwo records. Or perhaps you want to find three ModelOne's that satisfy a certain condition and as soon as you've found those three, you want the pipeline to halt.
+  process(:user, email_sent: false) do |user|
+    email = fetch_email_for(user.name) || "default@example.com"  # Fixed!
+    send_email(email)
+    restored_store.update(user, email: email, email_sent: true)
+  end
+end
-In order to allow this control, each pipline can specify an `open` method indicating whether the pipeline should continue with start or stop and/or continue processing vs halt at the end of the next step.
+pipeline.process(restored_store)  # Only processes remaining users
+```
-A pipeline is always "closed" when all of its steps are complete, so pipelines cannot loop. If you want a Pipeline to loop, you'd have to have an `each: :model` pipeline that creates a new model to be processed. The pipeline would then re-run (with the new model).
+### Storage Structure
-Here's an example with some custom "open" methods:
+When using YAML storage, data is stored in a simple, human-readable file structure:
-```ruby
-class MyProducer < ConcurrentPipeline::Producer
-  model(:model_one) do
-    attribute :id # an :id attribute is always required!
-    attribute :valid
-  end
+```
+/tmp/my_pipeline/
+├── data.yml              # Current state (always up-to-date)
+└── versions/
+    ├── 0001.yml          # Historical version 1
+    ├── 0002.yml          # Historical version 2
+    └── 0003.yml          # Historical version 3
+```
-  model(:model_two) do
-    attribute :id # an :id attribute is always required!
-    attribute :processed
-  end
+- **`data.yml`**: Contains the most recent state of your data. You can inspect this file at any time to see the current state.
+- **`versions/`**: Contains snapshots of previous versions. Each file is a complete snapshot at that point in time.
-  pipeline(each: :model_one) do
-    # we close this pipeline as soon as we've found at least
-    # three valid :model_one records. Note that because of
-    # concurrency, we might not be able to stop at *exactly*
-    # three valid models!
-    open { store.all(:model_one).select(&:valid).count < 3 }
+When you restore to a previous version, that version is copied to `data.yml` and any versions after it are deleted. You can then continue working from that restored state.
-    steps(:process)
+### Running Shell Commands
-    def process
-      sleep(rand(4))
-      changeset.update(record, valid: true)
-    end
-  end
+The `Shell` class helps run external commands within your pipeline. It exists because running shell commands in Ruby can be tedious - you need to capture stdout, stderr, check exit status, and handle failures. Shell simplifies this.
-  pipeline(each: :model_two) do
-    open { store.all(:model_one).select(&:valid).count >= 3 }
+Available in process blocks via the `shell` helper:
-    # noop for example
-    steps(:process)
-    def process
-      store.update(record, processed: true)
+```ruby
+pipeline = ConcurrentPipeline::Pipeline.define do
+  processor(:sync)
+  process(:repository, cloned: false) do |repo|
+    # Shell.run returns a Result with stdout, stderr, success?, command
+    result = shell.run("git clone #{repo.url} /tmp/#{repo.name}")
+    if result.success?
+      puts result.stdout
+      store.update(repo, cloned: true)
+    else
+      puts "Failed: #{result.stderr}"
     end
   end
+end
+```
-  pipeline do
-    open {
-      is_open = store.all(:model_two).all?(&:processed)
-      stream.push(
-        :stdout,
-        "last pipeline is now: #{is_open ? :open : :closed}"
-      )
-      is_open
-    }
+Use `run!` to raise on failure:
+```ruby
+process(:repository, cloned: false) do |repo|
+  # Raises error if command fails, returns stdout if success
+  output = shell.run!("git clone #{repo.url} /tmp/#{repo.name}")
+  store.update(repo, cloned: true, output: output)
+end
+```
-    steps(:process)
+Stream output in real-time with a block:
-    def process
-      stream.push(:stdout, "all done")
-    end
+```ruby
+process(:project, built: false) do |project|
+  shell.run("npm run build") do |stream, line|
+    puts "[#{stream}] #{line}"
   end
+  store.update(project, built: true)
 end
-initial_data = {
-  model_one: [
-    { id: 1, valid: false },
-    { id: 2, valid: false },
-    { id: 3, valid: false },
-    { id: 4, valid: false },
-    { id: 5, valid: false },
-  ],
-  model_two: [
-    { id: 1, processed: false }
-  ]
-}
-producer = MyProducer.new(data: initial_data)
-producer.call
 ```
-### Error Handling
+Use outside of pipelines by calling directly:
-What happens if a step raises an error? Theoretically, that particular pipeline should just halt and the error will be logged in the corresponding PipelineStep record.
+```ruby
+# Check if a command succeeds
+result = ConcurrentPipeline::Shell.run("which docker")
+docker_installed = result.success?
-The return value of `Pipeline#call` is a boolean indicating whether all PipelineSteps have succeeded.
+# Get output or raise
+version = ConcurrentPipeline::Shell.run!("ruby --version")
+puts version  # => "ruby 3.2.9 ..."
+```
-It is possible that I've screwed this up and that an error leads to a deadlock. In order to prevent against data-loss, each update to the yaml file is written to disk in a directory you can find at `Pipeline#dir`. You can also pass your own directory during initialization: `MyPipeline.new(dir: "/tmp/my_dir")`
+### Multiple Processing Steps
-### Other hints
+Chain multiple steps together - each step processes what the previous step created:
-You can pass your data to a Producer in three ways:
-- Passing a hash: `MyProducer.new(data: {...})`
-- Passing a store: `MyProducer.new(store: store)`
+```ruby
+store = ConcurrentPipeline::Store.define do
+  storage(:yaml, dir: "/tmp/my_pipeline")
-Lastly, you can pass a block to apply changesets immediately:
+  record(:company) do
+    attribute(:name)
+    attribute(:fetched, default: false)
+  end
-```ruby
-processor = MyProducer.new do |changeset|
-  [:a, :b, :c].each do |item|
-    changset.create(:model, name: item)
+  record(:employee) do
+    attribute(:company_name)
+    attribute(:name)
+    attribute(:processed, default: false)
   end
 end
-```
-If you need access to an outer scope for a stream to access, you can construct your own stream:
+store.create(:company, name: "Acme Corp")
+store.create(:company, name: "Tech Inc")
-```ruby
-my_stream = ConcurrentPipeline::Producer::Stream.new
-outer_variable = :something
-my_stream.on(:stdout) { puts "outer_variable: #{outer_variable}" }
-MyProducer.new(stream: my_stream)
-```
+pipeline = ConcurrentPipeline::Pipeline.define do
+  processor(:async)
+  # Step 1: Fetch employees for each company
+  process(:company, fetched: false) do |company|
+    employees = api_fetch_employees(company.name)
+    employees.each do |emp|
+      store.create(:employee, company_name: company.name, name: emp)
+    end
+    store.update(company, fetched: true)
+  end
-## Ruby can't do parallel processing so concurrency here is useless
+  # Step 2: Process each employee
+  process(:employee, processed: false) do |employee|
+    send_welcome_email(employee.name)
+    store.update(employee, processed: true)
+  end
+end
+pipeline.process(store)
+```
-Yes and no, but maybe, but maybe not. I've pulled out some explanation in a [separate doc](./concurrency.md)
+### Final words
-tl;dr: Concurrency will hurt you if you're crunching lots of numbers. It will help you if you're shelling out or hitting the network. Also if you're really crunching so many numbers and performance is critical...maybe just use Go.
+That's it, you've reached THE END OF THE INTERNET.