RubyGems - rperf - Versions diffs - 0.5.0 → 0.6.0 - Mend

rperf 0.5.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 3413c4c6ed0cdc0897428bf01fc0fec17a4d14f1c2883e9e5afa0cff110247dc
-  data.tar.gz: '097b06203ce4648a860f2816635d6dfac52f8e5987aa381653cec874d52abf7c'
+  metadata.gz: 497392cfda8e82d1c37aadd0953b4c73b6bfb09870e6c612c1fd5fced0e3d24f
+  data.tar.gz: 6960be209fc3d4aac0f268378c5b7e1399027da0c5b7f498bcb4be0662012d62
 SHA512:
-  metadata.gz: 37065071f049a27eb1bab9f859ed39499022489a19aa8ecd91b3dc35cb6052ffb6b2fbc02c67ea46a94e8dba7644f2b23760d72d2dda7b998ccf3c61c304e225
-  data.tar.gz: 686ab430d58e5dd5163ae65a2bd330a76e57cf0dd72e7eac2b7c61621a03007bd724cac20b1e452766870ff33de325855e199bc3d873d004344f7b26b9b6614f
+  metadata.gz: '09fc32b7577ac9544a846c86c37a7ad11e9de00a27bbb0bbd25cbc2fcabe04e74741c64f9fb3cfe1a9663145e058215272a247766c7b8106218eda80cbcd838f'
+  data.tar.gz: 9d13e685c5a293c4d9033376509bf4b5c762a5f5155a2d5dd6e838d5a55dc79b9ef7d9521a5bcf65eac88a683f0e20cd7e5dac2680134aa565c709eb48452e40

data/README.md CHANGED Viewed

@@ -2,25 +2,66 @@
   <img src="docs/logo.svg" alt="rperf logo" width="260">
 </p>
-# rperf
+<h1 align="center">rperf</h1>
-A safepoint-based sampling performance profiler for Ruby. Uses actual time deltas as sample weights to correct safepoint bias.
+<p align="center">
+  <strong>Know where your Ruby spends its time — accurately.</strong><br>
+  A sampling profiler that corrects safepoint bias using real time deltas.
+</p>
-- Requires Ruby >= 3.4.0
-- Output: pprof protobuf, collapsed stacks, or text report
-- Modes: CPU time (per-thread) and wall time (with GVL/GC tracking)
-- [Online manual](https://ko1.github.io/rperf/docs/manual/) | [GitHub](https://github.com/ko1/rperf)
+<p align="center">
+  <a href="https://rubygems.org/gems/rperf"><img src="https://img.shields.io/gem/v/rperf.svg" alt="Gem Version"></a>
+  <img src="https://img.shields.io/badge/Ruby-%3E%3D%203.4.0-cc342d" alt="Ruby >= 3.4.0">
+  <a href="https://ko1.github.io/rperf/docs/manual/"><img src="https://img.shields.io/badge/docs-manual-blue" alt="Manual"></a>
+  <img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License">
+</p>
-## Quick Start
+<p align="center">
+  pprof / collapsed stacks / text report &nbsp;·&nbsp; CPU mode & wall mode (GVL + GC tracking)
+</p>
+<p align="center">
+  <a href='https://ko1.github.io/rperf/'>Web site</a>,
+  <a href='https://ko1.github.io/rperf/docs/manual/'>Online manual</a>,
+  <a href='https://github.com/ko1/rperf'>GitHub repository</a>
+</p>
+## See It in Action
 ```bash
-gem install rperf
+$ gem install rperf
+$ rperf exec ruby fib.rb
+ Performance stats for 'ruby fib.rb':
+         2,326.0 ms   user
+            64.5 ms   sys
+         2,035.5 ms   real
+         2,034.2 ms 100.0%  CPU execution
+               1            [Ruby] detected threads
+             7.0 ms         [Ruby] GC time (7 count: 5 minor, 2 major)
+         106,078            [Ruby] allocated objects
+              22 MB         [OS] peak memory (maxrss)
+ Flat:
+         2,034.2 ms 100.0%  Object#fibonacci (fib.rb)
+ Cumulative:
+         2,034.2 ms 100.0%  Object#fibonacci (fib.rb)
+         2,034.2 ms 100.0%  <main> (fib.rb)
+  2034 samples / 2034 triggers, 0.1% profiler overhead
+```
+## Quick Start
+```bash
 # Performance summary (wall mode, prints to stderr)
 rperf stat ruby app.rb
-# Profile to file
-rperf record ruby app.rb                              # → rperf.data (pprof, cpu mode)
+# Record a pprof profile to file
+rperf record ruby app.rb                              # → rperf.data (cpu mode)
 rperf record -m wall -o profile.pb.gz ruby server.rb   # wall mode, custom output
 # View results (report/diff require Go: https://go.dev/dl/)
@@ -67,19 +108,20 @@ Inspired by Linux `perf` — familiar subcommand interface for profiling workflo
 |---------|-------------|
 | `rperf record` | Profile a command and save to file |
 | `rperf stat` | Profile a command and print summary to stderr |
+| `rperf exec` | Profile a command and print full report to stderr |
 | `rperf report` | Open pprof profile with `go tool pprof` (requires Go) |
 | `rperf diff` | Compare two pprof profiles (requires Go) |
 | `rperf help` | Show full reference documentation |
 ## How It Works
-### The Problem
+### The Challenge: Safepoint Sampling
-Ruby's sampling profilers collect stack traces at **safepoints**, not at the exact timer tick. Traditional profilers assign equal weight to every sample, so if a safepoint is delayed 5ms, that delay is invisible.
+Most Ruby profilers (e.g., stackprof) use signal handlers to capture stack traces at the exact moment the timer fires. rperf takes a different approach — it samples at **safepoints** (VM checkpoints), which is safer (no async-signal-safety concerns, reliable access to VM state) but means the sample timing can be delayed. Without correction, this delay would skew the results.
-### The Solution
+### The Fix: Weight = Real Time
-rperf uses **time deltas as sample weights**:
+rperf uses **actual elapsed time as sample weights** — so delayed samples carry proportionally more weight, and the profile matches reality:
 ```
 Timer (signal or thread)         VM thread (postponed job)
@@ -116,23 +158,22 @@ rperf hooks GVL and GC events to attribute non-CPU time:
 | `[GC marking]` | Time in GC mark phase |
 | `[GC sweeping]` | Time in GC sweep phase |
-## Pros & Cons
+## Why rperf?
-### Pros
+- **Accurate despite safepoints** — Safepoint sampling is *safer* (no async-signal-safety issues), but normally *inaccurate*. rperf compensates with real time-delta weights, so profiles faithfully reflect where time is actually spent.
+- **See the whole picture** (wall mode) — GVL contention, off-GVL I/O, GC marking/sweeping — all attributed to the call stacks responsible, via synthetic frames.
+- **Low overhead** — Signal-based timer on Linux (no extra thread). ~1–5 µs per sample.
+- **pprof compatible** — Works with `go tool pprof`, speedscope, and other standard tools out of the box.
+- **Zero code changes** — Profile any Ruby program via CLI or environment variables. Drop-in for Rails, too.
+- **`perf`-like CLI** — `record`, `stat`, `report`, `diff` — if you know Linux perf, you already know rperf.
-- **Safepoint-based, but accurate**: Unlike signal-based profilers (e.g., stackprof), rperf samples at safepoints. Safepoint sampling is safer — no async-signal-safety constraints, so backtraces and VM state (GC phase, GVL ownership) can be inspected reliably. The downside is less precise sampling timing, but rperf compensates by using actual time deltas as sample weights — so the profiling results faithfully reflect where time is actually spent.
-- **GVL & GC visibility** (wall mode): Attributes off-GVL time, GVL contention, and GC phases to the responsible call stacks with synthetic frames.
-- **Low overhead**: No extra thread on Linux (signal-based timer). Sampling overhead is ~1-5 us per sample.
-- **pprof compatible**: Output works with `go tool pprof`, speedscope, and other standard tools.
-- **No code changes required**: Profile any Ruby program via CLI (`rperf stat ruby app.rb`) or environment variables (`RPERF_ENABLED=1`).
-- **perf-like CLI**: Familiar subcommand interface — `record`, `stat`, `report`, `diff` — inspired by Linux perf.
+### Limitations
-### Cons
+- **Method-level only** — no line-level granularity.
+- **Ruby >= 3.4.0** — uses recent VM internals (postponed jobs, thread event hooks).
+- **POSIX only** — Linux, macOS. No Windows.
+- **No fork support** — profiling does not follow fork(2) child processes.
-- **Method-level only**: Profiles at the method level, not the line level. You can see which method is slow, but not which line within it.
-- **Ruby >= 3.4.0**: Requires recent Ruby for the internal APIs used (postponed jobs, thread event hooks).
-- **POSIX only**: Linux, macOS, etc. No Windows support.
-- **Safepoint sampling**: Cannot sample inside C extensions or during long-running C calls that don't reach a safepoint. Time spent there is attributed to the next sample.
 ## Output Formats
@@ -146,4 +187,4 @@ Format is auto-detected from extension, or set explicitly with `--format`.
 ## License
-MIT
+MIT

data/docs/help.md CHANGED Viewed

@@ -130,22 +130,132 @@ nil if profiler was not running; otherwise a Hash:
   detected_thread_count: 4,        # threads seen during profiling
   start_time_ns: 17740...,         # CLOCK_REALTIME epoch nanos
   duration_ns: 10000000,           # profiling duration in nanos
-  aggregated_samples: [            # when aggregate: true (default)
-    [frames, weight, seq],         #   frames: [[path, label], ...] deepest-first
-    ...                            #   weight: Integer (nanoseconds, merged per unique stack)
-  ],                               #   seq: Integer (thread sequence, 1-based)
+  aggregated_samples: [                  # when aggregate: true (default)
+    [frames, weight, seq, label_set_id], #   frames: [[path, label], ...] deepest-first
+    ...                                  #   weight: Integer (nanoseconds, merged per unique stack)
+  ],                                     #   seq: Integer (thread sequence, 1-based)
+                                         #   label_set_id: Integer (0 = no labels)
+  label_sets: [{}, {request: "abc"}, ...], # label set table (index = label_set_id)
   # --- OR ---
-  raw_samples: [           # when aggregate: false
-    [frames, weight, seq], #   one entry per timer sample (not merged)
+  raw_samples: [                   # when aggregate: false
+    [frames, weight, seq, label_set_id], # one entry per timer sample (not merged)
     ...
   ] }
 ```
+### Rperf.snapshot(clear: false)
+Returns a snapshot of the current profiling data without stopping.
+Only works in aggregate mode (the default). Returns nil if not profiling.
+When `clear: true` is given, resets aggregated data after taking the snapshot.
+This enables interval-based profiling where each snapshot covers only the
+period since the last clear.
+```ruby
+Rperf.start(frequency: 1000)
+# ... work ...
+snap = Rperf.snapshot         # read data without stopping
+Rperf.save("snap.pb.gz", snap)
+# ... more work ...
+data = Rperf.stop
+```
+Interval-based usage:
+```ruby
+Rperf.start(frequency: 1000)
+loop do
+  sleep 10
+  snap = Rperf.snapshot(clear: true)  # each snapshot covers the last 10s
+  Rperf.save("profile-#{Time.now.to_i}.pb.gz", snap)
+end
+```
+### Rperf.label(**labels, &block)
+Attaches key-value labels to the current thread's samples. Labels appear
+in pprof sample labels, enabling per-context filtering (e.g., per-request).
+If profiling is not running, labels are silently ignored (no error).
+```ruby
+# Block form — labels are restored when the block exits
+Rperf.label(request: "abc-123", endpoint: "/api/users") do
+  handle_request   # samples inside get these labels
+end
+# labels are restored to previous state here
+# Without block — labels persist until changed
+Rperf.label(request: "abc-123")
+# Merge — new labels merge with existing ones
+Rperf.label(phase: "db")      # adds phase, keeps request
+# Delete a key — set value to nil
+Rperf.label(request: nil)     # removes request key
+# Nested blocks — each block restores its entry state
+Rperf.label(request: "abc") do
+  Rperf.label(phase: "db") do
+    Rperf.labels  #=> {request: "abc", phase: "db"}
+  end
+  Rperf.labels    #=> {request: "abc"}
+end
+Rperf.labels      #=> {}
+```
+In pprof output, use labels for filtering and grouping:
+    go tool pprof -tagfocus=request=abc-123 profile.pb.gz
+    go tool pprof -tagroot=request profile.pb.gz
+    go tool pprof -tagleaf=request profile.pb.gz
+### Rperf.labels
+Returns the current thread's labels as a Hash. Empty hash if none set.
 ### Rperf.save(path, data, format: nil)
 Writes data to path. format: :pprof, :collapsed, or :text.
 nil auto-detects from extension.
+### Rperf::Middleware (Rack)
+Labels samples with the request endpoint. Requires `require "rperf/middleware"`.
+```ruby
+# Rails
+Rails.application.config.middleware.use Rperf::Middleware
+# Sinatra
+use Rperf::Middleware
+```
+The middleware only sets labels — start profiling separately.
+Option: `label_key:` (default: `:endpoint`).
+### Rperf::ActiveJobMiddleware
+Labels samples with the job class name. Requires `require "rperf/active_job"`.
+```ruby
+class ApplicationJob < ActiveJob::Base
+  include Rperf::ActiveJobMiddleware
+end
+```
+### Rperf::SidekiqMiddleware
+Labels samples with the worker class name. Requires `require "rperf/sidekiq"`.
+```ruby
+Sidekiq.configure_server do |config|
+  config.server_middleware do |chain|
+    chain.add Rperf::SidekiqMiddleware
+  end
+end
+```
 ## PROFILING MODES
 - **cpu** — Measures per-thread CPU time via Linux thread clock.
@@ -175,11 +285,20 @@ Embedded metadata:
 Sample labels:
     thread_seq      thread sequence number (1-based, assigned per profiling session)
+    <user labels>   custom key-value labels set via Rperf.label()
 View comments: `go tool pprof -comments profile.pb.gz`
 Group by thread: `go tool pprof -tagroot=thread_seq profile.pb.gz`
+Filter by label: `go tool pprof -tagfocus=request=abc-123 profile.pb.gz`
+Group by label (root): `go tool pprof -tagroot=request profile.pb.gz`
+Group by label (leaf): `go tool pprof -tagleaf=request profile.pb.gz`
+Exclude by label: `go tool pprof -tagignore=request=healthcheck profile.pb.gz`
 ### collapsed
 Plain text. One line per unique stack: `frame1;frame2;...;leaf weight`

data/exe/rperf CHANGED Viewed

@@ -80,7 +80,7 @@ USAGE = "Usage: rperf record [options] command [args...]\n" \
 # Handle top-level flags before subcommand parsing
 case ARGV.first
 when "-v", "--version"
-  require "rperf"
+  require_relative "../lib/rperf"
   puts "rperf #{Rperf::VERSION}"
   exit
 when "-h", "--help"

data/ext/rperf/rperf.c CHANGED Viewed

@@ -66,6 +66,7 @@ typedef struct rperf_sample {
     int64_t weight;
     int type;           /* rperf_sample_type */
     int thread_seq;     /* thread sequence number (1-based) */
+    int label_set_id;   /* label set ID (0 = no labels) */
 } rperf_sample_t;
 /* ---- Sample buffer (double-buffered) ---- */
@@ -103,6 +104,7 @@ typedef struct rperf_agg_entry {
     uint32_t frame_start;     /* offset into stack_pool */
     int depth;                /* includes synthetic frame */
     int thread_seq;
+    int label_set_id;         /* label set ID (0 = no labels) */
     int64_t weight;           /* accumulated */
     uint32_t hash;            /* cached hash value */
     int used;                 /* 0 = empty, 1 = used */
@@ -124,6 +126,7 @@ typedef struct rperf_thread_data {
     int64_t suspended_at_ns;        /* wall time at SUSPENDED */
     int64_t ready_at_ns;            /* wall time at READY */
     int thread_seq;                 /* thread sequence number (1-based) */
+    int label_set_id;               /* current label set ID (0 = no labels) */
 } rperf_thread_data_t;
 /* ---- GC tracking state ---- */
@@ -132,6 +135,7 @@ typedef struct rperf_gc_state {
     int phase;                /* rperf_gc_phase */
     int64_t enter_ns;         /* wall time at GC_ENTER */
     int thread_seq;           /* thread_seq at GC_ENTER */
+    int label_set_id;         /* label_set_id at GC_ENTER */
 } rperf_gc_state_t;
 /* ---- Sampling overhead stats ---- */
@@ -175,6 +179,9 @@ typedef struct rperf_profiler {
     int next_thread_seq;
     /* Sampling overhead stats */
     rperf_stats_t stats;
+    /* Label sets: Ruby Array of Hash objects, managed from Ruby side.
+     * Index 0 is reserved (no labels). GC-marked via profiler_mark. */
+    VALUE label_sets;  /* Ruby Array or Qnil */
 } rperf_profiler_t;
 static rperf_profiler_t g_profiler;
@@ -195,6 +202,10 @@ rperf_profiler_mark(void *ptr)
                                 buf->frame_pool + buf->frame_pool_count);
         }
     }
+    /* Mark label_sets array */
+    if (prof->label_sets != Qnil) {
+        rb_gc_mark(prof->label_sets);
+    }
     /* Mark frame_table keys (unique frame VALUEs).
      * Acquire count to synchronize with the release-store in insert,
      * ensuring we see the keys pointer that is valid for [0, count).
@@ -431,7 +442,7 @@ rperf_frame_table_insert(rperf_frame_table_t *ft, VALUE fval)
 /* ---- Aggregation table operations (all malloc-based, no GVL needed) ---- */
 static uint32_t
-rperf_fnv1a_u32(const uint32_t *data, int len, int thread_seq)
+rperf_fnv1a_u32(const uint32_t *data, int len, int thread_seq, int label_set_id)
 {
     uint32_t h = 2166136261u;
     int i;
@@ -441,6 +452,8 @@ rperf_fnv1a_u32(const uint32_t *data, int len, int thread_seq)
     }
     h ^= (uint32_t)thread_seq;
     h *= 16777619u;
+    h ^= (uint32_t)label_set_id;
+    h *= 16777619u;
     return h;
 }
@@ -506,7 +519,8 @@ rperf_agg_ensure_stack_pool(rperf_agg_table_t *at, int needed)
 /* Insert or merge a stack into the aggregation table */
 static void
 rperf_agg_table_insert(rperf_agg_table_t *at, const uint32_t *frame_ids,
-                       int depth, int thread_seq, int64_t weight, uint32_t hash)
+                       int depth, int thread_seq, int label_set_id,
+                       int64_t weight, uint32_t hash)
 {
     size_t idx = hash % at->bucket_capacity;
@@ -514,6 +528,7 @@ rperf_agg_table_insert(rperf_agg_table_t *at, const uint32_t *frame_ids,
         rperf_agg_entry_t *e = &at->buckets[idx];
         if (!e->used) break;
         if (e->hash == hash && e->depth == depth && e->thread_seq == thread_seq &&
+            e->label_set_id == label_set_id &&
             memcmp(at->stack_pool + e->frame_start, frame_ids,
                    depth * sizeof(uint32_t)) == 0) {
             /* Match — merge weight */
@@ -530,6 +545,7 @@ rperf_agg_table_insert(rperf_agg_table_t *at, const uint32_t *frame_ids,
     e->frame_start = (uint32_t)at->stack_pool_count;
     e->depth = depth;
     e->thread_seq = thread_seq;
+    e->label_set_id = label_set_id;
     e->weight = weight;
     e->hash = hash;
     e->used = 1;
@@ -581,10 +597,10 @@ rperf_aggregate_buffer(rperf_profiler_t *prof, rperf_sample_buffer_t *buf)
         if (overflow) break; /* frame_table full, stop aggregating this buffer */
         int total_depth = off + s->depth;
-        hash = rperf_fnv1a_u32(temp_ids, total_depth, s->thread_seq);
+        hash = rperf_fnv1a_u32(temp_ids, total_depth, s->thread_seq, s->label_set_id);
         rperf_agg_table_insert(&prof->agg_table, temp_ids, total_depth,
-                               s->thread_seq, s->weight, hash);
+                               s->thread_seq, s->label_set_id, s->weight, hash);
     }
     /* Reset buffer for reuse.
@@ -634,7 +650,7 @@ rperf_try_swap(rperf_profiler_t *prof)
 /* Write a sample into a specific buffer. No swap check. */
 static int
 rperf_write_sample(rperf_sample_buffer_t *buf, size_t frame_start, int depth,
-                   int64_t weight, int type, int thread_seq)
+                   int64_t weight, int type, int thread_seq, int label_set_id)
 {
     if (weight <= 0) return 0;
     if (rperf_ensure_sample_capacity(buf) < 0) return -1;
@@ -645,16 +661,17 @@ rperf_write_sample(rperf_sample_buffer_t *buf, size_t frame_start, int depth,
     sample->weight = weight;
     sample->type = type;
     sample->thread_seq = thread_seq;
+    sample->label_set_id = label_set_id;
     buf->sample_count++;
     return 0;
 }
 static void
 rperf_record_sample(rperf_profiler_t *prof, size_t frame_start, int depth,
-                    int64_t weight, int type, int thread_seq)
+                    int64_t weight, int type, int thread_seq, int label_set_id)
 {
     rperf_sample_buffer_t *buf = &prof->buffers[atomic_load_explicit(&prof->active_idx, memory_order_relaxed)];
-    rperf_write_sample(buf, frame_start, depth, weight, type, thread_seq);
+    rperf_write_sample(buf, frame_start, depth, weight, type, thread_seq, label_set_id);
     rperf_try_swap(prof);
 }
@@ -676,12 +693,11 @@ rperf_thread_data_create(rperf_profiler_t *prof, VALUE thread)
 /* ---- Thread event hooks ---- */
 static void
-rperf_handle_suspended(rperf_profiler_t *prof, VALUE thread)
+rperf_handle_suspended(rperf_profiler_t *prof, VALUE thread, rperf_thread_data_t *td)
 {
     /* Has GVL — safe to call Ruby APIs */
     int64_t wall_now = rperf_wall_time_ns();
-    rperf_thread_data_t *td = (rperf_thread_data_t *)rb_internal_thread_specific_get(thread, prof->ts_key);
     int is_first = 0;
     if (td == NULL) {
@@ -705,7 +721,7 @@ rperf_handle_suspended(rperf_profiler_t *prof, VALUE thread)
     /* Record normal sample (skip if first time — no prev_time) */
     if (!is_first) {
         int64_t weight = time_now - td->prev_time_ns;
-        rperf_record_sample(prof, frame_start, depth, weight, RPERF_SAMPLE_NORMAL, td->thread_seq);
+        rperf_record_sample(prof, frame_start, depth, weight, RPERF_SAMPLE_NORMAL, td->thread_seq, td->label_set_id);
     }
     /* Save timestamp for READY/RESUMED */
@@ -715,21 +731,18 @@ rperf_handle_suspended(rperf_profiler_t *prof, VALUE thread)
 }
 static void
-rperf_handle_ready(rperf_profiler_t *prof, VALUE thread)
+rperf_handle_ready(rperf_thread_data_t *td)
 {
     /* May NOT have GVL — only simple C operations allowed */
-    rperf_thread_data_t *td = (rperf_thread_data_t *)rb_internal_thread_specific_get(thread, prof->ts_key);
     if (!td) return;
     td->ready_at_ns = rperf_wall_time_ns();
 }
 static void
-rperf_handle_resumed(rperf_profiler_t *prof, VALUE thread)
+rperf_handle_resumed(rperf_profiler_t *prof, VALUE thread, rperf_thread_data_t *td)
 {
     /* Has GVL */
-    rperf_thread_data_t *td = (rperf_thread_data_t *)rb_internal_thread_specific_get(thread, prof->ts_key);
     if (td == NULL) {
         td = rperf_thread_data_create(prof, thread);
         if (!td) return;
@@ -758,12 +771,12 @@ rperf_handle_resumed(rperf_profiler_t *prof, VALUE thread)
         if (td->ready_at_ns > 0 && td->ready_at_ns > td->suspended_at_ns) {
             int64_t blocked_ns = td->ready_at_ns - td->suspended_at_ns;
             rperf_write_sample(buf, frame_start, depth, blocked_ns,
-                               RPERF_SAMPLE_GVL_BLOCKED, td->thread_seq);
+                               RPERF_SAMPLE_GVL_BLOCKED, td->thread_seq, td->label_set_id);
         }
         if (td->ready_at_ns > 0 && wall_now > td->ready_at_ns) {
             int64_t wait_ns = wall_now - td->ready_at_ns;
             rperf_write_sample(buf, frame_start, depth, wait_ns,
-                               RPERF_SAMPLE_GVL_WAIT, td->thread_seq);
+                               RPERF_SAMPLE_GVL_WAIT, td->thread_seq, td->label_set_id);
         }
         rperf_try_swap(prof);
@@ -781,9 +794,8 @@ skip_gvl:
 }
 static void
-rperf_handle_exited(rperf_profiler_t *prof, VALUE thread)
+rperf_handle_exited(rperf_profiler_t *prof, VALUE thread, rperf_thread_data_t *td)
 {
-    rperf_thread_data_t *td = (rperf_thread_data_t *)rb_internal_thread_specific_get(thread, prof->ts_key);
     if (td) {
         free(td);
         rb_internal_thread_specific_set(thread, prof->ts_key, NULL);
@@ -797,15 +809,16 @@ rperf_thread_event_hook(rb_event_flag_t event, const rb_internal_thread_event_da
     if (!prof->running) return;
     VALUE thread = data->thread;
+    rperf_thread_data_t *td = (rperf_thread_data_t *)rb_internal_thread_specific_get(thread, prof->ts_key);
     if (event & RUBY_INTERNAL_THREAD_EVENT_SUSPENDED)
-        rperf_handle_suspended(prof, thread);
+        rperf_handle_suspended(prof, thread, td);
     else if (event & RUBY_INTERNAL_THREAD_EVENT_READY)
-        rperf_handle_ready(prof, thread);
+        rperf_handle_ready(td);
     else if (event & RUBY_INTERNAL_THREAD_EVENT_RESUMED)
-        rperf_handle_resumed(prof, thread);
+        rperf_handle_resumed(prof, thread, td);
     else if (event & RUBY_INTERNAL_THREAD_EVENT_EXITED)
-        rperf_handle_exited(prof, thread);
+        rperf_handle_exited(prof, thread, td);
 }
 /* ---- GC event hook ---- */
@@ -826,13 +839,14 @@ rperf_gc_event_hook(rb_event_flag_t event, VALUE data, VALUE self, ID id, VALUE
         prof->gc.phase = RPERF_GC_NONE;
     }
     else if (event & RUBY_INTERNAL_EVENT_GC_ENTER) {
-        /* Save timestamp and thread_seq; backtrace is captured at GC_EXIT
+        /* Save timestamp, thread_seq, and label_set_id; backtrace is captured at GC_EXIT
          * to avoid buffer mismatch after a double-buffer swap. */
         prof->gc.enter_ns = rperf_wall_time_ns();
         {
             VALUE thread = rb_thread_current();
             rperf_thread_data_t *td = (rperf_thread_data_t *)rb_internal_thread_specific_get(thread, prof->ts_key);
             prof->gc.thread_seq = td ? td->thread_seq : 0;
+            prof->gc.label_set_id = td ? td->label_set_id : 0;
         }
     }
     else if (event & RUBY_INTERNAL_EVENT_GC_EXIT) {
@@ -861,7 +875,7 @@ rperf_gc_event_hook(rb_event_flag_t event, VALUE data, VALUE self, ID id, VALUE
         }
         buf->frame_pool_count += depth;
-        rperf_record_sample(prof, frame_start, depth, weight, type, prof->gc.thread_seq);
+        rperf_record_sample(prof, frame_start, depth, weight, type, prof->gc.thread_seq, prof->gc.label_set_id);
         prof->gc.enter_ns = 0;
     }
 }
@@ -908,7 +922,7 @@ rperf_sample_job(void *arg)
     if (depth <= 0) return;
     buf->frame_pool_count += depth;
-    rperf_record_sample(prof, frame_start, depth, weight, RPERF_SAMPLE_NORMAL, td->thread_seq);
+    rperf_record_sample(prof, frame_start, depth, weight, RPERF_SAMPLE_NORMAL, td->thread_seq, td->label_set_id);
     clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts_end);
     prof->stats.sampling_count++;
@@ -1006,6 +1020,94 @@ rperf_resolve_frame(VALUE fval)
     return rb_ary_new3(2, path, label);
 }
+/* ---- Shared helpers for stop/snapshot ---- */
+/* Flush pending sample buffers into agg_table.
+ * Caller must ensure no concurrent access (worker joined or mutex held). */
+static void
+rperf_flush_buffers(rperf_profiler_t *prof)
+{
+    int cur_idx = atomic_load_explicit(&prof->active_idx, memory_order_acquire);
+    if (atomic_load_explicit(&prof->swap_ready, memory_order_acquire)) {
+        int standby_idx = cur_idx ^ 1;
+        rperf_aggregate_buffer(prof, &prof->buffers[standby_idx]);
+        atomic_store_explicit(&prof->swap_ready, 0, memory_order_release);
+    }
+    rperf_aggregate_buffer(prof, &prof->buffers[cur_idx]);
+}
+/* Build result hash from aggregated data (agg_table + frame_table).
+ * Does NOT free any resources.  Caller must hold GVL. */
+static VALUE
+rperf_build_aggregated_result(rperf_profiler_t *prof)
+{
+    VALUE result, samples_ary;
+    size_t i;
+    int j;
+    result = rb_hash_new();
+    rb_hash_aset(result, ID2SYM(rb_intern("mode")),
+                 ID2SYM(rb_intern(prof->mode == 1 ? "wall" : "cpu")));
+    rb_hash_aset(result, ID2SYM(rb_intern("frequency")), INT2NUM(prof->frequency));
+    rb_hash_aset(result, ID2SYM(rb_intern("trigger_count")), SIZET2NUM(prof->stats.trigger_count));
+    rb_hash_aset(result, ID2SYM(rb_intern("sampling_count")), SIZET2NUM(prof->stats.sampling_count));
+    rb_hash_aset(result, ID2SYM(rb_intern("sampling_time_ns")), LONG2NUM(prof->stats.sampling_total_ns));
+    rb_hash_aset(result, ID2SYM(rb_intern("detected_thread_count")), INT2NUM(prof->next_thread_seq));
+    rb_hash_aset(result, ID2SYM(rb_intern("unique_frames")),
+                 SIZET2NUM(prof->frame_table.count - RPERF_SYNTHETIC_COUNT));
+    rb_hash_aset(result, ID2SYM(rb_intern("unique_stacks")),
+                 SIZET2NUM(prof->agg_table.count));
+    {
+        struct timespec now_monotonic;
+        int64_t start_ns, duration_ns;
+        clock_gettime(CLOCK_MONOTONIC, &now_monotonic);
+        start_ns = (int64_t)prof->start_realtime.tv_sec * 1000000000LL
+                 + (int64_t)prof->start_realtime.tv_nsec;
+        duration_ns = ((int64_t)now_monotonic.tv_sec - (int64_t)prof->start_monotonic.tv_sec) * 1000000000LL
+                    + ((int64_t)now_monotonic.tv_nsec - (int64_t)prof->start_monotonic.tv_nsec);
+        rb_hash_aset(result, ID2SYM(rb_intern("start_time_ns")), LONG2NUM(start_ns));
+        rb_hash_aset(result, ID2SYM(rb_intern("duration_ns")), LONG2NUM(duration_ns));
+    }
+    {
+        rperf_frame_table_t *ft = &prof->frame_table;
+        VALUE resolved_ary = rb_ary_new_capa((long)ft->count);
+        rb_ary_push(resolved_ary, rb_ary_new3(2, rb_str_new_lit("<GVL>"), rb_str_new_lit("[GVL blocked]")));
+        rb_ary_push(resolved_ary, rb_ary_new3(2, rb_str_new_lit("<GVL>"), rb_str_new_lit("[GVL wait]")));
+        rb_ary_push(resolved_ary, rb_ary_new3(2, rb_str_new_lit("<GC>"),  rb_str_new_lit("[GC marking]")));
+        rb_ary_push(resolved_ary, rb_ary_new3(2, rb_str_new_lit("<GC>"),  rb_str_new_lit("[GC sweeping]")));
+        for (i = RPERF_SYNTHETIC_COUNT; i < ft->count; i++) {
+            rb_ary_push(resolved_ary, rperf_resolve_frame(atomic_load_explicit(&ft->keys, memory_order_relaxed)[i]));
+        }
+        rperf_agg_table_t *at = &prof->agg_table;
+        samples_ary = rb_ary_new();
+        for (i = 0; i < at->bucket_capacity; i++) {
+            rperf_agg_entry_t *e = &at->buckets[i];
+            if (!e->used) continue;
+            VALUE frames = rb_ary_new_capa(e->depth);
+            for (j = 0; j < e->depth; j++) {
+                uint32_t fid = at->stack_pool[e->frame_start + j];
+                rb_ary_push(frames, RARRAY_AREF(resolved_ary, fid));
+            }
+            VALUE sample = rb_ary_new3(4, frames, LONG2NUM(e->weight), INT2NUM(e->thread_seq), INT2NUM(e->label_set_id));
+            rb_ary_push(samples_ary, sample);
+        }
+    }
+    rb_hash_aset(result, ID2SYM(rb_intern("aggregated_samples")), samples_ary);
+    if (prof->label_sets != Qnil) {
+        rb_hash_aset(result, ID2SYM(rb_intern("label_sets")), prof->label_sets);
+    }
+    return result;
+}
 /* ---- Ruby API ---- */
 /* _c_start(frequency, mode, aggregate, signal)
@@ -1038,6 +1140,7 @@ rb_rperf_start(VALUE self, VALUE vfreq, VALUE vmode, VALUE vagg, VALUE vsig)
     g_profiler.stats.trigger_count = 0;
     atomic_store_explicit(&g_profiler.active_idx, 0, memory_order_relaxed);
     atomic_store_explicit(&g_profiler.swap_ready, 0, memory_order_relaxed);
+    g_profiler.label_sets = Qnil;
     /* Initialize worker mutex/cond */
     CHECKED(pthread_mutex_init(&g_profiler.worker_mutex, NULL));
@@ -1259,15 +1362,8 @@ rb_rperf_stop(VALUE self)
     rb_remove_event_hook(rperf_gc_event_hook);
     if (g_profiler.aggregate) {
-        /* Worker thread is joined; no concurrent access to these atomics. */
-        int cur_idx = atomic_load_explicit(&g_profiler.active_idx, memory_order_relaxed);
-        /* Aggregate remaining samples from both buffers */
-        if (atomic_load_explicit(&g_profiler.swap_ready, memory_order_relaxed)) {
-            int standby_idx = cur_idx ^ 1;
-            rperf_aggregate_buffer(&g_profiler, &g_profiler.buffers[standby_idx]);
-            atomic_store_explicit(&g_profiler.swap_ready, 0, memory_order_relaxed);
-        }
-        rperf_aggregate_buffer(&g_profiler, &g_profiler.buffers[cur_idx]);
+        /* Worker thread is joined; no concurrent access. */
+        rperf_flush_buffers(&g_profiler);
     }
     /* Clean up thread-specific data for all live threads */
@@ -1285,73 +1381,8 @@ rb_rperf_stop(VALUE self)
         }
     }
-    /* Build result hash */
-    result = rb_hash_new();
-    /* mode */
-    rb_hash_aset(result, ID2SYM(rb_intern("mode")),
-                 ID2SYM(rb_intern(g_profiler.mode == 1 ? "wall" : "cpu")));
-    /* frequency */
-    rb_hash_aset(result, ID2SYM(rb_intern("frequency")), INT2NUM(g_profiler.frequency));
-    /* trigger_count, sampling_count, sampling_time_ns, detected_thread_count */
-    rb_hash_aset(result, ID2SYM(rb_intern("trigger_count")), SIZET2NUM(g_profiler.stats.trigger_count));
-    rb_hash_aset(result, ID2SYM(rb_intern("sampling_count")), SIZET2NUM(g_profiler.stats.sampling_count));
-    rb_hash_aset(result, ID2SYM(rb_intern("sampling_time_ns")), LONG2NUM(g_profiler.stats.sampling_total_ns));
-    rb_hash_aset(result, ID2SYM(rb_intern("detected_thread_count")), INT2NUM(g_profiler.next_thread_seq));
-    /* aggregation stats */
     if (g_profiler.aggregate) {
-        rb_hash_aset(result, ID2SYM(rb_intern("unique_frames")),
-                     SIZET2NUM(g_profiler.frame_table.count - RPERF_SYNTHETIC_COUNT));
-        rb_hash_aset(result, ID2SYM(rb_intern("unique_stacks")),
-                     SIZET2NUM(g_profiler.agg_table.count));
-    }
-    /* start_time_ns (CLOCK_REALTIME epoch nanos), duration_ns (CLOCK_MONOTONIC delta) */
-    {
-        struct timespec stop_monotonic;
-        int64_t start_ns, duration_ns;
-        clock_gettime(CLOCK_MONOTONIC, &stop_monotonic);
-        start_ns = (int64_t)g_profiler.start_realtime.tv_sec * 1000000000LL
-                 + (int64_t)g_profiler.start_realtime.tv_nsec;
-        duration_ns = ((int64_t)stop_monotonic.tv_sec - (int64_t)g_profiler.start_monotonic.tv_sec) * 1000000000LL
-                    + ((int64_t)stop_monotonic.tv_nsec - (int64_t)g_profiler.start_monotonic.tv_nsec);
-        rb_hash_aset(result, ID2SYM(rb_intern("start_time_ns")), LONG2NUM(start_ns));
-        rb_hash_aset(result, ID2SYM(rb_intern("duration_ns")), LONG2NUM(duration_ns));
-    }
-    if (g_profiler.aggregate) {
-        /* Build samples from aggregation table.
-         * Use a Ruby array for resolved frames so GC protects them. */
-        rperf_frame_table_t *ft = &g_profiler.frame_table;
-        VALUE resolved_ary = rb_ary_new_capa((long)ft->count);
-        /* Synthetic frames */
-        rb_ary_push(resolved_ary, rb_ary_new3(2, rb_str_new_lit("<GVL>"), rb_str_new_lit("[GVL blocked]")));
-        rb_ary_push(resolved_ary, rb_ary_new3(2, rb_str_new_lit("<GVL>"), rb_str_new_lit("[GVL wait]")));
-        rb_ary_push(resolved_ary, rb_ary_new3(2, rb_str_new_lit("<GC>"),  rb_str_new_lit("[GC marking]")));
-        rb_ary_push(resolved_ary, rb_ary_new3(2, rb_str_new_lit("<GC>"),  rb_str_new_lit("[GC sweeping]")));
-        /* Real frames */
-        for (i = RPERF_SYNTHETIC_COUNT; i < ft->count; i++) {
-            rb_ary_push(resolved_ary, rperf_resolve_frame(atomic_load_explicit(&ft->keys, memory_order_relaxed)[i]));
-        }
-        rperf_agg_table_t *at = &g_profiler.agg_table;
-        samples_ary = rb_ary_new();
-        for (i = 0; i < at->bucket_capacity; i++) {
-            rperf_agg_entry_t *e = &at->buckets[i];
-            if (!e->used) continue;
-            VALUE frames = rb_ary_new_capa(e->depth);
-            for (j = 0; j < e->depth; j++) {
-                uint32_t fid = at->stack_pool[e->frame_start + j];
-                rb_ary_push(frames, RARRAY_AREF(resolved_ary, fid));
-            }
-            VALUE sample = rb_ary_new3(3, frames, LONG2NUM(e->weight), INT2NUM(e->thread_seq));
-            rb_ary_push(samples_ary, sample);
-        }
+        result = rperf_build_aggregated_result(&g_profiler);
         rperf_sample_buffer_free(&g_profiler.buffers[1]);
         rperf_frame_table_free(&g_profiler.frame_table);
@@ -1359,6 +1390,27 @@ rb_rperf_stop(VALUE self)
     } else {
         /* Raw samples path (aggregate: false) */
         rperf_sample_buffer_t *buf = &g_profiler.buffers[0];
+        result = rb_hash_new();
+        rb_hash_aset(result, ID2SYM(rb_intern("mode")),
+                     ID2SYM(rb_intern(g_profiler.mode == 1 ? "wall" : "cpu")));
+        rb_hash_aset(result, ID2SYM(rb_intern("frequency")), INT2NUM(g_profiler.frequency));
+        rb_hash_aset(result, ID2SYM(rb_intern("trigger_count")), SIZET2NUM(g_profiler.stats.trigger_count));
+        rb_hash_aset(result, ID2SYM(rb_intern("sampling_count")), SIZET2NUM(g_profiler.stats.sampling_count));
+        rb_hash_aset(result, ID2SYM(rb_intern("sampling_time_ns")), LONG2NUM(g_profiler.stats.sampling_total_ns));
+        rb_hash_aset(result, ID2SYM(rb_intern("detected_thread_count")), INT2NUM(g_profiler.next_thread_seq));
+        {
+            struct timespec stop_monotonic;
+            int64_t start_ns, duration_ns;
+            clock_gettime(CLOCK_MONOTONIC, &stop_monotonic);
+            start_ns = (int64_t)g_profiler.start_realtime.tv_sec * 1000000000LL
+                     + (int64_t)g_profiler.start_realtime.tv_nsec;
+            duration_ns = ((int64_t)stop_monotonic.tv_sec - (int64_t)g_profiler.start_monotonic.tv_sec) * 1000000000LL
+                        + ((int64_t)stop_monotonic.tv_nsec - (int64_t)g_profiler.start_monotonic.tv_nsec);
+            rb_hash_aset(result, ID2SYM(rb_intern("start_time_ns")), LONG2NUM(start_ns));
+            rb_hash_aset(result, ID2SYM(rb_intern("duration_ns")), LONG2NUM(duration_ns));
+        }
         samples_ary = rb_ary_new_capa((long)buf->sample_count);
         for (i = 0; i < buf->sample_count; i++) {
             rperf_sample_t *s = &buf->samples[i];
@@ -1384,13 +1436,14 @@ rb_rperf_stop(VALUE self)
                 rb_ary_push(frames, rperf_resolve_frame(fval));
             }
-            VALUE sample = rb_ary_new3(3, frames, LONG2NUM(s->weight), INT2NUM(s->thread_seq));
+            VALUE sample = rb_ary_new3(4, frames, LONG2NUM(s->weight), INT2NUM(s->thread_seq), INT2NUM(s->label_set_id));
             rb_ary_push(samples_ary, sample);
         }
+        rb_hash_aset(result, ID2SYM(rb_intern("raw_samples")), samples_ary);
+        if (g_profiler.label_sets != Qnil) {
+            rb_hash_aset(result, ID2SYM(rb_intern("label_sets")), g_profiler.label_sets);
+        }
     }
-    rb_hash_aset(result,
-                 ID2SYM(rb_intern(g_profiler.aggregate ? "aggregated_samples" : "raw_samples")),
-                 samples_ary);
     /* Cleanup */
     rperf_sample_buffer_free(&g_profiler.buffers[0]);
@@ -1398,6 +1451,113 @@ rb_rperf_stop(VALUE self)
     return result;
 }
+/* ---- Snapshot: read aggregated data without stopping ---- */
+/* Clear aggregated data for the next interval.
+ * Caller must hold GVL + worker_mutex.
+ * Keeps allocations intact for reuse.  Does NOT touch frame_table
+ * (frame IDs must stay stable — dmark may be iterating keys outside GVL,
+ * and existing threads reference frame IDs via their thread_data). */
+static void
+rperf_clear_aggregated_data(rperf_profiler_t *prof)
+{
+    /* Clear agg_table entries (keep allocation) */
+    memset(prof->agg_table.buckets, 0,
+           prof->agg_table.bucket_capacity * sizeof(rperf_agg_entry_t));
+    prof->agg_table.count = 0;
+    prof->agg_table.stack_pool_count = 0;
+    /* Reset stats */
+    prof->stats.trigger_count = 0;
+    prof->stats.sampling_count = 0;
+    prof->stats.sampling_total_ns = 0;
+    /* Reset start timestamps so next snapshot's duration_ns covers
+     * only the period since this clear. */
+    clock_gettime(CLOCK_REALTIME, &prof->start_realtime);
+    clock_gettime(CLOCK_MONOTONIC, &prof->start_monotonic);
+}
+static VALUE
+rb_rperf_snapshot(VALUE self, VALUE vclear)
+{
+    VALUE result;
+    if (!g_profiler.running) {
+        return Qnil;
+    }
+    if (!g_profiler.aggregate) {
+        rb_raise(rb_eRuntimeError, "snapshot requires aggregate mode (aggregate: true)");
+    }
+    /* GVL is held → no postponed jobs fire → no new samples written.
+     * Lock worker_mutex to pause worker thread's aggregation. */
+    CHECKED(pthread_mutex_lock(&g_profiler.worker_mutex));
+    rperf_flush_buffers(&g_profiler);
+    /* Build result while mutex is held.  If clear is requested, we must
+     * also clear under the same lock to avoid a window where the worker
+     * could aggregate into the table between build and clear. */
+    result = rperf_build_aggregated_result(&g_profiler);
+    if (RTEST(vclear)) {
+        rperf_clear_aggregated_data(&g_profiler);
+    }
+    CHECKED(pthread_mutex_unlock(&g_profiler.worker_mutex));
+    return result;
+}
+/* ---- Label API ---- */
+/* _c_set_label(label_set_id) — set current thread's label_set_id.
+ * Called from Ruby with GVL held. */
+static VALUE
+rb_rperf_set_label(VALUE self, VALUE vid)
+{
+    if (!g_profiler.running) return vid;
+    int label_set_id = NUM2INT(vid);
+    VALUE thread = rb_thread_current();
+    rperf_thread_data_t *td = (rperf_thread_data_t *)rb_internal_thread_specific_get(thread, g_profiler.ts_key);
+    if (td == NULL) {
+        td = rperf_thread_data_create(&g_profiler, thread);
+        if (!td) rb_raise(rb_eNoMemError, "rperf: failed to allocate thread data");
+    }
+    td->label_set_id = label_set_id;
+    return vid;
+}
+/* _c_get_label() — get current thread's label_set_id.
+ * Returns 0 if not profiling or thread not yet seen. */
+static VALUE
+rb_rperf_get_label(VALUE self)
+{
+    if (!g_profiler.running) return INT2FIX(0);
+    VALUE thread = rb_thread_current();
+    rperf_thread_data_t *td = (rperf_thread_data_t *)rb_internal_thread_specific_get(thread, g_profiler.ts_key);
+    if (td == NULL) return INT2FIX(0);
+    return INT2NUM(td->label_set_id);
+}
+/* _c_set_label_sets(ary) — store label_sets Ruby Array for result building */
+static VALUE
+rb_rperf_set_label_sets(VALUE self, VALUE ary)
+{
+    g_profiler.label_sets = ary;
+    return ary;
+}
+/* _c_get_label_sets() — get label_sets Ruby Array */
+static VALUE
+rb_rperf_get_label_sets(VALUE self)
+{
+    return g_profiler.label_sets;
+}
 /* ---- Fork safety ---- */
 static void
@@ -1459,8 +1619,14 @@ Init_rperf(void)
     VALUE mRperf = rb_define_module("Rperf");
     rb_define_module_function(mRperf, "_c_start", rb_rperf_start, 4);
     rb_define_module_function(mRperf, "_c_stop", rb_rperf_stop, 0);
+    rb_define_module_function(mRperf, "_c_snapshot", rb_rperf_snapshot, 1);
+    rb_define_module_function(mRperf, "_c_set_label", rb_rperf_set_label, 1);
+    rb_define_module_function(mRperf, "_c_get_label", rb_rperf_get_label, 0);
+    rb_define_module_function(mRperf, "_c_set_label_sets", rb_rperf_set_label_sets, 1);
+    rb_define_module_function(mRperf, "_c_get_label_sets", rb_rperf_get_label_sets, 0);
     memset(&g_profiler, 0, sizeof(g_profiler));
+    g_profiler.label_sets = Qnil;
     g_profiler.pj_handle = rb_postponed_job_preregister(0, rperf_sample_job, &g_profiler);
     g_profiler.ts_key = rb_internal_thread_specific_key_create();

data/lib/rperf/active_job.rb ADDED Viewed

@@ -0,0 +1,13 @@
+require "rperf"
+module Rperf::ActiveJobMiddleware
+  extend ActiveSupport::Concern
+  included do
+    around_perform do |job, block|
+      Rperf.label(job: job.class.name) do
+        block.call
+      end
+    end
+  end
+end

data/lib/rperf/middleware.rb ADDED Viewed

@@ -0,0 +1,15 @@
+require "rperf"
+class Rperf::Middleware
+  def initialize(app, label_key: :endpoint)
+    @app = app
+    @label_key = label_key
+  end
+  def call(env)
+    endpoint = "#{env["REQUEST_METHOD"]} #{env["PATH_INFO"]}"
+    Rperf.label(@label_key => endpoint) do
+      @app.call(env)
+    end
+  end
+end

data/lib/rperf/sidekiq.rb ADDED Viewed

@@ -0,0 +1,9 @@
+require "rperf"
+class Rperf::SidekiqMiddleware
+  def call(_worker, job, _queue)
+    Rperf.label(job: job["class"]) do
+      yield
+    end
+  end
+end

data/lib/rperf/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Rperf
-  VERSION = "0.5.0"
+  VERSION = "0.6.0"
 end

data/lib/rperf.rb CHANGED Viewed

@@ -1,4 +1,4 @@
-require "rperf/version"
+require_relative "rperf/version"
 require "zlib"
 require "stringio"
@@ -42,6 +42,8 @@ module Rperf
     @format = format
     @stat = stat
     @stat_start_mono = Process.clock_gettime(Process::CLOCK_MONOTONIC) if @stat
+    @label_set_table = nil
+    @label_set_index = nil
     _c_start(frequency, c_mode, aggregate, c_signal)
     if block_given?
@@ -61,15 +63,15 @@ module Rperf
     # :aggregated_samples.  Build aggregated view so encoders always work.
     if data[:raw_samples] && !data[:aggregated_samples]
       merged = {}
-      data[:raw_samples].each do |frames, weight, thread_seq|
-        key = [frames, thread_seq || 0]
+      data[:raw_samples].each do |frames, weight, thread_seq, label_set_id|
+        key = [frames, thread_seq || 0, label_set_id || 0]
         if merged.key?(key)
           merged[key] += weight
         else
           merged[key] = weight
         end
       end
-      data[:aggregated_samples] = merged.map { |(frames, ts), w| [frames, w, ts] }
+      data[:aggregated_samples] = merged.map { |(frames, ts, lsi), w| [frames, w, ts, lsi] }
     end
     print_stats(data) if @verbose
@@ -84,6 +86,77 @@ module Rperf
     data
   end
+  # Returns a snapshot of the current profiling data without stopping.
+  # Only works in aggregate mode (the default). Returns nil if not profiling.
+  # The returned data has the same format as stop's return value and can be
+  # passed to save(), PProf.encode(), Collapsed.encode(), or Text.encode().
+  #
+  # +clear:+ if true, resets aggregated data after taking the snapshot.
+  # This allows interval-based profiling where each snapshot covers only
+  # the period since the last clear.
+  def self.snapshot(clear: false)
+    _c_snapshot(clear)
+  end
+  # Label set management for per-context profiling.
+  # Label sets are stored as an Array of Hashes, indexed by label_set_id.
+  # Index 0 is reserved (no labels).
+  @label_set_table = nil  # Array of frozen Hash
+  @label_set_index = nil  # Hash → id (for dedup)
+  def self._init_label_sets
+    @label_set_table = [{}]  # id 0 = no labels
+    @label_set_index = { {} => 0 }
+  end
+  def self._intern_label_set(hash)
+    frozen = hash.frozen? ? hash : hash.freeze
+    @label_set_index[frozen] ||= begin
+      id = @label_set_table.size
+      @label_set_table << frozen
+      _c_set_label_sets(@label_set_table)
+      id
+    end
+  end
+  # Sets labels on the current thread for profiling annotation.
+  # With a block: restores previous labels when the block exits.
+  # Without a block: sets labels persistently on the current thread.
+  # Labels are key-value pairs written into pprof sample labels.
+  #
+  #   Rperf.label(request: "abc") { handle_request }
+  #   Rperf.label(request: "abc")  # persistent set
+  #
+  # Values of nil remove that key. Existing labels are merged.
+  def self.label(**kw, &block)
+    _init_label_sets unless @label_set_table
+    cur_id = _c_get_label
+    cur_labels = @label_set_table[cur_id] || {}
+    new_labels = cur_labels.merge(kw).reject { |_, v| v.nil? }
+    new_id = _intern_label_set(new_labels)
+    _c_set_label(new_id)
+    if block
+      begin
+        yield
+      ensure
+        _c_set_label(cur_id)
+      end
+    end
+  end
+  # Returns the current thread's labels as a Hash.
+  # Returns an empty Hash if no labels are set or profiling is not running.
+  def self.labels
+    return {} unless @label_set_table
+    cur_id = _c_get_label
+    @label_set_table[cur_id] || {}
+  end
   # Saves profiling data to a file.
   # format: :pprof, :collapsed, or :text. nil = auto-detect from path extension
   #   .collapsed → collapsed stacks (FlameGraph / speedscope compatible)
@@ -498,17 +571,30 @@ module Rperf
         end
       }
-      # Convert string frames to index frames and merge identical stacks per thread
+      # Convert string frames to index frames and merge identical stacks per thread/label
       merged = Hash.new(0)
       thread_seq_key = intern.("thread_seq")
-      samples_raw.each do |frames, weight, thread_seq|
-        key = [frames.map { |path, label| [intern.(path), intern.(label)] }, thread_seq || 0]
+      label_sets = data[:label_sets]  # Array of Hash (may be nil)
+      samples_raw.each do |frames, weight, thread_seq, label_set_id|
+        key = [frames.map { |path, label| [intern.(path), intern.(label)] }, thread_seq || 0, label_set_id || 0]
         merged[key] += weight
       end
       merged = merged.to_a
+      # Intern label set keys/values for pprof labels
+      label_key_indices = {}  # String key → string_table index
+      if label_sets
+        label_sets.each do |ls|
+          ls.each do |k, v|
+            sk = k.to_s
+            label_key_indices[sk] ||= intern.(sk)
+            intern.(v.to_s)  # ensure value is interned
+          end
+        end
+      end
       # Build location/function tables
-      locations, functions = build_tables(merged.map { |(frames, _), w| [frames, w] })
+      locations, functions = build_tables(merged.map { |(frames, _, _), w| [frames, w] })
       # Intern type label and unit
       type_label = mode == :wall ? "wall" : "cpu"
@@ -521,8 +607,8 @@ module Rperf
       # field 1: sample_type (repeated ValueType)
       buf << encode_message(1, encode_value_type(type_idx, ns_idx))
-      # field 2: sample (repeated Sample) with thread_seq label
-      merged.each do |(frames, thread_seq), weight|
+      # field 2: sample (repeated Sample) with thread_seq + user labels
+      merged.each do |(frames, thread_seq, label_set_id), weight|
         sample_buf = "".b
         loc_ids = frames.map { |f| locations[f] }
         sample_buf << encode_packed_uint64(1, loc_ids)
@@ -533,6 +619,17 @@ module Rperf
           label_buf << encode_int64(3, thread_seq)       # num
           sample_buf << encode_message(3, label_buf)
         end
+        if label_sets && label_set_id && label_set_id > 0
+          ls = label_sets[label_set_id]
+          if ls
+            ls.each do |k, v|
+              label_buf = "".b
+              label_buf << encode_int64(1, label_key_indices[k.to_s])  # key
+              label_buf << encode_int64(2, string_index[v.to_s])       # str
+              sample_buf << encode_message(3, label_buf)
+            end
+          end
+        end
         buf << encode_message(2, sample_buf)
       end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: rperf
 version: !ruby/object:Gem::Version
-  version: 0.5.0
+  version: 0.6.0
 platform: ruby
 authors:
 - Koichi Sasada
@@ -52,6 +52,9 @@ files:
 - ext/rperf/extconf.rb
 - ext/rperf/rperf.c
 - lib/rperf.rb
+- lib/rperf/active_job.rb
+- lib/rperf/middleware.rb
+- lib/rperf/sidekiq.rb
 - lib/rperf/version.rb
 homepage: https://github.com/ko1/rperf
 licenses: