iriq 0.0.1 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d1f6ebe248ed57192adc8f5c9600dfe14d9d0c85160dd5d6588d7fbfd7995e72
4
- data.tar.gz: d80356a646effb078ebe78b62417534746ded550fd4145faa53998aa35866bde
3
+ metadata.gz: 95d6bc09f7de65bcb4acc5db3ad68d1b83c326360d913c45aede66038a100461
4
+ data.tar.gz: ae58d39b77fce3041cc5561b575dea52706b38f4cfab11d4881cf2f644f6cc59
5
5
  SHA512:
6
- metadata.gz: d355eef90433cef5cd807d9c68d6259ad49d79dd87d842770fded377a535ed39e4d24228594b24ce96015f8c9e7246bdc96d7bcfa58f3cdf3350bb389bff211f
7
- data.tar.gz: b0e1ddf1c6bcebadbe2578fe0760076a8401fd4fc2cd9a5f345b651371cf2fb47ebaf302565df61b05a389e0253f3febdbccc4e2389cf7d2e810abc8e4156a40
6
+ metadata.gz: a8fa85f112d9766ff4e9e4ad60c1043ce9701fe42e6dbd01f70da39fa0dd554bff573f40c858d88ab1e1539de0c1f017609535cc57e8e4fcc6651f0d20697e60
7
+ data.tar.gz: 97c714e664874c08278a305b8beff470b68bd2b7e4f58977cc44cf2e3716e619d17011a33d77b8ea9356eec0715acb6c815252551edf64db16eec4599f816274
data/CHANGELOG.md CHANGED
@@ -1,2 +1,18 @@
1
+ ### 0.1.0 (2026-05-24)
2
+ - CLI: auto-detect file argument, retire --extract flag
3
+ - CLI: section flags work in pipe mode + clean up help text
4
+ - script/memory.rb — track retained memory + cache footprints
5
+ - Perf: classifier + inflector memoization, singleton classifier, combined extractor regex
6
+ - Perf: derive SegmentHints once per Corpus.observe (~2x faster)
7
+ - script/benchmark.rb — measure the main hot paths
8
+ - README: replace fabricated example numbers with real fixture output
9
+ - Pipe mode: extraction by default, auto-switch to cluster view at scale
10
+ - Iriq::Extractor — pull IRIs out of free text
11
+ - E2E spec: pipe IriGenerator stream through real iriq binary
12
+ - IriGenerator fixture + popular-outlier heuristic
13
+ - CLI --corpus persistence, pipe batch mode, --stats, E2E specs
14
+ - Streaming Corpus with rolling stats and learning
15
+ - RESTful hints, flag-based CLI, swappable inflector
16
+
1
17
  ### 0.0.1 (2026-05-24)
2
18
  - prototype
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- iriq (0.0.1)
4
+ iriq (0.1.0)
5
5
 
6
6
  GEM
7
7
  remote: https://rubygems.org/
@@ -74,7 +74,7 @@ CHECKSUMS
74
74
  erb (6.0.4) sha256=38e3803694be357fe2bfe312487c74beaf9fb4e5beb3e22498952fe1645b95d9
75
75
  io-console (0.8.2) sha256=d6e3ae7a7cc7574f4b8893b4fca2162e57a825b223a177b7afa236c5ef9814cc
76
76
  irb (1.17.0) sha256=168c4ddb93d8a361a045c41d92b2952c7a118fa73f23fe14e55609eb7a863aae
77
- iriq (0.0.1)
77
+ iriq (0.1.0)
78
78
  pp (0.6.3) sha256=2951d514450b93ccfeb1df7d021cae0da16e0a7f95ee1e2273719669d0ab9df6
79
79
  prettyprint (0.2.0) sha256=2bc9e15581a94742064a3cc8b0fb9d45aae3d03a1baa6ef80922627a0766f193
80
80
  prism (1.9.0) sha256=7b530c6a9f92c24300014919c9dcbc055bf4cdf51ec30aed099b06cd6674ef85
data/README.md CHANGED
@@ -23,17 +23,40 @@ iri.path_segments # => ["users", "123"]
23
23
  iri.canonical # => "https://foo.com/users/123"
24
24
 
25
25
  Iriq.normalize("https://foo.com/users/123")
26
- # => "https://foo.com/users/{integer_id}"
26
+ # => "https://foo.com/users/{user_id}"
27
27
 
28
28
  Iriq.explain("https://foo.com/users/123/orders/456")
29
29
  # => [
30
- # { value: "users", type: :literal, variable: false },
31
- # { value: "123", type: :integer_id, variable: true },
32
- # { value: "orders", type: :literal, variable: false },
33
- # { value: "456", type: :integer_id, variable: true },
30
+ # { value: "users", type: :literal, variable: false, hint: nil },
31
+ # { value: "123", type: :integer_id, variable: true, hint: "user_id" },
32
+ # { value: "orders", type: :literal, variable: false, hint: nil },
33
+ # { value: "456", type: :integer_id, variable: true, hint: "order_id" },
34
34
  # ]
35
35
  ```
36
36
 
37
+ Pass `hints: false` to `Iriq.normalize` (or `PathShape`) for mechanical
38
+ placeholders (`{integer_id}` instead of `{user_id}`).
39
+
40
+ ## RESTful hints
41
+
42
+ When a variable segment follows a literal one, Iriq derives a hint by
43
+ singularizing the literal and suffixing `_id` (or `_uuid` for UUIDs). This is
44
+ what produces `{user_id}` from `/users/123` and `{order_id}` from
45
+ `/orders/456`. Singularization uses `Iriq::Inflector`, which delegates to a
46
+ swappable adapter:
47
+
48
+ ```ruby
49
+ # Default: ActiveSupport::Inflector if `active_support/inflector` is loadable,
50
+ # otherwise a built-in adapter with rules adapted from ActiveSupport.
51
+
52
+ Iriq::Inflector.singularize("categories") # => "category"
53
+ Iriq::Inflector.singularize("people") # => "person"
54
+
55
+ # Override:
56
+ Iriq::Inflector.adapter = MyAdapter # must respond to .singularize(String)
57
+ Iriq::Inflector.reset_adapter!
58
+ ```
59
+
37
60
  ## Supported inputs
38
61
 
39
62
  | Input | Notes |
@@ -69,7 +92,7 @@ clusterer.add("https://foo.com/users/456")
69
92
  clusterer.add("https://foo.com/users/789/orders/1")
70
93
 
71
94
  clusterer.clusters.map(&:shape)
72
- # => ["/users/{integer_id}", "/users/{integer_id}/orders/{integer_id}"]
95
+ # => ["/users/{user_id}", "/users/{user_id}/orders/{order_id}"]
73
96
 
74
97
  clusterer.clusters.first.segment_stats
75
98
  # => [
@@ -79,8 +102,8 @@ clusterer.clusters.first.segment_stats
79
102
 
80
103
  clusterer.explain("https://foo.com/users/999")
81
104
  # => [
82
- # { value: "users", type: :literal, variable: false, stable: true },
83
- # { value: "999", type: :integer_id, variable: true, stable: false },
105
+ # { value: "users", type: :literal, variable: false, hint: nil, stable: true },
106
+ # { value: "999", type: :integer_id, variable: true, hint: "user_id", stable: false },
84
107
  # ]
85
108
  ```
86
109
 
@@ -89,6 +112,104 @@ a position the classifier *would* call variable but that is empirically
89
112
  constant across all members of the cluster will be reported with
90
113
  `stable: true, variable: false`.
91
114
 
115
+ ## Corpus (streaming + learning)
116
+
117
+ For processing many identifiers — possibly an unbounded stream — use
118
+ `Iriq::Corpus`. It maintains rolling aggregates and per-(host, prefix)
119
+ frequency stats so classification improves as more data comes in.
120
+
121
+ ```ruby
122
+ corpus = Iriq::Corpus.new
123
+
124
+ iris.each do |iri|
125
+ obs = corpus.observe(iri)
126
+ obs.fingerprint # deterministic shape: "https://foo.com/users/{user_id}"
127
+ obs.cluster # the Iriq::Cluster this fell into
128
+ obs.explanation # per-segment annotations with corpus-informed classification
129
+ end
130
+
131
+ corpus.host_counts # { "foo.com" => 1234, "bar.com" => 7 }
132
+ corpus.path_length_counts # { 2 => 800, 3 => 434 }
133
+ corpus.fingerprint_counts # shape → count
134
+ corpus.raw_shape_counts # hint-free shape → count
135
+ corpus.clusters # Iriq::Cluster instances
136
+ ```
137
+
138
+ ### Deterministic vs. corpus-informed normalization
139
+
140
+ ```ruby
141
+ Iriq.normalize("https://foo.com/users/me")
142
+ # => "https://foo.com/users/me" # mechanical: "me" is a literal
143
+
144
+ corpus.normalize("https://foo.com/users/me")
145
+ # => depends on what the corpus has seen
146
+ ```
147
+
148
+ If many `/users/{integer_id}` paths flow in alongside a handful of
149
+ `/users/me`, the cluster `/users/me` is preserved (mechanical clustering
150
+ keeps literal routes distinct). If many *distinct literal handles*
151
+ (`/users/alice`, `/users/bob`, `/users/carol`, ...) flow in, the corpus
152
+ promotes that position to a `{user}` placeholder:
153
+
154
+ ```ruby
155
+ %w[alice bob carol dave erin frank gina hank ivan jane].each do |name|
156
+ corpus.observe("https://foo.com/users/#{name}/profile")
157
+ end
158
+
159
+ corpus.normalize("https://foo.com/users/alice/profile")
160
+ # => "https://foo.com/users/{user}/profile"
161
+ ```
162
+
163
+ ### Explainability
164
+
165
+ Each row of `corpus.explain(...)` (and `observation.explanation`) carries a
166
+ `classification:` symbol on top of the deterministic fields:
167
+
168
+ | Classification | Meaning |
169
+ | --------------------------- | ---------------------------------------------------- |
170
+ | `:stable_literal` | Literal value dominates this position |
171
+ | `:variable_identifier` | Classifier said variable (uuid, integer, etc.) |
172
+ | `:rare_literal` | Literal seen here, but not dominant |
173
+ | `:corpus_inferred_variable` | Classifier said literal, but position has high entropy |
174
+ | `:ambiguous` | Insufficient signal — never seen, or mixed |
175
+
176
+ ## Extracting IRIs from text
177
+
178
+ `Iriq::Extractor` is what powers pipe-mode in the CLI. Picks up explicit-
179
+ scheme URLs (`http`, `https`, `ftp`, `ws`, `wss`, `urn`) and `foo.com/path`-
180
+ style scheme-less URLs (small TLD allow-list, required path). Trims trailing
181
+ sentence punctuation iteratively and preserves balanced parens
182
+ (`https://en.wikipedia.org/wiki/Ruby_(programming_language)` stays intact;
183
+ `(see https://foo.com)` drops the outer paren).
184
+
185
+ ```ruby
186
+ Iriq.extract("Visit https://foo.com today, also hit foo.com/users.")
187
+ # => [#<Iriq::Identifier https://foo.com>,
188
+ # #<Iriq::Identifier https://foo.com/users>]
189
+
190
+ # Disable scheme-less:
191
+ Iriq::Extractor.new(scheme_less: false).extract("hit foo.com/users today")
192
+ # => []
193
+ ```
194
+
195
+ Known limitations (intentional):
196
+
197
+ - Comma is a URL boundary, so query strings like `?q=37.7,-122.4` truncate.
198
+ Trade-off picked to keep CSV-shaped text working.
199
+ - No HTML entity decoding (`&amp;` stays as-is).
200
+ - Scheme-less mode skips bare hostnames without a path (too noisy in prose).
201
+
202
+ ### Memory bounds
203
+
204
+ - Per-position `value_counts` is capped (`max_values_per_position`, default
205
+ 1000) — once full, `total` keeps growing but only existing keys count up.
206
+ - Cluster examples are capped at `Iriq::Cluster::MAX_EXAMPLES`.
207
+ - No raw IRI strings are retained outside the bounded cluster examples.
208
+
209
+ ```ruby
210
+ Iriq::Corpus.new(max_values_per_position: 200)
211
+ ```
212
+
92
213
  ## Object model
93
214
 
94
215
  | Class | Responsibility |
@@ -96,51 +217,124 @@ constant across all members of the cluster will be reported with
96
217
  | `Iriq::Parser` | String → `Identifier` |
97
218
  | `Iriq::Identifier` | Structured fields + `canonical` reconstruction |
98
219
  | `Iriq::SegmentClassifier` | Single segment → type symbol |
99
- | `Iriq::PathShape` | Segments → `/users/{integer_id}` route shape |
220
+ | `Iriq::PathShape` | Segments → `/users/{user_id}` route shape |
221
+ | `Iriq::SegmentHints` | Derives `user_id`-style hints from neighbors |
222
+ | `Iriq::Inflector` | Singularization with swappable adapter (AS or built-in) |
100
223
  | `Iriq::Normalizer` | Identifier → canonical, shape-aware string |
101
- | `Iriq::Explanation` | Per-segment `{value, type, variable}` annotations |
224
+ | `Iriq::Explanation` | Per-segment `{value, type, variable, hint}` rows |
102
225
  | `Iriq::Cluster` | One host + shape group, with examples & stats |
103
226
  | `Iriq::Clusterer` | Many identifiers → `Cluster` set + explain |
227
+ | `Iriq::PositionStats` | Capped value/type frequencies for one position |
228
+ | `Iriq::Observation` | What `Corpus#observe` returns |
229
+ | `Iriq::Corpus` | Streaming observer with rolling aggregates + learning |
230
+ | `Iriq::Extractor` | Pulls IRIs out of free text (scheme-anchored) |
104
231
 
105
232
  ## CLI
106
233
 
107
- Installing the gem also installs an `iriq` executable.
234
+ Installing the gem installs an `iriq` executable. Two main modes:
235
+
236
+ **Single input** — combined parse + normalize summary; trim with section
237
+ flags (`-p`, `-n`).
108
238
 
109
239
  ```
110
- $ iriq parse https://foo.com/users/123
111
- original: https://foo.com/users/123
240
+ $ iriq foo.com/users/456
241
+ # parse
242
+ original: foo.com/users/456
112
243
  kind: url
113
244
  scheme: https
114
245
  host: foo.com
115
- path_segments: ["users", "123"]
116
- canonical: https://foo.com/users/123
246
+ path_segments: ["users", "456"]
247
+ canonical: https://foo.com/users/456
117
248
 
118
- $ iriq normalize foo.com/posts/2024-05-23/hello-world
119
- https://foo.com/posts/{date}/{slug}
249
+ # normalize
250
+ https://foo.com/users/{user_id}
120
251
 
121
- $ iriq explain https://foo.com/users/123/orders/456
122
- literal users
123
- * integer_id 123
124
- literal orders
125
- * integer_id 456
252
+ $ iriq -n https://foo.com/users/123
253
+ https://foo.com/users/{user_id}
254
+ ```
126
255
 
127
- $ iriq classify f47ac10b-58cc-4372-a567-0e02b2c3d479
128
- uuid
256
+ **Piped stdin** extraction runs by default. Output auto-switches: small
257
+ inputs get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the
258
+ cluster view via an ephemeral corpus. Section flags work too — emit one
259
+ normalized URL / parsed record per extracted IRI.
129
260
 
130
- $ cat urls.txt | iriq cluster
131
- [2] foo.com /users/{integer_id}
132
- https://foo.com/users/1
133
- https://foo.com/users/2
134
- [1] foo.com /posts/{slug}/edit
135
- https://foo.com/posts/abc-123/edit
261
+ ```
262
+ $ cat short.txt | iriq
263
+ [2] https://github.com/dpep/iriq
264
+ [1] https://foo.com/users
265
+
266
+ $ cat short.txt | iriq -n # normalized URL per line
267
+ https://github.com/dpep/iriq
268
+ https://foo.com/users
269
+
270
+ $ cat access.log | iriq # ≥ 10 IRIs → cluster view
271
+ [190] docs.example.com /users/{user_id}
272
+ [186] app.example.com /users/{user_id}
273
+ ...
274
+
275
+ $ cat README.md | iriq --stats # rolling aggregates
276
+ $ cat README.md | iriq cluster # force cluster view
277
+ $ cat README.md | iriq --corpus c.json # persist into a corpus
136
278
  ```
137
279
 
138
- Add `--json` to any command for machine-readable output. `iriq cluster` reads
139
- identifiers (one per line) from a file argument or stdin; lines that fail to
140
- parse are skipped with a warning on stderr.
280
+ `--corpus PATH` makes the corpus survive across invocations (atomic JSON
281
+ file). Once it has data, `-n` becomes corpus-informed:
282
+
283
+ ```
284
+ $ for n in alice bob carol dave erin frank gina hank ivan jane; do
285
+ iriq --corpus c.json https://foo.com/users/$n/profile >/dev/null
286
+ done
287
+
288
+ $ iriq -n --corpus c.json https://foo.com/users/zoe/profile
289
+ https://foo.com/users/{user}/profile # mechanical would keep "zoe"
290
+ ```
291
+
292
+ Flags:
293
+
294
+ | Flag | Effect |
295
+ | ------------------- | ------------------------------------------------------- |
296
+ | `-p, --parse` | Show parsed fields |
297
+ | `-n, --normalize` | Show the shape-normalized form |
298
+ | `-j, --json` | Emit JSON |
299
+ | `-N, --no-hints` | Use `{integer_id}` etc. instead of `{user_id}` |
300
+ | `--no-scheme-less` | Skip `foo.com/path`-style extraction (explicit-scheme only) |
301
+ | `--corpus PATH` | Load/create a JSON corpus at PATH; observe and save |
302
+ | `--stats` | Print rolling aggregates |
303
+ | `-V, --version` | Print version |
304
+
305
+ A positional argument that doesn't parse as an IRI but IS an existing
306
+ file is read and extracted from automatically — `iriq ./access.log` and
307
+ `iriq /var/log/foo.log` Just Work. (Bare filenames like `README.md`
308
+ may still parse as a URL; pipe with `cat` to disambiguate.)
141
309
 
142
310
  Exit codes: `0` success, `1` usage error, `2` parse error.
143
311
 
312
+ ## Performance
313
+
314
+ Measured on the deterministic `IriGenerator` fixture (Ruby 3.4.9, single
315
+ thread):
316
+
317
+ | Operation | Throughput |
318
+ | ------------------------ | ------------ |
319
+ | `Iriq.parse` | ~260k URLs/s |
320
+ | `Iriq.normalize` | ~148k URLs/s |
321
+ | `Iriq.explain` | ~205k URLs/s |
322
+ | `Iriq.extract` (prose) | ~9.6 MB/s |
323
+ | `Corpus#observe` | ~80k URLs/s |
324
+ | Corpus save/load (10k) | ~135 ms |
325
+
326
+ Linear scaling holds through 100k observations; per-observation retained
327
+ memory amortizes to ~100 bytes at that scale. Memoization caches are
328
+ bounded by `CACHE_MAX = 10_000` (cleared when full) — overhead is a few
329
+ hundred KB regardless of corpus size.
330
+
331
+ Re-run anytime with:
332
+
333
+ ```
334
+ bundle exec script/benchmark.rb # throughput
335
+ bundle exec script/memory.rb # retained memory + cache footprints
336
+ ```
337
+
144
338
  ## Limitations (intentional)
145
339
 
146
340
  This is an MVP. Iriq does **not**: