iriq 0.0.1 → 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +16 -0
- data/Gemfile.lock +2 -2
- data/README.md +227 -33
- data/lib/iriq/cli.rb +288 -100
- data/lib/iriq/cluster.rb +23 -0
- data/lib/iriq/clusterer.rb +32 -17
- data/lib/iriq/corpus.rb +268 -0
- data/lib/iriq/explanation.rb +6 -22
- data/lib/iriq/extractor.rb +125 -0
- data/lib/iriq/identifier.rb +11 -3
- data/lib/iriq/inflector.rb +145 -0
- data/lib/iriq/normalizer.rb +11 -8
- data/lib/iriq/observation.rb +25 -0
- data/lib/iriq/path_shape.rb +27 -9
- data/lib/iriq/position_stats.rb +64 -0
- data/lib/iriq/segment_classifier.rb +31 -7
- data/lib/iriq/segment_hints.rb +32 -0
- data/lib/iriq/version.rb +1 -1
- data/lib/iriq.rb +10 -0
- data/script/benchmark.rb +81 -0
- data/script/memory.rb +121 -0
- metadata +9 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 95d6bc09f7de65bcb4acc5db3ad68d1b83c326360d913c45aede66038a100461
|
|
4
|
+
data.tar.gz: ae58d39b77fce3041cc5561b575dea52706b38f4cfab11d4881cf2f644f6cc59
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: a8fa85f112d9766ff4e9e4ad60c1043ce9701fe42e6dbd01f70da39fa0dd554bff573f40c858d88ab1e1539de0c1f017609535cc57e8e4fcc6651f0d20697e60
|
|
7
|
+
data.tar.gz: 97c714e664874c08278a305b8beff470b68bd2b7e4f58977cc44cf2e3716e619d17011a33d77b8ea9356eec0715acb6c815252551edf64db16eec4599f816274
|
data/CHANGELOG.md
CHANGED
|
@@ -1,2 +1,18 @@
|
|
|
1
|
+
### 0.1.0 (2026-05-24)
|
|
2
|
+
- CLI: auto-detect file argument, retire --extract flag
|
|
3
|
+
- CLI: section flags work in pipe mode + clean up help text
|
|
4
|
+
- script/memory.rb — track retained memory + cache footprints
|
|
5
|
+
- Perf: classifier + inflector memoization, singleton classifier, combined extractor regex
|
|
6
|
+
- Perf: derive SegmentHints once per Corpus.observe (~2x faster)
|
|
7
|
+
- script/benchmark.rb — measure the main hot paths
|
|
8
|
+
- README: replace fabricated example numbers with real fixture output
|
|
9
|
+
- Pipe mode: extraction by default, auto-switch to cluster view at scale
|
|
10
|
+
- Iriq::Extractor — pull IRIs out of free text
|
|
11
|
+
- E2E spec: pipe IriGenerator stream through real iriq binary
|
|
12
|
+
- IriGenerator fixture + popular-outlier heuristic
|
|
13
|
+
- CLI --corpus persistence, pipe batch mode, --stats, E2E specs
|
|
14
|
+
- Streaming Corpus with rolling stats and learning
|
|
15
|
+
- RESTful hints, flag-based CLI, swappable inflector
|
|
16
|
+
|
|
1
17
|
### 0.0.1 (2026-05-24)
|
|
2
18
|
- prototype
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
iriq (0.0
|
|
4
|
+
iriq (0.1.0)
|
|
5
5
|
|
|
6
6
|
GEM
|
|
7
7
|
remote: https://rubygems.org/
|
|
@@ -74,7 +74,7 @@ CHECKSUMS
|
|
|
74
74
|
erb (6.0.4) sha256=38e3803694be357fe2bfe312487c74beaf9fb4e5beb3e22498952fe1645b95d9
|
|
75
75
|
io-console (0.8.2) sha256=d6e3ae7a7cc7574f4b8893b4fca2162e57a825b223a177b7afa236c5ef9814cc
|
|
76
76
|
irb (1.17.0) sha256=168c4ddb93d8a361a045c41d92b2952c7a118fa73f23fe14e55609eb7a863aae
|
|
77
|
-
iriq (0.0
|
|
77
|
+
iriq (0.1.0)
|
|
78
78
|
pp (0.6.3) sha256=2951d514450b93ccfeb1df7d021cae0da16e0a7f95ee1e2273719669d0ab9df6
|
|
79
79
|
prettyprint (0.2.0) sha256=2bc9e15581a94742064a3cc8b0fb9d45aae3d03a1baa6ef80922627a0766f193
|
|
80
80
|
prism (1.9.0) sha256=7b530c6a9f92c24300014919c9dcbc055bf4cdf51ec30aed099b06cd6674ef85
|
data/README.md
CHANGED
|
@@ -23,17 +23,40 @@ iri.path_segments # => ["users", "123"]
|
|
|
23
23
|
iri.canonical # => "https://foo.com/users/123"
|
|
24
24
|
|
|
25
25
|
Iriq.normalize("https://foo.com/users/123")
|
|
26
|
-
# => "https://foo.com/users/{
|
|
26
|
+
# => "https://foo.com/users/{user_id}"
|
|
27
27
|
|
|
28
28
|
Iriq.explain("https://foo.com/users/123/orders/456")
|
|
29
29
|
# => [
|
|
30
|
-
# { value: "users", type: :literal, variable: false },
|
|
31
|
-
# { value: "123", type: :integer_id, variable: true },
|
|
32
|
-
# { value: "orders", type: :literal, variable: false },
|
|
33
|
-
# { value: "456", type: :integer_id, variable: true },
|
|
30
|
+
# { value: "users", type: :literal, variable: false, hint: nil },
|
|
31
|
+
# { value: "123", type: :integer_id, variable: true, hint: "user_id" },
|
|
32
|
+
# { value: "orders", type: :literal, variable: false, hint: nil },
|
|
33
|
+
# { value: "456", type: :integer_id, variable: true, hint: "order_id" },
|
|
34
34
|
# ]
|
|
35
35
|
```
|
|
36
36
|
|
|
37
|
+
Pass `hints: false` to `Iriq.normalize` (or `PathShape`) for mechanical
|
|
38
|
+
placeholders (`{integer_id}` instead of `{user_id}`).
|
|
39
|
+
|
|
40
|
+
## RESTful hints
|
|
41
|
+
|
|
42
|
+
When a variable segment follows a literal one, Iriq derives a hint by
|
|
43
|
+
singularizing the literal and suffixing `_id` (or `_uuid` for UUIDs). This is
|
|
44
|
+
what produces `{user_id}` from `/users/123` and `{order_id}` from
|
|
45
|
+
`/orders/456`. Singularization uses `Iriq::Inflector`, which delegates to a
|
|
46
|
+
swappable adapter:
|
|
47
|
+
|
|
48
|
+
```ruby
|
|
49
|
+
# Default: ActiveSupport::Inflector if `active_support/inflector` is loadable,
|
|
50
|
+
# otherwise a built-in adapter with rules adapted from ActiveSupport.
|
|
51
|
+
|
|
52
|
+
Iriq::Inflector.singularize("categories") # => "category"
|
|
53
|
+
Iriq::Inflector.singularize("people") # => "person"
|
|
54
|
+
|
|
55
|
+
# Override:
|
|
56
|
+
Iriq::Inflector.adapter = MyAdapter # must respond to .singularize(String)
|
|
57
|
+
Iriq::Inflector.reset_adapter!
|
|
58
|
+
```
|
|
59
|
+
|
|
37
60
|
## Supported inputs
|
|
38
61
|
|
|
39
62
|
| Input | Notes |
|
|
@@ -69,7 +92,7 @@ clusterer.add("https://foo.com/users/456")
|
|
|
69
92
|
clusterer.add("https://foo.com/users/789/orders/1")
|
|
70
93
|
|
|
71
94
|
clusterer.clusters.map(&:shape)
|
|
72
|
-
# => ["/users/{
|
|
95
|
+
# => ["/users/{user_id}", "/users/{user_id}/orders/{order_id}"]
|
|
73
96
|
|
|
74
97
|
clusterer.clusters.first.segment_stats
|
|
75
98
|
# => [
|
|
@@ -79,8 +102,8 @@ clusterer.clusters.first.segment_stats
|
|
|
79
102
|
|
|
80
103
|
clusterer.explain("https://foo.com/users/999")
|
|
81
104
|
# => [
|
|
82
|
-
# { value: "users", type: :literal, variable: false, stable: true },
|
|
83
|
-
# { value: "999", type: :integer_id, variable: true, stable: false },
|
|
105
|
+
# { value: "users", type: :literal, variable: false, hint: nil, stable: true },
|
|
106
|
+
# { value: "999", type: :integer_id, variable: true, hint: "user_id", stable: false },
|
|
84
107
|
# ]
|
|
85
108
|
```
|
|
86
109
|
|
|
@@ -89,6 +112,104 @@ a position the classifier *would* call variable but that is empirically
|
|
|
89
112
|
constant across all members of the cluster will be reported with
|
|
90
113
|
`stable: true, variable: false`.
|
|
91
114
|
|
|
115
|
+
## Corpus (streaming + learning)
|
|
116
|
+
|
|
117
|
+
For processing many identifiers — possibly an unbounded stream — use
|
|
118
|
+
`Iriq::Corpus`. It maintains rolling aggregates and per-(host, prefix)
|
|
119
|
+
frequency stats so classification improves as more data comes in.
|
|
120
|
+
|
|
121
|
+
```ruby
|
|
122
|
+
corpus = Iriq::Corpus.new
|
|
123
|
+
|
|
124
|
+
iris.each do |iri|
|
|
125
|
+
obs = corpus.observe(iri)
|
|
126
|
+
obs.fingerprint # deterministic shape: "https://foo.com/users/{user_id}"
|
|
127
|
+
obs.cluster # the Iriq::Cluster this fell into
|
|
128
|
+
obs.explanation # per-segment annotations with corpus-informed classification
|
|
129
|
+
end
|
|
130
|
+
|
|
131
|
+
corpus.host_counts # { "foo.com" => 1234, "bar.com" => 7 }
|
|
132
|
+
corpus.path_length_counts # { 2 => 800, 3 => 434 }
|
|
133
|
+
corpus.fingerprint_counts # shape → count
|
|
134
|
+
corpus.raw_shape_counts # hint-free shape → count
|
|
135
|
+
corpus.clusters # Iriq::Cluster instances
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Deterministic vs. corpus-informed normalization
|
|
139
|
+
|
|
140
|
+
```ruby
|
|
141
|
+
Iriq.normalize("https://foo.com/users/me")
|
|
142
|
+
# => "https://foo.com/users/me" # mechanical: "me" is a literal
|
|
143
|
+
|
|
144
|
+
corpus.normalize("https://foo.com/users/me")
|
|
145
|
+
# => depends on what the corpus has seen
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
If many `/users/{integer_id}` paths flow in alongside a handful of
|
|
149
|
+
`/users/me`, the cluster `/users/me` is preserved (mechanical clustering
|
|
150
|
+
keeps literal routes distinct). If many *distinct literal handles*
|
|
151
|
+
(`/users/alice`, `/users/bob`, `/users/carol`, ...) flow in, the corpus
|
|
152
|
+
promotes that position to a `{user}` placeholder:
|
|
153
|
+
|
|
154
|
+
```ruby
|
|
155
|
+
%w[alice bob carol dave erin frank gina hank ivan jane].each do |name|
|
|
156
|
+
corpus.observe("https://foo.com/users/#{name}/profile")
|
|
157
|
+
end
|
|
158
|
+
|
|
159
|
+
corpus.normalize("https://foo.com/users/alice/profile")
|
|
160
|
+
# => "https://foo.com/users/{user}/profile"
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
### Explainability
|
|
164
|
+
|
|
165
|
+
Each row of `corpus.explain(...)` (and `observation.explanation`) carries a
|
|
166
|
+
`classification:` symbol on top of the deterministic fields:
|
|
167
|
+
|
|
168
|
+
| Classification | Meaning |
|
|
169
|
+
| --------------------------- | ---------------------------------------------------- |
|
|
170
|
+
| `:stable_literal` | Literal value dominates this position |
|
|
171
|
+
| `:variable_identifier` | Classifier said variable (uuid, integer, etc.) |
|
|
172
|
+
| `:rare_literal` | Literal seen here, but not dominant |
|
|
173
|
+
| `:corpus_inferred_variable` | Classifier said literal, but position has high entropy |
|
|
174
|
+
| `:ambiguous` | Insufficient signal — never seen, or mixed |
|
|
175
|
+
|
|
176
|
+
## Extracting IRIs from text
|
|
177
|
+
|
|
178
|
+
`Iriq::Extractor` is what powers pipe-mode in the CLI. Picks up explicit-
|
|
179
|
+
scheme URLs (`http`, `https`, `ftp`, `ws`, `wss`, `urn`) and `foo.com/path`-
|
|
180
|
+
style scheme-less URLs (small TLD allow-list, required path). Trims trailing
|
|
181
|
+
sentence punctuation iteratively and preserves balanced parens
|
|
182
|
+
(`https://en.wikipedia.org/wiki/Ruby_(programming_language)` stays intact;
|
|
183
|
+
`(see https://foo.com)` drops the outer paren).
|
|
184
|
+
|
|
185
|
+
```ruby
|
|
186
|
+
Iriq.extract("Visit https://foo.com today, also hit foo.com/users.")
|
|
187
|
+
# => [#<Iriq::Identifier https://foo.com>,
|
|
188
|
+
# #<Iriq::Identifier https://foo.com/users>]
|
|
189
|
+
|
|
190
|
+
# Disable scheme-less:
|
|
191
|
+
Iriq::Extractor.new(scheme_less: false).extract("hit foo.com/users today")
|
|
192
|
+
# => []
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
Known limitations (intentional):
|
|
196
|
+
|
|
197
|
+
- Comma is a URL boundary, so query strings like `?q=37.7,-122.4` truncate.
|
|
198
|
+
Trade-off picked to keep CSV-shaped text working.
|
|
199
|
+
- No HTML entity decoding (`&` stays as-is).
|
|
200
|
+
- Scheme-less mode skips bare hostnames without a path (too noisy in prose).
|
|
201
|
+
|
|
202
|
+
### Memory bounds
|
|
203
|
+
|
|
204
|
+
- Per-position `value_counts` is capped (`max_values_per_position`, default
|
|
205
|
+
1000) — once full, `total` keeps growing but only existing keys count up.
|
|
206
|
+
- Cluster examples are capped at `Iriq::Cluster::MAX_EXAMPLES`.
|
|
207
|
+
- No raw IRI strings are retained outside the bounded cluster examples.
|
|
208
|
+
|
|
209
|
+
```ruby
|
|
210
|
+
Iriq::Corpus.new(max_values_per_position: 200)
|
|
211
|
+
```
|
|
212
|
+
|
|
92
213
|
## Object model
|
|
93
214
|
|
|
94
215
|
| Class | Responsibility |
|
|
@@ -96,51 +217,124 @@ constant across all members of the cluster will be reported with
|
|
|
96
217
|
| `Iriq::Parser` | String → `Identifier` |
|
|
97
218
|
| `Iriq::Identifier` | Structured fields + `canonical` reconstruction |
|
|
98
219
|
| `Iriq::SegmentClassifier` | Single segment → type symbol |
|
|
99
|
-
| `Iriq::PathShape` | Segments → `/users/{
|
|
220
|
+
| `Iriq::PathShape` | Segments → `/users/{user_id}` route shape |
|
|
221
|
+
| `Iriq::SegmentHints` | Derives `user_id`-style hints from neighbors |
|
|
222
|
+
| `Iriq::Inflector` | Singularization with swappable adapter (AS or built-in) |
|
|
100
223
|
| `Iriq::Normalizer` | Identifier → canonical, shape-aware string |
|
|
101
|
-
| `Iriq::Explanation` | Per-segment `{value, type, variable}`
|
|
224
|
+
| `Iriq::Explanation` | Per-segment `{value, type, variable, hint}` rows |
|
|
102
225
|
| `Iriq::Cluster` | One host + shape group, with examples & stats |
|
|
103
226
|
| `Iriq::Clusterer` | Many identifiers → `Cluster` set + explain |
|
|
227
|
+
| `Iriq::PositionStats` | Capped value/type frequencies for one position |
|
|
228
|
+
| `Iriq::Observation` | What `Corpus#observe` returns |
|
|
229
|
+
| `Iriq::Corpus` | Streaming observer with rolling aggregates + learning |
|
|
230
|
+
| `Iriq::Extractor` | Pulls IRIs out of free text (scheme-anchored) |
|
|
104
231
|
|
|
105
232
|
## CLI
|
|
106
233
|
|
|
107
|
-
Installing the gem
|
|
234
|
+
Installing the gem installs an `iriq` executable. Two main modes:
|
|
235
|
+
|
|
236
|
+
**Single input** — combined parse + normalize summary; trim with section
|
|
237
|
+
flags (`-p`, `-n`).
|
|
108
238
|
|
|
109
239
|
```
|
|
110
|
-
$ iriq
|
|
111
|
-
|
|
240
|
+
$ iriq foo.com/users/456
|
|
241
|
+
# parse
|
|
242
|
+
original: foo.com/users/456
|
|
112
243
|
kind: url
|
|
113
244
|
scheme: https
|
|
114
245
|
host: foo.com
|
|
115
|
-
path_segments: ["users", "
|
|
116
|
-
canonical: https://foo.com/users/
|
|
246
|
+
path_segments: ["users", "456"]
|
|
247
|
+
canonical: https://foo.com/users/456
|
|
117
248
|
|
|
118
|
-
|
|
119
|
-
https://foo.com/
|
|
249
|
+
# normalize
|
|
250
|
+
https://foo.com/users/{user_id}
|
|
120
251
|
|
|
121
|
-
$ iriq
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
literal orders
|
|
125
|
-
* integer_id 456
|
|
252
|
+
$ iriq -n https://foo.com/users/123
|
|
253
|
+
https://foo.com/users/{user_id}
|
|
254
|
+
```
|
|
126
255
|
|
|
127
|
-
|
|
128
|
-
|
|
256
|
+
**Piped stdin** — extraction runs by default. Output auto-switches: small
|
|
257
|
+
inputs get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the
|
|
258
|
+
cluster view via an ephemeral corpus. Section flags work too — emit one
|
|
259
|
+
normalized URL / parsed record per extracted IRI.
|
|
129
260
|
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
261
|
+
```
|
|
262
|
+
$ cat short.txt | iriq
|
|
263
|
+
[2] https://github.com/dpep/iriq
|
|
264
|
+
[1] https://foo.com/users
|
|
265
|
+
|
|
266
|
+
$ cat short.txt | iriq -n # normalized URL per line
|
|
267
|
+
https://github.com/dpep/iriq
|
|
268
|
+
https://foo.com/users
|
|
269
|
+
|
|
270
|
+
$ cat access.log | iriq # ≥ 10 IRIs → cluster view
|
|
271
|
+
[190] docs.example.com /users/{user_id}
|
|
272
|
+
[186] app.example.com /users/{user_id}
|
|
273
|
+
...
|
|
274
|
+
|
|
275
|
+
$ cat README.md | iriq --stats # rolling aggregates
|
|
276
|
+
$ cat README.md | iriq cluster # force cluster view
|
|
277
|
+
$ cat README.md | iriq --corpus c.json # persist into a corpus
|
|
136
278
|
```
|
|
137
279
|
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
280
|
+
`--corpus PATH` makes the corpus survive across invocations (atomic JSON
|
|
281
|
+
file). Once it has data, `-n` becomes corpus-informed:
|
|
282
|
+
|
|
283
|
+
```
|
|
284
|
+
$ for n in alice bob carol dave erin frank gina hank ivan jane; do
|
|
285
|
+
iriq --corpus c.json https://foo.com/users/$n/profile >/dev/null
|
|
286
|
+
done
|
|
287
|
+
|
|
288
|
+
$ iriq -n --corpus c.json https://foo.com/users/zoe/profile
|
|
289
|
+
https://foo.com/users/{user}/profile # mechanical would keep "zoe"
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
Flags:
|
|
293
|
+
|
|
294
|
+
| Flag | Effect |
|
|
295
|
+
| ------------------- | ------------------------------------------------------- |
|
|
296
|
+
| `-p, --parse` | Show parsed fields |
|
|
297
|
+
| `-n, --normalize` | Show the shape-normalized form |
|
|
298
|
+
| `-j, --json` | Emit JSON |
|
|
299
|
+
| `-N, --no-hints` | Use `{integer_id}` etc. instead of `{user_id}` |
|
|
300
|
+
| `--no-scheme-less` | Skip `foo.com/path`-style extraction (explicit-scheme only) |
|
|
301
|
+
| `--corpus PATH` | Load/create a JSON corpus at PATH; observe and save |
|
|
302
|
+
| `--stats` | Print rolling aggregates |
|
|
303
|
+
| `-V, --version` | Print version |
|
|
304
|
+
|
|
305
|
+
A positional argument that doesn't parse as an IRI but IS an existing
|
|
306
|
+
file is read and extracted from automatically — `iriq ./access.log` and
|
|
307
|
+
`iriq /var/log/foo.log` Just Work. (Bare filenames like `README.md`
|
|
308
|
+
may still parse as a URL; pipe with `cat` to disambiguate.)
|
|
141
309
|
|
|
142
310
|
Exit codes: `0` success, `1` usage error, `2` parse error.
|
|
143
311
|
|
|
312
|
+
## Performance
|
|
313
|
+
|
|
314
|
+
Measured on the deterministic `IriGenerator` fixture (Ruby 3.4.9, single
|
|
315
|
+
thread):
|
|
316
|
+
|
|
317
|
+
| Operation | Throughput |
|
|
318
|
+
| ------------------------ | ------------ |
|
|
319
|
+
| `Iriq.parse` | ~260k URLs/s |
|
|
320
|
+
| `Iriq.normalize` | ~148k URLs/s |
|
|
321
|
+
| `Iriq.explain` | ~205k URLs/s |
|
|
322
|
+
| `Iriq.extract` (prose) | ~9.6 MB/s |
|
|
323
|
+
| `Corpus#observe` | ~80k URLs/s |
|
|
324
|
+
| Corpus save/load (10k) | ~135 ms |
|
|
325
|
+
|
|
326
|
+
Linear scaling holds through 100k observations; per-observation retained
|
|
327
|
+
memory amortizes to ~100 bytes at that scale. Memoization caches are
|
|
328
|
+
bounded by `CACHE_MAX = 10_000` (cleared when full) — overhead is a few
|
|
329
|
+
hundred KB regardless of corpus size.
|
|
330
|
+
|
|
331
|
+
Re-run anytime with:
|
|
332
|
+
|
|
333
|
+
```
|
|
334
|
+
bundle exec script/benchmark.rb # throughput
|
|
335
|
+
bundle exec script/memory.rb # retained memory + cache footprints
|
|
336
|
+
```
|
|
337
|
+
|
|
144
338
|
## Limitations (intentional)
|
|
145
339
|
|
|
146
340
|
This is an MVP. Iriq does **not**:
|