iriq 0.2.0 → 0.30.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +78 -0
- data/CLAUDE.md +128 -41
- data/Gemfile.lock +4 -4
- data/Makefile +80 -23
- data/README.md +225 -347
- data/completions/_iriq +52 -0
- data/completions/iriq.bash +70 -0
- data/docs/ARCHITECTURE.md +223 -0
- data/docs/ROADMAP.md +190 -0
- data/iriq.gemspec +2 -2
- data/lib/iriq/cli.rb +398 -46
- data/lib/iriq/cluster.rb +284 -12
- data/lib/iriq/corpus.rb +318 -36
- data/lib/iriq/cross_host_shape.rb +37 -0
- data/lib/iriq/event.rb +22 -0
- data/lib/iriq/evidence.rb +114 -0
- data/lib/iriq/explanation.rb +1 -1
- data/lib/iriq/normalizer.rb +71 -29
- data/lib/iriq/path_shape.rb +30 -24
- data/lib/iriq/position.rb +75 -0
- data/lib/iriq/position_stats.rb +74 -8
- data/lib/iriq/recognizer.rb +54 -0
- data/lib/iriq/recognizer_proposal.rb +167 -0
- data/lib/iriq/recognizers/date.rb +53 -0
- data/lib/iriq/recognizers/integer.rb +37 -0
- data/lib/iriq/recognizers/uuid.rb +16 -0
- data/lib/iriq/reducer.rb +37 -0
- data/lib/iriq/registrable_domain.rb +56 -0
- data/lib/iriq/segment_classifier.rb +475 -23
- data/lib/iriq/segment_hints.rb +9 -0
- data/lib/iriq/shape.rb +106 -0
- data/lib/iriq/specificity.rb +35 -0
- data/lib/iriq/storage/memory.rb +83 -12
- data/lib/iriq/storage/sqlite.rb +216 -37
- data/lib/iriq/synthesized_recognizer.rb +56 -0
- data/lib/iriq/trace.rb +294 -0
- data/lib/iriq/version.rb +1 -1
- data/lib/iriq.rb +17 -0
- metadata +22 -3
data/README.md
CHANGED
|
@@ -1,46 +1,47 @@
|
|
|
1
1
|
Iriq
|
|
2
2
|
======
|
|
3
|
-

|
|
4
3
|
[](https://codecov.io/gh/dpep/iriq)
|
|
5
4
|
|
|
6
|
-
|
|
5
|
+
**Iriq finds the *shape* of a URL** — the structural template you get when you
|
|
6
|
+
erase the parts that vary and keep the parts that don't. `…/users/123` and
|
|
7
|
+
`…/users/999` are the same shape: `/users/{user_id}`. Feed iriq a pile of messy
|
|
8
|
+
URLs — a log file, a column of links, free-text prose — and it collapses them
|
|
9
|
+
into a small set of stable, deterministic route templates. Fifty thousand
|
|
10
|
+
distinct URLs become twelve shapes.
|
|
7
11
|
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
12
|
+
(An **IRI** is just a URL — the internationalized superset of URI/URL that also
|
|
13
|
+
allows non-ASCII characters. If you know URLs, you know IRIs. The name is *IRI
|
|
14
|
+
Query*: iriq queries an IRI for its structure.)
|
|
11
15
|
|
|
12
|
-
|
|
13
|
-
|
|
16
|
+
Everything iriq does — parsing, normalizing, classifying path and query
|
|
17
|
+
components, clustering, learning new patterns — exists to derive, render, or
|
|
18
|
+
group by that shape.
|
|
14
19
|
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
20
|
+
And it gets sharper the more you feed it. Point a *corpus* at a stream and
|
|
21
|
+
classifications improve as data flows in — high-churn slots get promoted to
|
|
22
|
+
placeholders, and whole types emerge that you can't see in any single URL (a
|
|
23
|
+
position that's always 100–599 is an HTTP status; one bounded to a dozen values
|
|
24
|
+
is an enum).
|
|
18
25
|
|
|
19
26
|
```sh
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
# RubyGems — installs the CLI shim and the library
|
|
24
|
-
gem install iriq
|
|
25
|
-
|
|
26
|
-
# Go — installs the CLI binary into $GOBIN
|
|
27
|
-
go install github.com/dpep/iriq/cmd/iriq@latest
|
|
27
|
+
$ iriq -n https://foo.com/users/123
|
|
28
|
+
https://foo.com/users/{user_id}
|
|
28
29
|
```
|
|
29
30
|
|
|
30
|
-
|
|
31
|
+
It answers questions like:
|
|
31
32
|
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
33
|
+
- "What routes does this service actually expose?" (cluster a log file)
|
|
34
|
+
- "Which params are stable identifiers vs. churning IDs vs. enums?"
|
|
35
|
+
(`--stats`)
|
|
36
|
+
- "Are these 50,000 distinct URLs really just 12 templates?" (clustering)
|
|
37
|
+
- "What does `/api/v1/users/abc-123-def` become as a route shape?"
|
|
38
|
+
(`/api/{version}/users/{user_id}`)
|
|
36
39
|
|
|
37
|
-
|
|
38
|
-
import "github.com/dpep/iriq"
|
|
39
|
-
```
|
|
40
|
+
Iriq ships as a **command-line tool** (`iriq`) and a **Rust library**.
|
|
40
41
|
|
|
41
|
-
##
|
|
42
|
+
## Quick start
|
|
42
43
|
|
|
43
|
-
```
|
|
44
|
+
```sh
|
|
44
45
|
$ iriq https://foo.com/users/123
|
|
45
46
|
# parse
|
|
46
47
|
original: https://foo.com/users/123
|
|
@@ -56,368 +57,267 @@ https://foo.com/users/{user_id}
|
|
|
56
57
|
$ iriq -n https://foo.com/users/123
|
|
57
58
|
https://foo.com/users/{user_id}
|
|
58
59
|
|
|
59
|
-
$
|
|
60
|
-
|
|
61
|
-
$ iriq ./access.log -n # file auto-detected → normalize each found URL
|
|
60
|
+
$ iriq -n https://shop.com/pricing/usd?currency=eur
|
|
61
|
+
https://shop.com/pricing/USD?currency=EUR # currency upcased
|
|
62
62
|
```
|
|
63
63
|
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
# Ruby
|
|
70
|
-
iri = Iriq.parse("https://foo.com/users/123")
|
|
71
|
-
iri.scheme # => "https"
|
|
72
|
-
iri.host # => "foo.com"
|
|
73
|
-
iri.path_segments # => ["users", "123"]
|
|
74
|
-
iri.canonical # => "https://foo.com/users/123"
|
|
75
|
-
|
|
76
|
-
Iriq.normalize("https://foo.com/users/123")
|
|
77
|
-
# => "https://foo.com/users/{user_id}"
|
|
78
|
-
|
|
79
|
-
Iriq.explain("https://foo.com/users/123/orders/456")
|
|
80
|
-
# => [
|
|
81
|
-
# { value: "users", type: :literal, variable: false, hint: nil },
|
|
82
|
-
# { value: "123", type: :integer_id, variable: true, hint: "user_id" },
|
|
83
|
-
# { value: "orders", type: :literal, variable: false, hint: nil },
|
|
84
|
-
# { value: "456", type: :integer_id, variable: true, hint: "order_id" },
|
|
85
|
-
# ]
|
|
86
|
-
```
|
|
87
|
-
|
|
88
|
-
```go
|
|
89
|
-
// Go (same surface)
|
|
90
|
-
iri, _ := iriq.Parse("https://foo.com/users/123")
|
|
91
|
-
iri.Scheme // "https"
|
|
92
|
-
iri.Host // "foo.com"
|
|
93
|
-
iri.PathSegments // []string{"users", "123"}
|
|
94
|
-
iri.Canonical() // "https://foo.com/users/123"
|
|
64
|
+
```sh
|
|
65
|
+
$ cat access.log | iriq # ≥ 10 IRIs → cluster view
|
|
66
|
+
[190] docs.example.com /users/{user_id}
|
|
67
|
+
[186] app.example.com /users/{user_id}
|
|
68
|
+
...
|
|
95
69
|
|
|
96
|
-
|
|
97
|
-
|
|
70
|
+
$ cat access.log | iriq --stats # rolling aggregates
|
|
71
|
+
$ iriq ./access.log -n # auto-detect file → normalize each
|
|
72
|
+
$ iriq -J < access.log # newline-delimited JSON
|
|
73
|
+
$ iriq --corpus c.db < access.log # persist into a SQLite corpus
|
|
98
74
|
```
|
|
99
75
|
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
76
|
+
Once a corpus has data, `-n` becomes corpus-informed — a position that only ever
|
|
77
|
+
holds integers clusters to a single `{user_id}` shape, and new values normalize
|
|
78
|
+
to it:
|
|
103
79
|
|
|
104
|
-
|
|
105
|
-
|
|
80
|
+
```sh
|
|
81
|
+
$ for n in 1 2 3 4 5 6 7 8 9 10; do
|
|
82
|
+
iriq --corpus c.db https://api.foo.com/users/$n >/dev/null
|
|
83
|
+
done
|
|
106
84
|
|
|
107
|
-
|
|
85
|
+
$ iriq -n --corpus c.db https://api.foo.com/users/999
|
|
86
|
+
https://api.foo.com/users/{user_id}
|
|
87
|
+
```
|
|
108
88
|
|
|
109
|
-
|
|
110
|
-
singularizing the literal and suffixing `_id` (or `_uuid` for UUIDs). This is
|
|
111
|
-
what produces `{user_id}` from `/users/123` and `{order_id}` from
|
|
112
|
-
`/orders/456`. Singularization uses `Iriq::Inflector`, which delegates to a
|
|
113
|
-
swappable adapter:
|
|
89
|
+
### Two ways to normalize
|
|
114
90
|
|
|
115
|
-
|
|
116
|
-
# Default: ActiveSupport::Inflector if `active_support/inflector` is loadable,
|
|
117
|
-
# otherwise a built-in adapter with rules adapted from ActiveSupport.
|
|
91
|
+
Pick by the question you're asking:
|
|
118
92
|
|
|
119
|
-
|
|
120
|
-
|
|
93
|
+
- **`--canonical`** — clean up *this* URL, keeping the specifics.
|
|
94
|
+
`HTTP://Foo.com:80/pull/42` → `http://foo.com/pull/42` (scheme/host
|
|
95
|
+
lowercased, default port dropped; path and query left alone). Handy, but
|
|
96
|
+
table stakes — plenty of libraries do it.
|
|
97
|
+
- **`--normalize`** *(the default)* — find the URL's *shape*, erasing the
|
|
98
|
+
specifics into placeholders. `…/pull/42` → `…/pull/{id}`. This is the part
|
|
99
|
+
you came to iriq for.
|
|
121
100
|
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
Iriq::Inflector.reset_adapter!
|
|
125
|
-
```
|
|
101
|
+
Same input, two questions: "what's the clean form of *this* URL?" vs "what
|
|
102
|
+
*kind* of URL is this?" The second is iriq's reason to exist.
|
|
126
103
|
|
|
127
|
-
##
|
|
104
|
+
## Install
|
|
128
105
|
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
| `foo.com/users/456` | Scheme-less; `https://` is assumed |
|
|
133
|
-
| `urn:isbn:0451450523` | URN — `scheme` and `nss` are populated |
|
|
134
|
-
| `https://例え.テスト/こんにちは` | Unicode IRI — display form preserved |
|
|
135
|
-
| `HTTPS://Foo.com:443/A` | Scheme + host lowercased; default port dropped |
|
|
136
|
-
| `https://foo.com/a/./b/../c` | Dot segments normalized |
|
|
106
|
+
```sh
|
|
107
|
+
# Homebrew (recommended)
|
|
108
|
+
brew install dpep/tools/iriq
|
|
137
109
|
|
|
138
|
-
|
|
110
|
+
# Cargo, from crates.io
|
|
111
|
+
cargo install iriq
|
|
139
112
|
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
- `:literal` — plain word (`users`, `orders`, `Profile`, `こんにちは`)
|
|
143
|
-
- `:integer_id` — pure digits below the timestamp range (`1`, `123`, `42`)
|
|
144
|
-
- `:uuid` — `f47ac10b-58cc-4372-a567-0e02b2c3d479`
|
|
145
|
-
- `:date` — `2024-05-23`
|
|
146
|
-
- `:timestamp` — ISO 8601, or 10/13-digit UNIX epoch
|
|
147
|
-
- `:hash` — 32+ hex chars (md5 / sha)
|
|
148
|
-
- `:slug` — `my-cool-post`, `my_cool_post`
|
|
149
|
-
- `:opaque_id` — short alphanumeric mix that doesn't fit elsewhere
|
|
150
|
-
|
|
151
|
-
Heuristics are deterministic and ordered — the first matching rule wins.
|
|
152
|
-
|
|
153
|
-
## Clustering
|
|
154
|
-
|
|
155
|
-
```ruby
|
|
156
|
-
clusterer = Iriq::Clusterer.new
|
|
157
|
-
clusterer.add("https://foo.com/users/123")
|
|
158
|
-
clusterer.add("https://foo.com/users/456")
|
|
159
|
-
clusterer.add("https://foo.com/users/789/orders/1")
|
|
160
|
-
|
|
161
|
-
clusterer.clusters.map(&:shape)
|
|
162
|
-
# => ["/users/{user_id}", "/users/{user_id}/orders/{order_id}"]
|
|
163
|
-
|
|
164
|
-
clusterer.clusters.first.segment_stats
|
|
165
|
-
# => [
|
|
166
|
-
# { position: 0, stable: true, values: { "users" => 2 } },
|
|
167
|
-
# { position: 1, stable: false, values: { "123" => 1, "456" => 1 } },
|
|
168
|
-
# ]
|
|
169
|
-
|
|
170
|
-
clusterer.explain("https://foo.com/users/999")
|
|
171
|
-
# => [
|
|
172
|
-
# { value: "users", type: :literal, variable: false, hint: nil, stable: true },
|
|
173
|
-
# { value: "999", type: :integer_id, variable: true, hint: "user_id", stable: false },
|
|
174
|
-
# ]
|
|
113
|
+
# Cargo, from a source checkout
|
|
114
|
+
cargo install --path rust/iriq
|
|
175
115
|
```
|
|
176
116
|
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
constant across all members of the cluster will be reported with
|
|
180
|
-
`stable: true, variable: false`.
|
|
117
|
+
One crate ships both the library and the `iriq` binary. Corpora persist to
|
|
118
|
+
SQLite (bundled, WAL) out of the box — nothing to flag, install, or rebuild.
|
|
181
119
|
|
|
182
|
-
##
|
|
120
|
+
## Use it as a Rust library
|
|
183
121
|
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
frequency stats so classification improves as more data comes in.
|
|
187
|
-
|
|
188
|
-
```ruby
|
|
189
|
-
corpus = Iriq::Corpus.new
|
|
190
|
-
|
|
191
|
-
iris.each do |iri|
|
|
192
|
-
obs = corpus.observe(iri)
|
|
193
|
-
obs.fingerprint # deterministic shape: "https://foo.com/users/{user_id}"
|
|
194
|
-
obs.cluster # the Iriq::Cluster this fell into
|
|
195
|
-
obs.explanation # per-segment annotations with corpus-informed classification
|
|
196
|
-
end
|
|
197
|
-
|
|
198
|
-
corpus.host_counts # { "foo.com" => 1234, "bar.com" => 7 }
|
|
199
|
-
corpus.path_length_counts # { 2 => 800, 3 => 434 }
|
|
200
|
-
corpus.fingerprint_counts # shape → count
|
|
201
|
-
corpus.raw_shape_counts # hint-free shape → count
|
|
202
|
-
corpus.clusters # Iriq::Cluster instances
|
|
122
|
+
```sh
|
|
123
|
+
cargo add iriq
|
|
203
124
|
```
|
|
204
125
|
|
|
205
|
-
|
|
126
|
+
```rust
|
|
127
|
+
use iriq::{parse, normalize, Corpus};
|
|
206
128
|
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
129
|
+
let iri = parse("https://foo.com/users/123")?;
|
|
130
|
+
iri.host; // "foo.com"
|
|
131
|
+
iri.path_segments; // ["users", "123"]
|
|
132
|
+
iri.canonical(); // "https://foo.com/users/123"
|
|
210
133
|
|
|
211
|
-
|
|
212
|
-
# => depends on what the corpus has seen
|
|
213
|
-
```
|
|
134
|
+
normalize("https://foo.com/users/123")?; // "https://foo.com/users/{user_id}"
|
|
214
135
|
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
|
|
218
|
-
(
|
|
219
|
-
promotes that position to a `{user}` placeholder:
|
|
220
|
-
|
|
221
|
-
```ruby
|
|
222
|
-
%w[alice bob carol dave erin frank gina hank ivan jane].each do |name|
|
|
223
|
-
corpus.observe("https://foo.com/users/#{name}/profile")
|
|
224
|
-
end
|
|
225
|
-
|
|
226
|
-
corpus.normalize("https://foo.com/users/alice/profile")
|
|
227
|
-
# => "https://foo.com/users/{user}/profile"
|
|
136
|
+
// Streaming clustering against a persistent corpus.
|
|
137
|
+
let mut corpus = Corpus::open("c.db")?;
|
|
138
|
+
corpus.observe("https://foo.com/users/1")?;
|
|
139
|
+
corpus.save("c.db")?;
|
|
228
140
|
```
|
|
229
141
|
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
Each row of `corpus.explain(...)` (and `observation.explanation`) carries a
|
|
233
|
-
`classification:` symbol on top of the deterministic fields:
|
|
142
|
+
Full API on [docs.rs/iriq](https://docs.rs/iriq); see the
|
|
143
|
+
[crate README](rust/iriq/README.md) for the library tour.
|
|
234
144
|
|
|
235
|
-
|
|
236
|
-
| --------------------------- | ---------------------------------------------------- |
|
|
237
|
-
| `:stable_literal` | Literal value dominates this position |
|
|
238
|
-
| `:variable_identifier` | Classifier said variable (uuid, integer, etc.) |
|
|
239
|
-
| `:rare_literal` | Literal seen here, but not dominant |
|
|
240
|
-
| `:corpus_inferred_variable` | Classifier said literal, but position has high entropy |
|
|
241
|
-
| `:ambiguous` | Insufficient signal — never seen, or mixed |
|
|
145
|
+
## Segment classification
|
|
242
146
|
|
|
243
|
-
|
|
147
|
+
Iriq classifies each path/query segment into one of ~25 types — the first
|
|
148
|
+
matching rule wins, and heuristics are deterministic:
|
|
149
|
+
|
|
150
|
+
- `literal` — plain word (`users`, `orders`, `Profile`, `こんにちは`)
|
|
151
|
+
- `integer` — pure digits below the timestamp range
|
|
152
|
+
- `float` — decimal with digits on both sides (`3.14`, `-2.5`, `1.0`)
|
|
153
|
+
- `boolean` — `true` / `false` (any case)
|
|
154
|
+
- `version` — semver-ish with `v` prefix (`v1`, `v2.0.1`, `v1.2.3-beta`)
|
|
155
|
+
- `locale` — BCP 47-ish (`en-US`, `fr_CA`, `zh-Hant`, bare `en`/`fr`/`ja`)
|
|
156
|
+
- `currency` — ISO 4217 codes (`USD`, `EUR`, `JPY`)
|
|
157
|
+
- `uuid` — `f47ac10b-58cc-4372-a567-0e02b2c3d479`
|
|
158
|
+
- `date` — `2024-05-23`, `2024/05/23`, `20240523`, `05/23/2024`. Canonicalized to ISO in `--normalize` output.
|
|
159
|
+
- `timestamp` — ISO 8601, or 10/13-digit UNIX epoch
|
|
160
|
+
- `hash` — 32+ hex chars (md5 / sha)
|
|
161
|
+
- `slug` — `my-cool-post`, `my_cool_post`
|
|
162
|
+
- `ipv4` / `ipv6` — collapsed to `{ip}` in normalized output
|
|
163
|
+
- `url` — `https://...`, `ftp://...`, also scheme-less `foo.com/path`
|
|
164
|
+
- `email` — `local@host.tld`
|
|
165
|
+
- `phone` — E.164 (`+15551234567`) or NANP (`555-666-7777`, `(555) 666-7777`)
|
|
166
|
+
- `jwt` — three base64url segments separated by dots
|
|
167
|
+
- `mime` — `image/png`, `application/vnd.api+json`
|
|
168
|
+
- `file` — `name.ext` for known extensions; per-kind grouping (image/document/data/...)
|
|
169
|
+
- `color` — hex form (`#fff`, `#ffffff`, `#ffffff80`)
|
|
170
|
+
- `coordinate` — `lat,lng` pair with plausible-range validation
|
|
171
|
+
- `country` — ISO 3166-1 alpha-2 codes (`US`, `JP`, `GB`)
|
|
172
|
+
- `base64` — standard base64 blobs with disambiguating `+`/`/`/`=`
|
|
173
|
+
- `opaque_id` — short alphanumeric mix that doesn't fit elsewhere
|
|
174
|
+
|
|
175
|
+
### RESTful hints
|
|
176
|
+
|
|
177
|
+
When a variable segment follows a literal one, iriq derives a hint by
|
|
178
|
+
singularizing the literal and suffixing `_id` (or `_uuid` for UUIDs). That's
|
|
179
|
+
what produces `{user_id}` from `/users/123` and `{order_id}` from `/orders/456`.
|
|
180
|
+
Semantic types (`version`, `locale`, `currency`, `date`, `boolean`) skip the
|
|
181
|
+
hint and surface as `{type}` — `/api/v1/status` renders as `/api/{version}/status`,
|
|
182
|
+
not the misleading `/api/{api_id}/status`. Pass `-N` / `--no-hints` for
|
|
183
|
+
mechanical placeholders (`{integer}` instead of `{user_id}`).
|
|
184
|
+
|
|
185
|
+
### Types only the corpus can see
|
|
186
|
+
|
|
187
|
+
Four types never come from a single URL — they emerge from the *distribution*
|
|
188
|
+
of values a position has held across many observations:
|
|
189
|
+
|
|
190
|
+
| Type | Emerges when a position… |
|
|
191
|
+
| --- | --- |
|
|
192
|
+
| `number` | holds both integers and floats |
|
|
193
|
+
| `year` | holds integers that all land in 1900–2100 |
|
|
194
|
+
| `http_status` | holds integers that all land in 100–599 |
|
|
195
|
+
| `enum` | holds a small, bounded set of distinct values |
|
|
196
|
+
|
|
197
|
+
Mechanically, `200` is just an integer. Across ten thousand URLs where that
|
|
198
|
+
slot is always 100–599, it's an HTTP status. That's the corpus earning its keep.
|
|
244
199
|
|
|
245
|
-
|
|
246
|
-
scheme URLs (`http`, `https`, `ftp`, `ws`, `wss`, `urn`) and `foo.com/path`-
|
|
247
|
-
style scheme-less URLs (small TLD allow-list, required path). Trims trailing
|
|
248
|
-
sentence punctuation iteratively and preserves balanced parens
|
|
249
|
-
(`https://en.wikipedia.org/wiki/Ruby_(programming_language)` stays intact;
|
|
250
|
-
`(see https://foo.com)` drops the outer paren).
|
|
251
|
-
|
|
252
|
-
```ruby
|
|
253
|
-
Iriq.extract("Visit https://foo.com today, also hit foo.com/users.")
|
|
254
|
-
# => [#<Iriq::Identifier https://foo.com>,
|
|
255
|
-
# #<Iriq::Identifier https://foo.com/users>]
|
|
256
|
-
|
|
257
|
-
# Disable scheme-less:
|
|
258
|
-
Iriq::Extractor.new(scheme_less: false).extract("hit foo.com/users today")
|
|
259
|
-
# => []
|
|
260
|
-
```
|
|
200
|
+
## Corpus (streaming + learning)
|
|
261
201
|
|
|
262
|
-
|
|
202
|
+
For processing many identifiers — possibly an unbounded stream — point iriq at a
|
|
203
|
+
corpus. It maintains rolling aggregates and per-(host, prefix) frequency stats,
|
|
204
|
+
so classification improves as more data comes in.
|
|
263
205
|
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
|
|
267
|
-
|
|
206
|
+
`--corpus PATH` makes the corpus survive across invocations. A `.db` /
|
|
207
|
+
`.sqlite` / `.sqlite3` path is stored in SQLite (WAL journaling, incremental
|
|
208
|
+
UPSERTs — multiple `iriq --corpus` processes can write concurrently); a
|
|
209
|
+
`.json` path writes a plain JSON file instead.
|
|
268
210
|
|
|
269
|
-
###
|
|
211
|
+
### Re-runnable inference
|
|
270
212
|
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
|
|
274
|
-
|
|
213
|
+
A corpus persists the source-IRI log alongside the materialized views.
|
|
214
|
+
`--reinfer` drops every view and replays the log through the current classifier
|
|
215
|
+
and reducers. Tune a threshold, swap in a different classifier, or activate new
|
|
216
|
+
recognizers (below) — then reinfer to see the new results without re-feeding
|
|
217
|
+
URLs.
|
|
275
218
|
|
|
276
|
-
```
|
|
277
|
-
|
|
219
|
+
```sh
|
|
220
|
+
$ iriq --corpus c.db --reinfer
|
|
278
221
|
```
|
|
279
222
|
|
|
280
|
-
##
|
|
281
|
-
|
|
282
|
-
| Class | Responsibility |
|
|
283
|
-
| --------------------------- | ---------------------------------------------------- |
|
|
284
|
-
| `Iriq::Parser` | String → `Identifier` |
|
|
285
|
-
| `Iriq::Identifier` | Structured fields + `canonical` reconstruction |
|
|
286
|
-
| `Iriq::SegmentClassifier` | Single segment → type symbol |
|
|
287
|
-
| `Iriq::PathShape` | Segments → `/users/{user_id}` route shape |
|
|
288
|
-
| `Iriq::SegmentHints` | Derives `user_id`-style hints from neighbors |
|
|
289
|
-
| `Iriq::Inflector` | Singularization with swappable adapter (AS or built-in) |
|
|
290
|
-
| `Iriq::Normalizer` | Identifier → canonical, shape-aware string |
|
|
291
|
-
| `Iriq::Explanation` | Per-segment `{value, type, variable, hint}` rows |
|
|
292
|
-
| `Iriq::Cluster` | One host + shape group, with examples & stats |
|
|
293
|
-
| `Iriq::Clusterer` | Many identifiers → `Cluster` set + explain |
|
|
294
|
-
| `Iriq::PositionStats` | Capped value/type frequencies for one position |
|
|
295
|
-
| `Iriq::Observation` | What `Corpus#observe` returns |
|
|
296
|
-
| `Iriq::Corpus` | Streaming observer with rolling aggregates + learning |
|
|
297
|
-
| `Iriq::Extractor` | Pulls IRIs out of free text (scheme-anchored) |
|
|
223
|
+
## Learning new types
|
|
298
224
|
|
|
299
|
-
|
|
225
|
+
Iriq doesn't just classify against a fixed list — it watches the stream and
|
|
226
|
+
*proposes new recognizers* for patterns it keeps seeing. Notice `ghp_…` or
|
|
227
|
+
`cus_…` recurring at a slug position and iriq will suggest a recognizer for it,
|
|
228
|
+
with evidence: coverage, host count, confidence. Proposals are never
|
|
229
|
+
auto-applied — you activate the ones you trust, and they persist with the
|
|
230
|
+
corpus. Human-in-the-loop by design.
|
|
300
231
|
|
|
301
|
-
|
|
302
|
-
|
|
303
|
-
|
|
304
|
-
flags (`-p`, `-n`).
|
|
232
|
+
```sh
|
|
233
|
+
# Print proposals (human-readable, or --json)
|
|
234
|
+
$ iriq --corpus c.db --propose-recognizers
|
|
305
235
|
|
|
236
|
+
# Auto-activate every proposal with confidence ≥ 0.9, then reinfer
|
|
237
|
+
$ iriq --corpus c.db --propose-recognizers --activate-above 0.9
|
|
306
238
|
```
|
|
307
|
-
$ iriq foo.com/users/456
|
|
308
|
-
# parse
|
|
309
|
-
original: foo.com/users/456
|
|
310
|
-
kind: url
|
|
311
|
-
scheme: https
|
|
312
|
-
host: foo.com
|
|
313
|
-
path_segments: ["users", "456"]
|
|
314
|
-
canonical: https://foo.com/users/456
|
|
315
239
|
|
|
316
|
-
|
|
317
|
-
https://foo.com/users/{user_id}
|
|
240
|
+
### Cross-host shape learning
|
|
318
241
|
|
|
319
|
-
|
|
320
|
-
|
|
321
|
-
|
|
322
|
-
|
|
323
|
-
**Piped stdin** — extraction runs by default. Output auto-switches: small
|
|
324
|
-
inputs get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the
|
|
325
|
-
cluster view via an ephemeral corpus. Section flags work too — emit one
|
|
326
|
-
normalized URL / parsed record per extracted IRI.
|
|
242
|
+
A route shape that recurs across multiple hosts is independent evidence of a
|
|
243
|
+
semantic pattern — two unrelated hosts inventing the same `/users/{integer}`
|
|
244
|
+
structure by accident is unlikely.
|
|
327
245
|
|
|
246
|
+
```sh
|
|
247
|
+
$ iriq --corpus c.db --cross-host-shapes [--min-hosts N]
|
|
328
248
|
```
|
|
329
|
-
$ cat short.txt | iriq
|
|
330
|
-
[2] https://github.com/dpep/iriq
|
|
331
|
-
[1] https://foo.com/users
|
|
332
249
|
|
|
333
|
-
|
|
334
|
-
|
|
335
|
-
|
|
250
|
+
The same signal feeds back into proposal `confidence`: each additional host
|
|
251
|
+
beyond the first adds `0.05` to the score (capped at 1.0), so a prefix proposed
|
|
252
|
+
on 5 hosts is meaningfully stronger than the same coverage seen on 1 host.
|
|
336
253
|
|
|
337
|
-
|
|
338
|
-
[190] docs.example.com /users/{user_id}
|
|
339
|
-
[186] app.example.com /users/{user_id}
|
|
340
|
-
...
|
|
254
|
+
## Extracting IRIs from text
|
|
341
255
|
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
|
|
345
|
-
|
|
256
|
+
Pipe-mode extraction picks up explicit-scheme URLs (`http`, `https`, `ftp`,
|
|
257
|
+
`ws`, `wss`, `urn`) and `foo.com/path`-style scheme-less URLs (small TLD
|
|
258
|
+
allow-list, required path). It trims trailing sentence punctuation and preserves
|
|
259
|
+
balanced parens (`https://en.wikipedia.org/wiki/Ruby_(programming_language)`
|
|
260
|
+
stays intact; `(see https://foo.com)` drops the outer paren).
|
|
261
|
+
|
|
262
|
+
Known limitations (intentional):
|
|
346
263
|
|
|
347
|
-
|
|
348
|
-
|
|
264
|
+
- Comma is a URL boundary, so query strings like `?q=37.7,-122.4` truncate.
|
|
265
|
+
Trade-off picked to keep CSV-shaped text working.
|
|
266
|
+
- No HTML entity decoding (`&` stays as-is).
|
|
267
|
+
- Scheme-less mode skips bare hostnames without a path (too noisy in prose).
|
|
349
268
|
|
|
350
|
-
-
|
|
351
|
-
corpora and when you want the data human-readable.
|
|
352
|
-
- `.db` / `.sqlite` / `.sqlite3` — a SQLite database with WAL journaling.
|
|
353
|
-
Each observation is an incremental UPSERT, so multiple `iriq --corpus`
|
|
354
|
-
processes can write concurrently without clobbering each other, and the
|
|
355
|
-
cost of opening doesn't scale with corpus size.
|
|
269
|
+
Disable scheme-less extraction with `--no-scheme-less`.
|
|
356
270
|
|
|
357
|
-
|
|
271
|
+
## How it works
|
|
358
272
|
|
|
359
|
-
|
|
360
|
-
|
|
361
|
-
|
|
362
|
-
|
|
273
|
+
Under the shape sits one idea: **Position + Evidence**. A *Position* is a slot
|
|
274
|
+
in a host's structure — a typed path prefix, or a query-param name. *Evidence*
|
|
275
|
+
is everything the corpus has observed about that slot: which values, how often,
|
|
276
|
+
across how many hosts. Strings are observations; types are inferences drawn from
|
|
277
|
+
the pile. Shape is the surface you see; Position + Evidence is the engine
|
|
278
|
+
underneath. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full model.
|
|
363
279
|
|
|
364
|
-
|
|
365
|
-
https://foo.com/users/{user}/profile # mechanical would keep "zoe"
|
|
366
|
-
```
|
|
280
|
+
## CLI reference
|
|
367
281
|
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
exports any backend as JSON.
|
|
282
|
+
**Single input** — combined parse + normalize summary; trim with section flags
|
|
283
|
+
(`-p`, `-n`).
|
|
371
284
|
|
|
372
|
-
|
|
285
|
+
**Piped stdin** — extraction runs by default. Output auto-switches: small inputs
|
|
286
|
+
get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the cluster view via
|
|
287
|
+
an ephemeral corpus.
|
|
373
288
|
|
|
374
289
|
| Flag | Effect |
|
|
375
290
|
| ------------------- | ------------------------------------------------------- |
|
|
376
291
|
| `-p, --parse` | Show parsed fields |
|
|
377
292
|
| `-n, --normalize` | Show the shape-normalized form |
|
|
293
|
+
| `-c, --canonical` | Show the canonical form (no shape normalization) |
|
|
378
294
|
| `-j, --json` | Emit JSON |
|
|
379
|
-
| `-
|
|
295
|
+
| `-J, --ndjson` | Newline-delimited JSON (one object per line); implies `--json` |
|
|
296
|
+
| `-N, --no-hints` | Use `{integer}` etc. instead of `{user_id}` |
|
|
380
297
|
| `--no-scheme-less` | Skip `foo.com/path`-style extraction (explicit-scheme only) |
|
|
381
298
|
| `--corpus PATH` | Load/create a corpus at PATH (`.json` or `.db`/`.sqlite`/`.sqlite3`) |
|
|
299
|
+
| `--host MODE` | Host-keying for clustering: `full` (default), `reg` strips subdomains, `none` ignores host |
|
|
382
300
|
| `--stats` | Print rolling aggregates |
|
|
301
|
+
| `--reinfer` | Drop the materialized views and replay the source-IRI log through the current classifier + reducers |
|
|
302
|
+
| `--propose-recognizers` | Scan observed values for shape patterns that recur enough to suggest a new recognizer. Combine with `--json` for structured output |
|
|
303
|
+
| `--cross-host-shapes` | List route shapes that recur across multiple hosts |
|
|
304
|
+
| `--min-observations N` | Proposal threshold; default 20 |
|
|
305
|
+
| `--min-coverage F` | Proposal threshold; default 0.7 |
|
|
306
|
+
| `--min-hosts N` | Threshold for both proposals and cross-host shapes; default 1 / 2 respectively |
|
|
307
|
+
| `--activate-above F` | With `--propose-recognizers`, auto-activate every proposal whose confidence is ≥ F |
|
|
308
|
+
| `completion bash\|zsh` | Print shell completion script (Homebrew installs this automatically) |
|
|
383
309
|
| `-V, --version` | Print version |
|
|
384
310
|
|
|
385
|
-
A positional argument that doesn't parse as an IRI but IS an existing
|
|
386
|
-
|
|
387
|
-
`iriq /var/log/foo.log` Just Work. (Bare filenames like `README.md`
|
|
388
|
-
|
|
311
|
+
A positional argument that doesn't parse as an IRI but IS an existing file is
|
|
312
|
+
read and extracted from automatically — `iriq ./access.log` and
|
|
313
|
+
`iriq /var/log/foo.log` Just Work. (Bare filenames like `README.md` may still
|
|
314
|
+
parse as a URL; pipe with `cat` to disambiguate.)
|
|
389
315
|
|
|
390
316
|
Exit codes: `0` success, `1` usage error, `2` parse error.
|
|
391
317
|
|
|
392
|
-
## Performance
|
|
393
|
-
|
|
394
|
-
Measured on the deterministic `IriGenerator` fixture (Ruby 3.4.9, single
|
|
395
|
-
thread):
|
|
396
|
-
|
|
397
|
-
| Operation | Throughput |
|
|
398
|
-
| ------------------------ | ------------ |
|
|
399
|
-
| `Iriq.parse` | ~260k URLs/s |
|
|
400
|
-
| `Iriq.normalize` | ~148k URLs/s |
|
|
401
|
-
| `Iriq.explain` | ~205k URLs/s |
|
|
402
|
-
| `Iriq.extract` (prose) | ~9.6 MB/s |
|
|
403
|
-
| `Corpus#observe` | ~80k URLs/s |
|
|
404
|
-
| Corpus save/load (10k) | ~135 ms |
|
|
405
|
-
|
|
406
|
-
Linear scaling holds through 100k observations; per-observation retained
|
|
407
|
-
memory amortizes to ~100 bytes at that scale. Memoization caches are
|
|
408
|
-
bounded by `CACHE_MAX = 10_000` (cleared when full) — overhead is a few
|
|
409
|
-
hundred KB regardless of corpus size.
|
|
410
|
-
|
|
411
|
-
Re-run anytime with:
|
|
412
|
-
|
|
413
|
-
```
|
|
414
|
-
bundle exec script/benchmark.rb # throughput
|
|
415
|
-
bundle exec script/memory.rb # retained memory + cache footprints
|
|
416
|
-
```
|
|
417
|
-
|
|
418
318
|
## Limitations (intentional)
|
|
419
319
|
|
|
420
|
-
|
|
320
|
+
Iriq does **not**:
|
|
421
321
|
|
|
422
322
|
- Implement RFC 3986, RFC 3987, or the WHATWG URL standard fully.
|
|
423
323
|
- Convert between Unicode (IRI) and punycode (URI) — the display form is
|
|
@@ -425,31 +325,11 @@ This is an MVP. Iriq does **not**:
|
|
|
425
325
|
- Percent-encode or decode path/query bytes. Bytes are kept as written.
|
|
426
326
|
- Validate scheme-specific structure beyond URL vs. URN.
|
|
427
327
|
- Resolve relative references against a base URL.
|
|
428
|
-
- Round-trip `canonical` back to the exact original byte-for-byte (whitespace
|
|
429
|
-
|
|
430
|
-
|
|
431
|
-
For richer IRI handling, see `addressable`. Iriq's focus is the analysis
|
|
432
|
-
side: classification, normalization, and clustering — not a complete URL
|
|
433
|
-
implementation.
|
|
434
|
-
|
|
435
|
-
----
|
|
436
|
-
## Go port
|
|
437
|
-
|
|
438
|
-
A Go implementation lives under [`go/`](go/) — same public surface, same
|
|
439
|
-
behavior, ~10× faster CLI on extraction-heavy workloads. The Ruby gem is
|
|
440
|
-
the reference; the Go port stays in sync via golden JSON fixtures
|
|
441
|
-
(`spec/fixtures/`) and a CLI parity harness (`script/cli_parity.sh`), both
|
|
442
|
-
checked in CI.
|
|
443
|
-
|
|
444
|
-
```go
|
|
445
|
-
import "github.com/dpep/iriq/go/iriq"
|
|
446
|
-
|
|
447
|
-
iri, _ := iriq.Parse("https://foo.com/users/123")
|
|
448
|
-
norm, _ := iriq.Normalize("https://foo.com/users/123")
|
|
449
|
-
// "https://foo.com/users/{user_id}"
|
|
450
|
-
```
|
|
328
|
+
- Round-trip `canonical` back to the exact original byte-for-byte (whitespace is
|
|
329
|
+
stripped, default ports are dropped, dot segments are collapsed).
|
|
451
330
|
|
|
452
|
-
|
|
331
|
+
Iriq's focus is the analysis side: classification, normalization, and clustering
|
|
332
|
+
— not a complete URL implementation.
|
|
453
333
|
|
|
454
334
|
----
|
|
455
335
|
## Contributing
|
|
@@ -458,9 +338,7 @@ Yes please :)
|
|
|
458
338
|
|
|
459
339
|
1. Fork it
|
|
460
340
|
1. Create your feature branch (`git checkout -b my-feature`)
|
|
461
|
-
1. Ensure the tests pass (`
|
|
462
|
-
1. If you changed library behavior, port the change to Go (or open an
|
|
463
|
-
issue) and regenerate fixtures: `bundle exec ruby script/generate_fixtures.rb`
|
|
341
|
+
1. Ensure the tests pass (`cd rust && cargo test`)
|
|
464
342
|
1. Commit your changes (`git commit -am 'awesome new feature'`)
|
|
465
343
|
1. Push your branch (`git push origin my-feature`)
|
|
466
344
|
1. Create a Pull Request
|