iriq 0.2.0 → 0.30.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,46 +1,47 @@
1
1
  Iriq
2
2
  ======
3
- ![Gem](https://img.shields.io/gem/dt/iriq?style=plastic)
4
3
  [![codecov](https://codecov.io/gh/dpep/iriq/branch/main/graph/badge.svg)](https://codecov.io/gh/dpep/iriq)
5
4
 
6
- IRI extraction, normalization, and clustering.
5
+ **Iriq finds the *shape* of a URL** — the structural template you get when you
6
+ erase the parts that vary and keep the parts that don't. `…/users/123` and
7
+ `…/users/999` are the same shape: `/users/{user_id}`. Feed iriq a pile of messy
8
+ URLs — a log file, a column of links, free-text prose — and it collapses them
9
+ into a small set of stable, deterministic route templates. Fifty thousand
10
+ distinct URLs become twelve shapes.
7
11
 
8
- Iriq pulls IRIs out of free text, parses them, normalizes them into
9
- canonical shape-aware forms, classifies their path and query components,
10
- and clusters similar identifiers surfacing what's stable vs. unique.
12
+ (An **IRI** is just a URL the internationalized superset of URI/URL that also
13
+ allows non-ASCII characters. If you know URLs, you know IRIs. The name is *IRI
14
+ Query*: iriq queries an IRI for its structure.)
11
15
 
12
- Ships as both a **command-line tool** (`iriq`) and a **library** (Ruby and
13
- Gosame behavior, enforced by parity tests).
16
+ Everything iriq does parsing, normalizing, classifying path and query
17
+ components, clustering, learning new patterns exists to derive, render, or
18
+ group by that shape.
14
19
 
15
- ## Install
16
-
17
- The CLI is available three ways. Pick whichever fits your workflow:
20
+ And it gets sharper the more you feed it. Point a *corpus* at a stream and
21
+ classifications improve as data flows in — high-churn slots get promoted to
22
+ placeholders, and whole types emerge that you can't see in any single URL (a
23
+ position that's always 100–599 is an HTTP status; one bounded to a dozen values
24
+ is an enum).
18
25
 
19
26
  ```sh
20
- # Homebrew (recommended)
21
- brew install dpep/tools/iriq
22
-
23
- # RubyGems — installs the CLI shim and the library
24
- gem install iriq
25
-
26
- # Go — installs the CLI binary into $GOBIN
27
- go install github.com/dpep/iriq/cmd/iriq@latest
27
+ $ iriq -n https://foo.com/users/123
28
+ https://foo.com/users/{user_id}
28
29
  ```
29
30
 
30
- For library use, depend on whichever runtime you're working in:
31
+ It answers questions like:
31
32
 
32
- ```ruby
33
- # Gemfile
34
- gem "iriq"
35
- ```
33
+ - "What routes does this service actually expose?" (cluster a log file)
34
+ - "Which params are stable identifiers vs. churning IDs vs. enums?"
35
+ (`--stats`)
36
+ - "Are these 50,000 distinct URLs really just 12 templates?" (clustering)
37
+ - "What does `/api/v1/users/abc-123-def` become as a route shape?"
38
+ (`/api/{version}/users/{user_id}`)
36
39
 
37
- ```go
38
- import "github.com/dpep/iriq"
39
- ```
40
+ Iriq ships as a **command-line tool** (`iriq`) and a **Rust library**.
40
41
 
41
- ## CLI quick start
42
+ ## Quick start
42
43
 
43
- ```
44
+ ```sh
44
45
  $ iriq https://foo.com/users/123
45
46
  # parse
46
47
  original: https://foo.com/users/123
@@ -56,368 +57,267 @@ https://foo.com/users/{user_id}
56
57
  $ iriq -n https://foo.com/users/123
57
58
  https://foo.com/users/{user_id}
58
59
 
59
- $ cat access.log | iriq # extract → URL list (or clusters at scale)
60
- $ cat access.log | iriq --stats # rolling aggregates
61
- $ iriq ./access.log -n # file auto-detected → normalize each found URL
60
+ $ iriq -n https://shop.com/pricing/usd?currency=eur
61
+ https://shop.com/pricing/USD?currency=EUR # currency upcased
62
62
  ```
63
63
 
64
- Full CLI reference is below under [CLI](#cli).
65
-
66
- ## Library quick start
67
-
68
- ```ruby
69
- # Ruby
70
- iri = Iriq.parse("https://foo.com/users/123")
71
- iri.scheme # => "https"
72
- iri.host # => "foo.com"
73
- iri.path_segments # => ["users", "123"]
74
- iri.canonical # => "https://foo.com/users/123"
75
-
76
- Iriq.normalize("https://foo.com/users/123")
77
- # => "https://foo.com/users/{user_id}"
78
-
79
- Iriq.explain("https://foo.com/users/123/orders/456")
80
- # => [
81
- # { value: "users", type: :literal, variable: false, hint: nil },
82
- # { value: "123", type: :integer_id, variable: true, hint: "user_id" },
83
- # { value: "orders", type: :literal, variable: false, hint: nil },
84
- # { value: "456", type: :integer_id, variable: true, hint: "order_id" },
85
- # ]
86
- ```
87
-
88
- ```go
89
- // Go (same surface)
90
- iri, _ := iriq.Parse("https://foo.com/users/123")
91
- iri.Scheme // "https"
92
- iri.Host // "foo.com"
93
- iri.PathSegments // []string{"users", "123"}
94
- iri.Canonical() // "https://foo.com/users/123"
64
+ ```sh
65
+ $ cat access.log | iriq # ≥ 10 IRIs → cluster view
66
+ [190] docs.example.com /users/{user_id}
67
+ [186] app.example.com /users/{user_id}
68
+ ...
95
69
 
96
- norm, _ := iriq.Normalize("https://foo.com/users/123")
97
- // "https://foo.com/users/{user_id}"
70
+ $ cat access.log | iriq --stats # rolling aggregates
71
+ $ iriq ./access.log -n # auto-detect file → normalize each
72
+ $ iriq -J < access.log # newline-delimited JSON
73
+ $ iriq --corpus c.db < access.log # persist into a SQLite corpus
98
74
  ```
99
75
 
100
- The Ruby gem is the reference implementation; Go mirrors its API and is
101
- kept in sync via JSON fixtures plus a CLI parity harness. See
102
- [CLAUDE.md](CLAUDE.md) for the dev process.
76
+ Once a corpus has data, `-n` becomes corpus-informed a position that only ever
77
+ holds integers clusters to a single `{user_id}` shape, and new values normalize
78
+ to it:
103
79
 
104
- Pass `hints: false` to `Iriq.normalize` (or `PathShape`) for mechanical
105
- placeholders (`{integer_id}` instead of `{user_id}`).
80
+ ```sh
81
+ $ for n in 1 2 3 4 5 6 7 8 9 10; do
82
+ iriq --corpus c.db https://api.foo.com/users/$n >/dev/null
83
+ done
106
84
 
107
- ## RESTful hints
85
+ $ iriq -n --corpus c.db https://api.foo.com/users/999
86
+ https://api.foo.com/users/{user_id}
87
+ ```
108
88
 
109
- When a variable segment follows a literal one, Iriq derives a hint by
110
- singularizing the literal and suffixing `_id` (or `_uuid` for UUIDs). This is
111
- what produces `{user_id}` from `/users/123` and `{order_id}` from
112
- `/orders/456`. Singularization uses `Iriq::Inflector`, which delegates to a
113
- swappable adapter:
89
+ ### Two ways to normalize
114
90
 
115
- ```ruby
116
- # Default: ActiveSupport::Inflector if `active_support/inflector` is loadable,
117
- # otherwise a built-in adapter with rules adapted from ActiveSupport.
91
+ Pick by the question you're asking:
118
92
 
119
- Iriq::Inflector.singularize("categories") # => "category"
120
- Iriq::Inflector.singularize("people") # => "person"
93
+ - **`--canonical`** — clean up *this* URL, keeping the specifics.
94
+ `HTTP://Foo.com:80/pull/42` `http://foo.com/pull/42` (scheme/host
95
+ lowercased, default port dropped; path and query left alone). Handy, but
96
+ table stakes — plenty of libraries do it.
97
+ - **`--normalize`** *(the default)* — find the URL's *shape*, erasing the
98
+ specifics into placeholders. `…/pull/42` → `…/pull/{id}`. This is the part
99
+ you came to iriq for.
121
100
 
122
- # Override:
123
- Iriq::Inflector.adapter = MyAdapter # must respond to .singularize(String)
124
- Iriq::Inflector.reset_adapter!
125
- ```
101
+ Same input, two questions: "what's the clean form of *this* URL?" vs "what
102
+ *kind* of URL is this?" The second is iriq's reason to exist.
126
103
 
127
- ## Supported inputs
104
+ ## Install
128
105
 
129
- | Input | Notes |
130
- | ------------------------------------ | ------------------------------------------------ |
131
- | `https://foo.com/users/123` | Standard URL |
132
- | `foo.com/users/456` | Scheme-less; `https://` is assumed |
133
- | `urn:isbn:0451450523` | URN — `scheme` and `nss` are populated |
134
- | `https://例え.テスト/こんにちは` | Unicode IRI — display form preserved |
135
- | `HTTPS://Foo.com:443/A` | Scheme + host lowercased; default port dropped |
136
- | `https://foo.com/a/./b/../c` | Dot segments normalized |
106
+ ```sh
107
+ # Homebrew (recommended)
108
+ brew install dpep/tools/iriq
137
109
 
138
- ## Segment classification
110
+ # Cargo, from crates.io
111
+ cargo install iriq
139
112
 
140
- `Iriq::SegmentClassifier` returns one of:
141
-
142
- - `:literal` — plain word (`users`, `orders`, `Profile`, `こんにちは`)
143
- - `:integer_id` — pure digits below the timestamp range (`1`, `123`, `42`)
144
- - `:uuid` — `f47ac10b-58cc-4372-a567-0e02b2c3d479`
145
- - `:date` — `2024-05-23`
146
- - `:timestamp` — ISO 8601, or 10/13-digit UNIX epoch
147
- - `:hash` — 32+ hex chars (md5 / sha)
148
- - `:slug` — `my-cool-post`, `my_cool_post`
149
- - `:opaque_id` — short alphanumeric mix that doesn't fit elsewhere
150
-
151
- Heuristics are deterministic and ordered — the first matching rule wins.
152
-
153
- ## Clustering
154
-
155
- ```ruby
156
- clusterer = Iriq::Clusterer.new
157
- clusterer.add("https://foo.com/users/123")
158
- clusterer.add("https://foo.com/users/456")
159
- clusterer.add("https://foo.com/users/789/orders/1")
160
-
161
- clusterer.clusters.map(&:shape)
162
- # => ["/users/{user_id}", "/users/{user_id}/orders/{order_id}"]
163
-
164
- clusterer.clusters.first.segment_stats
165
- # => [
166
- # { position: 0, stable: true, values: { "users" => 2 } },
167
- # { position: 1, stable: false, values: { "123" => 1, "456" => 1 } },
168
- # ]
169
-
170
- clusterer.explain("https://foo.com/users/999")
171
- # => [
172
- # { value: "users", type: :literal, variable: false, hint: nil, stable: true },
173
- # { value: "999", type: :integer_id, variable: true, hint: "user_id", stable: false },
174
- # ]
113
+ # Cargo, from a source checkout
114
+ cargo install --path rust/iriq
175
115
  ```
176
116
 
177
- The clusterer combines classifier output with what it has actually observed:
178
- a position the classifier *would* call variable but that is empirically
179
- constant across all members of the cluster will be reported with
180
- `stable: true, variable: false`.
117
+ One crate ships both the library and the `iriq` binary. Corpora persist to
118
+ SQLite (bundled, WAL) out of the box nothing to flag, install, or rebuild.
181
119
 
182
- ## Corpus (streaming + learning)
120
+ ## Use it as a Rust library
183
121
 
184
- For processing many identifiers — possibly an unbounded stream — use
185
- `Iriq::Corpus`. It maintains rolling aggregates and per-(host, prefix)
186
- frequency stats so classification improves as more data comes in.
187
-
188
- ```ruby
189
- corpus = Iriq::Corpus.new
190
-
191
- iris.each do |iri|
192
- obs = corpus.observe(iri)
193
- obs.fingerprint # deterministic shape: "https://foo.com/users/{user_id}"
194
- obs.cluster # the Iriq::Cluster this fell into
195
- obs.explanation # per-segment annotations with corpus-informed classification
196
- end
197
-
198
- corpus.host_counts # { "foo.com" => 1234, "bar.com" => 7 }
199
- corpus.path_length_counts # { 2 => 800, 3 => 434 }
200
- corpus.fingerprint_counts # shape → count
201
- corpus.raw_shape_counts # hint-free shape → count
202
- corpus.clusters # Iriq::Cluster instances
122
+ ```sh
123
+ cargo add iriq
203
124
  ```
204
125
 
205
- ### Deterministic vs. corpus-informed normalization
126
+ ```rust
127
+ use iriq::{parse, normalize, Corpus};
206
128
 
207
- ```ruby
208
- Iriq.normalize("https://foo.com/users/me")
209
- # => "https://foo.com/users/me" # mechanical: "me" is a literal
129
+ let iri = parse("https://foo.com/users/123")?;
130
+ iri.host; // "foo.com"
131
+ iri.path_segments; // ["users", "123"]
132
+ iri.canonical(); // "https://foo.com/users/123"
210
133
 
211
- corpus.normalize("https://foo.com/users/me")
212
- # => depends on what the corpus has seen
213
- ```
134
+ normalize("https://foo.com/users/123")?; // "https://foo.com/users/{user_id}"
214
135
 
215
- If many `/users/{integer_id}` paths flow in alongside a handful of
216
- `/users/me`, the cluster `/users/me` is preserved (mechanical clustering
217
- keeps literal routes distinct). If many *distinct literal handles*
218
- (`/users/alice`, `/users/bob`, `/users/carol`, ...) flow in, the corpus
219
- promotes that position to a `{user}` placeholder:
220
-
221
- ```ruby
222
- %w[alice bob carol dave erin frank gina hank ivan jane].each do |name|
223
- corpus.observe("https://foo.com/users/#{name}/profile")
224
- end
225
-
226
- corpus.normalize("https://foo.com/users/alice/profile")
227
- # => "https://foo.com/users/{user}/profile"
136
+ // Streaming clustering against a persistent corpus.
137
+ let mut corpus = Corpus::open("c.db")?;
138
+ corpus.observe("https://foo.com/users/1")?;
139
+ corpus.save("c.db")?;
228
140
  ```
229
141
 
230
- ### Explainability
231
-
232
- Each row of `corpus.explain(...)` (and `observation.explanation`) carries a
233
- `classification:` symbol on top of the deterministic fields:
142
+ Full API on [docs.rs/iriq](https://docs.rs/iriq); see the
143
+ [crate README](rust/iriq/README.md) for the library tour.
234
144
 
235
- | Classification | Meaning |
236
- | --------------------------- | ---------------------------------------------------- |
237
- | `:stable_literal` | Literal value dominates this position |
238
- | `:variable_identifier` | Classifier said variable (uuid, integer, etc.) |
239
- | `:rare_literal` | Literal seen here, but not dominant |
240
- | `:corpus_inferred_variable` | Classifier said literal, but position has high entropy |
241
- | `:ambiguous` | Insufficient signal — never seen, or mixed |
145
+ ## Segment classification
242
146
 
243
- ## Extracting IRIs from text
147
+ Iriq classifies each path/query segment into one of ~25 types — the first
148
+ matching rule wins, and heuristics are deterministic:
149
+
150
+ - `literal` — plain word (`users`, `orders`, `Profile`, `こんにちは`)
151
+ - `integer` — pure digits below the timestamp range
152
+ - `float` — decimal with digits on both sides (`3.14`, `-2.5`, `1.0`)
153
+ - `boolean` — `true` / `false` (any case)
154
+ - `version` — semver-ish with `v` prefix (`v1`, `v2.0.1`, `v1.2.3-beta`)
155
+ - `locale` — BCP 47-ish (`en-US`, `fr_CA`, `zh-Hant`, bare `en`/`fr`/`ja`)
156
+ - `currency` — ISO 4217 codes (`USD`, `EUR`, `JPY`)
157
+ - `uuid` — `f47ac10b-58cc-4372-a567-0e02b2c3d479`
158
+ - `date` — `2024-05-23`, `2024/05/23`, `20240523`, `05/23/2024`. Canonicalized to ISO in `--normalize` output.
159
+ - `timestamp` — ISO 8601, or 10/13-digit UNIX epoch
160
+ - `hash` — 32+ hex chars (md5 / sha)
161
+ - `slug` — `my-cool-post`, `my_cool_post`
162
+ - `ipv4` / `ipv6` — collapsed to `{ip}` in normalized output
163
+ - `url` — `https://...`, `ftp://...`, also scheme-less `foo.com/path`
164
+ - `email` — `local@host.tld`
165
+ - `phone` — E.164 (`+15551234567`) or NANP (`555-666-7777`, `(555) 666-7777`)
166
+ - `jwt` — three base64url segments separated by dots
167
+ - `mime` — `image/png`, `application/vnd.api+json`
168
+ - `file` — `name.ext` for known extensions; per-kind grouping (image/document/data/...)
169
+ - `color` — hex form (`#fff`, `#ffffff`, `#ffffff80`)
170
+ - `coordinate` — `lat,lng` pair with plausible-range validation
171
+ - `country` — ISO 3166-1 alpha-2 codes (`US`, `JP`, `GB`)
172
+ - `base64` — standard base64 blobs with disambiguating `+`/`/`/`=`
173
+ - `opaque_id` — short alphanumeric mix that doesn't fit elsewhere
174
+
175
+ ### RESTful hints
176
+
177
+ When a variable segment follows a literal one, iriq derives a hint by
178
+ singularizing the literal and suffixing `_id` (or `_uuid` for UUIDs). That's
179
+ what produces `{user_id}` from `/users/123` and `{order_id}` from `/orders/456`.
180
+ Semantic types (`version`, `locale`, `currency`, `date`, `boolean`) skip the
181
+ hint and surface as `{type}` — `/api/v1/status` renders as `/api/{version}/status`,
182
+ not the misleading `/api/{api_id}/status`. Pass `-N` / `--no-hints` for
183
+ mechanical placeholders (`{integer}` instead of `{user_id}`).
184
+
185
+ ### Types only the corpus can see
186
+
187
+ Four types never come from a single URL — they emerge from the *distribution*
188
+ of values a position has held across many observations:
189
+
190
+ | Type | Emerges when a position… |
191
+ | --- | --- |
192
+ | `number` | holds both integers and floats |
193
+ | `year` | holds integers that all land in 1900–2100 |
194
+ | `http_status` | holds integers that all land in 100–599 |
195
+ | `enum` | holds a small, bounded set of distinct values |
196
+
197
+ Mechanically, `200` is just an integer. Across ten thousand URLs where that
198
+ slot is always 100–599, it's an HTTP status. That's the corpus earning its keep.
244
199
 
245
- `Iriq::Extractor` is what powers pipe-mode in the CLI. Picks up explicit-
246
- scheme URLs (`http`, `https`, `ftp`, `ws`, `wss`, `urn`) and `foo.com/path`-
247
- style scheme-less URLs (small TLD allow-list, required path). Trims trailing
248
- sentence punctuation iteratively and preserves balanced parens
249
- (`https://en.wikipedia.org/wiki/Ruby_(programming_language)` stays intact;
250
- `(see https://foo.com)` drops the outer paren).
251
-
252
- ```ruby
253
- Iriq.extract("Visit https://foo.com today, also hit foo.com/users.")
254
- # => [#<Iriq::Identifier https://foo.com>,
255
- # #<Iriq::Identifier https://foo.com/users>]
256
-
257
- # Disable scheme-less:
258
- Iriq::Extractor.new(scheme_less: false).extract("hit foo.com/users today")
259
- # => []
260
- ```
200
+ ## Corpus (streaming + learning)
261
201
 
262
- Known limitations (intentional):
202
+ For processing many identifiers — possibly an unbounded stream — point iriq at a
203
+ corpus. It maintains rolling aggregates and per-(host, prefix) frequency stats,
204
+ so classification improves as more data comes in.
263
205
 
264
- - Comma is a URL boundary, so query strings like `?q=37.7,-122.4` truncate.
265
- Trade-off picked to keep CSV-shaped text working.
266
- - No HTML entity decoding (`&amp;` stays as-is).
267
- - Scheme-less mode skips bare hostnames without a path (too noisy in prose).
206
+ `--corpus PATH` makes the corpus survive across invocations. A `.db` /
207
+ `.sqlite` / `.sqlite3` path is stored in SQLite (WAL journaling, incremental
208
+ UPSERTs multiple `iriq --corpus` processes can write concurrently); a
209
+ `.json` path writes a plain JSON file instead.
268
210
 
269
- ### Memory bounds
211
+ ### Re-runnable inference
270
212
 
271
- - Per-position `value_counts` is capped (`max_values_per_position`, default
272
- 1000) once full, `total` keeps growing but only existing keys count up.
273
- - Cluster examples are capped at `Iriq::Cluster::MAX_EXAMPLES`.
274
- - No raw IRI strings are retained outside the bounded cluster examples.
213
+ A corpus persists the source-IRI log alongside the materialized views.
214
+ `--reinfer` drops every view and replays the log through the current classifier
215
+ and reducers. Tune a threshold, swap in a different classifier, or activate new
216
+ recognizers (below) then reinfer to see the new results without re-feeding
217
+ URLs.
275
218
 
276
- ```ruby
277
- Iriq::Corpus.new(max_values_per_position: 200)
219
+ ```sh
220
+ $ iriq --corpus c.db --reinfer
278
221
  ```
279
222
 
280
- ## Object model
281
-
282
- | Class | Responsibility |
283
- | --------------------------- | ---------------------------------------------------- |
284
- | `Iriq::Parser` | String → `Identifier` |
285
- | `Iriq::Identifier` | Structured fields + `canonical` reconstruction |
286
- | `Iriq::SegmentClassifier` | Single segment → type symbol |
287
- | `Iriq::PathShape` | Segments → `/users/{user_id}` route shape |
288
- | `Iriq::SegmentHints` | Derives `user_id`-style hints from neighbors |
289
- | `Iriq::Inflector` | Singularization with swappable adapter (AS or built-in) |
290
- | `Iriq::Normalizer` | Identifier → canonical, shape-aware string |
291
- | `Iriq::Explanation` | Per-segment `{value, type, variable, hint}` rows |
292
- | `Iriq::Cluster` | One host + shape group, with examples & stats |
293
- | `Iriq::Clusterer` | Many identifiers → `Cluster` set + explain |
294
- | `Iriq::PositionStats` | Capped value/type frequencies for one position |
295
- | `Iriq::Observation` | What `Corpus#observe` returns |
296
- | `Iriq::Corpus` | Streaming observer with rolling aggregates + learning |
297
- | `Iriq::Extractor` | Pulls IRIs out of free text (scheme-anchored) |
223
+ ## Learning new types
298
224
 
299
- ## CLI
225
+ Iriq doesn't just classify against a fixed list — it watches the stream and
226
+ *proposes new recognizers* for patterns it keeps seeing. Notice `ghp_…` or
227
+ `cus_…` recurring at a slug position and iriq will suggest a recognizer for it,
228
+ with evidence: coverage, host count, confidence. Proposals are never
229
+ auto-applied — you activate the ones you trust, and they persist with the
230
+ corpus. Human-in-the-loop by design.
300
231
 
301
- Installing the gem installs an `iriq` executable. Two main modes:
302
-
303
- **Single input** combined parse + normalize summary; trim with section
304
- flags (`-p`, `-n`).
232
+ ```sh
233
+ # Print proposals (human-readable, or --json)
234
+ $ iriq --corpus c.db --propose-recognizers
305
235
 
236
+ # Auto-activate every proposal with confidence ≥ 0.9, then reinfer
237
+ $ iriq --corpus c.db --propose-recognizers --activate-above 0.9
306
238
  ```
307
- $ iriq foo.com/users/456
308
- # parse
309
- original: foo.com/users/456
310
- kind: url
311
- scheme: https
312
- host: foo.com
313
- path_segments: ["users", "456"]
314
- canonical: https://foo.com/users/456
315
239
 
316
- # normalize
317
- https://foo.com/users/{user_id}
240
+ ### Cross-host shape learning
318
241
 
319
- $ iriq -n https://foo.com/users/123
320
- https://foo.com/users/{user_id}
321
- ```
322
-
323
- **Piped stdin** — extraction runs by default. Output auto-switches: small
324
- inputs get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the
325
- cluster view via an ephemeral corpus. Section flags work too — emit one
326
- normalized URL / parsed record per extracted IRI.
242
+ A route shape that recurs across multiple hosts is independent evidence of a
243
+ semantic pattern — two unrelated hosts inventing the same `/users/{integer}`
244
+ structure by accident is unlikely.
327
245
 
246
+ ```sh
247
+ $ iriq --corpus c.db --cross-host-shapes [--min-hosts N]
328
248
  ```
329
- $ cat short.txt | iriq
330
- [2] https://github.com/dpep/iriq
331
- [1] https://foo.com/users
332
249
 
333
- $ cat short.txt | iriq -n # normalized URL per line
334
- https://github.com/dpep/iriq
335
- https://foo.com/users
250
+ The same signal feeds back into proposal `confidence`: each additional host
251
+ beyond the first adds `0.05` to the score (capped at 1.0), so a prefix proposed
252
+ on 5 hosts is meaningfully stronger than the same coverage seen on 1 host.
336
253
 
337
- $ cat access.log | iriq # ≥ 10 IRIs cluster view
338
- [190] docs.example.com /users/{user_id}
339
- [186] app.example.com /users/{user_id}
340
- ...
254
+ ## Extracting IRIs from text
341
255
 
342
- $ cat README.md | iriq --stats # rolling aggregates
343
- $ cat README.md | iriq cluster # force cluster view
344
- $ cat README.md | iriq --corpus c.json # persist into a corpus
345
- ```
256
+ Pipe-mode extraction picks up explicit-scheme URLs (`http`, `https`, `ftp`,
257
+ `ws`, `wss`, `urn`) and `foo.com/path`-style scheme-less URLs (small TLD
258
+ allow-list, required path). It trims trailing sentence punctuation and preserves
259
+ balanced parens (`https://en.wikipedia.org/wiki/Ruby_(programming_language)`
260
+ stays intact; `(see https://foo.com)` drops the outer paren).
261
+
262
+ Known limitations (intentional):
346
263
 
347
- `--corpus PATH` makes the corpus survive across invocations. The file
348
- extension picks the storage backend:
264
+ - Comma is a URL boundary, so query strings like `?q=37.7,-122.4` truncate.
265
+ Trade-off picked to keep CSV-shaped text working.
266
+ - No HTML entity decoding (`&amp;` stays as-is).
267
+ - Scheme-less mode skips bare hostnames without a path (too noisy in prose).
349
268
 
350
- - `.json` a single atomically-written JSON file (default). Best for small
351
- corpora and when you want the data human-readable.
352
- - `.db` / `.sqlite` / `.sqlite3` — a SQLite database with WAL journaling.
353
- Each observation is an incremental UPSERT, so multiple `iriq --corpus`
354
- processes can write concurrently without clobbering each other, and the
355
- cost of opening doesn't scale with corpus size.
269
+ Disable scheme-less extraction with `--no-scheme-less`.
356
270
 
357
- Once the corpus has data, `-n` becomes corpus-informed:
271
+ ## How it works
358
272
 
359
- ```
360
- $ for n in alice bob carol dave erin frank gina hank ivan jane; do
361
- iriq --corpus c.db https://foo.com/users/$n/profile >/dev/null
362
- done
273
+ Under the shape sits one idea: **Position + Evidence**. A *Position* is a slot
274
+ in a host's structure a typed path prefix, or a query-param name. *Evidence*
275
+ is everything the corpus has observed about that slot: which values, how often,
276
+ across how many hosts. Strings are observations; types are inferences drawn from
277
+ the pile. Shape is the surface you see; Position + Evidence is the engine
278
+ underneath. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full model.
363
279
 
364
- $ iriq -n --corpus c.db https://foo.com/users/zoe/profile
365
- https://foo.com/users/{user}/profile # mechanical would keep "zoe"
366
- ```
280
+ ## CLI reference
367
281
 
368
- Library: `Iriq::Corpus.open("c.db")` (or `iriq.OpenCorpus("c.db")` in Go)
369
- dispatches on the same extension rules. `corpus.save("export.json")`
370
- exports any backend as JSON.
282
+ **Single input** combined parse + normalize summary; trim with section flags
283
+ (`-p`, `-n`).
371
284
 
372
- Flags:
285
+ **Piped stdin** — extraction runs by default. Output auto-switches: small inputs
286
+ get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the cluster view via
287
+ an ephemeral corpus.
373
288
 
374
289
  | Flag | Effect |
375
290
  | ------------------- | ------------------------------------------------------- |
376
291
  | `-p, --parse` | Show parsed fields |
377
292
  | `-n, --normalize` | Show the shape-normalized form |
293
+ | `-c, --canonical` | Show the canonical form (no shape normalization) |
378
294
  | `-j, --json` | Emit JSON |
379
- | `-N, --no-hints` | Use `{integer_id}` etc. instead of `{user_id}` |
295
+ | `-J, --ndjson` | Newline-delimited JSON (one object per line); implies `--json` |
296
+ | `-N, --no-hints` | Use `{integer}` etc. instead of `{user_id}` |
380
297
  | `--no-scheme-less` | Skip `foo.com/path`-style extraction (explicit-scheme only) |
381
298
  | `--corpus PATH` | Load/create a corpus at PATH (`.json` or `.db`/`.sqlite`/`.sqlite3`) |
299
+ | `--host MODE` | Host-keying for clustering: `full` (default), `reg` strips subdomains, `none` ignores host |
382
300
  | `--stats` | Print rolling aggregates |
301
+ | `--reinfer` | Drop the materialized views and replay the source-IRI log through the current classifier + reducers |
302
+ | `--propose-recognizers` | Scan observed values for shape patterns that recur enough to suggest a new recognizer. Combine with `--json` for structured output |
303
+ | `--cross-host-shapes` | List route shapes that recur across multiple hosts |
304
+ | `--min-observations N` | Proposal threshold; default 20 |
305
+ | `--min-coverage F` | Proposal threshold; default 0.7 |
306
+ | `--min-hosts N` | Threshold for both proposals and cross-host shapes; default 1 / 2 respectively |
307
+ | `--activate-above F` | With `--propose-recognizers`, auto-activate every proposal whose confidence is ≥ F |
308
+ | `completion bash\|zsh` | Print shell completion script (Homebrew installs this automatically) |
383
309
  | `-V, --version` | Print version |
384
310
 
385
- A positional argument that doesn't parse as an IRI but IS an existing
386
- file is read and extracted from automatically — `iriq ./access.log` and
387
- `iriq /var/log/foo.log` Just Work. (Bare filenames like `README.md`
388
- may still parse as a URL; pipe with `cat` to disambiguate.)
311
+ A positional argument that doesn't parse as an IRI but IS an existing file is
312
+ read and extracted from automatically — `iriq ./access.log` and
313
+ `iriq /var/log/foo.log` Just Work. (Bare filenames like `README.md` may still
314
+ parse as a URL; pipe with `cat` to disambiguate.)
389
315
 
390
316
  Exit codes: `0` success, `1` usage error, `2` parse error.
391
317
 
392
- ## Performance
393
-
394
- Measured on the deterministic `IriGenerator` fixture (Ruby 3.4.9, single
395
- thread):
396
-
397
- | Operation | Throughput |
398
- | ------------------------ | ------------ |
399
- | `Iriq.parse` | ~260k URLs/s |
400
- | `Iriq.normalize` | ~148k URLs/s |
401
- | `Iriq.explain` | ~205k URLs/s |
402
- | `Iriq.extract` (prose) | ~9.6 MB/s |
403
- | `Corpus#observe` | ~80k URLs/s |
404
- | Corpus save/load (10k) | ~135 ms |
405
-
406
- Linear scaling holds through 100k observations; per-observation retained
407
- memory amortizes to ~100 bytes at that scale. Memoization caches are
408
- bounded by `CACHE_MAX = 10_000` (cleared when full) — overhead is a few
409
- hundred KB regardless of corpus size.
410
-
411
- Re-run anytime with:
412
-
413
- ```
414
- bundle exec script/benchmark.rb # throughput
415
- bundle exec script/memory.rb # retained memory + cache footprints
416
- ```
417
-
418
318
  ## Limitations (intentional)
419
319
 
420
- This is an MVP. Iriq does **not**:
320
+ Iriq does **not**:
421
321
 
422
322
  - Implement RFC 3986, RFC 3987, or the WHATWG URL standard fully.
423
323
  - Convert between Unicode (IRI) and punycode (URI) — the display form is
@@ -425,31 +325,11 @@ This is an MVP. Iriq does **not**:
425
325
  - Percent-encode or decode path/query bytes. Bytes are kept as written.
426
326
  - Validate scheme-specific structure beyond URL vs. URN.
427
327
  - Resolve relative references against a base URL.
428
- - Round-trip `canonical` back to the exact original byte-for-byte (whitespace
429
- is stripped, default ports are dropped, dot segments are collapsed).
430
-
431
- For richer IRI handling, see `addressable`. Iriq's focus is the analysis
432
- side: classification, normalization, and clustering — not a complete URL
433
- implementation.
434
-
435
- ----
436
- ## Go port
437
-
438
- A Go implementation lives under [`go/`](go/) — same public surface, same
439
- behavior, ~10× faster CLI on extraction-heavy workloads. The Ruby gem is
440
- the reference; the Go port stays in sync via golden JSON fixtures
441
- (`spec/fixtures/`) and a CLI parity harness (`script/cli_parity.sh`), both
442
- checked in CI.
443
-
444
- ```go
445
- import "github.com/dpep/iriq/go/iriq"
446
-
447
- iri, _ := iriq.Parse("https://foo.com/users/123")
448
- norm, _ := iriq.Normalize("https://foo.com/users/123")
449
- // "https://foo.com/users/{user_id}"
450
- ```
328
+ - Round-trip `canonical` back to the exact original byte-for-byte (whitespace is
329
+ stripped, default ports are dropped, dot segments are collapsed).
451
330
 
452
- See [`go/README.md`](go/README.md) for the full API table and porting workflow.
331
+ Iriq's focus is the analysis side: classification, normalization, and clustering
332
+ — not a complete URL implementation.
453
333
 
454
334
  ----
455
335
  ## Contributing
@@ -458,9 +338,7 @@ Yes please :)
458
338
 
459
339
  1. Fork it
460
340
  1. Create your feature branch (`git checkout -b my-feature`)
461
- 1. Ensure the tests pass (`bundle exec rspec`)
462
- 1. If you changed library behavior, port the change to Go (or open an
463
- issue) and regenerate fixtures: `bundle exec ruby script/generate_fixtures.rb`
341
+ 1. Ensure the tests pass (`cd rust && cargo test`)
464
342
  1. Commit your changes (`git commit -am 'awesome new feature'`)
465
343
  1. Push your branch (`git push origin my-feature`)
466
344
  1. Create a Pull Request