iriq 0.1.0 → 0.30.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +87 -0
  3. data/CLAUDE.md +208 -0
  4. data/Gemfile.lock +8 -2
  5. data/Makefile +113 -0
  6. data/README.md +249 -270
  7. data/completions/_iriq +52 -0
  8. data/completions/iriq.bash +70 -0
  9. data/docs/ARCHITECTURE.md +223 -0
  10. data/docs/ROADMAP.md +190 -0
  11. data/iriq.gemspec +5 -4
  12. data/lib/iriq/cli.rb +402 -49
  13. data/lib/iriq/cluster.rb +304 -8
  14. data/lib/iriq/clusterer.rb +19 -44
  15. data/lib/iriq/corpus.rb +417 -81
  16. data/lib/iriq/cross_host_shape.rb +37 -0
  17. data/lib/iriq/event.rb +22 -0
  18. data/lib/iriq/evidence.rb +114 -0
  19. data/lib/iriq/explanation.rb +1 -1
  20. data/lib/iriq/normalizer.rb +71 -29
  21. data/lib/iriq/parser.rb +1 -1
  22. data/lib/iriq/path_shape.rb +30 -24
  23. data/lib/iriq/position.rb +75 -0
  24. data/lib/iriq/position_stats.rb +74 -8
  25. data/lib/iriq/recognizer.rb +54 -0
  26. data/lib/iriq/recognizer_proposal.rb +167 -0
  27. data/lib/iriq/recognizers/date.rb +53 -0
  28. data/lib/iriq/recognizers/integer.rb +37 -0
  29. data/lib/iriq/recognizers/uuid.rb +16 -0
  30. data/lib/iriq/reducer.rb +37 -0
  31. data/lib/iriq/registrable_domain.rb +56 -0
  32. data/lib/iriq/segment_classifier.rb +475 -23
  33. data/lib/iriq/segment_hints.rb +9 -0
  34. data/lib/iriq/shape.rb +106 -0
  35. data/lib/iriq/specificity.rb +35 -0
  36. data/lib/iriq/storage/json.rb +43 -0
  37. data/lib/iriq/storage/memory.rb +209 -0
  38. data/lib/iriq/storage/sqlite.rb +546 -0
  39. data/lib/iriq/storage.rb +35 -0
  40. data/lib/iriq/synthesized_recognizer.rb +56 -0
  41. data/lib/iriq/trace.rb +294 -0
  42. data/lib/iriq/version.rb +1 -1
  43. data/lib/iriq.rb +18 -0
  44. metadata +44 -8
  45. data/script/benchmark.rb +0 -81
  46. data/script/memory.rb +0 -121
data/README.md CHANGED
@@ -1,343 +1,323 @@
1
1
  Iriq
2
2
  ======
3
- ![Gem](https://img.shields.io/gem/dt/iriq?style=plastic)
4
3
  [![codecov](https://codecov.io/gh/dpep/iriq/branch/main/graph/badge.svg)](https://codecov.io/gh/dpep/iriq)
5
4
 
6
- Semantic IRI / URI / URL / URN normalization and clustering for Ruby.
5
+ **Iriq finds the *shape* of a URL** the structural template you get when you
6
+ erase the parts that vary and keep the parts that don't. `…/users/123` and
7
+ `…/users/999` are the same shape: `/users/{user_id}`. Feed iriq a pile of messy
8
+ URLs — a log file, a column of links, free-text prose — and it collapses them
9
+ into a small set of stable, deterministic route templates. Fifty thousand
10
+ distinct URLs become twelve shapes.
7
11
 
8
- Iriq parses resource identifiers, normalizes them into canonical IRI-like
9
- forms, classifies path and query components, clusters similar identifiers,
10
- and explains which parts are stable vs. unique.
12
+ (An **IRI** is just a URL the internationalized superset of URI/URL that also
13
+ allows non-ASCII characters. If you know URLs, you know IRIs. The name is *IRI
14
+ Query*: iriq queries an IRI for its structure.)
11
15
 
12
- ```ruby
13
- require "iriq"
14
- ```
16
+ Everything iriq does — parsing, normalizing, classifying path and query
17
+ components, clustering, learning new patterns — exists to derive, render, or
18
+ group by that shape.
15
19
 
16
- ## Quick start
20
+ And it gets sharper the more you feed it. Point a *corpus* at a stream and
21
+ classifications improve as data flows in — high-churn slots get promoted to
22
+ placeholders, and whole types emerge that you can't see in any single URL (a
23
+ position that's always 100–599 is an HTTP status; one bounded to a dozen values
24
+ is an enum).
17
25
 
18
- ```ruby
19
- iri = Iriq.parse("https://foo.com/users/123")
20
- iri.scheme # => "https"
21
- iri.host # => "foo.com"
22
- iri.path_segments # => ["users", "123"]
23
- iri.canonical # => "https://foo.com/users/123"
24
-
25
- Iriq.normalize("https://foo.com/users/123")
26
- # => "https://foo.com/users/{user_id}"
27
-
28
- Iriq.explain("https://foo.com/users/123/orders/456")
29
- # => [
30
- # { value: "users", type: :literal, variable: false, hint: nil },
31
- # { value: "123", type: :integer_id, variable: true, hint: "user_id" },
32
- # { value: "orders", type: :literal, variable: false, hint: nil },
33
- # { value: "456", type: :integer_id, variable: true, hint: "order_id" },
34
- # ]
26
+ ```sh
27
+ $ iriq -n https://foo.com/users/123
28
+ https://foo.com/users/{user_id}
35
29
  ```
36
30
 
37
- Pass `hints: false` to `Iriq.normalize` (or `PathShape`) for mechanical
38
- placeholders (`{integer_id}` instead of `{user_id}`).
31
+ It answers questions like:
39
32
 
40
- ## RESTful hints
33
+ - "What routes does this service actually expose?" (cluster a log file)
34
+ - "Which params are stable identifiers vs. churning IDs vs. enums?"
35
+ (`--stats`)
36
+ - "Are these 50,000 distinct URLs really just 12 templates?" (clustering)
37
+ - "What does `/api/v1/users/abc-123-def` become as a route shape?"
38
+ (`/api/{version}/users/{user_id}`)
41
39
 
42
- When a variable segment follows a literal one, Iriq derives a hint by
43
- singularizing the literal and suffixing `_id` (or `_uuid` for UUIDs). This is
44
- what produces `{user_id}` from `/users/123` and `{order_id}` from
45
- `/orders/456`. Singularization uses `Iriq::Inflector`, which delegates to a
46
- swappable adapter:
40
+ Iriq ships as a **command-line tool** (`iriq`) and a **Rust library**.
47
41
 
48
- ```ruby
49
- # Default: ActiveSupport::Inflector if `active_support/inflector` is loadable,
50
- # otherwise a built-in adapter with rules adapted from ActiveSupport.
42
+ ## Quick start
51
43
 
52
- Iriq::Inflector.singularize("categories") # => "category"
53
- Iriq::Inflector.singularize("people") # => "person"
44
+ ```sh
45
+ $ iriq https://foo.com/users/123
46
+ # parse
47
+ original: https://foo.com/users/123
48
+ kind: url
49
+ scheme: https
50
+ host: foo.com
51
+ path_segments: ["users", "123"]
52
+ canonical: https://foo.com/users/123
54
53
 
55
- # Override:
56
- Iriq::Inflector.adapter = MyAdapter # must respond to .singularize(String)
57
- Iriq::Inflector.reset_adapter!
58
- ```
54
+ # normalize
55
+ https://foo.com/users/{user_id}
59
56
 
60
- ## Supported inputs
57
+ $ iriq -n https://foo.com/users/123
58
+ https://foo.com/users/{user_id}
61
59
 
62
- | Input | Notes |
63
- | ------------------------------------ | ------------------------------------------------ |
64
- | `https://foo.com/users/123` | Standard URL |
65
- | `foo.com/users/456` | Scheme-less; `https://` is assumed |
66
- | `urn:isbn:0451450523` | URN — `scheme` and `nss` are populated |
67
- | `https://例え.テスト/こんにちは` | Unicode IRI — display form preserved |
68
- | `HTTPS://Foo.com:443/A` | Scheme + host lowercased; default port dropped |
69
- | `https://foo.com/a/./b/../c` | Dot segments normalized |
60
+ $ iriq -n https://shop.com/pricing/usd?currency=eur
61
+ https://shop.com/pricing/USD?currency=EUR # currency upcased
62
+ ```
70
63
 
71
- ## Segment classification
64
+ ```sh
65
+ $ cat access.log | iriq # ≥ 10 IRIs → cluster view
66
+ [190] docs.example.com /users/{user_id}
67
+ [186] app.example.com /users/{user_id}
68
+ ...
72
69
 
73
- `Iriq::SegmentClassifier` returns one of:
74
-
75
- - `:literal` plain word (`users`, `orders`, `Profile`, `こんにちは`)
76
- - `:integer_id` pure digits below the timestamp range (`1`, `123`, `42`)
77
- - `:uuid` — `f47ac10b-58cc-4372-a567-0e02b2c3d479`
78
- - `:date` — `2024-05-23`
79
- - `:timestamp` — ISO 8601, or 10/13-digit UNIX epoch
80
- - `:hash` — 32+ hex chars (md5 / sha)
81
- - `:slug` — `my-cool-post`, `my_cool_post`
82
- - `:opaque_id` — short alphanumeric mix that doesn't fit elsewhere
83
-
84
- Heuristics are deterministic and ordered — the first matching rule wins.
85
-
86
- ## Clustering
87
-
88
- ```ruby
89
- clusterer = Iriq::Clusterer.new
90
- clusterer.add("https://foo.com/users/123")
91
- clusterer.add("https://foo.com/users/456")
92
- clusterer.add("https://foo.com/users/789/orders/1")
93
-
94
- clusterer.clusters.map(&:shape)
95
- # => ["/users/{user_id}", "/users/{user_id}/orders/{order_id}"]
96
-
97
- clusterer.clusters.first.segment_stats
98
- # => [
99
- # { position: 0, stable: true, values: { "users" => 2 } },
100
- # { position: 1, stable: false, values: { "123" => 1, "456" => 1 } },
101
- # ]
102
-
103
- clusterer.explain("https://foo.com/users/999")
104
- # => [
105
- # { value: "users", type: :literal, variable: false, hint: nil, stable: true },
106
- # { value: "999", type: :integer_id, variable: true, hint: "user_id", stable: false },
107
- # ]
70
+ $ cat access.log | iriq --stats # rolling aggregates
71
+ $ iriq ./access.log -n # auto-detect file → normalize each
72
+ $ iriq -J < access.log # newline-delimited JSON
73
+ $ iriq --corpus c.db < access.log # persist into a SQLite corpus
108
74
  ```
109
75
 
110
- The clusterer combines classifier output with what it has actually observed:
111
- a position the classifier *would* call variable but that is empirically
112
- constant across all members of the cluster will be reported with
113
- `stable: true, variable: false`.
76
+ Once a corpus has data, `-n` becomes corpus-informed a position that only ever
77
+ holds integers clusters to a single `{user_id}` shape, and new values normalize
78
+ to it:
114
79
 
115
- ## Corpus (streaming + learning)
80
+ ```sh
81
+ $ for n in 1 2 3 4 5 6 7 8 9 10; do
82
+ iriq --corpus c.db https://api.foo.com/users/$n >/dev/null
83
+ done
116
84
 
117
- For processing many identifiers possibly an unbounded stream — use
118
- `Iriq::Corpus`. It maintains rolling aggregates and per-(host, prefix)
119
- frequency stats so classification improves as more data comes in.
120
-
121
- ```ruby
122
- corpus = Iriq::Corpus.new
123
-
124
- iris.each do |iri|
125
- obs = corpus.observe(iri)
126
- obs.fingerprint # deterministic shape: "https://foo.com/users/{user_id}"
127
- obs.cluster # the Iriq::Cluster this fell into
128
- obs.explanation # per-segment annotations with corpus-informed classification
129
- end
130
-
131
- corpus.host_counts # { "foo.com" => 1234, "bar.com" => 7 }
132
- corpus.path_length_counts # { 2 => 800, 3 => 434 }
133
- corpus.fingerprint_counts # shape → count
134
- corpus.raw_shape_counts # hint-free shape → count
135
- corpus.clusters # Iriq::Cluster instances
85
+ $ iriq -n --corpus c.db https://api.foo.com/users/999
86
+ https://api.foo.com/users/{user_id}
136
87
  ```
137
88
 
138
- ### Deterministic vs. corpus-informed normalization
89
+ ### Two ways to normalize
139
90
 
140
- ```ruby
141
- Iriq.normalize("https://foo.com/users/me")
142
- # => "https://foo.com/users/me" # mechanical: "me" is a literal
91
+ Pick by the question you're asking:
143
92
 
144
- corpus.normalize("https://foo.com/users/me")
145
- # => depends on what the corpus has seen
146
- ```
93
+ - **`--canonical`** — clean up *this* URL, keeping the specifics.
94
+ `HTTP://Foo.com:80/pull/42` `http://foo.com/pull/42` (scheme/host
95
+ lowercased, default port dropped; path and query left alone). Handy, but
96
+ table stakes — plenty of libraries do it.
97
+ - **`--normalize`** *(the default)* — find the URL's *shape*, erasing the
98
+ specifics into placeholders. `…/pull/42` → `…/pull/{id}`. This is the part
99
+ you came to iriq for.
147
100
 
148
- If many `/users/{integer_id}` paths flow in alongside a handful of
149
- `/users/me`, the cluster `/users/me` is preserved (mechanical clustering
150
- keeps literal routes distinct). If many *distinct literal handles*
151
- (`/users/alice`, `/users/bob`, `/users/carol`, ...) flow in, the corpus
152
- promotes that position to a `{user}` placeholder:
101
+ Same input, two questions: "what's the clean form of *this* URL?" vs "what
102
+ *kind* of URL is this?" The second is iriq's reason to exist.
153
103
 
154
- ```ruby
155
- %w[alice bob carol dave erin frank gina hank ivan jane].each do |name|
156
- corpus.observe("https://foo.com/users/#{name}/profile")
157
- end
104
+ ## Install
158
105
 
159
- corpus.normalize("https://foo.com/users/alice/profile")
160
- # => "https://foo.com/users/{user}/profile"
161
- ```
106
+ ```sh
107
+ # Homebrew (recommended)
108
+ brew install dpep/tools/iriq
162
109
 
163
- ### Explainability
110
+ # Cargo, from crates.io
111
+ cargo install iriq
164
112
 
165
- Each row of `corpus.explain(...)` (and `observation.explanation`) carries a
166
- `classification:` symbol on top of the deterministic fields:
113
+ # Cargo, from a source checkout
114
+ cargo install --path rust/iriq
115
+ ```
167
116
 
168
- | Classification | Meaning |
169
- | --------------------------- | ---------------------------------------------------- |
170
- | `:stable_literal` | Literal value dominates this position |
171
- | `:variable_identifier` | Classifier said variable (uuid, integer, etc.) |
172
- | `:rare_literal` | Literal seen here, but not dominant |
173
- | `:corpus_inferred_variable` | Classifier said literal, but position has high entropy |
174
- | `:ambiguous` | Insufficient signal — never seen, or mixed |
117
+ One crate ships both the library and the `iriq` binary. Corpora persist to
118
+ SQLite (bundled, WAL) out of the box — nothing to flag, install, or rebuild.
175
119
 
176
- ## Extracting IRIs from text
120
+ ## Use it as a Rust library
177
121
 
178
- `Iriq::Extractor` is what powers pipe-mode in the CLI. Picks up explicit-
179
- scheme URLs (`http`, `https`, `ftp`, `ws`, `wss`, `urn`) and `foo.com/path`-
180
- style scheme-less URLs (small TLD allow-list, required path). Trims trailing
181
- sentence punctuation iteratively and preserves balanced parens
182
- (`https://en.wikipedia.org/wiki/Ruby_(programming_language)` stays intact;
183
- `(see https://foo.com)` drops the outer paren).
184
-
185
- ```ruby
186
- Iriq.extract("Visit https://foo.com today, also hit foo.com/users.")
187
- # => [#<Iriq::Identifier https://foo.com>,
188
- # #<Iriq::Identifier https://foo.com/users>]
189
-
190
- # Disable scheme-less:
191
- Iriq::Extractor.new(scheme_less: false).extract("hit foo.com/users today")
192
- # => []
122
+ ```sh
123
+ cargo add iriq
193
124
  ```
194
125
 
195
- Known limitations (intentional):
196
-
197
- - Comma is a URL boundary, so query strings like `?q=37.7,-122.4` truncate.
198
- Trade-off picked to keep CSV-shaped text working.
199
- - No HTML entity decoding (`&amp;` stays as-is).
200
- - Scheme-less mode skips bare hostnames without a path (too noisy in prose).
126
+ ```rust
127
+ use iriq::{parse, normalize, Corpus};
201
128
 
202
- ### Memory bounds
129
+ let iri = parse("https://foo.com/users/123")?;
130
+ iri.host; // "foo.com"
131
+ iri.path_segments; // ["users", "123"]
132
+ iri.canonical(); // "https://foo.com/users/123"
203
133
 
204
- - Per-position `value_counts` is capped (`max_values_per_position`, default
205
- 1000) — once full, `total` keeps growing but only existing keys count up.
206
- - Cluster examples are capped at `Iriq::Cluster::MAX_EXAMPLES`.
207
- - No raw IRI strings are retained outside the bounded cluster examples.
134
+ normalize("https://foo.com/users/123")?; // "https://foo.com/users/{user_id}"
208
135
 
209
- ```ruby
210
- Iriq::Corpus.new(max_values_per_position: 200)
136
+ // Streaming clustering against a persistent corpus.
137
+ let mut corpus = Corpus::open("c.db")?;
138
+ corpus.observe("https://foo.com/users/1")?;
139
+ corpus.save("c.db")?;
211
140
  ```
212
141
 
213
- ## Object model
142
+ Full API on [docs.rs/iriq](https://docs.rs/iriq); see the
143
+ [crate README](rust/iriq/README.md) for the library tour.
214
144
 
215
- | Class | Responsibility |
216
- | --------------------------- | ---------------------------------------------------- |
217
- | `Iriq::Parser` | String → `Identifier` |
218
- | `Iriq::Identifier` | Structured fields + `canonical` reconstruction |
219
- | `Iriq::SegmentClassifier` | Single segment → type symbol |
220
- | `Iriq::PathShape` | Segments → `/users/{user_id}` route shape |
221
- | `Iriq::SegmentHints` | Derives `user_id`-style hints from neighbors |
222
- | `Iriq::Inflector` | Singularization with swappable adapter (AS or built-in) |
223
- | `Iriq::Normalizer` | Identifier → canonical, shape-aware string |
224
- | `Iriq::Explanation` | Per-segment `{value, type, variable, hint}` rows |
225
- | `Iriq::Cluster` | One host + shape group, with examples & stats |
226
- | `Iriq::Clusterer` | Many identifiers → `Cluster` set + explain |
227
- | `Iriq::PositionStats` | Capped value/type frequencies for one position |
228
- | `Iriq::Observation` | What `Corpus#observe` returns |
229
- | `Iriq::Corpus` | Streaming observer with rolling aggregates + learning |
230
- | `Iriq::Extractor` | Pulls IRIs out of free text (scheme-anchored) |
145
+ ## Segment classification
231
146
 
232
- ## CLI
147
+ Iriq classifies each path/query segment into one of ~25 types — the first
148
+ matching rule wins, and heuristics are deterministic:
149
+
150
+ - `literal` — plain word (`users`, `orders`, `Profile`, `こんにちは`)
151
+ - `integer` — pure digits below the timestamp range
152
+ - `float` — decimal with digits on both sides (`3.14`, `-2.5`, `1.0`)
153
+ - `boolean` — `true` / `false` (any case)
154
+ - `version` — semver-ish with `v` prefix (`v1`, `v2.0.1`, `v1.2.3-beta`)
155
+ - `locale` — BCP 47-ish (`en-US`, `fr_CA`, `zh-Hant`, bare `en`/`fr`/`ja`)
156
+ - `currency` — ISO 4217 codes (`USD`, `EUR`, `JPY`)
157
+ - `uuid` — `f47ac10b-58cc-4372-a567-0e02b2c3d479`
158
+ - `date` — `2024-05-23`, `2024/05/23`, `20240523`, `05/23/2024`. Canonicalized to ISO in `--normalize` output.
159
+ - `timestamp` — ISO 8601, or 10/13-digit UNIX epoch
160
+ - `hash` — 32+ hex chars (md5 / sha)
161
+ - `slug` — `my-cool-post`, `my_cool_post`
162
+ - `ipv4` / `ipv6` — collapsed to `{ip}` in normalized output
163
+ - `url` — `https://...`, `ftp://...`, also scheme-less `foo.com/path`
164
+ - `email` — `local@host.tld`
165
+ - `phone` — E.164 (`+15551234567`) or NANP (`555-666-7777`, `(555) 666-7777`)
166
+ - `jwt` — three base64url segments separated by dots
167
+ - `mime` — `image/png`, `application/vnd.api+json`
168
+ - `file` — `name.ext` for known extensions; per-kind grouping (image/document/data/...)
169
+ - `color` — hex form (`#fff`, `#ffffff`, `#ffffff80`)
170
+ - `coordinate` — `lat,lng` pair with plausible-range validation
171
+ - `country` — ISO 3166-1 alpha-2 codes (`US`, `JP`, `GB`)
172
+ - `base64` — standard base64 blobs with disambiguating `+`/`/`/`=`
173
+ - `opaque_id` — short alphanumeric mix that doesn't fit elsewhere
174
+
175
+ ### RESTful hints
176
+
177
+ When a variable segment follows a literal one, iriq derives a hint by
178
+ singularizing the literal and suffixing `_id` (or `_uuid` for UUIDs). That's
179
+ what produces `{user_id}` from `/users/123` and `{order_id}` from `/orders/456`.
180
+ Semantic types (`version`, `locale`, `currency`, `date`, `boolean`) skip the
181
+ hint and surface as `{type}` — `/api/v1/status` renders as `/api/{version}/status`,
182
+ not the misleading `/api/{api_id}/status`. Pass `-N` / `--no-hints` for
183
+ mechanical placeholders (`{integer}` instead of `{user_id}`).
184
+
185
+ ### Types only the corpus can see
186
+
187
+ Four types never come from a single URL — they emerge from the *distribution*
188
+ of values a position has held across many observations:
189
+
190
+ | Type | Emerges when a position… |
191
+ | --- | --- |
192
+ | `number` | holds both integers and floats |
193
+ | `year` | holds integers that all land in 1900–2100 |
194
+ | `http_status` | holds integers that all land in 100–599 |
195
+ | `enum` | holds a small, bounded set of distinct values |
196
+
197
+ Mechanically, `200` is just an integer. Across ten thousand URLs where that
198
+ slot is always 100–599, it's an HTTP status. That's the corpus earning its keep.
233
199
 
234
- Installing the gem installs an `iriq` executable. Two main modes:
200
+ ## Corpus (streaming + learning)
235
201
 
236
- **Single input**combined parse + normalize summary; trim with section
237
- flags (`-p`, `-n`).
202
+ For processing many identifiers possibly an unbounded stream point iriq at a
203
+ corpus. It maintains rolling aggregates and per-(host, prefix) frequency stats,
204
+ so classification improves as more data comes in.
238
205
 
239
- ```
240
- $ iriq foo.com/users/456
241
- # parse
242
- original: foo.com/users/456
243
- kind: url
244
- scheme: https
245
- host: foo.com
246
- path_segments: ["users", "456"]
247
- canonical: https://foo.com/users/456
206
+ `--corpus PATH` makes the corpus survive across invocations. A `.db` /
207
+ `.sqlite` / `.sqlite3` path is stored in SQLite (WAL journaling, incremental
208
+ UPSERTs — multiple `iriq --corpus` processes can write concurrently); a
209
+ `.json` path writes a plain JSON file instead.
248
210
 
249
- # normalize
250
- https://foo.com/users/{user_id}
211
+ ### Re-runnable inference
251
212
 
252
- $ iriq -n https://foo.com/users/123
253
- https://foo.com/users/{user_id}
213
+ A corpus persists the source-IRI log alongside the materialized views.
214
+ `--reinfer` drops every view and replays the log through the current classifier
215
+ and reducers. Tune a threshold, swap in a different classifier, or activate new
216
+ recognizers (below) — then reinfer to see the new results without re-feeding
217
+ URLs.
218
+
219
+ ```sh
220
+ $ iriq --corpus c.db --reinfer
254
221
  ```
255
222
 
256
- **Piped stdin** extraction runs by default. Output auto-switches: small
257
- inputs get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the
258
- cluster view via an ephemeral corpus. Section flags work too emit one
259
- normalized URL / parsed record per extracted IRI.
223
+ ## Learning new types
224
+
225
+ Iriq doesn't just classify against a fixed list it watches the stream and
226
+ *proposes new recognizers* for patterns it keeps seeing. Notice `ghp_…` or
227
+ `cus_…` recurring at a slug position and iriq will suggest a recognizer for it,
228
+ with evidence: coverage, host count, confidence. Proposals are never
229
+ auto-applied — you activate the ones you trust, and they persist with the
230
+ corpus. Human-in-the-loop by design.
260
231
 
232
+ ```sh
233
+ # Print proposals (human-readable, or --json)
234
+ $ iriq --corpus c.db --propose-recognizers
235
+
236
+ # Auto-activate every proposal with confidence ≥ 0.9, then reinfer
237
+ $ iriq --corpus c.db --propose-recognizers --activate-above 0.9
261
238
  ```
262
- $ cat short.txt | iriq
263
- [2] https://github.com/dpep/iriq
264
- [1] https://foo.com/users
265
239
 
266
- $ cat short.txt | iriq -n # normalized URL per line
267
- https://github.com/dpep/iriq
268
- https://foo.com/users
240
+ ### Cross-host shape learning
269
241
 
270
- $ cat access.log | iriq # 10 IRIs cluster view
271
- [190] docs.example.com /users/{user_id}
272
- [186] app.example.com /users/{user_id}
273
- ...
242
+ A route shape that recurs across multiple hosts is independent evidence of a
243
+ semantic pattern — two unrelated hosts inventing the same `/users/{integer}`
244
+ structure by accident is unlikely.
274
245
 
275
- $ cat README.md | iriq --stats # rolling aggregates
276
- $ cat README.md | iriq cluster # force cluster view
277
- $ cat README.md | iriq --corpus c.json # persist into a corpus
246
+ ```sh
247
+ $ iriq --corpus c.db --cross-host-shapes [--min-hosts N]
278
248
  ```
279
249
 
280
- `--corpus PATH` makes the corpus survive across invocations (atomic JSON
281
- file). Once it has data, `-n` becomes corpus-informed:
250
+ The same signal feeds back into proposal `confidence`: each additional host
251
+ beyond the first adds `0.05` to the score (capped at 1.0), so a prefix proposed
252
+ on 5 hosts is meaningfully stronger than the same coverage seen on 1 host.
282
253
 
283
- ```
284
- $ for n in alice bob carol dave erin frank gina hank ivan jane; do
285
- iriq --corpus c.json https://foo.com/users/$n/profile >/dev/null
286
- done
254
+ ## Extracting IRIs from text
287
255
 
288
- $ iriq -n --corpus c.json https://foo.com/users/zoe/profile
289
- https://foo.com/users/{user}/profile # mechanical would keep "zoe"
290
- ```
256
+ Pipe-mode extraction picks up explicit-scheme URLs (`http`, `https`, `ftp`,
257
+ `ws`, `wss`, `urn`) and `foo.com/path`-style scheme-less URLs (small TLD
258
+ allow-list, required path). It trims trailing sentence punctuation and preserves
259
+ balanced parens (`https://en.wikipedia.org/wiki/Ruby_(programming_language)`
260
+ stays intact; `(see https://foo.com)` drops the outer paren).
291
261
 
292
- Flags:
262
+ Known limitations (intentional):
263
+
264
+ - Comma is a URL boundary, so query strings like `?q=37.7,-122.4` truncate.
265
+ Trade-off picked to keep CSV-shaped text working.
266
+ - No HTML entity decoding (`&amp;` stays as-is).
267
+ - Scheme-less mode skips bare hostnames without a path (too noisy in prose).
268
+
269
+ Disable scheme-less extraction with `--no-scheme-less`.
270
+
271
+ ## How it works
272
+
273
+ Under the shape sits one idea: **Position + Evidence**. A *Position* is a slot
274
+ in a host's structure — a typed path prefix, or a query-param name. *Evidence*
275
+ is everything the corpus has observed about that slot: which values, how often,
276
+ across how many hosts. Strings are observations; types are inferences drawn from
277
+ the pile. Shape is the surface you see; Position + Evidence is the engine
278
+ underneath. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full model.
279
+
280
+ ## CLI reference
281
+
282
+ **Single input** — combined parse + normalize summary; trim with section flags
283
+ (`-p`, `-n`).
284
+
285
+ **Piped stdin** — extraction runs by default. Output auto-switches: small inputs
286
+ get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the cluster view via
287
+ an ephemeral corpus.
293
288
 
294
289
  | Flag | Effect |
295
290
  | ------------------- | ------------------------------------------------------- |
296
291
  | `-p, --parse` | Show parsed fields |
297
292
  | `-n, --normalize` | Show the shape-normalized form |
293
+ | `-c, --canonical` | Show the canonical form (no shape normalization) |
298
294
  | `-j, --json` | Emit JSON |
299
- | `-N, --no-hints` | Use `{integer_id}` etc. instead of `{user_id}` |
295
+ | `-J, --ndjson` | Newline-delimited JSON (one object per line); implies `--json` |
296
+ | `-N, --no-hints` | Use `{integer}` etc. instead of `{user_id}` |
300
297
  | `--no-scheme-less` | Skip `foo.com/path`-style extraction (explicit-scheme only) |
301
- | `--corpus PATH` | Load/create a JSON corpus at PATH; observe and save |
298
+ | `--corpus PATH` | Load/create a corpus at PATH (`.json` or `.db`/`.sqlite`/`.sqlite3`) |
299
+ | `--host MODE` | Host-keying for clustering: `full` (default), `reg` strips subdomains, `none` ignores host |
302
300
  | `--stats` | Print rolling aggregates |
301
+ | `--reinfer` | Drop the materialized views and replay the source-IRI log through the current classifier + reducers |
302
+ | `--propose-recognizers` | Scan observed values for shape patterns that recur enough to suggest a new recognizer. Combine with `--json` for structured output |
303
+ | `--cross-host-shapes` | List route shapes that recur across multiple hosts |
304
+ | `--min-observations N` | Proposal threshold; default 20 |
305
+ | `--min-coverage F` | Proposal threshold; default 0.7 |
306
+ | `--min-hosts N` | Threshold for both proposals and cross-host shapes; default 1 / 2 respectively |
307
+ | `--activate-above F` | With `--propose-recognizers`, auto-activate every proposal whose confidence is ≥ F |
308
+ | `completion bash\|zsh` | Print shell completion script (Homebrew installs this automatically) |
303
309
  | `-V, --version` | Print version |
304
310
 
305
- A positional argument that doesn't parse as an IRI but IS an existing
306
- file is read and extracted from automatically — `iriq ./access.log` and
307
- `iriq /var/log/foo.log` Just Work. (Bare filenames like `README.md`
308
- may still parse as a URL; pipe with `cat` to disambiguate.)
311
+ A positional argument that doesn't parse as an IRI but IS an existing file is
312
+ read and extracted from automatically — `iriq ./access.log` and
313
+ `iriq /var/log/foo.log` Just Work. (Bare filenames like `README.md` may still
314
+ parse as a URL; pipe with `cat` to disambiguate.)
309
315
 
310
316
  Exit codes: `0` success, `1` usage error, `2` parse error.
311
317
 
312
- ## Performance
313
-
314
- Measured on the deterministic `IriGenerator` fixture (Ruby 3.4.9, single
315
- thread):
316
-
317
- | Operation | Throughput |
318
- | ------------------------ | ------------ |
319
- | `Iriq.parse` | ~260k URLs/s |
320
- | `Iriq.normalize` | ~148k URLs/s |
321
- | `Iriq.explain` | ~205k URLs/s |
322
- | `Iriq.extract` (prose) | ~9.6 MB/s |
323
- | `Corpus#observe` | ~80k URLs/s |
324
- | Corpus save/load (10k) | ~135 ms |
325
-
326
- Linear scaling holds through 100k observations; per-observation retained
327
- memory amortizes to ~100 bytes at that scale. Memoization caches are
328
- bounded by `CACHE_MAX = 10_000` (cleared when full) — overhead is a few
329
- hundred KB regardless of corpus size.
330
-
331
- Re-run anytime with:
332
-
333
- ```
334
- bundle exec script/benchmark.rb # throughput
335
- bundle exec script/memory.rb # retained memory + cache footprints
336
- ```
337
-
338
318
  ## Limitations (intentional)
339
319
 
340
- This is an MVP. Iriq does **not**:
320
+ Iriq does **not**:
341
321
 
342
322
  - Implement RFC 3986, RFC 3987, or the WHATWG URL standard fully.
343
323
  - Convert between Unicode (IRI) and punycode (URI) — the display form is
@@ -345,12 +325,11 @@ This is an MVP. Iriq does **not**:
345
325
  - Percent-encode or decode path/query bytes. Bytes are kept as written.
346
326
  - Validate scheme-specific structure beyond URL vs. URN.
347
327
  - Resolve relative references against a base URL.
348
- - Round-trip `canonical` back to the exact original byte-for-byte (whitespace
349
- is stripped, default ports are dropped, dot segments are collapsed).
328
+ - Round-trip `canonical` back to the exact original byte-for-byte (whitespace is
329
+ stripped, default ports are dropped, dot segments are collapsed).
350
330
 
351
- For richer IRI handling, see `addressable`. Iriq's focus is the analysis
352
- side: classification, normalization, and clustering — not a complete URL
353
- implementation.
331
+ Iriq's focus is the analysis side: classification, normalization, and clustering
332
+ — not a complete URL implementation.
354
333
 
355
334
  ----
356
335
  ## Contributing
@@ -359,7 +338,7 @@ Yes please :)
359
338
 
360
339
  1. Fork it
361
340
  1. Create your feature branch (`git checkout -b my-feature`)
362
- 1. Ensure the tests pass (`bundle exec rspec`)
341
+ 1. Ensure the tests pass (`cd rust && cargo test`)
363
342
  1. Commit your changes (`git commit -am 'awesome new feature'`)
364
343
  1. Push your branch (`git push origin my-feature`)
365
344
  1. Create a Pull Request