iriq 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 95d6bc09f7de65bcb4acc5db3ad68d1b83c326360d913c45aede66038a100461
4
- data.tar.gz: ae58d39b77fce3041cc5561b575dea52706b38f4cfab11d4881cf2f644f6cc59
3
+ metadata.gz: e629988f23137ecb0c0f4737e246f65a949fa2843f4bd16d244566ca76dc37ed
4
+ data.tar.gz: c36cae38205a6a6f63a8a38e40849c922bbb6cf7046e9599aaaefc37fb443303
5
5
  SHA512:
6
- metadata.gz: a8fa85f112d9766ff4e9e4ad60c1043ce9701fe42e6dbd01f70da39fa0dd554bff573f40c858d88ab1e1539de0c1f017609535cc57e8e4fcc6651f0d20697e60
7
- data.tar.gz: 97c714e664874c08278a305b8beff470b68bd2b7e4f58977cc44cf2e3716e619d17011a33d77b8ea9356eec0715acb6c815252551edf64db16eec4599f816274
6
+ metadata.gz: 75d0329756d16dd7b9c8e5ca7b8ba447aa51c4d3c34a1ec31549bb6e6725338bfbdb2ee4badb18c52342c8ae0ece5f005e9288c617fa5e140d7cfff400274561
7
+ data.tar.gz: cb670e665a2e67feeb5f1bcad157a27e15da1aa61efc53e4ae66388a67ee498ef7caa1605be85d27e087fda44c399a9f93b6701242d1a22dec2653b667918c6d
data/CHANGELOG.md CHANGED
@@ -1,3 +1,12 @@
1
+ ### 0.2.0 (2026-05-25)
2
+ - Corpus storage backends: JSON (default) and SQLite, dispatched by file extension
3
+ - Go: `iriq.OpenCorpus(path)`; Ruby: `Iriq::Corpus.open(path)`
4
+ - SQLite backend: incremental UPSERTs, WAL mode, concurrent-safe via busy_timeout + BEGIN IMMEDIATE; checkpoints on close so the WAL sidecar doesn't grow unbounded
5
+ - Batch mode: `corpus.batch { ... }` (Ruby) / `corpus.Batch(fn)` (Go) wraps many observations in one transaction
6
+ - Clusterer now wraps the in-memory Storage backend; only one cluster code path
7
+ - script/bench_storage.sh — JSON vs SQLite timing across single-process, incremental, and concurrent workloads
8
+ - **Breaking (Go)**: `Corpus.HostCounts` / `PathLengthCounts` / `RawShapeCounts` / `FingerprintCounts` are methods now, not fields
9
+
1
10
  ### 0.1.0 (2026-05-24)
2
11
  - CLI: auto-detect file argument, retire --extract flag
3
12
  - CLI: section flags work in pipe mode + clean up help text
data/CLAUDE.md ADDED
@@ -0,0 +1,121 @@
1
+ # Iriq development conventions
2
+
3
+ ## Repo layout — Ruby and Go intermixed at the root
4
+
5
+ We chose to mix Ruby and Go at the repo root rather than nest the Go module
6
+ under `/go/`. The signal is "both implementations are peers, not one-is-primary."
7
+
8
+ ```
9
+ iriq/
10
+ lib/ exe/ spec/ ← Ruby gem (library, CLI, specs)
11
+ iriq.gemspec
12
+ Gemfile
13
+
14
+ go.mod ← module github.com/dpep/iriq
15
+ *.go ← Go package `iriq` at the root
16
+ cmd/iriq/ ← Go CLI binary
17
+ bin/ ← built Go binary (gitignored)
18
+
19
+ script/ ← shared dev scripts (fixture gen, parity, benches)
20
+ spec/fixtures/ ← golden JSON shared by Ruby specs + Go tests
21
+ .github/workflows/ ← Ruby CI, Go CI, parity CI
22
+ ```
23
+
24
+ Trade-offs of this layout:
25
+
26
+ - Clean import path: `github.com/dpep/iriq` (no `/go/` artifact in consumers' code).
27
+ - One version tag (`vX.Y.Z`) serves both runtimes — Ruby's gemspec and Go's
28
+ module use the same tag stream.
29
+ - Root `ls` is busier (~15 `.go` files next to Ruby ones), accepted in exchange.
30
+ - The gemspec explicitly excludes Go files so `gem build` doesn't ship them:
31
+ `git ls-files * ':!:spec' ':!:script' ':!:cmd' ':!:bin' ':!:*.go' ':!:go.mod' ':!:go.sum'`.
32
+
33
+ ## Building
34
+
35
+ ```sh
36
+ # Ruby gem
37
+ bundle install
38
+ bundle exec exe/iriq --help # runs the CLI from source
39
+
40
+ # Go binary — convenience targets in the Makefile
41
+ make build # → ./bin/iriq
42
+ make install # go install into $GOBIN
43
+ make uninstall # remove from $GOBIN
44
+ make clean # remove ./bin/
45
+ make test # go test ./...
46
+
47
+ # Both via Homebrew
48
+ brew install dpep/tools/iriq # uses the Ruby gem under the hood
49
+ ```
50
+
51
+ ## Keeping Ruby and Go in sync
52
+
53
+ The Ruby gem is the **reference implementation**. Go mirrors its public API
54
+ and behavior. Two layers of parity testing keep them aligned:
55
+
56
+ 1. **Golden JSON fixtures** (`spec/fixtures/*.json`)
57
+ Generated by `script/generate_fixtures.rb` from the Ruby implementation
58
+ over a curated set of inputs. Go's `fixtures_test.go` loads each file
59
+ and asserts the same outputs from the Go side.
60
+
61
+ 2. **CLI parity harness** (`script/cli_parity.sh`)
62
+ Runs the same input through `bundle exec exe/iriq` and the Go binary and
63
+ diffs stdout. Lives in CI as the `Ruby ↔ Go parity` job.
64
+
65
+ When changing behavior:
66
+
67
+ 1. Update the Ruby code + specs first.
68
+ 2. Regenerate fixtures: `bundle exec ruby script/generate_fixtures.rb`.
69
+ 3. Port the change to Go.
70
+ 4. `go test ./...` (uses the updated fixtures).
71
+ 5. `script/cli_parity.sh` should pass.
72
+ 6. Commit fixtures with the change — CI will fail if they're stale.
73
+
74
+ ## Tests
75
+
76
+ ```sh
77
+ bundle exec rspec # Ruby suite (305+ examples)
78
+ go test ./... # Go suite (native + fixture parity tests)
79
+ script/cli_parity.sh # CLI parity (13+ scenarios)
80
+ ```
81
+
82
+ ## Releases
83
+
84
+ - One version tag covers both runtimes — bump `lib/iriq/version.rb` (and
85
+ optionally a matching constant on the Go side if we add one), tag `vX.Y.Z`,
86
+ push.
87
+ - `gem push iriq-X.Y.Z.gem` to publish to RubyGems.
88
+ - Update `Formula/iriq.rb` in the homebrew-tools tap to the new version.
89
+ - Go consumers pick up the tag automatically via `go get @vX.Y.Z`.
90
+
91
+ ## Corpus storage backends
92
+
93
+ The `Corpus` class delegates state to a `Storage` backend; three backends ship:
94
+
95
+ - **Memory** — default, in-process only.
96
+ - **JSON** — Memory wrapped with atomic load/save against a JSON file
97
+ (`.json` by default). Same shape both runtimes have always written.
98
+ - **SQLite** — incremental UPSERTs against a `.db` / `.sqlite` / `.sqlite3`
99
+ file with WAL journaling. Supports concurrent observers and avoids
100
+ loading the whole corpus into memory.
101
+
102
+ `Corpus.open(path)` (Ruby) / `iriq.OpenCorpus(path)` (Go) picks the backend
103
+ by file extension. `corpus.save(other_path)` exports as JSON regardless of
104
+ the live backend; `corpus.save(same_path)` is idempotent (no clobbering a
105
+ SQLite file with JSON, etc.).
106
+
107
+ The Ruby `sqlite3` gem is loaded lazily (only when a `.db` path is opened),
108
+ keeping the iriq install footprint minimal for users that stick with JSON.
109
+ On the Go side we use `modernc.org/sqlite` (pure Go — no cgo).
110
+
111
+ When adding a new backend, replicate the contract in both languages and
112
+ add a parity scenario in `script/cli_parity.sh`'s `corpus_pair` section.
113
+
114
+ ## What lives where in scripts
115
+
116
+ - `script/benchmark.rb` — Ruby-only throughput benchmark.
117
+ - `script/memory.rb` — Ruby-only memory profile.
118
+ - `script/generate_fixtures.rb` — produces `spec/fixtures/*.json` for cross-runtime parity.
119
+ - `script/cli_parity.sh` — Ruby ↔ Go CLI diff.
120
+ - `script/bench_compare.sh` — Ruby vs Go CLI wall-time comparison.
121
+ - `script/bench_storage.sh` — JSON vs SQLite backend timing (single-process, incremental, concurrent).
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- iriq (0.1.0)
4
+ iriq (0.2.0)
5
5
 
6
6
  GEM
7
7
  remote: https://rubygems.org/
@@ -19,6 +19,7 @@ GEM
19
19
  prism (>= 1.3.0)
20
20
  rdoc (>= 4.0.0)
21
21
  reline (>= 0.4.2)
22
+ mini_portile2 (2.8.9)
22
23
  pp (0.6.3)
23
24
  prettyprint
24
25
  prettyprint (0.2.0)
@@ -53,6 +54,8 @@ GEM
53
54
  simplecov_json_formatter (~> 0.1)
54
55
  simplecov-html (0.13.2)
55
56
  simplecov_json_formatter (0.1.4)
57
+ sqlite3 (2.9.4)
58
+ mini_portile2 (~> 2.8.0)
56
59
  stringio (3.2.0)
57
60
  tsort (0.2.0)
58
61
 
@@ -65,6 +68,7 @@ DEPENDENCIES
65
68
  rspec (>= 3.10)
66
69
  rspec-debugging
67
70
  simplecov (>= 0.22)
71
+ sqlite3 (>= 1.6)
68
72
 
69
73
  CHECKSUMS
70
74
  date (3.5.1) sha256=750d06384d7b9c15d562c76291407d89e368dda4d4fff957eb94962d325a0dc0
@@ -74,7 +78,8 @@ CHECKSUMS
74
78
  erb (6.0.4) sha256=38e3803694be357fe2bfe312487c74beaf9fb4e5beb3e22498952fe1645b95d9
75
79
  io-console (0.8.2) sha256=d6e3ae7a7cc7574f4b8893b4fca2162e57a825b223a177b7afa236c5ef9814cc
76
80
  irb (1.17.0) sha256=168c4ddb93d8a361a045c41d92b2952c7a118fa73f23fe14e55609eb7a863aae
77
- iriq (0.1.0)
81
+ iriq (0.2.0)
82
+ mini_portile2 (2.8.9) sha256=0cd7c7f824e010c072e33f68bc02d85a00aeb6fce05bb4819c03dfd3c140c289
78
83
  pp (0.6.3) sha256=2951d514450b93ccfeb1df7d021cae0da16e0a7f95ee1e2273719669d0ab9df6
79
84
  prettyprint (0.2.0) sha256=2bc9e15581a94742064a3cc8b0fb9d45aae3d03a1baa6ef80922627a0766f193
80
85
  prism (1.9.0) sha256=7b530c6a9f92c24300014919c9dcbc055bf4cdf51ec30aed099b06cd6674ef85
@@ -90,6 +95,7 @@ CHECKSUMS
90
95
  simplecov (0.22.0) sha256=fe2622c7834ff23b98066bb0a854284b2729a569ac659f82621fc22ef36213a5
91
96
  simplecov-html (0.13.2) sha256=bd0b8e54e7c2d7685927e8d6286466359b6f16b18cb0df47b508e8d73c777246
92
97
  simplecov_json_formatter (0.1.4) sha256=529418fbe8de1713ac2b2d612aa3daa56d316975d307244399fa4838c601b428
98
+ sqlite3 (2.9.4) sha256=6161c5b9c17886b289558e6c8082b28a22a814736d2433c9a67f4c6bfcde5c97
93
99
  stringio (3.2.0) sha256=c37cb2e58b4ffbd33fe5cd948c05934af997b36e0b6ca6fdf43afa234cf222e1
94
100
  tsort (0.2.0) sha256=9650a793f6859a43b6641671278f79cfead60ac714148aabe4e3f0060480089f
95
101
 
data/Makefile ADDED
@@ -0,0 +1,56 @@
1
+ # Iriq Go binary — build/install/clean/uninstall helpers.
2
+ #
3
+ # make - same as `make help`
4
+ # make build - build into ./bin/iriq
5
+ # make install - go install into $GOBIN (defaults to $GOPATH/bin)
6
+ # make test - go test ./...
7
+ # make clean - remove ./bin/
8
+ # make uninstall - remove the binary from $GOBIN
9
+ #
10
+ # Ruby gem build/install is handled by Bundler/RubyGems; see CLAUDE.md.
11
+
12
+ GO ?= go
13
+ BIN_DIR := bin
14
+ BIN := $(BIN_DIR)/iriq
15
+ PKG := ./cmd/iriq
16
+
17
+ # Resolve $GOBIN, falling back to $GOPATH/bin (Go's default install location).
18
+ GOBIN := $(shell $(GO) env GOBIN)
19
+ ifeq ($(GOBIN),)
20
+ GOBIN := $(shell $(GO) env GOPATH)/bin
21
+ endif
22
+ INSTALLED := $(GOBIN)/iriq
23
+
24
+ .DEFAULT_GOAL := help
25
+ .PHONY: help build install test clean uninstall
26
+
27
+ help:
28
+ @echo "Iriq Go targets:"
29
+ @echo " make build build into $(BIN)"
30
+ @echo " make install go install into $(GOBIN)"
31
+ @echo " make test run go test ./..."
32
+ @echo " make clean remove $(BIN_DIR)/"
33
+ @echo " make uninstall remove $(INSTALLED)"
34
+
35
+ build:
36
+ @mkdir -p $(BIN_DIR)
37
+ $(GO) build -o $(BIN) $(PKG)
38
+ @echo "built $(BIN)"
39
+
40
+ install:
41
+ $(GO) install $(PKG)
42
+ @echo "installed $(INSTALLED)"
43
+
44
+ test:
45
+ $(GO) test ./...
46
+
47
+ clean:
48
+ rm -rf $(BIN_DIR)
49
+ @echo "removed $(BIN_DIR)/"
50
+
51
+ uninstall:
52
+ @if [ -f "$(INSTALLED)" ]; then \
53
+ rm "$(INSTALLED)" && echo "removed $(INSTALLED)"; \
54
+ else \
55
+ echo "not installed at $(INSTALLED)"; \
56
+ fi
data/README.md CHANGED
@@ -3,19 +3,70 @@ Iriq
3
3
  ![Gem](https://img.shields.io/gem/dt/iriq?style=plastic)
4
4
  [![codecov](https://codecov.io/gh/dpep/iriq/branch/main/graph/badge.svg)](https://codecov.io/gh/dpep/iriq)
5
5
 
6
- Semantic IRI / URI / URL / URN normalization and clustering for Ruby.
6
+ IRI extraction, normalization, and clustering.
7
7
 
8
- Iriq parses resource identifiers, normalizes them into canonical IRI-like
9
- forms, classifies path and query components, clusters similar identifiers,
10
- and explains which parts are stable vs. unique.
8
+ Iriq pulls IRIs out of free text, parses them, normalizes them into
9
+ canonical shape-aware forms, classifies their path and query components,
10
+ and clusters similar identifiers surfacing what's stable vs. unique.
11
+
12
+ Ships as both a **command-line tool** (`iriq`) and a **library** (Ruby and
13
+ Go — same behavior, enforced by parity tests).
14
+
15
+ ## Install
16
+
17
+ The CLI is available three ways. Pick whichever fits your workflow:
18
+
19
+ ```sh
20
+ # Homebrew (recommended)
21
+ brew install dpep/tools/iriq
22
+
23
+ # RubyGems — installs the CLI shim and the library
24
+ gem install iriq
25
+
26
+ # Go — installs the CLI binary into $GOBIN
27
+ go install github.com/dpep/iriq/cmd/iriq@latest
28
+ ```
29
+
30
+ For library use, depend on whichever runtime you're working in:
11
31
 
12
32
  ```ruby
13
- require "iriq"
33
+ # Gemfile
34
+ gem "iriq"
35
+ ```
36
+
37
+ ```go
38
+ import "github.com/dpep/iriq"
39
+ ```
40
+
41
+ ## CLI quick start
42
+
43
+ ```
44
+ $ iriq https://foo.com/users/123
45
+ # parse
46
+ original: https://foo.com/users/123
47
+ kind: url
48
+ scheme: https
49
+ host: foo.com
50
+ path_segments: ["users", "123"]
51
+ canonical: https://foo.com/users/123
52
+
53
+ # normalize
54
+ https://foo.com/users/{user_id}
55
+
56
+ $ iriq -n https://foo.com/users/123
57
+ https://foo.com/users/{user_id}
58
+
59
+ $ cat access.log | iriq # extract → URL list (or clusters at scale)
60
+ $ cat access.log | iriq --stats # rolling aggregates
61
+ $ iriq ./access.log -n # file auto-detected → normalize each found URL
14
62
  ```
15
63
 
16
- ## Quick start
64
+ Full CLI reference is below under [CLI](#cli).
65
+
66
+ ## Library quick start
17
67
 
18
68
  ```ruby
69
+ # Ruby
19
70
  iri = Iriq.parse("https://foo.com/users/123")
20
71
  iri.scheme # => "https"
21
72
  iri.host # => "foo.com"
@@ -34,6 +85,22 @@ Iriq.explain("https://foo.com/users/123/orders/456")
34
85
  # ]
35
86
  ```
36
87
 
88
+ ```go
89
+ // Go (same surface)
90
+ iri, _ := iriq.Parse("https://foo.com/users/123")
91
+ iri.Scheme // "https"
92
+ iri.Host // "foo.com"
93
+ iri.PathSegments // []string{"users", "123"}
94
+ iri.Canonical() // "https://foo.com/users/123"
95
+
96
+ norm, _ := iriq.Normalize("https://foo.com/users/123")
97
+ // "https://foo.com/users/{user_id}"
98
+ ```
99
+
100
+ The Ruby gem is the reference implementation; Go mirrors its API and is
101
+ kept in sync via JSON fixtures plus a CLI parity harness. See
102
+ [CLAUDE.md](CLAUDE.md) for the dev process.
103
+
37
104
  Pass `hints: false` to `Iriq.normalize` (or `PathShape`) for mechanical
38
105
  placeholders (`{integer_id}` instead of `{user_id}`).
39
106
 
@@ -277,18 +344,31 @@ $ cat README.md | iriq cluster # force cluster view
277
344
  $ cat README.md | iriq --corpus c.json # persist into a corpus
278
345
  ```
279
346
 
280
- `--corpus PATH` makes the corpus survive across invocations (atomic JSON
281
- file). Once it has data, `-n` becomes corpus-informed:
347
+ `--corpus PATH` makes the corpus survive across invocations. The file
348
+ extension picks the storage backend:
349
+
350
+ - `.json` — a single atomically-written JSON file (default). Best for small
351
+ corpora and when you want the data human-readable.
352
+ - `.db` / `.sqlite` / `.sqlite3` — a SQLite database with WAL journaling.
353
+ Each observation is an incremental UPSERT, so multiple `iriq --corpus`
354
+ processes can write concurrently without clobbering each other, and the
355
+ cost of opening doesn't scale with corpus size.
356
+
357
+ Once the corpus has data, `-n` becomes corpus-informed:
282
358
 
283
359
  ```
284
360
  $ for n in alice bob carol dave erin frank gina hank ivan jane; do
285
- iriq --corpus c.json https://foo.com/users/$n/profile >/dev/null
361
+ iriq --corpus c.db https://foo.com/users/$n/profile >/dev/null
286
362
  done
287
363
 
288
- $ iriq -n --corpus c.json https://foo.com/users/zoe/profile
364
+ $ iriq -n --corpus c.db https://foo.com/users/zoe/profile
289
365
  https://foo.com/users/{user}/profile # mechanical would keep "zoe"
290
366
  ```
291
367
 
368
+ Library: `Iriq::Corpus.open("c.db")` (or `iriq.OpenCorpus("c.db")` in Go)
369
+ dispatches on the same extension rules. `corpus.save("export.json")`
370
+ exports any backend as JSON.
371
+
292
372
  Flags:
293
373
 
294
374
  | Flag | Effect |
@@ -298,7 +378,7 @@ Flags:
298
378
  | `-j, --json` | Emit JSON |
299
379
  | `-N, --no-hints` | Use `{integer_id}` etc. instead of `{user_id}` |
300
380
  | `--no-scheme-less` | Skip `foo.com/path`-style extraction (explicit-scheme only) |
301
- | `--corpus PATH` | Load/create a JSON corpus at PATH; observe and save |
381
+ | `--corpus PATH` | Load/create a corpus at PATH (`.json` or `.db`/`.sqlite`/`.sqlite3`) |
302
382
  | `--stats` | Print rolling aggregates |
303
383
  | `-V, --version` | Print version |
304
384
 
@@ -352,6 +432,25 @@ For richer IRI handling, see `addressable`. Iriq's focus is the analysis
352
432
  side: classification, normalization, and clustering — not a complete URL
353
433
  implementation.
354
434
 
435
+ ----
436
+ ## Go port
437
+
438
+ A Go implementation lives under [`go/`](go/) — same public surface, same
439
+ behavior, ~10× faster CLI on extraction-heavy workloads. The Ruby gem is
440
+ the reference; the Go port stays in sync via golden JSON fixtures
441
+ (`spec/fixtures/`) and a CLI parity harness (`script/cli_parity.sh`), both
442
+ checked in CI.
443
+
444
+ ```go
445
+ import "github.com/dpep/iriq/go/iriq"
446
+
447
+ iri, _ := iriq.Parse("https://foo.com/users/123")
448
+ norm, _ := iriq.Normalize("https://foo.com/users/123")
449
+ // "https://foo.com/users/{user_id}"
450
+ ```
451
+
452
+ See [`go/README.md`](go/README.md) for the full API table and porting workflow.
453
+
355
454
  ----
356
455
  ## Contributing
357
456
 
@@ -360,6 +459,8 @@ Yes please :)
360
459
  1. Fork it
361
460
  1. Create your feature branch (`git checkout -b my-feature`)
362
461
  1. Ensure the tests pass (`bundle exec rspec`)
462
+ 1. If you changed library behavior, port the change to Go (or open an
463
+ issue) and regenerate fixtures: `bundle exec ruby script/generate_fixtures.rb`
363
464
  1. Commit your changes (`git commit -am 'awesome new feature'`)
364
465
  1. Push your branch (`git push origin my-feature`)
365
466
  1. Create a Pull Request
data/iriq.gemspec CHANGED
@@ -4,13 +4,13 @@ Gem::Specification.new do |s|
4
4
  s.name = "iriq"
5
5
  s.version = Iriq::VERSION
6
6
  s.authors = ["Daniel Pepper"]
7
- s.description = "Semantic IRI/URI/URL/URN parsing, normalization, classification, and clustering."
8
- s.files = `git ls-files * ':!:spec'`.split("\n")
7
+ s.description = "IRI extraction, normalization, and clustering."
8
+ s.files = `git ls-files * ':!:spec' ':!:script' ':!:cmd' ':!:bin' ':!:*.go' ':!:go.mod' ':!:go.sum'`.split("\n")
9
9
  s.bindir = "exe"
10
10
  s.executables = ["iriq"]
11
11
  s.homepage = "https://github.com/dpep/iriq"
12
12
  s.license = "MIT"
13
- s.summary = "Semantic IRI normalization and clustering."
13
+ s.summary = "IRI extraction, normalization, and clustering."
14
14
 
15
15
  s.required_ruby_version = ">= 3.2"
16
16
 
@@ -18,4 +18,5 @@ Gem::Specification.new do |s|
18
18
  s.add_development_dependency 'rspec', '>= 3.10'
19
19
  s.add_development_dependency 'rspec-debugging'
20
20
  s.add_development_dependency 'simplecov', '>= 0.22'
21
+ s.add_development_dependency 'sqlite3', '>= 1.6'
21
22
  end
data/lib/iriq/cli.rb CHANGED
@@ -150,9 +150,7 @@ module Iriq
150
150
  end
151
151
 
152
152
  def load_corpus(path)
153
- return Corpus.load(path) if File.exist?(path)
154
-
155
- Corpus.new
153
+ Corpus.open(path)
156
154
  end
157
155
 
158
156
  def print_usage(io, code)
@@ -193,7 +191,7 @@ module Iriq
193
191
  def cmd_batch(args, opts, corpus, explicit_cluster: false)
194
192
  corpus ||= Corpus.new
195
193
  iris = extract_text(read_text(args.first), opts)
196
- iris.each { |iri| corpus.observe(iri) }
194
+ corpus.batch { iris.each { |iri| corpus.observe(iri) } }
197
195
 
198
196
  if opts[:sections].any?
199
197
  emit_per_iri_sections(iris, opts)
@@ -298,6 +296,9 @@ module Iriq
298
296
  end
299
297
  end
300
298
 
299
+ # Compact identifier hash for parse output (both JSON and human). Drops
300
+ # nil values and empty collections so URN dumps don't carry empty
301
+ # host/path/query slots, and URL dumps don't include null fragment/nss.
301
302
  def identifier_hash(iri)
302
303
  {
303
304
  original: iri.original,
@@ -310,7 +311,7 @@ module Iriq
310
311
  fragment: iri.fragment,
311
312
  nss: iri.nss,
312
313
  canonical: iri.canonical,
313
- }
314
+ }.reject { |_, v| v.nil? || (v.respond_to?(:empty?) && v.empty?) }
314
315
  end
315
316
 
316
317
  def emit_sections(data, sections)
data/lib/iriq/cluster.rb CHANGED
@@ -77,5 +77,29 @@ module Iriq
77
77
  cluster.instance_variable_set(:@segment_counts, h["segment_counts"].map { |sub| Hash.new(0).merge(sub) })
78
78
  cluster
79
79
  end
80
+
81
+ # Shared cluster-key derivation. Returns [key, host, scheme, shape] —
82
+ # callers that already have a hinted shape can pass it in to skip the
83
+ # recomputation; URN inputs ignore the override and always derive their
84
+ # own shape from the NSS value.
85
+ def self.key_for(iri, classifier:, shape: nil)
86
+ if iri.urn?
87
+ ns, value = (iri.nss || "").split(":", 2)
88
+ derived = value ? urn_value_shape(ns, value, classifier) : nil
89
+ key = "urn:#{ns}:#{derived}"
90
+ [key, nil, "urn", key]
91
+ else
92
+ shape ||= PathShape.new(classifier: classifier).for(iri.path_segments)
93
+ key = "#{iri.scheme}://#{iri.host}#{shape}"
94
+ [key, iri.host, iri.scheme, shape]
95
+ end
96
+ end
97
+
98
+ def self.urn_value_shape(ns, value, classifier)
99
+ entry = SegmentHints.derive([ns, value], classifier).last
100
+ return entry[:value] unless entry[:variable]
101
+
102
+ "{#{entry[:hint] || entry[:type]}}"
103
+ end
80
104
  end
81
105
  end
@@ -3,31 +3,28 @@ module Iriq
3
3
  # `clusters` to read out the groups. `explain` annotates a single identifier
4
4
  # against the cluster it would fall into, including which positions are
5
5
  # stable across all observed members.
6
+ #
7
+ # Implemented as a thin wrapper over Storage::Memory — the same code path
8
+ # Corpus uses for the cluster portion of its state, so there's only one
9
+ # place that knows how clusters get stored.
6
10
  class Clusterer
7
11
  def initialize(classifier: SegmentClassifier::DEFAULT)
8
12
  @classifier = classifier
9
- @clusters = {}
13
+ @storage = Storage::Memory.new(classifier: classifier)
10
14
  end
11
15
 
12
16
  def add(input, shape: nil)
13
17
  iri = coerce(input)
14
- key, host, scheme, shape = cluster_key(iri, shape: shape)
15
- cluster = @clusters[key] ||= Cluster.new(
16
- key: key,
17
- host: host,
18
- scheme: scheme,
19
- shape: shape,
20
- )
21
- cluster.add(iri)
22
- cluster
18
+ key, host, scheme, derived = Cluster.key_for(iri, classifier: @classifier, shape: shape)
19
+ @storage.add_to_cluster(key, host, scheme, derived, iri)
23
20
  end
24
21
 
25
22
  def clusters
26
- @clusters.values
23
+ @storage.clusters
27
24
  end
28
25
 
29
26
  def size
30
- @clusters.size
27
+ @storage.cluster_size
31
28
  end
32
29
 
33
30
  # Returns a per-segment explanation for the input, merging classifier
@@ -36,8 +33,8 @@ module Iriq
36
33
  # would otherwise call them variable).
37
34
  def explain(input)
38
35
  iri = coerce(input)
39
- key, * = cluster_key(iri)
40
- cluster = @clusters[key]
36
+ key, * = Cluster.key_for(iri, classifier: @classifier)
37
+ cluster = clusters.find { |c| c.key == key }
41
38
  stats = cluster ? cluster.segment_stats : []
42
39
  hinted = SegmentHints.derive(iri.path_segments, @classifier)
43
40
 
@@ -50,43 +47,21 @@ module Iriq
50
47
  end
51
48
  end
52
49
 
53
- private
54
-
55
- def coerce(input)
56
- input.is_a?(Identifier) ? input : Parser.parse(input)
57
- end
58
-
59
- def cluster_key(iri, shape: nil)
60
- if iri.urn?
61
- ns, value = (iri.nss || "").split(":", 2)
62
- shape = value ? urn_value_shape(ns, value) : nil
63
- key = "urn:#{ns}:#{shape}"
64
- [key, nil, "urn", key]
65
- else
66
- shape ||= PathShape.new(classifier: @classifier).for(iri.path_segments)
67
- key = "#{iri.scheme}://#{iri.host}#{shape}"
68
- [key, iri.host, iri.scheme, shape]
69
- end
70
- end
71
-
72
- def urn_value_shape(ns, value)
73
- entry = SegmentHints.derive([ns, value], @classifier).last
74
- return entry[:value] unless entry[:variable]
75
-
76
- "{#{entry[:hint] || entry[:type]}}"
77
- end
78
-
79
- public
80
-
81
50
  def dump
82
- { "clusters" => @clusters.transform_values(&:dump) }
51
+ { "clusters" => clusters.each_with_object({}) { |c, h| h[c.key] = c.dump } }
83
52
  end
84
53
 
85
54
  def self.from_dump(h, classifier: SegmentClassifier::DEFAULT)
86
55
  c = new(classifier: classifier)
87
56
  restored = h["clusters"].transform_values { |cdump| Cluster.from_dump(cdump) }
88
- c.instance_variable_set(:@clusters, restored)
57
+ c.instance_variable_get(:@storage).instance_variable_set(:@clusters, restored)
89
58
  c
90
59
  end
60
+
61
+ private
62
+
63
+ def coerce(input)
64
+ input.is_a?(Identifier) ? input : Parser.parse(input)
65
+ end
91
66
  end
92
67
  end