iriq 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +9 -0
- data/CLAUDE.md +121 -0
- data/Gemfile.lock +8 -2
- data/Makefile +56 -0
- data/README.md +112 -11
- data/iriq.gemspec +4 -3
- data/lib/iriq/cli.rb +6 -5
- data/lib/iriq/cluster.rb +24 -0
- data/lib/iriq/clusterer.rb +19 -44
- data/lib/iriq/corpus.rb +123 -69
- data/lib/iriq/parser.rb +1 -1
- data/lib/iriq/storage/json.rb +43 -0
- data/lib/iriq/storage/memory.rb +138 -0
- data/lib/iriq/storage/sqlite.rb +367 -0
- data/lib/iriq/storage.rb +35 -0
- data/lib/iriq/version.rb +1 -1
- data/lib/iriq.rb +1 -0
- metadata +23 -6
- data/script/benchmark.rb +0 -81
- data/script/memory.rb +0 -121
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: e629988f23137ecb0c0f4737e246f65a949fa2843f4bd16d244566ca76dc37ed
|
|
4
|
+
data.tar.gz: c36cae38205a6a6f63a8a38e40849c922bbb6cf7046e9599aaaefc37fb443303
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 75d0329756d16dd7b9c8e5ca7b8ba447aa51c4d3c34a1ec31549bb6e6725338bfbdb2ee4badb18c52342c8ae0ece5f005e9288c617fa5e140d7cfff400274561
|
|
7
|
+
data.tar.gz: cb670e665a2e67feeb5f1bcad157a27e15da1aa61efc53e4ae66388a67ee498ef7caa1605be85d27e087fda44c399a9f93b6701242d1a22dec2653b667918c6d
|
data/CHANGELOG.md
CHANGED
|
@@ -1,3 +1,12 @@
|
|
|
1
|
+
### 0.2.0 (2026-05-25)
|
|
2
|
+
- Corpus storage backends: JSON (default) and SQLite, dispatched by file extension
|
|
3
|
+
- Go: `iriq.OpenCorpus(path)`; Ruby: `Iriq::Corpus.open(path)`
|
|
4
|
+
- SQLite backend: incremental UPSERTs, WAL mode, concurrent-safe via busy_timeout + BEGIN IMMEDIATE; checkpoints on close so the WAL sidecar doesn't grow unbounded
|
|
5
|
+
- Batch mode: `corpus.batch { ... }` (Ruby) / `corpus.Batch(fn)` (Go) wraps many observations in one transaction
|
|
6
|
+
- Clusterer now wraps the in-memory Storage backend; only one cluster code path
|
|
7
|
+
- script/bench_storage.sh — JSON vs SQLite timing across single-process, incremental, and concurrent workloads
|
|
8
|
+
- **Breaking (Go)**: `Corpus.HostCounts` / `PathLengthCounts` / `RawShapeCounts` / `FingerprintCounts` are methods now, not fields
|
|
9
|
+
|
|
1
10
|
### 0.1.0 (2026-05-24)
|
|
2
11
|
- CLI: auto-detect file argument, retire --extract flag
|
|
3
12
|
- CLI: section flags work in pipe mode + clean up help text
|
data/CLAUDE.md
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
1
|
+
# Iriq development conventions
|
|
2
|
+
|
|
3
|
+
## Repo layout — Ruby and Go intermixed at the root
|
|
4
|
+
|
|
5
|
+
We chose to mix Ruby and Go at the repo root rather than nest the Go module
|
|
6
|
+
under `/go/`. The signal is "both implementations are peers, not one-is-primary."
|
|
7
|
+
|
|
8
|
+
```
|
|
9
|
+
iriq/
|
|
10
|
+
lib/ exe/ spec/ ← Ruby gem (library, CLI, specs)
|
|
11
|
+
iriq.gemspec
|
|
12
|
+
Gemfile
|
|
13
|
+
|
|
14
|
+
go.mod ← module github.com/dpep/iriq
|
|
15
|
+
*.go ← Go package `iriq` at the root
|
|
16
|
+
cmd/iriq/ ← Go CLI binary
|
|
17
|
+
bin/ ← built Go binary (gitignored)
|
|
18
|
+
|
|
19
|
+
script/ ← shared dev scripts (fixture gen, parity, benches)
|
|
20
|
+
spec/fixtures/ ← golden JSON shared by Ruby specs + Go tests
|
|
21
|
+
.github/workflows/ ← Ruby CI, Go CI, parity CI
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
Trade-offs of this layout:
|
|
25
|
+
|
|
26
|
+
- Clean import path: `github.com/dpep/iriq` (no `/go/` artifact in consumers' code).
|
|
27
|
+
- One version tag (`vX.Y.Z`) serves both runtimes — Ruby's gemspec and Go's
|
|
28
|
+
module use the same tag stream.
|
|
29
|
+
- Root `ls` is busier (~15 `.go` files next to Ruby ones), accepted in exchange.
|
|
30
|
+
- The gemspec explicitly excludes Go files so `gem build` doesn't ship them:
|
|
31
|
+
`git ls-files * ':!:spec' ':!:script' ':!:cmd' ':!:bin' ':!:*.go' ':!:go.mod' ':!:go.sum'`.
|
|
32
|
+
|
|
33
|
+
## Building
|
|
34
|
+
|
|
35
|
+
```sh
|
|
36
|
+
# Ruby gem
|
|
37
|
+
bundle install
|
|
38
|
+
bundle exec exe/iriq --help # runs the CLI from source
|
|
39
|
+
|
|
40
|
+
# Go binary — convenience targets in the Makefile
|
|
41
|
+
make build # → ./bin/iriq
|
|
42
|
+
make install # go install into $GOBIN
|
|
43
|
+
make uninstall # remove from $GOBIN
|
|
44
|
+
make clean # remove ./bin/
|
|
45
|
+
make test # go test ./...
|
|
46
|
+
|
|
47
|
+
# Both via Homebrew
|
|
48
|
+
brew install dpep/tools/iriq # uses the Ruby gem under the hood
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
## Keeping Ruby and Go in sync
|
|
52
|
+
|
|
53
|
+
The Ruby gem is the **reference implementation**. Go mirrors its public API
|
|
54
|
+
and behavior. Two layers of parity testing keep them aligned:
|
|
55
|
+
|
|
56
|
+
1. **Golden JSON fixtures** (`spec/fixtures/*.json`)
|
|
57
|
+
Generated by `script/generate_fixtures.rb` from the Ruby implementation
|
|
58
|
+
over a curated set of inputs. Go's `fixtures_test.go` loads each file
|
|
59
|
+
and asserts the same outputs from the Go side.
|
|
60
|
+
|
|
61
|
+
2. **CLI parity harness** (`script/cli_parity.sh`)
|
|
62
|
+
Runs the same input through `bundle exec exe/iriq` and the Go binary and
|
|
63
|
+
diffs stdout. Lives in CI as the `Ruby ↔ Go parity` job.
|
|
64
|
+
|
|
65
|
+
When changing behavior:
|
|
66
|
+
|
|
67
|
+
1. Update the Ruby code + specs first.
|
|
68
|
+
2. Regenerate fixtures: `bundle exec ruby script/generate_fixtures.rb`.
|
|
69
|
+
3. Port the change to Go.
|
|
70
|
+
4. `go test ./...` (uses the updated fixtures).
|
|
71
|
+
5. `script/cli_parity.sh` should pass.
|
|
72
|
+
6. Commit fixtures with the change — CI will fail if they're stale.
|
|
73
|
+
|
|
74
|
+
## Tests
|
|
75
|
+
|
|
76
|
+
```sh
|
|
77
|
+
bundle exec rspec # Ruby suite (305+ examples)
|
|
78
|
+
go test ./... # Go suite (native + fixture parity tests)
|
|
79
|
+
script/cli_parity.sh # CLI parity (13+ scenarios)
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
## Releases
|
|
83
|
+
|
|
84
|
+
- One version tag covers both runtimes — bump `lib/iriq/version.rb` (and
|
|
85
|
+
optionally a matching constant on the Go side if we add one), tag `vX.Y.Z`,
|
|
86
|
+
push.
|
|
87
|
+
- `gem push iriq-X.Y.Z.gem` to publish to RubyGems.
|
|
88
|
+
- Update `Formula/iriq.rb` in the homebrew-tools tap to the new version.
|
|
89
|
+
- Go consumers pick up the tag automatically via `go get @vX.Y.Z`.
|
|
90
|
+
|
|
91
|
+
## Corpus storage backends
|
|
92
|
+
|
|
93
|
+
The `Corpus` class delegates state to a `Storage` backend; three backends ship:
|
|
94
|
+
|
|
95
|
+
- **Memory** — default, in-process only.
|
|
96
|
+
- **JSON** — Memory wrapped with atomic load/save against a JSON file
|
|
97
|
+
(`.json` by default). Same shape both runtimes have always written.
|
|
98
|
+
- **SQLite** — incremental UPSERTs against a `.db` / `.sqlite` / `.sqlite3`
|
|
99
|
+
file with WAL journaling. Supports concurrent observers and avoids
|
|
100
|
+
loading the whole corpus into memory.
|
|
101
|
+
|
|
102
|
+
`Corpus.open(path)` (Ruby) / `iriq.OpenCorpus(path)` (Go) picks the backend
|
|
103
|
+
by file extension. `corpus.save(other_path)` exports as JSON regardless of
|
|
104
|
+
the live backend; `corpus.save(same_path)` is idempotent (no clobbering a
|
|
105
|
+
SQLite file with JSON, etc.).
|
|
106
|
+
|
|
107
|
+
The Ruby `sqlite3` gem is loaded lazily (only when a `.db` path is opened),
|
|
108
|
+
keeping the iriq install footprint minimal for users that stick with JSON.
|
|
109
|
+
On the Go side we use `modernc.org/sqlite` (pure Go — no cgo).
|
|
110
|
+
|
|
111
|
+
When adding a new backend, replicate the contract in both languages and
|
|
112
|
+
add a parity scenario in `script/cli_parity.sh`'s `corpus_pair` section.
|
|
113
|
+
|
|
114
|
+
## What lives where in scripts
|
|
115
|
+
|
|
116
|
+
- `script/benchmark.rb` — Ruby-only throughput benchmark.
|
|
117
|
+
- `script/memory.rb` — Ruby-only memory profile.
|
|
118
|
+
- `script/generate_fixtures.rb` — produces `spec/fixtures/*.json` for cross-runtime parity.
|
|
119
|
+
- `script/cli_parity.sh` — Ruby ↔ Go CLI diff.
|
|
120
|
+
- `script/bench_compare.sh` — Ruby vs Go CLI wall-time comparison.
|
|
121
|
+
- `script/bench_storage.sh` — JSON vs SQLite backend timing (single-process, incremental, concurrent).
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
iriq (0.
|
|
4
|
+
iriq (0.2.0)
|
|
5
5
|
|
|
6
6
|
GEM
|
|
7
7
|
remote: https://rubygems.org/
|
|
@@ -19,6 +19,7 @@ GEM
|
|
|
19
19
|
prism (>= 1.3.0)
|
|
20
20
|
rdoc (>= 4.0.0)
|
|
21
21
|
reline (>= 0.4.2)
|
|
22
|
+
mini_portile2 (2.8.9)
|
|
22
23
|
pp (0.6.3)
|
|
23
24
|
prettyprint
|
|
24
25
|
prettyprint (0.2.0)
|
|
@@ -53,6 +54,8 @@ GEM
|
|
|
53
54
|
simplecov_json_formatter (~> 0.1)
|
|
54
55
|
simplecov-html (0.13.2)
|
|
55
56
|
simplecov_json_formatter (0.1.4)
|
|
57
|
+
sqlite3 (2.9.4)
|
|
58
|
+
mini_portile2 (~> 2.8.0)
|
|
56
59
|
stringio (3.2.0)
|
|
57
60
|
tsort (0.2.0)
|
|
58
61
|
|
|
@@ -65,6 +68,7 @@ DEPENDENCIES
|
|
|
65
68
|
rspec (>= 3.10)
|
|
66
69
|
rspec-debugging
|
|
67
70
|
simplecov (>= 0.22)
|
|
71
|
+
sqlite3 (>= 1.6)
|
|
68
72
|
|
|
69
73
|
CHECKSUMS
|
|
70
74
|
date (3.5.1) sha256=750d06384d7b9c15d562c76291407d89e368dda4d4fff957eb94962d325a0dc0
|
|
@@ -74,7 +78,8 @@ CHECKSUMS
|
|
|
74
78
|
erb (6.0.4) sha256=38e3803694be357fe2bfe312487c74beaf9fb4e5beb3e22498952fe1645b95d9
|
|
75
79
|
io-console (0.8.2) sha256=d6e3ae7a7cc7574f4b8893b4fca2162e57a825b223a177b7afa236c5ef9814cc
|
|
76
80
|
irb (1.17.0) sha256=168c4ddb93d8a361a045c41d92b2952c7a118fa73f23fe14e55609eb7a863aae
|
|
77
|
-
iriq (0.
|
|
81
|
+
iriq (0.2.0)
|
|
82
|
+
mini_portile2 (2.8.9) sha256=0cd7c7f824e010c072e33f68bc02d85a00aeb6fce05bb4819c03dfd3c140c289
|
|
78
83
|
pp (0.6.3) sha256=2951d514450b93ccfeb1df7d021cae0da16e0a7f95ee1e2273719669d0ab9df6
|
|
79
84
|
prettyprint (0.2.0) sha256=2bc9e15581a94742064a3cc8b0fb9d45aae3d03a1baa6ef80922627a0766f193
|
|
80
85
|
prism (1.9.0) sha256=7b530c6a9f92c24300014919c9dcbc055bf4cdf51ec30aed099b06cd6674ef85
|
|
@@ -90,6 +95,7 @@ CHECKSUMS
|
|
|
90
95
|
simplecov (0.22.0) sha256=fe2622c7834ff23b98066bb0a854284b2729a569ac659f82621fc22ef36213a5
|
|
91
96
|
simplecov-html (0.13.2) sha256=bd0b8e54e7c2d7685927e8d6286466359b6f16b18cb0df47b508e8d73c777246
|
|
92
97
|
simplecov_json_formatter (0.1.4) sha256=529418fbe8de1713ac2b2d612aa3daa56d316975d307244399fa4838c601b428
|
|
98
|
+
sqlite3 (2.9.4) sha256=6161c5b9c17886b289558e6c8082b28a22a814736d2433c9a67f4c6bfcde5c97
|
|
93
99
|
stringio (3.2.0) sha256=c37cb2e58b4ffbd33fe5cd948c05934af997b36e0b6ca6fdf43afa234cf222e1
|
|
94
100
|
tsort (0.2.0) sha256=9650a793f6859a43b6641671278f79cfead60ac714148aabe4e3f0060480089f
|
|
95
101
|
|
data/Makefile
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
# Iriq Go binary — build/install/clean/uninstall helpers.
|
|
2
|
+
#
|
|
3
|
+
# make - same as `make help`
|
|
4
|
+
# make build - build into ./bin/iriq
|
|
5
|
+
# make install - go install into $GOBIN (defaults to $GOPATH/bin)
|
|
6
|
+
# make test - go test ./...
|
|
7
|
+
# make clean - remove ./bin/
|
|
8
|
+
# make uninstall - remove the binary from $GOBIN
|
|
9
|
+
#
|
|
10
|
+
# Ruby gem build/install is handled by Bundler/RubyGems; see CLAUDE.md.
|
|
11
|
+
|
|
12
|
+
GO ?= go
|
|
13
|
+
BIN_DIR := bin
|
|
14
|
+
BIN := $(BIN_DIR)/iriq
|
|
15
|
+
PKG := ./cmd/iriq
|
|
16
|
+
|
|
17
|
+
# Resolve $GOBIN, falling back to $GOPATH/bin (Go's default install location).
|
|
18
|
+
GOBIN := $(shell $(GO) env GOBIN)
|
|
19
|
+
ifeq ($(GOBIN),)
|
|
20
|
+
GOBIN := $(shell $(GO) env GOPATH)/bin
|
|
21
|
+
endif
|
|
22
|
+
INSTALLED := $(GOBIN)/iriq
|
|
23
|
+
|
|
24
|
+
.DEFAULT_GOAL := help
|
|
25
|
+
.PHONY: help build install test clean uninstall
|
|
26
|
+
|
|
27
|
+
help:
|
|
28
|
+
@echo "Iriq Go targets:"
|
|
29
|
+
@echo " make build build into $(BIN)"
|
|
30
|
+
@echo " make install go install into $(GOBIN)"
|
|
31
|
+
@echo " make test run go test ./..."
|
|
32
|
+
@echo " make clean remove $(BIN_DIR)/"
|
|
33
|
+
@echo " make uninstall remove $(INSTALLED)"
|
|
34
|
+
|
|
35
|
+
build:
|
|
36
|
+
@mkdir -p $(BIN_DIR)
|
|
37
|
+
$(GO) build -o $(BIN) $(PKG)
|
|
38
|
+
@echo "built $(BIN)"
|
|
39
|
+
|
|
40
|
+
install:
|
|
41
|
+
$(GO) install $(PKG)
|
|
42
|
+
@echo "installed $(INSTALLED)"
|
|
43
|
+
|
|
44
|
+
test:
|
|
45
|
+
$(GO) test ./...
|
|
46
|
+
|
|
47
|
+
clean:
|
|
48
|
+
rm -rf $(BIN_DIR)
|
|
49
|
+
@echo "removed $(BIN_DIR)/"
|
|
50
|
+
|
|
51
|
+
uninstall:
|
|
52
|
+
@if [ -f "$(INSTALLED)" ]; then \
|
|
53
|
+
rm "$(INSTALLED)" && echo "removed $(INSTALLED)"; \
|
|
54
|
+
else \
|
|
55
|
+
echo "not installed at $(INSTALLED)"; \
|
|
56
|
+
fi
|
data/README.md
CHANGED
|
@@ -3,19 +3,70 @@ Iriq
|
|
|
3
3
|

|
|
4
4
|
[](https://codecov.io/gh/dpep/iriq)
|
|
5
5
|
|
|
6
|
-
|
|
6
|
+
IRI extraction, normalization, and clustering.
|
|
7
7
|
|
|
8
|
-
Iriq
|
|
9
|
-
forms, classifies path and query components,
|
|
10
|
-
and
|
|
8
|
+
Iriq pulls IRIs out of free text, parses them, normalizes them into
|
|
9
|
+
canonical shape-aware forms, classifies their path and query components,
|
|
10
|
+
and clusters similar identifiers — surfacing what's stable vs. unique.
|
|
11
|
+
|
|
12
|
+
Ships as both a **command-line tool** (`iriq`) and a **library** (Ruby and
|
|
13
|
+
Go — same behavior, enforced by parity tests).
|
|
14
|
+
|
|
15
|
+
## Install
|
|
16
|
+
|
|
17
|
+
The CLI is available three ways. Pick whichever fits your workflow:
|
|
18
|
+
|
|
19
|
+
```sh
|
|
20
|
+
# Homebrew (recommended)
|
|
21
|
+
brew install dpep/tools/iriq
|
|
22
|
+
|
|
23
|
+
# RubyGems — installs the CLI shim and the library
|
|
24
|
+
gem install iriq
|
|
25
|
+
|
|
26
|
+
# Go — installs the CLI binary into $GOBIN
|
|
27
|
+
go install github.com/dpep/iriq/cmd/iriq@latest
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
For library use, depend on whichever runtime you're working in:
|
|
11
31
|
|
|
12
32
|
```ruby
|
|
13
|
-
|
|
33
|
+
# Gemfile
|
|
34
|
+
gem "iriq"
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
```go
|
|
38
|
+
import "github.com/dpep/iriq"
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
## CLI quick start
|
|
42
|
+
|
|
43
|
+
```
|
|
44
|
+
$ iriq https://foo.com/users/123
|
|
45
|
+
# parse
|
|
46
|
+
original: https://foo.com/users/123
|
|
47
|
+
kind: url
|
|
48
|
+
scheme: https
|
|
49
|
+
host: foo.com
|
|
50
|
+
path_segments: ["users", "123"]
|
|
51
|
+
canonical: https://foo.com/users/123
|
|
52
|
+
|
|
53
|
+
# normalize
|
|
54
|
+
https://foo.com/users/{user_id}
|
|
55
|
+
|
|
56
|
+
$ iriq -n https://foo.com/users/123
|
|
57
|
+
https://foo.com/users/{user_id}
|
|
58
|
+
|
|
59
|
+
$ cat access.log | iriq # extract → URL list (or clusters at scale)
|
|
60
|
+
$ cat access.log | iriq --stats # rolling aggregates
|
|
61
|
+
$ iriq ./access.log -n # file auto-detected → normalize each found URL
|
|
14
62
|
```
|
|
15
63
|
|
|
16
|
-
|
|
64
|
+
Full CLI reference is below under [CLI](#cli).
|
|
65
|
+
|
|
66
|
+
## Library quick start
|
|
17
67
|
|
|
18
68
|
```ruby
|
|
69
|
+
# Ruby
|
|
19
70
|
iri = Iriq.parse("https://foo.com/users/123")
|
|
20
71
|
iri.scheme # => "https"
|
|
21
72
|
iri.host # => "foo.com"
|
|
@@ -34,6 +85,22 @@ Iriq.explain("https://foo.com/users/123/orders/456")
|
|
|
34
85
|
# ]
|
|
35
86
|
```
|
|
36
87
|
|
|
88
|
+
```go
|
|
89
|
+
// Go (same surface)
|
|
90
|
+
iri, _ := iriq.Parse("https://foo.com/users/123")
|
|
91
|
+
iri.Scheme // "https"
|
|
92
|
+
iri.Host // "foo.com"
|
|
93
|
+
iri.PathSegments // []string{"users", "123"}
|
|
94
|
+
iri.Canonical() // "https://foo.com/users/123"
|
|
95
|
+
|
|
96
|
+
norm, _ := iriq.Normalize("https://foo.com/users/123")
|
|
97
|
+
// "https://foo.com/users/{user_id}"
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
The Ruby gem is the reference implementation; Go mirrors its API and is
|
|
101
|
+
kept in sync via JSON fixtures plus a CLI parity harness. See
|
|
102
|
+
[CLAUDE.md](CLAUDE.md) for the dev process.
|
|
103
|
+
|
|
37
104
|
Pass `hints: false` to `Iriq.normalize` (or `PathShape`) for mechanical
|
|
38
105
|
placeholders (`{integer_id}` instead of `{user_id}`).
|
|
39
106
|
|
|
@@ -277,18 +344,31 @@ $ cat README.md | iriq cluster # force cluster view
|
|
|
277
344
|
$ cat README.md | iriq --corpus c.json # persist into a corpus
|
|
278
345
|
```
|
|
279
346
|
|
|
280
|
-
`--corpus PATH` makes the corpus survive across invocations
|
|
281
|
-
|
|
347
|
+
`--corpus PATH` makes the corpus survive across invocations. The file
|
|
348
|
+
extension picks the storage backend:
|
|
349
|
+
|
|
350
|
+
- `.json` — a single atomically-written JSON file (default). Best for small
|
|
351
|
+
corpora and when you want the data human-readable.
|
|
352
|
+
- `.db` / `.sqlite` / `.sqlite3` — a SQLite database with WAL journaling.
|
|
353
|
+
Each observation is an incremental UPSERT, so multiple `iriq --corpus`
|
|
354
|
+
processes can write concurrently without clobbering each other, and the
|
|
355
|
+
cost of opening doesn't scale with corpus size.
|
|
356
|
+
|
|
357
|
+
Once the corpus has data, `-n` becomes corpus-informed:
|
|
282
358
|
|
|
283
359
|
```
|
|
284
360
|
$ for n in alice bob carol dave erin frank gina hank ivan jane; do
|
|
285
|
-
iriq --corpus c.
|
|
361
|
+
iriq --corpus c.db https://foo.com/users/$n/profile >/dev/null
|
|
286
362
|
done
|
|
287
363
|
|
|
288
|
-
$ iriq -n --corpus c.
|
|
364
|
+
$ iriq -n --corpus c.db https://foo.com/users/zoe/profile
|
|
289
365
|
https://foo.com/users/{user}/profile # mechanical would keep "zoe"
|
|
290
366
|
```
|
|
291
367
|
|
|
368
|
+
Library: `Iriq::Corpus.open("c.db")` (or `iriq.OpenCorpus("c.db")` in Go)
|
|
369
|
+
dispatches on the same extension rules. `corpus.save("export.json")`
|
|
370
|
+
exports any backend as JSON.
|
|
371
|
+
|
|
292
372
|
Flags:
|
|
293
373
|
|
|
294
374
|
| Flag | Effect |
|
|
@@ -298,7 +378,7 @@ Flags:
|
|
|
298
378
|
| `-j, --json` | Emit JSON |
|
|
299
379
|
| `-N, --no-hints` | Use `{integer_id}` etc. instead of `{user_id}` |
|
|
300
380
|
| `--no-scheme-less` | Skip `foo.com/path`-style extraction (explicit-scheme only) |
|
|
301
|
-
| `--corpus PATH` | Load/create a
|
|
381
|
+
| `--corpus PATH` | Load/create a corpus at PATH (`.json` or `.db`/`.sqlite`/`.sqlite3`) |
|
|
302
382
|
| `--stats` | Print rolling aggregates |
|
|
303
383
|
| `-V, --version` | Print version |
|
|
304
384
|
|
|
@@ -352,6 +432,25 @@ For richer IRI handling, see `addressable`. Iriq's focus is the analysis
|
|
|
352
432
|
side: classification, normalization, and clustering — not a complete URL
|
|
353
433
|
implementation.
|
|
354
434
|
|
|
435
|
+
----
|
|
436
|
+
## Go port
|
|
437
|
+
|
|
438
|
+
A Go implementation lives under [`go/`](go/) — same public surface, same
|
|
439
|
+
behavior, ~10× faster CLI on extraction-heavy workloads. The Ruby gem is
|
|
440
|
+
the reference; the Go port stays in sync via golden JSON fixtures
|
|
441
|
+
(`spec/fixtures/`) and a CLI parity harness (`script/cli_parity.sh`), both
|
|
442
|
+
checked in CI.
|
|
443
|
+
|
|
444
|
+
```go
|
|
445
|
+
import "github.com/dpep/iriq/go/iriq"
|
|
446
|
+
|
|
447
|
+
iri, _ := iriq.Parse("https://foo.com/users/123")
|
|
448
|
+
norm, _ := iriq.Normalize("https://foo.com/users/123")
|
|
449
|
+
// "https://foo.com/users/{user_id}"
|
|
450
|
+
```
|
|
451
|
+
|
|
452
|
+
See [`go/README.md`](go/README.md) for the full API table and porting workflow.
|
|
453
|
+
|
|
355
454
|
----
|
|
356
455
|
## Contributing
|
|
357
456
|
|
|
@@ -360,6 +459,8 @@ Yes please :)
|
|
|
360
459
|
1. Fork it
|
|
361
460
|
1. Create your feature branch (`git checkout -b my-feature`)
|
|
362
461
|
1. Ensure the tests pass (`bundle exec rspec`)
|
|
462
|
+
1. If you changed library behavior, port the change to Go (or open an
|
|
463
|
+
issue) and regenerate fixtures: `bundle exec ruby script/generate_fixtures.rb`
|
|
363
464
|
1. Commit your changes (`git commit -am 'awesome new feature'`)
|
|
364
465
|
1. Push your branch (`git push origin my-feature`)
|
|
365
466
|
1. Create a Pull Request
|
data/iriq.gemspec
CHANGED
|
@@ -4,13 +4,13 @@ Gem::Specification.new do |s|
|
|
|
4
4
|
s.name = "iriq"
|
|
5
5
|
s.version = Iriq::VERSION
|
|
6
6
|
s.authors = ["Daniel Pepper"]
|
|
7
|
-
s.description = "
|
|
8
|
-
s.files = `git ls-files * ':!:spec'`.split("\n")
|
|
7
|
+
s.description = "IRI extraction, normalization, and clustering."
|
|
8
|
+
s.files = `git ls-files * ':!:spec' ':!:script' ':!:cmd' ':!:bin' ':!:*.go' ':!:go.mod' ':!:go.sum'`.split("\n")
|
|
9
9
|
s.bindir = "exe"
|
|
10
10
|
s.executables = ["iriq"]
|
|
11
11
|
s.homepage = "https://github.com/dpep/iriq"
|
|
12
12
|
s.license = "MIT"
|
|
13
|
-
s.summary = "
|
|
13
|
+
s.summary = "IRI extraction, normalization, and clustering."
|
|
14
14
|
|
|
15
15
|
s.required_ruby_version = ">= 3.2"
|
|
16
16
|
|
|
@@ -18,4 +18,5 @@ Gem::Specification.new do |s|
|
|
|
18
18
|
s.add_development_dependency 'rspec', '>= 3.10'
|
|
19
19
|
s.add_development_dependency 'rspec-debugging'
|
|
20
20
|
s.add_development_dependency 'simplecov', '>= 0.22'
|
|
21
|
+
s.add_development_dependency 'sqlite3', '>= 1.6'
|
|
21
22
|
end
|
data/lib/iriq/cli.rb
CHANGED
|
@@ -150,9 +150,7 @@ module Iriq
|
|
|
150
150
|
end
|
|
151
151
|
|
|
152
152
|
def load_corpus(path)
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
Corpus.new
|
|
153
|
+
Corpus.open(path)
|
|
156
154
|
end
|
|
157
155
|
|
|
158
156
|
def print_usage(io, code)
|
|
@@ -193,7 +191,7 @@ module Iriq
|
|
|
193
191
|
def cmd_batch(args, opts, corpus, explicit_cluster: false)
|
|
194
192
|
corpus ||= Corpus.new
|
|
195
193
|
iris = extract_text(read_text(args.first), opts)
|
|
196
|
-
iris.each { |iri| corpus.observe(iri) }
|
|
194
|
+
corpus.batch { iris.each { |iri| corpus.observe(iri) } }
|
|
197
195
|
|
|
198
196
|
if opts[:sections].any?
|
|
199
197
|
emit_per_iri_sections(iris, opts)
|
|
@@ -298,6 +296,9 @@ module Iriq
|
|
|
298
296
|
end
|
|
299
297
|
end
|
|
300
298
|
|
|
299
|
+
# Compact identifier hash for parse output (both JSON and human). Drops
|
|
300
|
+
# nil values and empty collections so URN dumps don't carry empty
|
|
301
|
+
# host/path/query slots, and URL dumps don't include null fragment/nss.
|
|
301
302
|
def identifier_hash(iri)
|
|
302
303
|
{
|
|
303
304
|
original: iri.original,
|
|
@@ -310,7 +311,7 @@ module Iriq
|
|
|
310
311
|
fragment: iri.fragment,
|
|
311
312
|
nss: iri.nss,
|
|
312
313
|
canonical: iri.canonical,
|
|
313
|
-
}
|
|
314
|
+
}.reject { |_, v| v.nil? || (v.respond_to?(:empty?) && v.empty?) }
|
|
314
315
|
end
|
|
315
316
|
|
|
316
317
|
def emit_sections(data, sections)
|
data/lib/iriq/cluster.rb
CHANGED
|
@@ -77,5 +77,29 @@ module Iriq
|
|
|
77
77
|
cluster.instance_variable_set(:@segment_counts, h["segment_counts"].map { |sub| Hash.new(0).merge(sub) })
|
|
78
78
|
cluster
|
|
79
79
|
end
|
|
80
|
+
|
|
81
|
+
# Shared cluster-key derivation. Returns [key, host, scheme, shape] —
|
|
82
|
+
# callers that already have a hinted shape can pass it in to skip the
|
|
83
|
+
# recomputation; URN inputs ignore the override and always derive their
|
|
84
|
+
# own shape from the NSS value.
|
|
85
|
+
def self.key_for(iri, classifier:, shape: nil)
|
|
86
|
+
if iri.urn?
|
|
87
|
+
ns, value = (iri.nss || "").split(":", 2)
|
|
88
|
+
derived = value ? urn_value_shape(ns, value, classifier) : nil
|
|
89
|
+
key = "urn:#{ns}:#{derived}"
|
|
90
|
+
[key, nil, "urn", key]
|
|
91
|
+
else
|
|
92
|
+
shape ||= PathShape.new(classifier: classifier).for(iri.path_segments)
|
|
93
|
+
key = "#{iri.scheme}://#{iri.host}#{shape}"
|
|
94
|
+
[key, iri.host, iri.scheme, shape]
|
|
95
|
+
end
|
|
96
|
+
end
|
|
97
|
+
|
|
98
|
+
def self.urn_value_shape(ns, value, classifier)
|
|
99
|
+
entry = SegmentHints.derive([ns, value], classifier).last
|
|
100
|
+
return entry[:value] unless entry[:variable]
|
|
101
|
+
|
|
102
|
+
"{#{entry[:hint] || entry[:type]}}"
|
|
103
|
+
end
|
|
80
104
|
end
|
|
81
105
|
end
|
data/lib/iriq/clusterer.rb
CHANGED
|
@@ -3,31 +3,28 @@ module Iriq
|
|
|
3
3
|
# `clusters` to read out the groups. `explain` annotates a single identifier
|
|
4
4
|
# against the cluster it would fall into, including which positions are
|
|
5
5
|
# stable across all observed members.
|
|
6
|
+
#
|
|
7
|
+
# Implemented as a thin wrapper over Storage::Memory — the same code path
|
|
8
|
+
# Corpus uses for the cluster portion of its state, so there's only one
|
|
9
|
+
# place that knows how clusters get stored.
|
|
6
10
|
class Clusterer
|
|
7
11
|
def initialize(classifier: SegmentClassifier::DEFAULT)
|
|
8
12
|
@classifier = classifier
|
|
9
|
-
@
|
|
13
|
+
@storage = Storage::Memory.new(classifier: classifier)
|
|
10
14
|
end
|
|
11
15
|
|
|
12
16
|
def add(input, shape: nil)
|
|
13
17
|
iri = coerce(input)
|
|
14
|
-
key, host, scheme,
|
|
15
|
-
|
|
16
|
-
key: key,
|
|
17
|
-
host: host,
|
|
18
|
-
scheme: scheme,
|
|
19
|
-
shape: shape,
|
|
20
|
-
)
|
|
21
|
-
cluster.add(iri)
|
|
22
|
-
cluster
|
|
18
|
+
key, host, scheme, derived = Cluster.key_for(iri, classifier: @classifier, shape: shape)
|
|
19
|
+
@storage.add_to_cluster(key, host, scheme, derived, iri)
|
|
23
20
|
end
|
|
24
21
|
|
|
25
22
|
def clusters
|
|
26
|
-
@clusters
|
|
23
|
+
@storage.clusters
|
|
27
24
|
end
|
|
28
25
|
|
|
29
26
|
def size
|
|
30
|
-
@
|
|
27
|
+
@storage.cluster_size
|
|
31
28
|
end
|
|
32
29
|
|
|
33
30
|
# Returns a per-segment explanation for the input, merging classifier
|
|
@@ -36,8 +33,8 @@ module Iriq
|
|
|
36
33
|
# would otherwise call them variable).
|
|
37
34
|
def explain(input)
|
|
38
35
|
iri = coerce(input)
|
|
39
|
-
key, * =
|
|
40
|
-
cluster =
|
|
36
|
+
key, * = Cluster.key_for(iri, classifier: @classifier)
|
|
37
|
+
cluster = clusters.find { |c| c.key == key }
|
|
41
38
|
stats = cluster ? cluster.segment_stats : []
|
|
42
39
|
hinted = SegmentHints.derive(iri.path_segments, @classifier)
|
|
43
40
|
|
|
@@ -50,43 +47,21 @@ module Iriq
|
|
|
50
47
|
end
|
|
51
48
|
end
|
|
52
49
|
|
|
53
|
-
private
|
|
54
|
-
|
|
55
|
-
def coerce(input)
|
|
56
|
-
input.is_a?(Identifier) ? input : Parser.parse(input)
|
|
57
|
-
end
|
|
58
|
-
|
|
59
|
-
def cluster_key(iri, shape: nil)
|
|
60
|
-
if iri.urn?
|
|
61
|
-
ns, value = (iri.nss || "").split(":", 2)
|
|
62
|
-
shape = value ? urn_value_shape(ns, value) : nil
|
|
63
|
-
key = "urn:#{ns}:#{shape}"
|
|
64
|
-
[key, nil, "urn", key]
|
|
65
|
-
else
|
|
66
|
-
shape ||= PathShape.new(classifier: @classifier).for(iri.path_segments)
|
|
67
|
-
key = "#{iri.scheme}://#{iri.host}#{shape}"
|
|
68
|
-
[key, iri.host, iri.scheme, shape]
|
|
69
|
-
end
|
|
70
|
-
end
|
|
71
|
-
|
|
72
|
-
def urn_value_shape(ns, value)
|
|
73
|
-
entry = SegmentHints.derive([ns, value], @classifier).last
|
|
74
|
-
return entry[:value] unless entry[:variable]
|
|
75
|
-
|
|
76
|
-
"{#{entry[:hint] || entry[:type]}}"
|
|
77
|
-
end
|
|
78
|
-
|
|
79
|
-
public
|
|
80
|
-
|
|
81
50
|
def dump
|
|
82
|
-
{ "clusters" =>
|
|
51
|
+
{ "clusters" => clusters.each_with_object({}) { |c, h| h[c.key] = c.dump } }
|
|
83
52
|
end
|
|
84
53
|
|
|
85
54
|
def self.from_dump(h, classifier: SegmentClassifier::DEFAULT)
|
|
86
55
|
c = new(classifier: classifier)
|
|
87
56
|
restored = h["clusters"].transform_values { |cdump| Cluster.from_dump(cdump) }
|
|
88
|
-
c.instance_variable_set(:@clusters, restored)
|
|
57
|
+
c.instance_variable_get(:@storage).instance_variable_set(:@clusters, restored)
|
|
89
58
|
c
|
|
90
59
|
end
|
|
60
|
+
|
|
61
|
+
private
|
|
62
|
+
|
|
63
|
+
def coerce(input)
|
|
64
|
+
input.is_a?(Identifier) ? input : Parser.parse(input)
|
|
65
|
+
end
|
|
91
66
|
end
|
|
92
67
|
end
|