RubyGems - iriq - Versions diffs - 0.0.1 → 0.2.0 - Mend

iriq 0.0.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +25 -0
data/CLAUDE.md +121 -0
data/Gemfile.lock +8 -2
data/Makefile +56 -0
data/README.md +334 -39
data/iriq.gemspec +4 -3
data/lib/iriq/cli.rb +289 -100
data/lib/iriq/cluster.rb +47 -0
data/lib/iriq/clusterer.rb +29 -39
data/lib/iriq/corpus.rb +322 -0
data/lib/iriq/explanation.rb +6 -22
data/lib/iriq/extractor.rb +125 -0
data/lib/iriq/identifier.rb +11 -3
data/lib/iriq/inflector.rb +145 -0
data/lib/iriq/normalizer.rb +11 -8
data/lib/iriq/observation.rb +25 -0
data/lib/iriq/parser.rb +1 -1
data/lib/iriq/path_shape.rb +27 -9
data/lib/iriq/position_stats.rb +64 -0
data/lib/iriq/segment_classifier.rb +31 -7
data/lib/iriq/segment_hints.rb +32 -0
data/lib/iriq/storage/json.rb +43 -0
data/lib/iriq/storage/memory.rb +138 -0
data/lib/iriq/storage/sqlite.rb +367 -0
data/lib/iriq/storage.rb +35 -0
data/lib/iriq/version.rb +1 -1
data/lib/iriq.rb +11 -0
metadata +29 -4

data/README.md CHANGED Viewed

@@ -3,19 +3,70 @@ Iriq
 ![Gem](https://img.shields.io/gem/dt/iriq?style=plastic)
 [![codecov](https://codecov.io/gh/dpep/iriq/branch/main/graph/badge.svg)](https://codecov.io/gh/dpep/iriq)
-Semantic IRI / URI / URL / URN normalization and clustering for Ruby.
+IRI extraction, normalization, and clustering.
-Iriq parses resource identifiers, normalizes them into canonical IRI-like
-forms, classifies path and query components, clusters similar identifiers,
-and explains which parts are stable vs. unique.
+Iriq pulls IRIs out of free text, parses them, normalizes them into
+canonical shape-aware forms, classifies their path and query components,
+and clusters similar identifiers — surfacing what's stable vs. unique.
+Ships as both a **command-line tool** (`iriq`) and a **library** (Ruby and
+Go — same behavior, enforced by parity tests).
+## Install
+The CLI is available three ways. Pick whichever fits your workflow:
+```sh
+# Homebrew (recommended)
+brew install dpep/tools/iriq
+# RubyGems — installs the CLI shim and the library
+gem install iriq
+# Go — installs the CLI binary into $GOBIN
+go install github.com/dpep/iriq/cmd/iriq@latest
+```
+For library use, depend on whichever runtime you're working in:
 ```ruby
-require "iriq"
+# Gemfile
+gem "iriq"
+```
+```go
+import "github.com/dpep/iriq"
 ```
-## Quick start
+## CLI quick start
+```
+$ iriq https://foo.com/users/123
+# parse
+original:      https://foo.com/users/123
+kind:          url
+scheme:        https
+host:          foo.com
+path_segments: ["users", "123"]
+canonical:     https://foo.com/users/123
+# normalize
+https://foo.com/users/{user_id}
+$ iriq -n https://foo.com/users/123
+https://foo.com/users/{user_id}
+$ cat access.log | iriq             # extract → URL list (or clusters at scale)
+$ cat access.log | iriq --stats     # rolling aggregates
+$ iriq ./access.log -n              # file auto-detected → normalize each found URL
+```
+Full CLI reference is below under [CLI](#cli).
+## Library quick start
 ```ruby
+# Ruby
 iri = Iriq.parse("https://foo.com/users/123")
 iri.scheme         # => "https"
 iri.host           # => "foo.com"
@@ -23,17 +74,56 @@ iri.path_segments  # => ["users", "123"]
 iri.canonical      # => "https://foo.com/users/123"
 Iriq.normalize("https://foo.com/users/123")
-# => "https://foo.com/users/{integer_id}"
+# => "https://foo.com/users/{user_id}"
 Iriq.explain("https://foo.com/users/123/orders/456")
 # => [
-#      { value: "users",  type: :literal,    variable: false },
-#      { value: "123",    type: :integer_id, variable: true  },
-#      { value: "orders", type: :literal,    variable: false },
-#      { value: "456",    type: :integer_id, variable: true  },
+#      { value: "users",  type: :literal,    variable: false, hint: nil        },
+#      { value: "123",    type: :integer_id, variable: true,  hint: "user_id"  },
+#      { value: "orders", type: :literal,    variable: false, hint: nil        },
+#      { value: "456",    type: :integer_id, variable: true,  hint: "order_id" },
 #    ]
 ```
+```go
+// Go (same surface)
+iri, _ := iriq.Parse("https://foo.com/users/123")
+iri.Scheme         // "https"
+iri.Host           // "foo.com"
+iri.PathSegments   // []string{"users", "123"}
+iri.Canonical()    // "https://foo.com/users/123"
+norm, _ := iriq.Normalize("https://foo.com/users/123")
+// "https://foo.com/users/{user_id}"
+```
+The Ruby gem is the reference implementation; Go mirrors its API and is
+kept in sync via JSON fixtures plus a CLI parity harness. See
+[CLAUDE.md](CLAUDE.md) for the dev process.
+Pass `hints: false` to `Iriq.normalize` (or `PathShape`) for mechanical
+placeholders (`{integer_id}` instead of `{user_id}`).
+## RESTful hints
+When a variable segment follows a literal one, Iriq derives a hint by
+singularizing the literal and suffixing `_id` (or `_uuid` for UUIDs). This is
+what produces `{user_id}` from `/users/123` and `{order_id}` from
+`/orders/456`. Singularization uses `Iriq::Inflector`, which delegates to a
+swappable adapter:
+```ruby
+# Default: ActiveSupport::Inflector if `active_support/inflector` is loadable,
+# otherwise a built-in adapter with rules adapted from ActiveSupport.
+Iriq::Inflector.singularize("categories")  # => "category"
+Iriq::Inflector.singularize("people")      # => "person"
+# Override:
+Iriq::Inflector.adapter = MyAdapter        # must respond to .singularize(String)
+Iriq::Inflector.reset_adapter!
+```
 ## Supported inputs
 | Input                                | Notes                                            |
@@ -69,7 +159,7 @@ clusterer.add("https://foo.com/users/456")
 clusterer.add("https://foo.com/users/789/orders/1")
 clusterer.clusters.map(&:shape)
-# => ["/users/{integer_id}", "/users/{integer_id}/orders/{integer_id}"]
+# => ["/users/{user_id}", "/users/{user_id}/orders/{order_id}"]
 clusterer.clusters.first.segment_stats
 # => [
@@ -79,8 +169,8 @@ clusterer.clusters.first.segment_stats
 clusterer.explain("https://foo.com/users/999")
 # => [
-#      { value: "users", type: :literal,    variable: false, stable: true  },
-#      { value: "999",   type: :integer_id, variable: true,  stable: false },
+#      { value: "users", type: :literal,    variable: false, hint: nil,       stable: true  },
+#      { value: "999",   type: :integer_id, variable: true,  hint: "user_id", stable: false },
 #    ]
 ```
@@ -89,6 +179,104 @@ a position the classifier *would* call variable but that is empirically
 constant across all members of the cluster will be reported with
 `stable: true, variable: false`.
+## Corpus (streaming + learning)
+For processing many identifiers — possibly an unbounded stream — use
+`Iriq::Corpus`. It maintains rolling aggregates and per-(host, prefix)
+frequency stats so classification improves as more data comes in.
+```ruby
+corpus = Iriq::Corpus.new
+iris.each do |iri|
+  obs = corpus.observe(iri)
+  obs.fingerprint   # deterministic shape: "https://foo.com/users/{user_id}"
+  obs.cluster       # the Iriq::Cluster this fell into
+  obs.explanation   # per-segment annotations with corpus-informed classification
+end
+corpus.host_counts          # { "foo.com" => 1234, "bar.com" => 7 }
+corpus.path_length_counts   # { 2 => 800, 3 => 434 }
+corpus.fingerprint_counts   # shape → count
+corpus.raw_shape_counts     # hint-free shape → count
+corpus.clusters             # Iriq::Cluster instances
+```
+### Deterministic vs. corpus-informed normalization
+```ruby
+Iriq.normalize("https://foo.com/users/me")
+# => "https://foo.com/users/me"   # mechanical: "me" is a literal
+corpus.normalize("https://foo.com/users/me")
+# => depends on what the corpus has seen
+```
+If many `/users/{integer_id}` paths flow in alongside a handful of
+`/users/me`, the cluster `/users/me` is preserved (mechanical clustering
+keeps literal routes distinct). If many *distinct literal handles*
+(`/users/alice`, `/users/bob`, `/users/carol`, ...) flow in, the corpus
+promotes that position to a `{user}` placeholder:
+```ruby
+%w[alice bob carol dave erin frank gina hank ivan jane].each do |name|
+  corpus.observe("https://foo.com/users/#{name}/profile")
+end
+corpus.normalize("https://foo.com/users/alice/profile")
+# => "https://foo.com/users/{user}/profile"
+```
+### Explainability
+Each row of `corpus.explain(...)` (and `observation.explanation`) carries a
+`classification:` symbol on top of the deterministic fields:
+| Classification              | Meaning                                              |
+| --------------------------- | ---------------------------------------------------- |
+| `:stable_literal`           | Literal value dominates this position                |
+| `:variable_identifier`      | Classifier said variable (uuid, integer, etc.)       |
+| `:rare_literal`             | Literal seen here, but not dominant                  |
+| `:corpus_inferred_variable` | Classifier said literal, but position has high entropy |
+| `:ambiguous`                | Insufficient signal — never seen, or mixed           |
+## Extracting IRIs from text
+`Iriq::Extractor` is what powers pipe-mode in the CLI. Picks up explicit-
+scheme URLs (`http`, `https`, `ftp`, `ws`, `wss`, `urn`) and `foo.com/path`-
+style scheme-less URLs (small TLD allow-list, required path). Trims trailing
+sentence punctuation iteratively and preserves balanced parens
+(`https://en.wikipedia.org/wiki/Ruby_(programming_language)` stays intact;
+`(see https://foo.com)` drops the outer paren).
+```ruby
+Iriq.extract("Visit https://foo.com today, also hit foo.com/users.")
+# => [#<Iriq::Identifier https://foo.com>,
+#     #<Iriq::Identifier https://foo.com/users>]
+# Disable scheme-less:
+Iriq::Extractor.new(scheme_less: false).extract("hit foo.com/users today")
+# => []
+```
+Known limitations (intentional):
+- Comma is a URL boundary, so query strings like `?q=37.7,-122.4` truncate.
+  Trade-off picked to keep CSV-shaped text working.
+- No HTML entity decoding (`&amp;` stays as-is).
+- Scheme-less mode skips bare hostnames without a path (too noisy in prose).
+### Memory bounds
+- Per-position `value_counts` is capped (`max_values_per_position`, default
+  1000) — once full, `total` keeps growing but only existing keys count up.
+- Cluster examples are capped at `Iriq::Cluster::MAX_EXAMPLES`.
+- No raw IRI strings are retained outside the bounded cluster examples.
+```ruby
+Iriq::Corpus.new(max_values_per_position: 200)
+```
 ## Object model
 | Class                       | Responsibility                                       |
@@ -96,51 +284,137 @@ constant across all members of the cluster will be reported with
 | `Iriq::Parser`              | String → `Identifier`                                |
 | `Iriq::Identifier`          | Structured fields + `canonical` reconstruction       |
 | `Iriq::SegmentClassifier`   | Single segment → type symbol                         |
-| `Iriq::PathShape`           | Segments → `/users/{integer_id}` route shape         |
+| `Iriq::PathShape`           | Segments → `/users/{user_id}` route shape            |
+| `Iriq::SegmentHints`        | Derives `user_id`-style hints from neighbors         |
+| `Iriq::Inflector`           | Singularization with swappable adapter (AS or built-in) |
 | `Iriq::Normalizer`          | Identifier → canonical, shape-aware string           |
-| `Iriq::Explanation`         | Per-segment `{value, type, variable}` annotations    |
+| `Iriq::Explanation`         | Per-segment `{value, type, variable, hint}` rows     |
 | `Iriq::Cluster`             | One host + shape group, with examples & stats        |
 | `Iriq::Clusterer`           | Many identifiers → `Cluster` set + explain          |
+| `Iriq::PositionStats`       | Capped value/type frequencies for one position       |
+| `Iriq::Observation`         | What `Corpus#observe` returns                        |
+| `Iriq::Corpus`              | Streaming observer with rolling aggregates + learning |
+| `Iriq::Extractor`           | Pulls IRIs out of free text (scheme-anchored)        |
 ## CLI
-Installing the gem also installs an `iriq` executable.
+Installing the gem installs an `iriq` executable. Two main modes:
+**Single input** — combined parse + normalize summary; trim with section
+flags (`-p`, `-n`).
 ```
-$ iriq parse https://foo.com/users/123
-original:      https://foo.com/users/123
+$ iriq foo.com/users/456
+# parse
+original:      foo.com/users/456
 kind:          url
 scheme:        https
 host:          foo.com
-path_segments: ["users", "123"]
-canonical:     https://foo.com/users/123
+path_segments: ["users", "456"]
+canonical:     https://foo.com/users/456
-$ iriq normalize foo.com/posts/2024-05-23/hello-world
-https://foo.com/posts/{date}/{slug}
+# normalize
+https://foo.com/users/{user_id}
-$ iriq explain https://foo.com/users/123/orders/456
-  literal      users
-* integer_id   123
-  literal      orders
-* integer_id   456
+$ iriq -n https://foo.com/users/123
+https://foo.com/users/{user_id}
+```
-$ iriq classify f47ac10b-58cc-4372-a567-0e02b2c3d479
-uuid
+**Piped stdin** — extraction runs by default. Output auto-switches: small
+inputs get a deduplicated URL list, larger inputs (≥ 10 IRIs) get the
+cluster view via an ephemeral corpus. Section flags work too — emit one
+normalized URL / parsed record per extracted IRI.
-$ cat urls.txt | iriq cluster
-[2] foo.com  /users/{integer_id}
-    https://foo.com/users/1
-    https://foo.com/users/2
-[1] foo.com  /posts/{slug}/edit
-    https://foo.com/posts/abc-123/edit
 ```
+$ cat short.txt | iriq
+[2] https://github.com/dpep/iriq
+[1] https://foo.com/users
+$ cat short.txt | iriq -n                     # normalized URL per line
+https://github.com/dpep/iriq
+https://foo.com/users
+$ cat access.log | iriq                       # ≥ 10 IRIs → cluster view
+[190] docs.example.com  /users/{user_id}
+[186] app.example.com   /users/{user_id}
+...
+$ cat README.md | iriq --stats                # rolling aggregates
+$ cat README.md | iriq cluster                # force cluster view
+$ cat README.md | iriq --corpus c.json        # persist into a corpus
+```
+`--corpus PATH` makes the corpus survive across invocations. The file
+extension picks the storage backend:
-Add `--json` to any command for machine-readable output. `iriq cluster` reads
-identifiers (one per line) from a file argument or stdin; lines that fail to
-parse are skipped with a warning on stderr.
+- `.json` — a single atomically-written JSON file (default). Best for small
+  corpora and when you want the data human-readable.
+- `.db` / `.sqlite` / `.sqlite3` — a SQLite database with WAL journaling.
+  Each observation is an incremental UPSERT, so multiple `iriq --corpus`
+  processes can write concurrently without clobbering each other, and the
+  cost of opening doesn't scale with corpus size.
+Once the corpus has data, `-n` becomes corpus-informed:
+```
+$ for n in alice bob carol dave erin frank gina hank ivan jane; do
+    iriq --corpus c.db https://foo.com/users/$n/profile >/dev/null
+  done
+$ iriq -n --corpus c.db https://foo.com/users/zoe/profile
+https://foo.com/users/{user}/profile         # mechanical would keep "zoe"
+```
+Library: `Iriq::Corpus.open("c.db")` (or `iriq.OpenCorpus("c.db")` in Go)
+dispatches on the same extension rules. `corpus.save("export.json")`
+exports any backend as JSON.
+Flags:
+| Flag                | Effect                                                  |
+| ------------------- | ------------------------------------------------------- |
+| `-p, --parse`       | Show parsed fields                                      |
+| `-n, --normalize`   | Show the shape-normalized form                          |
+| `-j, --json`        | Emit JSON                                               |
+| `-N, --no-hints`    | Use `{integer_id}` etc. instead of `{user_id}`          |
+| `--no-scheme-less`  | Skip `foo.com/path`-style extraction (explicit-scheme only) |
+| `--corpus PATH`     | Load/create a corpus at PATH (`.json` or `.db`/`.sqlite`/`.sqlite3`) |
+| `--stats`           | Print rolling aggregates                                |
+| `-V, --version`     | Print version                                           |
+A positional argument that doesn't parse as an IRI but IS an existing
+file is read and extracted from automatically — `iriq ./access.log` and
+`iriq /var/log/foo.log` Just Work. (Bare filenames like `README.md`
+may still parse as a URL; pipe with `cat` to disambiguate.)
 Exit codes: `0` success, `1` usage error, `2` parse error.
+## Performance
+Measured on the deterministic `IriGenerator` fixture (Ruby 3.4.9, single
+thread):
+| Operation                | Throughput   |
+| ------------------------ | ------------ |
+| `Iriq.parse`             | ~260k URLs/s |
+| `Iriq.normalize`         | ~148k URLs/s |
+| `Iriq.explain`           | ~205k URLs/s |
+| `Iriq.extract` (prose)   | ~9.6 MB/s    |
+| `Corpus#observe`         | ~80k URLs/s  |
+| Corpus save/load (10k)   | ~135 ms      |
+Linear scaling holds through 100k observations; per-observation retained
+memory amortizes to ~100 bytes at that scale. Memoization caches are
+bounded by `CACHE_MAX = 10_000` (cleared when full) — overhead is a few
+hundred KB regardless of corpus size.
+Re-run anytime with:
+```
+bundle exec script/benchmark.rb       # throughput
+bundle exec script/memory.rb          # retained memory + cache footprints
+```
 ## Limitations (intentional)
 This is an MVP. Iriq does **not**:
@@ -158,6 +432,25 @@ For richer IRI handling, see `addressable`. Iriq's focus is the analysis
 side: classification, normalization, and clustering — not a complete URL
 implementation.
+----
+## Go port
+A Go implementation lives under [`go/`](go/) — same public surface, same
+behavior, ~10× faster CLI on extraction-heavy workloads. The Ruby gem is
+the reference; the Go port stays in sync via golden JSON fixtures
+(`spec/fixtures/`) and a CLI parity harness (`script/cli_parity.sh`), both
+checked in CI.
+```go
+import "github.com/dpep/iriq/go/iriq"
+iri, _ := iriq.Parse("https://foo.com/users/123")
+norm, _ := iriq.Normalize("https://foo.com/users/123")
+// "https://foo.com/users/{user_id}"
+```
+See [`go/README.md`](go/README.md) for the full API table and porting workflow.
 ----
 ## Contributing
@@ -166,6 +459,8 @@ Yes please  :)
 1. Fork it
 1. Create your feature branch (`git checkout -b my-feature`)
 1. Ensure the tests pass (`bundle exec rspec`)
+1. If you changed library behavior, port the change to Go (or open an
+   issue) and regenerate fixtures: `bundle exec ruby script/generate_fixtures.rb`
 1. Commit your changes (`git commit -am 'awesome new feature'`)
 1. Push your branch (`git push origin my-feature`)
 1. Create a Pull Request

data/iriq.gemspec CHANGED Viewed

@@ -4,13 +4,13 @@ Gem::Specification.new do |s|
   s.name        = "iriq"
   s.version     = Iriq::VERSION
   s.authors     = ["Daniel Pepper"]
-  s.description = "Semantic IRI/URI/URL/URN parsing, normalization, classification, and clustering."
-  s.files       = `git ls-files * ':!:spec'`.split("\n")
+  s.description = "IRI extraction, normalization, and clustering."
+  s.files       = `git ls-files * ':!:spec' ':!:script' ':!:cmd' ':!:bin' ':!:*.go' ':!:go.mod' ':!:go.sum'`.split("\n")
   s.bindir      = "exe"
   s.executables = ["iriq"]
   s.homepage    = "https://github.com/dpep/iriq"
   s.license     = "MIT"
-  s.summary     = "Semantic IRI normalization and clustering."
+  s.summary     = "IRI extraction, normalization, and clustering."
   s.required_ruby_version = ">= 3.2"
@@ -18,4 +18,5 @@ Gem::Specification.new do |s|
   s.add_development_dependency 'rspec', '>= 3.10'
   s.add_development_dependency 'rspec-debugging'
   s.add_development_dependency 'simplecov', '>= 0.22'
+  s.add_development_dependency 'sqlite3', '>= 1.6'
 end