RubyGems - ucode - Versions diffs - 0.1.0 - Mend

ucode 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (228) hide show

checksums.yaml +7 -0
data/CLAUDE.md +211 -0
data/Gemfile +22 -0
data/Gemfile.lock +406 -0
data/README.md +469 -0
data/Rakefile +18 -0
data/TODO.new/00-README.md +66 -0
data/TODO.new/01-pillar-terminology-alignment.md +69 -0
data/TODO.new/02-audit-schema-design.md +255 -0
data/TODO.new/03-directory-output-spec.md +203 -0
data/TODO.new/04-fontist-org-contract.md +173 -0
data/TODO.new/05-baseline-unicode17-coverage-audit.md +144 -0
data/TODO.new/06-audit-namespace-skeleton.md +105 -0
data/TODO.new/07-audit-models-port.md +132 -0
data/TODO.new/08-extractors-cheap-port.md +113 -0
data/TODO.new/09-extractors-expensive-port.md +99 -0
data/TODO.new/10-aggregations-ucd-rewrite.md +168 -0
data/TODO.new/11-differ-and-library-auditor-port.md +102 -0
data/TODO.new/12-formatters-port.md +115 -0
data/TODO.new/13-directory-emitter.md +147 -0
data/TODO.new/14-html-face-browser.md +144 -0
data/TODO.new/15-html-library-browser.md +102 -0
data/TODO.new/16-cli-audit-subcommands.md +142 -0
data/TODO.new/17-fontisan-cleanup-audit.md +147 -0
data/TODO.new/18-fontisan-cleanup-ucd.md +156 -0
data/TODO.new/19-fontisan-docs-update.md +155 -0
data/TODO.new/20-canonical-resolver-4-tier.md +182 -0
data/TODO.new/21-canonical-unicode17-build.md +148 -0
data/TODO.new/22-implementation-order.md +176 -0
data/UCODE_CHANGELOG.md +97 -0
data/exe/ucode +8 -0
data/lib/ucode/aggregator.rb +77 -0
data/lib/ucode/audit/block_aggregator.rb +90 -0
data/lib/ucode/audit/codepoint_range_coalescer.rb +42 -0
data/lib/ucode/audit/context.rb +137 -0
data/lib/ucode/audit/discrepancy_detector.rb +213 -0
data/lib/ucode/audit/extractors/aggregations.rb +70 -0
data/lib/ucode/audit/extractors/base.rb +21 -0
data/lib/ucode/audit/extractors/color_capabilities.rb +143 -0
data/lib/ucode/audit/extractors/coverage.rb +55 -0
data/lib/ucode/audit/extractors/hinting.rb +199 -0
data/lib/ucode/audit/extractors/identity.rb +65 -0
data/lib/ucode/audit/extractors/licensing.rb +75 -0
data/lib/ucode/audit/extractors/metrics.rb +108 -0
data/lib/ucode/audit/extractors/opentype_layout.rb +71 -0
data/lib/ucode/audit/extractors/provenance.rb +34 -0
data/lib/ucode/audit/extractors/style.rb +88 -0
data/lib/ucode/audit/extractors/variation_detail.rb +101 -0
data/lib/ucode/audit/extractors.rb +31 -0
data/lib/ucode/audit/plane_aggregator.rb +37 -0
data/lib/ucode/audit/registry.rb +63 -0
data/lib/ucode/audit/script_aggregator.rb +92 -0
data/lib/ucode/audit.rb +27 -0
data/lib/ucode/cache.rb +113 -0
data/lib/ucode/cli.rb +272 -0
data/lib/ucode/commands/build.rb +68 -0
data/lib/ucode/commands/cache.rb +46 -0
data/lib/ucode/commands/fetch.rb +62 -0
data/lib/ucode/commands/font_coverage.rb +57 -0
data/lib/ucode/commands/glyphs.rb +136 -0
data/lib/ucode/commands/lookup.rb +65 -0
data/lib/ucode/commands/parse.rb +62 -0
data/lib/ucode/commands/site.rb +33 -0
data/lib/ucode/commands.rb +19 -0
data/lib/ucode/config.rb +110 -0
data/lib/ucode/coordinator/indices.rb +34 -0
data/lib/ucode/coordinator.rb +397 -0
data/lib/ucode/database.rb +214 -0
data/lib/ucode/db_builder.rb +107 -0
data/lib/ucode/error.rb +96 -0
data/lib/ucode/fetch/code_charts.rb +57 -0
data/lib/ucode/fetch/http.rb +83 -0
data/lib/ucode/fetch/ucd_zip.rb +57 -0
data/lib/ucode/fetch/unihan_zip.rb +57 -0
data/lib/ucode/fetch.rb +14 -0
data/lib/ucode/glyphs/cell_extractor.rb +130 -0
data/lib/ucode/glyphs/dvisvgm_renderer.rb +29 -0
data/lib/ucode/glyphs/embedded_fonts/catalog.rb +372 -0
data/lib/ucode/glyphs/embedded_fonts/content_stream_correlator.rb +228 -0
data/lib/ucode/glyphs/embedded_fonts/font_entry.rb +126 -0
data/lib/ucode/glyphs/embedded_fonts/renderer.rb +47 -0
data/lib/ucode/glyphs/embedded_fonts/source.rb +94 -0
data/lib/ucode/glyphs/embedded_fonts/svg.rb +123 -0
data/lib/ucode/glyphs/embedded_fonts/tounicode.rb +103 -0
data/lib/ucode/glyphs/embedded_fonts/writer.rb +76 -0
data/lib/ucode/glyphs/embedded_fonts.rb +50 -0
data/lib/ucode/glyphs/grid.rb +30 -0
data/lib/ucode/glyphs/grid_detector.rb +165 -0
data/lib/ucode/glyphs/last_resort/cmap_index.rb +96 -0
data/lib/ucode/glyphs/last_resort/contents.rb +74 -0
data/lib/ucode/glyphs/last_resort/glif.rb +124 -0
data/lib/ucode/glyphs/last_resort/renderer.rb +67 -0
data/lib/ucode/glyphs/last_resort/source.rb +125 -0
data/lib/ucode/glyphs/last_resort/svg.rb +247 -0
data/lib/ucode/glyphs/last_resort/writer.rb +83 -0
data/lib/ucode/glyphs/last_resort.rb +36 -0
data/lib/ucode/glyphs/monolith_page_map.rb +181 -0
data/lib/ucode/glyphs/mutool_renderer.rb +28 -0
data/lib/ucode/glyphs/page_renderer.rb +221 -0
data/lib/ucode/glyphs/path_bbox.rb +62 -0
data/lib/ucode/glyphs/pdf2svg_renderer.rb +26 -0
data/lib/ucode/glyphs/pdf_fetcher.rb +102 -0
data/lib/ucode/glyphs/pdftocairo_renderer.rb +32 -0
data/lib/ucode/glyphs/real_fonts/block_coverage.rb +45 -0
data/lib/ucode/glyphs/real_fonts/coverage_auditor.rb +117 -0
data/lib/ucode/glyphs/real_fonts/font_coverage_report.rb +45 -0
data/lib/ucode/glyphs/real_fonts/font_locator.rb +95 -0
data/lib/ucode/glyphs/real_fonts/unicode_17_blocks.rb +104 -0
data/lib/ucode/glyphs/real_fonts/writer.rb +50 -0
data/lib/ucode/glyphs/real_fonts.rb +32 -0
data/lib/ucode/glyphs/writer.rb +250 -0
data/lib/ucode/glyphs.rb +27 -0
data/lib/ucode/index.rb +106 -0
data/lib/ucode/index_builder.rb +94 -0
data/lib/ucode/models/audit/audit_axis.rb +30 -0
data/lib/ucode/models/audit/audit_diff.rb +77 -0
data/lib/ucode/models/audit/audit_report.rb +137 -0
data/lib/ucode/models/audit/baseline.rb +32 -0
data/lib/ucode/models/audit/block_summary.rb +72 -0
data/lib/ucode/models/audit/codepoint_detail.rb +45 -0
data/lib/ucode/models/audit/codepoint_range.rb +39 -0
data/lib/ucode/models/audit/codepoint_set_diff.rb +34 -0
data/lib/ucode/models/audit/color_capabilities.rb +91 -0
data/lib/ucode/models/audit/discrepancy.rb +38 -0
data/lib/ucode/models/audit/duplicate_group.rb +23 -0
data/lib/ucode/models/audit/embedding_type.rb +81 -0
data/lib/ucode/models/audit/field_change.rb +28 -0
data/lib/ucode/models/audit/fs_selection_flags.rb +65 -0
data/lib/ucode/models/audit/gasp_range.rb +63 -0
data/lib/ucode/models/audit/hinting.rb +99 -0
data/lib/ucode/models/audit/library_summary.rb +40 -0
data/lib/ucode/models/audit/licensing.rb +48 -0
data/lib/ucode/models/audit/metrics.rb +111 -0
data/lib/ucode/models/audit/named_instance.rb +41 -0
data/lib/ucode/models/audit/opentype_layout.rb +38 -0
data/lib/ucode/models/audit/plane_summary.rb +31 -0
data/lib/ucode/models/audit/script_coverage_row.rb +26 -0
data/lib/ucode/models/audit/script_features.rb +28 -0
data/lib/ucode/models/audit/script_summary.rb +54 -0
data/lib/ucode/models/audit/variation_detail.rb +42 -0
data/lib/ucode/models/audit.rb +50 -0
data/lib/ucode/models/bidi_bracket_pair.rb +20 -0
data/lib/ucode/models/bidi_mirroring.rb +19 -0
data/lib/ucode/models/binary_property_assignment.rb +26 -0
data/lib/ucode/models/block.rb +36 -0
data/lib/ucode/models/case_folding_rule.rb +23 -0
data/lib/ucode/models/cjk_radical.rb +23 -0
data/lib/ucode/models/codepoint/bidi.rb +28 -0
data/lib/ucode/models/codepoint/break_segmentation.rb +22 -0
data/lib/ucode/models/codepoint/case_folding.rb +25 -0
data/lib/ucode/models/codepoint/casing.rb +32 -0
data/lib/ucode/models/codepoint/decomposition.rb +27 -0
data/lib/ucode/models/codepoint/display.rb +24 -0
data/lib/ucode/models/codepoint/emoji.rb +29 -0
data/lib/ucode/models/codepoint/hangul.rb +20 -0
data/lib/ucode/models/codepoint/identifier.rb +30 -0
data/lib/ucode/models/codepoint/indic.rb +20 -0
data/lib/ucode/models/codepoint/joining.rb +20 -0
data/lib/ucode/models/codepoint/normalization.rb +35 -0
data/lib/ucode/models/codepoint/numeric_value.rb +35 -0
data/lib/ucode/models/codepoint.rb +122 -0
data/lib/ucode/models/name_alias.rb +21 -0
data/lib/ucode/models/named_sequence.rb +19 -0
data/lib/ucode/models/names_list_entry.rb +38 -0
data/lib/ucode/models/plane.rb +36 -0
data/lib/ucode/models/property_alias.rb +24 -0
data/lib/ucode/models/property_value_alias.rb +26 -0
data/lib/ucode/models/relationship/compat_equiv.rb +18 -0
data/lib/ucode/models/relationship/cross_reference.rb +17 -0
data/lib/ucode/models/relationship/footnote.rb +24 -0
data/lib/ucode/models/relationship/informal_alias.rb +18 -0
data/lib/ucode/models/relationship/sample_sequence.rb +24 -0
data/lib/ucode/models/relationship/variation_sequence.rb +19 -0
data/lib/ucode/models/relationship.rb +57 -0
data/lib/ucode/models/script.rb +41 -0
data/lib/ucode/models/special_casing_rule.rb +28 -0
data/lib/ucode/models/standardized_variant.rb +24 -0
data/lib/ucode/models/unihan_entry.rb +23 -0
data/lib/ucode/models.rb +47 -0
data/lib/ucode/parsers/auxiliary.rb +26 -0
data/lib/ucode/parsers/base.rb +137 -0
data/lib/ucode/parsers/bidi_brackets.rb +41 -0
data/lib/ucode/parsers/bidi_mirroring.rb +37 -0
data/lib/ucode/parsers/blocks.rb +63 -0
data/lib/ucode/parsers/case_folding.rb +53 -0
data/lib/ucode/parsers/cjk_radicals.rb +102 -0
data/lib/ucode/parsers/derived_age.rb +59 -0
data/lib/ucode/parsers/derived_core_properties.rb +60 -0
data/lib/ucode/parsers/extracted_properties.rb +74 -0
data/lib/ucode/parsers/name_aliases.rb +44 -0
data/lib/ucode/parsers/named_sequences.rb +51 -0
data/lib/ucode/parsers/names_list.rb +250 -0
data/lib/ucode/parsers/property_aliases.rb +41 -0
data/lib/ucode/parsers/property_value_aliases.rb +46 -0
data/lib/ucode/parsers/script_extensions.rb +64 -0
data/lib/ucode/parsers/scripts.rb +60 -0
data/lib/ucode/parsers/special_casing.rb +62 -0
data/lib/ucode/parsers/standardized_variants.rb +56 -0
data/lib/ucode/parsers/unicode_data/hangul_name.rb +73 -0
data/lib/ucode/parsers/unicode_data.rb +268 -0
data/lib/ucode/parsers/unihan.rb +125 -0
data/lib/ucode/parsers.rb +35 -0
data/lib/ucode/range_entry.rb +58 -0
data/lib/ucode/repo/aggregate_writer.rb +364 -0
data/lib/ucode/repo/atomic_writes.rb +48 -0
data/lib/ucode/repo/codepoint_writer.rb +96 -0
data/lib/ucode/repo/paths.rb +122 -0
data/lib/ucode/repo.rb +22 -0
data/lib/ucode/site/config_emitter.rb +124 -0
data/lib/ucode/site/generator.rb +178 -0
data/lib/ucode/site/search_index.rb +68 -0
data/lib/ucode/site/template/.gitignore +4 -0
data/lib/ucode/site/template/.vitepress/config.ts +8 -0
data/lib/ucode/site/template/.vitepress/theme/index.js +20 -0
data/lib/ucode/site/template/char/[codepoint].md +13 -0
data/lib/ucode/site/template/components/BlockView.vue +57 -0
data/lib/ucode/site/template/components/CharView.vue +85 -0
data/lib/ucode/site/template/components/PlaneView.vue +56 -0
data/lib/ucode/site/template/components/SearchView.vue +66 -0
data/lib/ucode/site/template/index.md +25 -0
data/lib/ucode/site/template/package.json +18 -0
data/lib/ucode/site/template/search.md +9 -0
data/lib/ucode/site.rb +13 -0
data/lib/ucode/version.rb +5 -0
data/lib/ucode/version_resolver.rb +76 -0
data/lib/ucode.rb +74 -0
data/ucode.gemspec +56 -0
metadata +404 -0

data/README.md ADDED Viewed

@@ -0,0 +1,469 @@
+# ucode
+`ucode` is a Ruby toolkit for the Unicode Character Database (UCD). It turns the
+official UCD text files into a structured, browsable dataset: one JSON document
+per assigned codepoint, plus a Vitepress site for navigation.
+> **Status (v0.1).** The JSON dataset, lookup index, and Vitepress site are
+> production-ready. **SVG glyph extraction from the Code Charts PDFs is
+> experimental and deferred to v0.2** — see
+> [Glyph extraction (experimental)](#glyph-extraction-experimental) below.
+## What you get (v0.1)
+- **Per-codepoint JSON** at `output/blocks/<BLOCK>/<U+XXXX>/index.json` with
+  full UCD properties, the human-curated relationships from `NamesList.txt`
+  (cross-references, see-also, compatibility equivalents, sample sequences,
+  informal aliases, footnotes), Unihan readings, and machine-computed refs
+  (decomposition, case mappings, case folding, bidi mirror, named sequences,
+  standardized variants, script extensions).
+- **Aggregate JSON**: planes, blocks, scripts, search index, enums,
+  relationships, named sequences, manifest.
+- **SQLite lookup index** for fast codepoint → block/script/char queries.
+- **Vitepress site** at `site/` for browsing Plane → Block → Character.
+## Install
+```sh
+gem install ucode
+```
+Or in a Gemfile:
+```ruby
+gem "ucode", "~> 0.1"
+```
+## Quick start
+```sh
+# 1. Fetch UCD + Unihan for Unicode 17.0.0
+ucode fetch ucd 17.0.0
+ucode fetch unihan 17.0.0
+# 2. Stream UCD → output/ JSON tree
+ucode parse 17.0.0 --to ./output
+# 3. (Optional) Build the SQLite lookup index + dataset in one go
+ucode build 17.0.0 --to ./output    # fetch + parse (glyphs skipped by default)
+# 4. (Optional) Generate the Vitepress site
+ucode site init --to ./site
+ucode site build --from ./output --to ./site
+cd site && npm install && npm run dev
+```
+## Three modes
+### Lookup mode
+Read-only access to the SQLite cache.
+```ruby
+require "ucode"
+db = Ucode::Database.open("17.0.0")
+db.lookup_block(0x0041)   # => "Basic Latin"
+db.lookup_script(0x0041)  # => "Latin"
+```
+CLI equivalent:
+```sh
+ucode lookup block 0x0041   # U+0041 → Basic Latin
+ucode lookup char U+1F600
+```
+### Dataset mode
+Build the per-codepoint JSON dataset.
+```ruby
+require "ucode"
+Ucode::Commands::ParseCommand.new.call("17.0.0", output_root: "./output")
+```
+Or via CLI:
+```sh
+ucode build 17.0.0 --to ./output
+```
+### Site mode
+Generate the Vitepress site.
+```ruby
+require "ucode"
+Ucode::Commands::SiteCommand.new.init(site_root: "./site")
+Ucode::Commands::SiteCommand.new.build(output_root: "./output", site_root: "./site")
+```
+Then:
+```sh
+cd site && npm install && npm run dev
+```
+## Glyph extraction (experimental in v0.1; concrete plan for v0.2)
+The `ucode glyphs` command and the `--include-glyphs` flag on `ucode build`
+are **opt-in and experimental in v0.1**. They emit per-codepoint `glyph.svg`
+files today, but the output is not yet suitable for end-user display.
+To run the pipeline anyway (e.g. for development or benchmarking):
+```sh
+ucode glyphs 17.0.0 --to ./output --include-glyphs
+ucode build 17.0.0 --to ./output --include-glyphs
+```
+Both emit a one-line experimental warning on stderr.
+### Why v0.1 glyph output is wrong
+The Code Charts PDFs composite each cell's content — the cell-border
+decoration (L-shaped corner ticks + dashed edges) **and** the actual
+character outline — into a single glyph definition. `pdftocairo -svg` (or
+any other PDF→SVG renderer) faithfully emits that composite as one `<path>`,
+so the v0.1 cell extractor grabs border + character together. Trying to
+post-process that composite path (drop sub-paths that hug the cell edge,
+keep the largest interior cluster) is fragile because the border and the
+character overlap.
+### The v0.2 plan — 4-tier glyph sourcing
+The v0.1 cell-position resolution (`GridDetector` + `CellExtractor`) is
+correct — the right `<use>` element is selected. The fix is not to keep
+post-processing the rendered SVG; it is to **bypass the renderer entirely**
+and read the character outline straight from one of four sources, tried in
+priority order. Lower tiers are fallbacks.
+| Priority | Tier         | Source                                              | Use when                                                                                                                          |
+| -------- | ------------ | --------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
+| 1        | **Tier 1**   | Real-font cmap (`fontist`-discovered)               | A redistributable/accessible font covers the codepoint. Highest fidelity; avoids Code Charts compositing of mark + base.          |
+| 2        | **Pillar 1** | PDF-embedded font + `/ToUnicode` CMap               | Code Charts PDF embeds a subsetted CIDFont whose `/ToUnicode` lets us map glyph IDs to codepoints directly.                       |
+| 3        | **Pillar 2** | PDF content-stream positional correlation           | Code Charts PDF embeds a CIDFont without `/ToUnicode`; glyphs are correlated to codepoints via chart-grid geometry (row/column labels). |
+| 4        | **Pillar 3** | Last Resort UFO                                     | Codepoint is a placeholder box (unassigned, PUA, noncharacter) or no higher tier produced a glyph.                                |
+The naming distinguishes **Tier 1** (real fonts, off-PDF) from the three
+**pillars** (PDF-embedded or fallback). For full details — including the
+PDF font object graph and how each pillar attributes a glyph ID to a
+codepoint — see [docs/architecture.md → The 4-tier glyph sourcing
+strategy](docs/architecture.md#the-4-tier-glyph-sourcing-strategy).
+**Status (post-v0.2):**
+- **Tier 1** (`Ucode::Glyphs::RealFonts`) — implemented. Uses
+  `fontist` for discovery and `fontisan` for parsing (never `ttfunk`).
+- **Pillar 1** (`Ucode::Glyphs::EmbeddedFonts::Catalog`) — implemented.
+  Walks Type0 → CIDFont → FontDescriptor → FontFile2/3; for fonts with
+  `/ToUnicode`, builds `{codepoint => gid}` directly from the CMap stream
+  and lifts the outline by GID.
+- **Pillar 2** (`Ucode::Glyphs::EmbeddedFonts::ContentStreamCorrelator`)
+  — implemented. Renders the relevant pages to SVG via `mutool draw -F
+  svg`, parses `<use>` elements, partitions labels from specimens by
+  font_obj_id, clusters by quantized (Y, X) position, decodes hex
+  codepoints from joined label glyphs, and matches positionally within
+  Y-rows.
+- **Pillar 3** (`Ucode::Glyphs::LastResort`) — implemented. Reads `.glif`
+  outlines directly from Unicode's
+  [Last Resort Font](https://github.com/unicode-org/last-resort-font) UFO
+  source and converts them to SVG.
+The 4 tiers are MECE: every codepoint in the charts is attributed to
+exactly one tier by the canonical resolver. The v0.1 cell extractor is
+retired once all four tiers ship.
+## How embedded font extraction works
+The v0.1 cell extractor rendered each Code Charts page to SVG and grabbed
+the `<path>` that landed in a grid cell. That grabbed the cell-border
+decoration along with the character. v0.2 pillar 1
+(`Ucode::Glyphs::EmbeddedFonts`) bypasses the renderer entirely and reads
+the character outline straight from the embedded font program — which
+contains only the character, never the border.
+### The PDF font object graph
+Every modern Code Charts font is a Type0 (composite) font whose PDF object
+graph has three layers below the Type0 outer font:
+```
+Type0 font (referenced from page content streams)
+  /BaseFont          /CIAIIP+Uni2000Generalpunctuation
+  /Encoding          /Identity-H          ← 2-byte CID encoding
+  /DescendantFonts   [ <CIDFontType2 ref> ]
+  /ToUnicode         <stream ref>         ← CID → Unicode codepoint
+       │
+       ▼
+CIDFontType2 (the "inner" CID font)
+  /BaseFont          /CIAIIP+Uni2000Generalpunctuation
+  /CIDToGIDMap       /Identity            ← CID == GID (common case)
+  /FontDescriptor    <ref>
+       │
+       ▼
+FontDescriptor
+  /FontFile2         <stream ref>         ← TrueType program
+  /FontFile3         <stream ref>         ← CFF / Type 1C (alternative)
+```
+The font program (the binary stream `/FontFile2` or `/FontFile3` points at)
+is the actual outline data — the `glyf` table for TrueType, the
+`CharStrings` dict for CFF. Reading it gives you the character outline with
+zero PDF page content attached.
+### The three ID spaces
+Three different integer ID spaces flow through the graph, and the
+architecture's job is to chain them:
+| ID space | What it numbers | Where it lives |
+| --- | --- | --- |
+| **CID** | Code shown in the content stream (`Tj`/`TJ` operators) | per-font; with `/Identity-H` it is a 16-bit index |
+| **GID** | Glyph in the font program's outline table | the font program itself |
+| **Unicode codepoint** | The scalar value (U+XXXX) the glyph represents | the `/ToUnicode` CMap |
+Two PDF-side maps connect them:
+- **CID → GID** via `/CIDToGIDMap`. If `/Identity`, they are equal.
+  Otherwise it is a binary stream lookup table (which ucode does not
+  currently parse — fonts that need it are skipped).
+- **CID → Unicode codepoint** via the `/ToUnicode` CMap stream
+  (Adobe Technical Note #5014). This is the same map the PDF viewer uses
+  to make text selectable and searchable.
+The third map — **GID → outline** — lives in the font program itself,
+queried by GID.
+### Correlation walk: codepoint → outline
+To render U+2010 (HYPHEN) the pipeline chains all three maps:
+1. **codepoint → FontEntry.** `Catalog#lookup(0x2010)` returns the
+   FontEntry whose ToUnicode CMap mentions U+2010 —
+   `CIAIIP+Uni2000Generalpunctuation`.
+2. **codepoint → GID.** `FontEntry#gid_for(0x2010)` looks up the per-font
+   `codepoint_to_gid` Hash. That Hash was built by inverting the parsed
+   ToUnicode `{cid => cp}` to `{cp => cid}`, then (with
+   `/CIDToGIDMap /Identity`) treating `cid == gid`. So GID = the CID the
+   CMap named.
+3. **GID → outline.** `FontEntry#accessor.outline_for_id(gid)` asks
+   fontisan for the outline at that GID — returns a `GlyphOutline` with
+   contours, control points, and bbox.
+4. **outline → SVG.** `Svg` walks `outline.to_commands`, emits each
+   command with y negated (fonts grow up, SVG grows down), wraps in a
+   viewBox padded 8% around the bbox, and produces a standalone XML
+   document.
+For U+2010 specifically, the ToUnicode CMap of
+`CIAIIP+Uni2000Generalpunctuation` contains:
+```
+1 beginbfchar
+<000A> <2010>
+endbfchar
+```
+CID `0x000A` → Unicode `U+2010`. With Identity CIDToGIDMap, GID = CID =
+10. The renderer asks fontisan for the outline at GID 10.
+**Why this is authoritative.** The ToUnicode CMap is the same data the
+PDF viewer uses to make text selectable and searchable. The Code Charts
+authors generated it when subsetting the font; it tells you exactly which
+glyph represents which codepoint. We are not guessing from glyph shape or
+grid position — we are reading the same correlation table the PDF itself
+uses.
+### Pipeline components
+```
+                  ┌──────────────────────────────────────┐
+                  │ Source                                │
+                  │  resolves CodeCharts.pdf + cache_dir │
+                  └──────────────┬───────────────────────┘
+                                 │
+                                 ▼
+                  ┌──────────────────────────────────────┐
+                  │ Catalog                               │
+                  │  walks PDF via mutool →               │
+                  │  builds { codepoint => FontEntry }    │
+                  └────────┬──────────────┬───────────────┘
+                           │              │
+                           ▼              ▼
+              ┌──────────────────┐  ┌──────────────────────┐
+              │ ToUnicode        │  │ FontEntry             │
+              │  parse CMap →    │  │  lazy fontisan accessor│
+              │  { cid => cp }   │  │  + codepoint_to_gid   │
+              └──────────────────┘  └──────────┬───────────┘
+                                               │ on first lookup
+                                               ▼
+                          ┌────────────────────────────────────┐
+                          │ mutool show -o <tmp> -b            │
+                          │   extracts /FontFile2 or /FontFile3│
+                          │   stream → cache_dir/<font>.ttf    │
+                          └────────────────┬───────────────────┘
+                                           │
+                                           ▼
+                          ┌────────────────────────────────────┐
+                          │ fontisan FontLoader                │
+                          │   parses glyf / CharStrings        │
+                          │   → GlyphAccessor                  │
+                          │   → OutlineExtractor               │
+                          │   → GlyphOutline#to_commands       │
+                          └────────────────┬───────────────────┘
+                                           │
+                                           ▼
+                          ┌────────────────────────────────────┐
+                          │ Svg                                │
+                          │   y-flip, viewBox + 8% padding,    │
+                          │   standalone XML                   │
+                          └────────────────────────────────────┘
+```
+**`Source`** — resolves the PDF path (`pdf:` arg →
+`UCODE_CODE_CHARTS_PDF` env → `<gem_root>/CodeCharts.pdf`) and the cache
+directory for extracted font programs (same pattern,
+`UCODE_PDF_FONT_CACHE` env, default `<gem_root>/data/pdf-fonts/`). Raises
+`EmbeddedFontsMissingError` when the resolved PDF doesn't exist.
+**`Catalog`** — walks the PDF once via `mutool` and builds the global
+`{codepoint => FontEntry}` index. Discovery happens in five batched
+`mutool` calls:
+- `mutool info CodeCharts.pdf` — lists every Type0 font and its object ID.
+- `mutool show -g <pdf> <id1> <id2> ...` — batched fetch of Type0 dicts.
+- Same for descendant CIDFont dicts.
+- Same for FontDescriptors.
+- Per-font `mutool show -o <tmp> -b <pdf> <tu_ref>` — fetches each
+  ToUnicode stream (cannot be batched because each is a separate binary
+  stream).
+PDF dict parsing is **not** a full grammar walk — instead, `Catalog`
+regex-extracts each field it needs (`/BaseFont`, `/DescendantFonts[<ref>]`,
+`/ToUnicode <ref>`, `/FontDescriptor <ref>`, `/FontFile2/3 <ref>`,
+`/CIDToGIDMap /Identity|<ref>`). The targeted approach is robust to the
+`<<...>>`/`[...]` nesting that breaks naive whitespace-split parsers.
+**`ToUnicode`** — parses a CMap stream text into a frozen
+`{cid => codepoint}` Hash. Supports:
+- `beginbfchar` / `endbfchar` — one-to-one `<cid> <uni>` pairs.
+- `beginbfrange` / `endbfrange` — two forms:
+  - `<lo> <hi> <start>` — cids `lo..hi` map to consecutive codepoints
+    starting at `start`.
+  - `<lo> <hi> [<u1> ... <un>]` — explicit per-cid codepoints within the
+    range.
+- UTF-16 surrogate-pair decoding — 8 hex digits (e.g. `D83DDE00`) decode
+  to one astral codepoint (U+1F600).
+`codespacerange` and `notdefrange` blocks are ignored; multi-codepoint
+targets (ligatures) take only the first codepoint.
+**`FontEntry`** — value object per Type0 font, holds the identity
+(`base_font`, object IDs), the kind of font program (`:ttf` or `:cff`),
+the resolved `cid_to_gid_map` (`:identity` or nil), and the frozen
+`codepoint_to_gid` Hash. The fontisan accessor is built lazily on first
+`#accessor` call: extracts the font stream via `mutool show -o <tmp> -b`
+to a `Tempfile`, atomically moves it into the cache (`FileUtils.mv`), then
+loads via `Fontisan::FontLoader`. Cache hits skip extraction entirely;
+cache files are invalidated by comparing mtime against the source PDF.
+**`Svg`** — converts a `GlyphOutline` into a standalone SVG document. Two
+coordinate transforms happen at emit time: y-negation (font space y grows
+up, SVG y grows down) and viewBox computation (bbox plus 8% padding on
+each side, y-flipped). Walks `outline.to_commands` and emits
+`M`/`L`/`Q`/`Z` directly — no intermediate path string is parsed back.
+Emits a `<title>` of the form `U+XXXX (Code Charts: <base_font>)` for
+debugging.
+**`Renderer`** — thin orchestrator: `Catalog#lookup` →
+`FontEntry#gid_for` → `FontEntry#accessor.outline_for_id` → `Svg#to_s`.
+Returns a `Result` struct (`codepoint`, `base_font`, `gid`, `svg`) on
+success or nil on any miss.
+**`Writer`** — iterates codepoints (defaults to `Catalog#codepoints`),
+calls `Renderer#render`, writes `glyph.svg` into the per-codepoint output
+folder. Idempotent via `Repo::AtomicWrites` (content-hash compare;
+existing identical files are left untouched). Returns a tally
+`{written:, skipped:, missing:, total:}`. `block_lookup:` is a callable
+that maps a codepoint to its original block name (verbatim from
+`Blocks.txt`) — codepoints returning nil are skipped.
+### What pillar 1 does not cover
+Pillar 1 handles only the fonts where correlation is unambiguous:
+- **Label fonts** (`MyriadPro-Bold` and friends) — these draw row/column
+  header text, not character glyphs. They are not Type0 with a ToUnicode
+  CMap, so they are invisible to discovery.
+- **Type0 fonts without `/ToUnicode`** — older subset practice. Without
+  the CMap we cannot attribute a glyph to a codepoint, so the font is
+  skipped. These codepoints fall through to **pillar 2** (content-stream
+  positional correlation), and from there to **pillar 3** (Last Resort)
+  if pillar 2 cannot resolve them either.
+- **Stream-form `/CIDToGIDMap`** — a binary lookup table. Treated as
+  unsupported; the font is skipped.
+- **Bare CFF streams fontisan does not yet recognize** — a separate
+  fontisan-side issue; flagged for investigation.
+Code Charts cells not covered by pillar 1 are exactly the cells whose
+character is not drawn from an embedded subsetted font with
+`/ToUnicode` — either a label, a glyph in a font without `/ToUnicode`,
+or a placeholder. **Pillar 2** (content-stream positional correlation)
+handles the no-`/ToUnicode` case, and **pillar 3** (Last Resort UFO)
+handles the placeholder case; the small remainder are correctly absent
+from the dataset.
+## System dependencies
+- Ruby ≥ 3.1
+- `mupdf-tools` (provides the `mutool` binary) — required for **v0.2 pillar 1
+  glyph extraction** (the default pipeline). `mutool` enumerates the subsetted
+  fonts embedded in `CodeCharts.pdf` and extracts their font program streams
+  for outline parsing. Install via Homebrew with `brew install mupdf-tools`,
+  or via apt with `apt install mupdf-tools`.
+- `fontisan` Ruby gem — pulled in automatically through the `Gemfile`; used
+  by pillar 1 to parse extracted TrueType (`.ttf`) and CFF/Type 1C font
+  programs and emit per-glyph outline data (contours, control points, bbox).
+- `pdftocairo` (poppler) — only required for the experimental v0.1
+  `glyphs` cell-extractor path. Alternatives (`pdf2svg`, `dvisvgm`) are
+  auto-detected.
+- `pdftk` — only required for the v0.1 `glyphs` command's monolith fallback
+  path.
+## Architecture
+Five concerns, each isolated:
+1. **`Ucode::Models`** — `lutaml-model` classes for every UCD aggregate.
+2. **`Ucode::Parsers`** — one streaming parser per UCD text file.
+3. **`Ucode::Coordinator`** — single-pass enrichment that merges indices
+   into each `CodePoint` as it streams.
+4. **`Ucode::Repo`** — atomic, idempotent writers for the output tree.
+5. **`Ucode::Glyphs`** — vector glyph extraction from Code Charts PDFs
+   (experimental in v0.1).
+6. **`Ucode::Site`** — Vitepress scaffold + config/page generator.
+CLI is thin Thor dispatch over `Ucode::Commands::*`. Each command class
+is a pure, in-process testable unit.
+See `CLAUDE.md` for the full architecture notes. See
+`docs/FONTISAN_MIGRATION.md` for the fontisan integration plan.
+## Authoritative source
+ucode parses the **UCD text files** (per UAX #44). The
+`ucd.all.flat.xml` shipped with the repo is reference-only — it omits
+the human-curated relationship data in `NamesList.txt` and has partial
+Unihan coverage. We never parse it.
+## License
+BSD-2-Clause. See `LICENSE.txt`.
+## Code of conduct
+Contributors are expected to follow the standard fontist org CoC.

data/Rakefile ADDED Viewed

@@ -0,0 +1,18 @@
+# frozen_string_literal: true
+require "rubygems"
+require "rake"
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+require "rubocop/rake_task"
+RuboCop::RakeTask.new
+require "yard"
+YARD::Rake::YardocTask.new do |t|
+  t.options = ["--output-dir", "docs/api"]
+end
+task default: %i[spec rubocop]

data/TODO.new/00-README.md ADDED Viewed

@@ -0,0 +1,66 @@
+# TODO.new — audit migration + Mode 2 work
+Work tracks for the fontisan audit → ucode audit migration, the
+per-font-audit output spec, and the Mode 1 canonical-dataset alignment.
+The full architecture reference is `docs/architecture.md` — read that
+first; these TODOs reference sections of it.
+## Tracks
+### Alignment & contract (lock these before any code moves)
+- [01 — Pillar terminology alignment](01-pillar-terminology-alignment.md)
+- [02 — Audit schema design](02-audit-schema-design.md)
+- [03 — Directory output spec](03-directory-output-spec.md)
+- [04 — fontist.org contract](04-fontist-org-contract.md)
+### Baseline measurement (know where we are)
+- [05 — Unicode 17 baseline coverage audit](05-baseline-unicode17-coverage-audit.md)
+### Audit migration (the big work)
+- [06 — Audit namespace skeleton](06-audit-namespace-skeleton.md)
+- [07 — Models::Audit port](07-audit-models-port.md)
+- [08 — Cheap extractors port](08-extractors-cheap-port.md)
+- [09 — Expensive extractors port](09-extractors-expensive-port.md)
+- [10 — Aggregations rewrite on ucode UCD](10-aggregations-ucd-rewrite.md)
+- [11 — Differ + library auditor port](11-differ-and-library-auditor-port.md)
+- [12 — Formatters port](12-formatters-port.md)
+### Output + browser
+- [13 — Directory emitter](13-directory-emitter.md)
+- [14 — HTML face browser](14-html-face-browser.md)
+- [15 — HTML library browser](15-html-library-browser.md)
+- [16 — CLI audit subcommands](16-cli-audit-subcommands.md)
+### Fontisan cleanup (after ucode audit ships)
+- [17 — Fontisan: delete audit subsystem](17-fontisan-cleanup-audit.md)
+- [18 — Fontisan: delete UCD subsystem](18-fontisan-cleanup-ucd.md)
+- [19 — Fontisan: docs and shim update](19-fontisan-docs-update.md)
+### Canonical Mode 1 alignment
+- [20 — Canonical 4-tier resolver](20-canonical-resolver-4-tier.md)
+- [21 — Canonical Unicode 17 dataset build](21-canonical-unicode17-build.md)
+### Sequencing
+- [22 — Implementation order](22-implementation-order.md)
+## Conventions
+- One concern per file. If a TODO grows past ~250 lines it should split.
+- File numbering is stable; reuse the next free number for additions.
+- Every TODO lists: Goal, Files, Scope, Acceptance, References.
+- Specs use real model instances — never `double()` (global rule).
+- All new lib files use Ruby `autoload` (declared in the immediate
+  parent namespace's file) for same-library code. No `require_relative`
+  and no `require "ucode/..."` inside the library.
+- No AI attribution in any commit, doc, or comment.
+- Branch naming: `audit/<track-slug>` (e.g. `audit/schema-design`).
+  One PR per track unless tracks are tightly coupled.
+- Land PR #1 (`tier1-cmap-audit`) before starting any track in this dir.
+  The migration builds on top of the merged RealFonts subsystem.

data/TODO.new/01-pillar-terminology-alignment.md ADDED Viewed

@@ -0,0 +1,69 @@
+# 01 — Pillar terminology alignment
+## Goal
+Fix the inconsistency between the README's "two pillars" claim and the
+actual 4-tier glyph sourcing strategy. The recent commit `24e6bfd`
+("Pillar-2 content-stream correlation fallback") was named correctly;
+the README at `docs/architecture.md`'s "4-tier strategy" section is
+authoritative.
+## Problem
+The README currently says (line ~155):
+> ### The v0.2 plan — two pillars
+> 1. Real character glyphs — extract the subsetted fonts from the PDF.
+> 2. Last Resort placeholders — render directly from the UFO source.
+This collapses Tier 1 (real-font cmap) and the three PDF-side pillars
+into "two pillars". The actual strategy (per project memory and the
+in-tree code) is four-tier:
+1. **Tier 1** — real-font cmap (`Ucode::Glyphs::RealFonts`).
+2. **Pillar 1** — PDF-embedded font with `/ToUnicode` (`EmbeddedFonts::Catalog`).
+3. **Pillar 2** — PDF content-stream correlation (`ContentStreamCorrelator`).
+4. **Pillar 3** — Last Resort UFO (`Ucode::Glyphs::LastResort`).
+The mismatch confuses anyone reading the code (where each tier is
+distinct) vs the README (which merges three of them).
+## Files to change
+- `README.md` — replace the "two pillars" section with the 4-tier table.
+  Cross-link to `docs/architecture.md` §"The 4-tier glyph sourcing
+  strategy" as the canonical reference.
+- `docs/architecture.md` — already correct; no change here.
+- `CLAUDE.md` — has a brief mention of glyph sourcing; align the
+  vocabulary with the 4-tier names.
+## Scope
+In scope:
+- README rewrite (one section, ~50 lines).
+- CLAUDE.md vocabulary tweak (one paragraph).
+- No code changes.
+Out of scope:
+- Renaming any code symbol. The current symbols (`RealFonts`,
+  `EmbeddedFonts::Catalog`, `ContentStreamCorrelator`, `LastResort`) are
+  fine; the names match their function. Only the prose label "tier" vs
+  "pillar" needs disambiguation.
+- Updating the commit message of `24e6bfd`. The commit was correctly
+  named; do not rewrite history.
+## Acceptance
+- `grep -ni "two pillars" README.md` returns no matches.
+- `grep -ni "pillar" README.md` returns matches that fit the 4-tier
+  vocabulary (Tier 1 + Pillars 1-3).
+- README's strategy section cross-links to `docs/architecture.md`.
+- No code changes; no spec changes; no changelog entry needed beyond
+  the commit message.
+## References
+- `docs/architecture.md` §"The 4-tier glyph sourcing strategy"
+- Commit `24e6bfd` (correctly named)
+- Commit `307fda3` (Tier-1 implementation)
+- Memory: `ucode_glyph_extraction_cell_border_bug.md`