ucode 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (228) hide show
  1. checksums.yaml +7 -0
  2. data/CLAUDE.md +211 -0
  3. data/Gemfile +22 -0
  4. data/Gemfile.lock +406 -0
  5. data/README.md +469 -0
  6. data/Rakefile +18 -0
  7. data/TODO.new/00-README.md +66 -0
  8. data/TODO.new/01-pillar-terminology-alignment.md +69 -0
  9. data/TODO.new/02-audit-schema-design.md +255 -0
  10. data/TODO.new/03-directory-output-spec.md +203 -0
  11. data/TODO.new/04-fontist-org-contract.md +173 -0
  12. data/TODO.new/05-baseline-unicode17-coverage-audit.md +144 -0
  13. data/TODO.new/06-audit-namespace-skeleton.md +105 -0
  14. data/TODO.new/07-audit-models-port.md +132 -0
  15. data/TODO.new/08-extractors-cheap-port.md +113 -0
  16. data/TODO.new/09-extractors-expensive-port.md +99 -0
  17. data/TODO.new/10-aggregations-ucd-rewrite.md +168 -0
  18. data/TODO.new/11-differ-and-library-auditor-port.md +102 -0
  19. data/TODO.new/12-formatters-port.md +115 -0
  20. data/TODO.new/13-directory-emitter.md +147 -0
  21. data/TODO.new/14-html-face-browser.md +144 -0
  22. data/TODO.new/15-html-library-browser.md +102 -0
  23. data/TODO.new/16-cli-audit-subcommands.md +142 -0
  24. data/TODO.new/17-fontisan-cleanup-audit.md +147 -0
  25. data/TODO.new/18-fontisan-cleanup-ucd.md +156 -0
  26. data/TODO.new/19-fontisan-docs-update.md +155 -0
  27. data/TODO.new/20-canonical-resolver-4-tier.md +182 -0
  28. data/TODO.new/21-canonical-unicode17-build.md +148 -0
  29. data/TODO.new/22-implementation-order.md +176 -0
  30. data/UCODE_CHANGELOG.md +97 -0
  31. data/exe/ucode +8 -0
  32. data/lib/ucode/aggregator.rb +77 -0
  33. data/lib/ucode/audit/block_aggregator.rb +90 -0
  34. data/lib/ucode/audit/codepoint_range_coalescer.rb +42 -0
  35. data/lib/ucode/audit/context.rb +137 -0
  36. data/lib/ucode/audit/discrepancy_detector.rb +213 -0
  37. data/lib/ucode/audit/extractors/aggregations.rb +70 -0
  38. data/lib/ucode/audit/extractors/base.rb +21 -0
  39. data/lib/ucode/audit/extractors/color_capabilities.rb +143 -0
  40. data/lib/ucode/audit/extractors/coverage.rb +55 -0
  41. data/lib/ucode/audit/extractors/hinting.rb +199 -0
  42. data/lib/ucode/audit/extractors/identity.rb +65 -0
  43. data/lib/ucode/audit/extractors/licensing.rb +75 -0
  44. data/lib/ucode/audit/extractors/metrics.rb +108 -0
  45. data/lib/ucode/audit/extractors/opentype_layout.rb +71 -0
  46. data/lib/ucode/audit/extractors/provenance.rb +34 -0
  47. data/lib/ucode/audit/extractors/style.rb +88 -0
  48. data/lib/ucode/audit/extractors/variation_detail.rb +101 -0
  49. data/lib/ucode/audit/extractors.rb +31 -0
  50. data/lib/ucode/audit/plane_aggregator.rb +37 -0
  51. data/lib/ucode/audit/registry.rb +63 -0
  52. data/lib/ucode/audit/script_aggregator.rb +92 -0
  53. data/lib/ucode/audit.rb +27 -0
  54. data/lib/ucode/cache.rb +113 -0
  55. data/lib/ucode/cli.rb +272 -0
  56. data/lib/ucode/commands/build.rb +68 -0
  57. data/lib/ucode/commands/cache.rb +46 -0
  58. data/lib/ucode/commands/fetch.rb +62 -0
  59. data/lib/ucode/commands/font_coverage.rb +57 -0
  60. data/lib/ucode/commands/glyphs.rb +136 -0
  61. data/lib/ucode/commands/lookup.rb +65 -0
  62. data/lib/ucode/commands/parse.rb +62 -0
  63. data/lib/ucode/commands/site.rb +33 -0
  64. data/lib/ucode/commands.rb +19 -0
  65. data/lib/ucode/config.rb +110 -0
  66. data/lib/ucode/coordinator/indices.rb +34 -0
  67. data/lib/ucode/coordinator.rb +397 -0
  68. data/lib/ucode/database.rb +214 -0
  69. data/lib/ucode/db_builder.rb +107 -0
  70. data/lib/ucode/error.rb +96 -0
  71. data/lib/ucode/fetch/code_charts.rb +57 -0
  72. data/lib/ucode/fetch/http.rb +83 -0
  73. data/lib/ucode/fetch/ucd_zip.rb +57 -0
  74. data/lib/ucode/fetch/unihan_zip.rb +57 -0
  75. data/lib/ucode/fetch.rb +14 -0
  76. data/lib/ucode/glyphs/cell_extractor.rb +130 -0
  77. data/lib/ucode/glyphs/dvisvgm_renderer.rb +29 -0
  78. data/lib/ucode/glyphs/embedded_fonts/catalog.rb +372 -0
  79. data/lib/ucode/glyphs/embedded_fonts/content_stream_correlator.rb +228 -0
  80. data/lib/ucode/glyphs/embedded_fonts/font_entry.rb +126 -0
  81. data/lib/ucode/glyphs/embedded_fonts/renderer.rb +47 -0
  82. data/lib/ucode/glyphs/embedded_fonts/source.rb +94 -0
  83. data/lib/ucode/glyphs/embedded_fonts/svg.rb +123 -0
  84. data/lib/ucode/glyphs/embedded_fonts/tounicode.rb +103 -0
  85. data/lib/ucode/glyphs/embedded_fonts/writer.rb +76 -0
  86. data/lib/ucode/glyphs/embedded_fonts.rb +50 -0
  87. data/lib/ucode/glyphs/grid.rb +30 -0
  88. data/lib/ucode/glyphs/grid_detector.rb +165 -0
  89. data/lib/ucode/glyphs/last_resort/cmap_index.rb +96 -0
  90. data/lib/ucode/glyphs/last_resort/contents.rb +74 -0
  91. data/lib/ucode/glyphs/last_resort/glif.rb +124 -0
  92. data/lib/ucode/glyphs/last_resort/renderer.rb +67 -0
  93. data/lib/ucode/glyphs/last_resort/source.rb +125 -0
  94. data/lib/ucode/glyphs/last_resort/svg.rb +247 -0
  95. data/lib/ucode/glyphs/last_resort/writer.rb +83 -0
  96. data/lib/ucode/glyphs/last_resort.rb +36 -0
  97. data/lib/ucode/glyphs/monolith_page_map.rb +181 -0
  98. data/lib/ucode/glyphs/mutool_renderer.rb +28 -0
  99. data/lib/ucode/glyphs/page_renderer.rb +221 -0
  100. data/lib/ucode/glyphs/path_bbox.rb +62 -0
  101. data/lib/ucode/glyphs/pdf2svg_renderer.rb +26 -0
  102. data/lib/ucode/glyphs/pdf_fetcher.rb +102 -0
  103. data/lib/ucode/glyphs/pdftocairo_renderer.rb +32 -0
  104. data/lib/ucode/glyphs/real_fonts/block_coverage.rb +45 -0
  105. data/lib/ucode/glyphs/real_fonts/coverage_auditor.rb +117 -0
  106. data/lib/ucode/glyphs/real_fonts/font_coverage_report.rb +45 -0
  107. data/lib/ucode/glyphs/real_fonts/font_locator.rb +95 -0
  108. data/lib/ucode/glyphs/real_fonts/unicode_17_blocks.rb +104 -0
  109. data/lib/ucode/glyphs/real_fonts/writer.rb +50 -0
  110. data/lib/ucode/glyphs/real_fonts.rb +32 -0
  111. data/lib/ucode/glyphs/writer.rb +250 -0
  112. data/lib/ucode/glyphs.rb +27 -0
  113. data/lib/ucode/index.rb +106 -0
  114. data/lib/ucode/index_builder.rb +94 -0
  115. data/lib/ucode/models/audit/audit_axis.rb +30 -0
  116. data/lib/ucode/models/audit/audit_diff.rb +77 -0
  117. data/lib/ucode/models/audit/audit_report.rb +137 -0
  118. data/lib/ucode/models/audit/baseline.rb +32 -0
  119. data/lib/ucode/models/audit/block_summary.rb +72 -0
  120. data/lib/ucode/models/audit/codepoint_detail.rb +45 -0
  121. data/lib/ucode/models/audit/codepoint_range.rb +39 -0
  122. data/lib/ucode/models/audit/codepoint_set_diff.rb +34 -0
  123. data/lib/ucode/models/audit/color_capabilities.rb +91 -0
  124. data/lib/ucode/models/audit/discrepancy.rb +38 -0
  125. data/lib/ucode/models/audit/duplicate_group.rb +23 -0
  126. data/lib/ucode/models/audit/embedding_type.rb +81 -0
  127. data/lib/ucode/models/audit/field_change.rb +28 -0
  128. data/lib/ucode/models/audit/fs_selection_flags.rb +65 -0
  129. data/lib/ucode/models/audit/gasp_range.rb +63 -0
  130. data/lib/ucode/models/audit/hinting.rb +99 -0
  131. data/lib/ucode/models/audit/library_summary.rb +40 -0
  132. data/lib/ucode/models/audit/licensing.rb +48 -0
  133. data/lib/ucode/models/audit/metrics.rb +111 -0
  134. data/lib/ucode/models/audit/named_instance.rb +41 -0
  135. data/lib/ucode/models/audit/opentype_layout.rb +38 -0
  136. data/lib/ucode/models/audit/plane_summary.rb +31 -0
  137. data/lib/ucode/models/audit/script_coverage_row.rb +26 -0
  138. data/lib/ucode/models/audit/script_features.rb +28 -0
  139. data/lib/ucode/models/audit/script_summary.rb +54 -0
  140. data/lib/ucode/models/audit/variation_detail.rb +42 -0
  141. data/lib/ucode/models/audit.rb +50 -0
  142. data/lib/ucode/models/bidi_bracket_pair.rb +20 -0
  143. data/lib/ucode/models/bidi_mirroring.rb +19 -0
  144. data/lib/ucode/models/binary_property_assignment.rb +26 -0
  145. data/lib/ucode/models/block.rb +36 -0
  146. data/lib/ucode/models/case_folding_rule.rb +23 -0
  147. data/lib/ucode/models/cjk_radical.rb +23 -0
  148. data/lib/ucode/models/codepoint/bidi.rb +28 -0
  149. data/lib/ucode/models/codepoint/break_segmentation.rb +22 -0
  150. data/lib/ucode/models/codepoint/case_folding.rb +25 -0
  151. data/lib/ucode/models/codepoint/casing.rb +32 -0
  152. data/lib/ucode/models/codepoint/decomposition.rb +27 -0
  153. data/lib/ucode/models/codepoint/display.rb +24 -0
  154. data/lib/ucode/models/codepoint/emoji.rb +29 -0
  155. data/lib/ucode/models/codepoint/hangul.rb +20 -0
  156. data/lib/ucode/models/codepoint/identifier.rb +30 -0
  157. data/lib/ucode/models/codepoint/indic.rb +20 -0
  158. data/lib/ucode/models/codepoint/joining.rb +20 -0
  159. data/lib/ucode/models/codepoint/normalization.rb +35 -0
  160. data/lib/ucode/models/codepoint/numeric_value.rb +35 -0
  161. data/lib/ucode/models/codepoint.rb +122 -0
  162. data/lib/ucode/models/name_alias.rb +21 -0
  163. data/lib/ucode/models/named_sequence.rb +19 -0
  164. data/lib/ucode/models/names_list_entry.rb +38 -0
  165. data/lib/ucode/models/plane.rb +36 -0
  166. data/lib/ucode/models/property_alias.rb +24 -0
  167. data/lib/ucode/models/property_value_alias.rb +26 -0
  168. data/lib/ucode/models/relationship/compat_equiv.rb +18 -0
  169. data/lib/ucode/models/relationship/cross_reference.rb +17 -0
  170. data/lib/ucode/models/relationship/footnote.rb +24 -0
  171. data/lib/ucode/models/relationship/informal_alias.rb +18 -0
  172. data/lib/ucode/models/relationship/sample_sequence.rb +24 -0
  173. data/lib/ucode/models/relationship/variation_sequence.rb +19 -0
  174. data/lib/ucode/models/relationship.rb +57 -0
  175. data/lib/ucode/models/script.rb +41 -0
  176. data/lib/ucode/models/special_casing_rule.rb +28 -0
  177. data/lib/ucode/models/standardized_variant.rb +24 -0
  178. data/lib/ucode/models/unihan_entry.rb +23 -0
  179. data/lib/ucode/models.rb +47 -0
  180. data/lib/ucode/parsers/auxiliary.rb +26 -0
  181. data/lib/ucode/parsers/base.rb +137 -0
  182. data/lib/ucode/parsers/bidi_brackets.rb +41 -0
  183. data/lib/ucode/parsers/bidi_mirroring.rb +37 -0
  184. data/lib/ucode/parsers/blocks.rb +63 -0
  185. data/lib/ucode/parsers/case_folding.rb +53 -0
  186. data/lib/ucode/parsers/cjk_radicals.rb +102 -0
  187. data/lib/ucode/parsers/derived_age.rb +59 -0
  188. data/lib/ucode/parsers/derived_core_properties.rb +60 -0
  189. data/lib/ucode/parsers/extracted_properties.rb +74 -0
  190. data/lib/ucode/parsers/name_aliases.rb +44 -0
  191. data/lib/ucode/parsers/named_sequences.rb +51 -0
  192. data/lib/ucode/parsers/names_list.rb +250 -0
  193. data/lib/ucode/parsers/property_aliases.rb +41 -0
  194. data/lib/ucode/parsers/property_value_aliases.rb +46 -0
  195. data/lib/ucode/parsers/script_extensions.rb +64 -0
  196. data/lib/ucode/parsers/scripts.rb +60 -0
  197. data/lib/ucode/parsers/special_casing.rb +62 -0
  198. data/lib/ucode/parsers/standardized_variants.rb +56 -0
  199. data/lib/ucode/parsers/unicode_data/hangul_name.rb +73 -0
  200. data/lib/ucode/parsers/unicode_data.rb +268 -0
  201. data/lib/ucode/parsers/unihan.rb +125 -0
  202. data/lib/ucode/parsers.rb +35 -0
  203. data/lib/ucode/range_entry.rb +58 -0
  204. data/lib/ucode/repo/aggregate_writer.rb +364 -0
  205. data/lib/ucode/repo/atomic_writes.rb +48 -0
  206. data/lib/ucode/repo/codepoint_writer.rb +96 -0
  207. data/lib/ucode/repo/paths.rb +122 -0
  208. data/lib/ucode/repo.rb +22 -0
  209. data/lib/ucode/site/config_emitter.rb +124 -0
  210. data/lib/ucode/site/generator.rb +178 -0
  211. data/lib/ucode/site/search_index.rb +68 -0
  212. data/lib/ucode/site/template/.gitignore +4 -0
  213. data/lib/ucode/site/template/.vitepress/config.ts +8 -0
  214. data/lib/ucode/site/template/.vitepress/theme/index.js +20 -0
  215. data/lib/ucode/site/template/char/[codepoint].md +13 -0
  216. data/lib/ucode/site/template/components/BlockView.vue +57 -0
  217. data/lib/ucode/site/template/components/CharView.vue +85 -0
  218. data/lib/ucode/site/template/components/PlaneView.vue +56 -0
  219. data/lib/ucode/site/template/components/SearchView.vue +66 -0
  220. data/lib/ucode/site/template/index.md +25 -0
  221. data/lib/ucode/site/template/package.json +18 -0
  222. data/lib/ucode/site/template/search.md +9 -0
  223. data/lib/ucode/site.rb +13 -0
  224. data/lib/ucode/version.rb +5 -0
  225. data/lib/ucode/version_resolver.rb +76 -0
  226. data/lib/ucode.rb +74 -0
  227. data/ucode.gemspec +56 -0
  228. metadata +404 -0
data/README.md ADDED
@@ -0,0 +1,469 @@
1
+ # ucode
2
+
3
+ `ucode` is a Ruby toolkit for the Unicode Character Database (UCD). It turns the
4
+ official UCD text files into a structured, browsable dataset: one JSON document
5
+ per assigned codepoint, plus a Vitepress site for navigation.
6
+
7
+ > **Status (v0.1).** The JSON dataset, lookup index, and Vitepress site are
8
+ > production-ready. **SVG glyph extraction from the Code Charts PDFs is
9
+ > experimental and deferred to v0.2** — see
10
+ > [Glyph extraction (experimental)](#glyph-extraction-experimental) below.
11
+
12
+ ## What you get (v0.1)
13
+
14
+ - **Per-codepoint JSON** at `output/blocks/<BLOCK>/<U+XXXX>/index.json` with
15
+ full UCD properties, the human-curated relationships from `NamesList.txt`
16
+ (cross-references, see-also, compatibility equivalents, sample sequences,
17
+ informal aliases, footnotes), Unihan readings, and machine-computed refs
18
+ (decomposition, case mappings, case folding, bidi mirror, named sequences,
19
+ standardized variants, script extensions).
20
+ - **Aggregate JSON**: planes, blocks, scripts, search index, enums,
21
+ relationships, named sequences, manifest.
22
+ - **SQLite lookup index** for fast codepoint → block/script/char queries.
23
+ - **Vitepress site** at `site/` for browsing Plane → Block → Character.
24
+
25
+ ## Install
26
+
27
+ ```sh
28
+ gem install ucode
29
+ ```
30
+
31
+ Or in a Gemfile:
32
+
33
+ ```ruby
34
+ gem "ucode", "~> 0.1"
35
+ ```
36
+
37
+ ## Quick start
38
+
39
+ ```sh
40
+ # 1. Fetch UCD + Unihan for Unicode 17.0.0
41
+ ucode fetch ucd 17.0.0
42
+ ucode fetch unihan 17.0.0
43
+
44
+ # 2. Stream UCD → output/ JSON tree
45
+ ucode parse 17.0.0 --to ./output
46
+
47
+ # 3. (Optional) Build the SQLite lookup index + dataset in one go
48
+ ucode build 17.0.0 --to ./output # fetch + parse (glyphs skipped by default)
49
+
50
+ # 4. (Optional) Generate the Vitepress site
51
+ ucode site init --to ./site
52
+ ucode site build --from ./output --to ./site
53
+ cd site && npm install && npm run dev
54
+ ```
55
+
56
+ ## Three modes
57
+
58
+ ### Lookup mode
59
+
60
+ Read-only access to the SQLite cache.
61
+
62
+ ```ruby
63
+ require "ucode"
64
+
65
+ db = Ucode::Database.open("17.0.0")
66
+ db.lookup_block(0x0041) # => "Basic Latin"
67
+ db.lookup_script(0x0041) # => "Latin"
68
+ ```
69
+
70
+ CLI equivalent:
71
+
72
+ ```sh
73
+ ucode lookup block 0x0041 # U+0041 → Basic Latin
74
+ ucode lookup char U+1F600
75
+ ```
76
+
77
+ ### Dataset mode
78
+
79
+ Build the per-codepoint JSON dataset.
80
+
81
+ ```ruby
82
+ require "ucode"
83
+
84
+ Ucode::Commands::ParseCommand.new.call("17.0.0", output_root: "./output")
85
+ ```
86
+
87
+ Or via CLI:
88
+
89
+ ```sh
90
+ ucode build 17.0.0 --to ./output
91
+ ```
92
+
93
+ ### Site mode
94
+
95
+ Generate the Vitepress site.
96
+
97
+ ```ruby
98
+ require "ucode"
99
+
100
+ Ucode::Commands::SiteCommand.new.init(site_root: "./site")
101
+ Ucode::Commands::SiteCommand.new.build(output_root: "./output", site_root: "./site")
102
+ ```
103
+
104
+ Then:
105
+
106
+ ```sh
107
+ cd site && npm install && npm run dev
108
+ ```
109
+
110
+ ## Glyph extraction (experimental in v0.1; concrete plan for v0.2)
111
+
112
+ The `ucode glyphs` command and the `--include-glyphs` flag on `ucode build`
113
+ are **opt-in and experimental in v0.1**. They emit per-codepoint `glyph.svg`
114
+ files today, but the output is not yet suitable for end-user display.
115
+
116
+ To run the pipeline anyway (e.g. for development or benchmarking):
117
+
118
+ ```sh
119
+ ucode glyphs 17.0.0 --to ./output --include-glyphs
120
+ ucode build 17.0.0 --to ./output --include-glyphs
121
+ ```
122
+
123
+ Both emit a one-line experimental warning on stderr.
124
+
125
+ ### Why v0.1 glyph output is wrong
126
+
127
+ The Code Charts PDFs composite each cell's content — the cell-border
128
+ decoration (L-shaped corner ticks + dashed edges) **and** the actual
129
+ character outline — into a single glyph definition. `pdftocairo -svg` (or
130
+ any other PDF→SVG renderer) faithfully emits that composite as one `<path>`,
131
+ so the v0.1 cell extractor grabs border + character together. Trying to
132
+ post-process that composite path (drop sub-paths that hug the cell edge,
133
+ keep the largest interior cluster) is fragile because the border and the
134
+ character overlap.
135
+
136
+ ### The v0.2 plan — 4-tier glyph sourcing
137
+
138
+ The v0.1 cell-position resolution (`GridDetector` + `CellExtractor`) is
139
+ correct — the right `<use>` element is selected. The fix is not to keep
140
+ post-processing the rendered SVG; it is to **bypass the renderer entirely**
141
+ and read the character outline straight from one of four sources, tried in
142
+ priority order. Lower tiers are fallbacks.
143
+
144
+ | Priority | Tier | Source | Use when |
145
+ | -------- | ------------ | --------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
146
+ | 1 | **Tier 1** | Real-font cmap (`fontist`-discovered) | A redistributable/accessible font covers the codepoint. Highest fidelity; avoids Code Charts compositing of mark + base. |
147
+ | 2 | **Pillar 1** | PDF-embedded font + `/ToUnicode` CMap | Code Charts PDF embeds a subsetted CIDFont whose `/ToUnicode` lets us map glyph IDs to codepoints directly. |
148
+ | 3 | **Pillar 2** | PDF content-stream positional correlation | Code Charts PDF embeds a CIDFont without `/ToUnicode`; glyphs are correlated to codepoints via chart-grid geometry (row/column labels). |
149
+ | 4 | **Pillar 3** | Last Resort UFO | Codepoint is a placeholder box (unassigned, PUA, noncharacter) or no higher tier produced a glyph. |
150
+
151
+ The naming distinguishes **Tier 1** (real fonts, off-PDF) from the three
152
+ **pillars** (PDF-embedded or fallback). For full details — including the
153
+ PDF font object graph and how each pillar attributes a glyph ID to a
154
+ codepoint — see [docs/architecture.md → The 4-tier glyph sourcing
155
+ strategy](docs/architecture.md#the-4-tier-glyph-sourcing-strategy).
156
+
157
+ **Status (post-v0.2):**
158
+
159
+ - **Tier 1** (`Ucode::Glyphs::RealFonts`) — implemented. Uses
160
+ `fontist` for discovery and `fontisan` for parsing (never `ttfunk`).
161
+ - **Pillar 1** (`Ucode::Glyphs::EmbeddedFonts::Catalog`) — implemented.
162
+ Walks Type0 → CIDFont → FontDescriptor → FontFile2/3; for fonts with
163
+ `/ToUnicode`, builds `{codepoint => gid}` directly from the CMap stream
164
+ and lifts the outline by GID.
165
+ - **Pillar 2** (`Ucode::Glyphs::EmbeddedFonts::ContentStreamCorrelator`)
166
+ — implemented. Renders the relevant pages to SVG via `mutool draw -F
167
+ svg`, parses `<use>` elements, partitions labels from specimens by
168
+ font_obj_id, clusters by quantized (Y, X) position, decodes hex
169
+ codepoints from joined label glyphs, and matches positionally within
170
+ Y-rows.
171
+ - **Pillar 3** (`Ucode::Glyphs::LastResort`) — implemented. Reads `.glif`
172
+ outlines directly from Unicode's
173
+ [Last Resort Font](https://github.com/unicode-org/last-resort-font) UFO
174
+ source and converts them to SVG.
175
+
176
+ The 4 tiers are MECE: every codepoint in the charts is attributed to
177
+ exactly one tier by the canonical resolver. The v0.1 cell extractor is
178
+ retired once all four tiers ship.
179
+
180
+ ## How embedded font extraction works
181
+
182
+ The v0.1 cell extractor rendered each Code Charts page to SVG and grabbed
183
+ the `<path>` that landed in a grid cell. That grabbed the cell-border
184
+ decoration along with the character. v0.2 pillar 1
185
+ (`Ucode::Glyphs::EmbeddedFonts`) bypasses the renderer entirely and reads
186
+ the character outline straight from the embedded font program — which
187
+ contains only the character, never the border.
188
+
189
+ ### The PDF font object graph
190
+
191
+ Every modern Code Charts font is a Type0 (composite) font whose PDF object
192
+ graph has three layers below the Type0 outer font:
193
+
194
+ ```
195
+ Type0 font (referenced from page content streams)
196
+ /BaseFont /CIAIIP+Uni2000Generalpunctuation
197
+ /Encoding /Identity-H ← 2-byte CID encoding
198
+ /DescendantFonts [ <CIDFontType2 ref> ]
199
+ /ToUnicode <stream ref> ← CID → Unicode codepoint
200
+
201
+
202
+ CIDFontType2 (the "inner" CID font)
203
+ /BaseFont /CIAIIP+Uni2000Generalpunctuation
204
+ /CIDToGIDMap /Identity ← CID == GID (common case)
205
+ /FontDescriptor <ref>
206
+
207
+
208
+ FontDescriptor
209
+ /FontFile2 <stream ref> ← TrueType program
210
+ /FontFile3 <stream ref> ← CFF / Type 1C (alternative)
211
+ ```
212
+
213
+ The font program (the binary stream `/FontFile2` or `/FontFile3` points at)
214
+ is the actual outline data — the `glyf` table for TrueType, the
215
+ `CharStrings` dict for CFF. Reading it gives you the character outline with
216
+ zero PDF page content attached.
217
+
218
+ ### The three ID spaces
219
+
220
+ Three different integer ID spaces flow through the graph, and the
221
+ architecture's job is to chain them:
222
+
223
+ | ID space | What it numbers | Where it lives |
224
+ | --- | --- | --- |
225
+ | **CID** | Code shown in the content stream (`Tj`/`TJ` operators) | per-font; with `/Identity-H` it is a 16-bit index |
226
+ | **GID** | Glyph in the font program's outline table | the font program itself |
227
+ | **Unicode codepoint** | The scalar value (U+XXXX) the glyph represents | the `/ToUnicode` CMap |
228
+
229
+ Two PDF-side maps connect them:
230
+
231
+ - **CID → GID** via `/CIDToGIDMap`. If `/Identity`, they are equal.
232
+ Otherwise it is a binary stream lookup table (which ucode does not
233
+ currently parse — fonts that need it are skipped).
234
+ - **CID → Unicode codepoint** via the `/ToUnicode` CMap stream
235
+ (Adobe Technical Note #5014). This is the same map the PDF viewer uses
236
+ to make text selectable and searchable.
237
+
238
+ The third map — **GID → outline** — lives in the font program itself,
239
+ queried by GID.
240
+
241
+ ### Correlation walk: codepoint → outline
242
+
243
+ To render U+2010 (HYPHEN) the pipeline chains all three maps:
244
+
245
+ 1. **codepoint → FontEntry.** `Catalog#lookup(0x2010)` returns the
246
+ FontEntry whose ToUnicode CMap mentions U+2010 —
247
+ `CIAIIP+Uni2000Generalpunctuation`.
248
+ 2. **codepoint → GID.** `FontEntry#gid_for(0x2010)` looks up the per-font
249
+ `codepoint_to_gid` Hash. That Hash was built by inverting the parsed
250
+ ToUnicode `{cid => cp}` to `{cp => cid}`, then (with
251
+ `/CIDToGIDMap /Identity`) treating `cid == gid`. So GID = the CID the
252
+ CMap named.
253
+ 3. **GID → outline.** `FontEntry#accessor.outline_for_id(gid)` asks
254
+ fontisan for the outline at that GID — returns a `GlyphOutline` with
255
+ contours, control points, and bbox.
256
+ 4. **outline → SVG.** `Svg` walks `outline.to_commands`, emits each
257
+ command with y negated (fonts grow up, SVG grows down), wraps in a
258
+ viewBox padded 8% around the bbox, and produces a standalone XML
259
+ document.
260
+
261
+ For U+2010 specifically, the ToUnicode CMap of
262
+ `CIAIIP+Uni2000Generalpunctuation` contains:
263
+
264
+ ```
265
+ 1 beginbfchar
266
+ <000A> <2010>
267
+ endbfchar
268
+ ```
269
+
270
+ CID `0x000A` → Unicode `U+2010`. With Identity CIDToGIDMap, GID = CID =
271
+ 10. The renderer asks fontisan for the outline at GID 10.
272
+
273
+ **Why this is authoritative.** The ToUnicode CMap is the same data the
274
+ PDF viewer uses to make text selectable and searchable. The Code Charts
275
+ authors generated it when subsetting the font; it tells you exactly which
276
+ glyph represents which codepoint. We are not guessing from glyph shape or
277
+ grid position — we are reading the same correlation table the PDF itself
278
+ uses.
279
+
280
+ ### Pipeline components
281
+
282
+ ```
283
+ ┌──────────────────────────────────────┐
284
+ │ Source │
285
+ │ resolves CodeCharts.pdf + cache_dir │
286
+ └──────────────┬───────────────────────┘
287
+
288
+
289
+ ┌──────────────────────────────────────┐
290
+ │ Catalog │
291
+ │ walks PDF via mutool → │
292
+ │ builds { codepoint => FontEntry } │
293
+ └────────┬──────────────┬───────────────┘
294
+ │ │
295
+ ▼ ▼
296
+ ┌──────────────────┐ ┌──────────────────────┐
297
+ │ ToUnicode │ │ FontEntry │
298
+ │ parse CMap → │ │ lazy fontisan accessor│
299
+ │ { cid => cp } │ │ + codepoint_to_gid │
300
+ └──────────────────┘ └──────────┬───────────┘
301
+ │ on first lookup
302
+
303
+ ┌────────────────────────────────────┐
304
+ │ mutool show -o <tmp> -b │
305
+ │ extracts /FontFile2 or /FontFile3│
306
+ │ stream → cache_dir/<font>.ttf │
307
+ └────────────────┬───────────────────┘
308
+
309
+
310
+ ┌────────────────────────────────────┐
311
+ │ fontisan FontLoader │
312
+ │ parses glyf / CharStrings │
313
+ │ → GlyphAccessor │
314
+ │ → OutlineExtractor │
315
+ │ → GlyphOutline#to_commands │
316
+ └────────────────┬───────────────────┘
317
+
318
+
319
+ ┌────────────────────────────────────┐
320
+ │ Svg │
321
+ │ y-flip, viewBox + 8% padding, │
322
+ │ standalone XML │
323
+ └────────────────────────────────────┘
324
+ ```
325
+
326
+ **`Source`** — resolves the PDF path (`pdf:` arg →
327
+ `UCODE_CODE_CHARTS_PDF` env → `<gem_root>/CodeCharts.pdf`) and the cache
328
+ directory for extracted font programs (same pattern,
329
+ `UCODE_PDF_FONT_CACHE` env, default `<gem_root>/data/pdf-fonts/`). Raises
330
+ `EmbeddedFontsMissingError` when the resolved PDF doesn't exist.
331
+
332
+ **`Catalog`** — walks the PDF once via `mutool` and builds the global
333
+ `{codepoint => FontEntry}` index. Discovery happens in five batched
334
+ `mutool` calls:
335
+
336
+ - `mutool info CodeCharts.pdf` — lists every Type0 font and its object ID.
337
+ - `mutool show -g <pdf> <id1> <id2> ...` — batched fetch of Type0 dicts.
338
+ - Same for descendant CIDFont dicts.
339
+ - Same for FontDescriptors.
340
+ - Per-font `mutool show -o <tmp> -b <pdf> <tu_ref>` — fetches each
341
+ ToUnicode stream (cannot be batched because each is a separate binary
342
+ stream).
343
+
344
+ PDF dict parsing is **not** a full grammar walk — instead, `Catalog`
345
+ regex-extracts each field it needs (`/BaseFont`, `/DescendantFonts[<ref>]`,
346
+ `/ToUnicode <ref>`, `/FontDescriptor <ref>`, `/FontFile2/3 <ref>`,
347
+ `/CIDToGIDMap /Identity|<ref>`). The targeted approach is robust to the
348
+ `<<...>>`/`[...]` nesting that breaks naive whitespace-split parsers.
349
+
350
+ **`ToUnicode`** — parses a CMap stream text into a frozen
351
+ `{cid => codepoint}` Hash. Supports:
352
+
353
+ - `beginbfchar` / `endbfchar` — one-to-one `<cid> <uni>` pairs.
354
+ - `beginbfrange` / `endbfrange` — two forms:
355
+ - `<lo> <hi> <start>` — cids `lo..hi` map to consecutive codepoints
356
+ starting at `start`.
357
+ - `<lo> <hi> [<u1> ... <un>]` — explicit per-cid codepoints within the
358
+ range.
359
+ - UTF-16 surrogate-pair decoding — 8 hex digits (e.g. `D83DDE00`) decode
360
+ to one astral codepoint (U+1F600).
361
+
362
+ `codespacerange` and `notdefrange` blocks are ignored; multi-codepoint
363
+ targets (ligatures) take only the first codepoint.
364
+
365
+ **`FontEntry`** — value object per Type0 font, holds the identity
366
+ (`base_font`, object IDs), the kind of font program (`:ttf` or `:cff`),
367
+ the resolved `cid_to_gid_map` (`:identity` or nil), and the frozen
368
+ `codepoint_to_gid` Hash. The fontisan accessor is built lazily on first
369
+ `#accessor` call: extracts the font stream via `mutool show -o <tmp> -b`
370
+ to a `Tempfile`, atomically moves it into the cache (`FileUtils.mv`), then
371
+ loads via `Fontisan::FontLoader`. Cache hits skip extraction entirely;
372
+ cache files are invalidated by comparing mtime against the source PDF.
373
+
374
+ **`Svg`** — converts a `GlyphOutline` into a standalone SVG document. Two
375
+ coordinate transforms happen at emit time: y-negation (font space y grows
376
+ up, SVG y grows down) and viewBox computation (bbox plus 8% padding on
377
+ each side, y-flipped). Walks `outline.to_commands` and emits
378
+ `M`/`L`/`Q`/`Z` directly — no intermediate path string is parsed back.
379
+ Emits a `<title>` of the form `U+XXXX (Code Charts: <base_font>)` for
380
+ debugging.
381
+
382
+ **`Renderer`** — thin orchestrator: `Catalog#lookup` →
383
+ `FontEntry#gid_for` → `FontEntry#accessor.outline_for_id` → `Svg#to_s`.
384
+ Returns a `Result` struct (`codepoint`, `base_font`, `gid`, `svg`) on
385
+ success or nil on any miss.
386
+
387
+ **`Writer`** — iterates codepoints (defaults to `Catalog#codepoints`),
388
+ calls `Renderer#render`, writes `glyph.svg` into the per-codepoint output
389
+ folder. Idempotent via `Repo::AtomicWrites` (content-hash compare;
390
+ existing identical files are left untouched). Returns a tally
391
+ `{written:, skipped:, missing:, total:}`. `block_lookup:` is a callable
392
+ that maps a codepoint to its original block name (verbatim from
393
+ `Blocks.txt`) — codepoints returning nil are skipped.
394
+
395
+ ### What pillar 1 does not cover
396
+
397
+ Pillar 1 handles only the fonts where correlation is unambiguous:
398
+
399
+ - **Label fonts** (`MyriadPro-Bold` and friends) — these draw row/column
400
+ header text, not character glyphs. They are not Type0 with a ToUnicode
401
+ CMap, so they are invisible to discovery.
402
+ - **Type0 fonts without `/ToUnicode`** — older subset practice. Without
403
+ the CMap we cannot attribute a glyph to a codepoint, so the font is
404
+ skipped. These codepoints fall through to **pillar 2** (content-stream
405
+ positional correlation), and from there to **pillar 3** (Last Resort)
406
+ if pillar 2 cannot resolve them either.
407
+ - **Stream-form `/CIDToGIDMap`** — a binary lookup table. Treated as
408
+ unsupported; the font is skipped.
409
+ - **Bare CFF streams fontisan does not yet recognize** — a separate
410
+ fontisan-side issue; flagged for investigation.
411
+
412
+ Code Charts cells not covered by pillar 1 are exactly the cells whose
413
+ character is not drawn from an embedded subsetted font with
414
+ `/ToUnicode` — either a label, a glyph in a font without `/ToUnicode`,
415
+ or a placeholder. **Pillar 2** (content-stream positional correlation)
416
+ handles the no-`/ToUnicode` case, and **pillar 3** (Last Resort UFO)
417
+ handles the placeholder case; the small remainder are correctly absent
418
+ from the dataset.
419
+
420
+ ## System dependencies
421
+
422
+ - Ruby ≥ 3.1
423
+ - `mupdf-tools` (provides the `mutool` binary) — required for **v0.2 pillar 1
424
+ glyph extraction** (the default pipeline). `mutool` enumerates the subsetted
425
+ fonts embedded in `CodeCharts.pdf` and extracts their font program streams
426
+ for outline parsing. Install via Homebrew with `brew install mupdf-tools`,
427
+ or via apt with `apt install mupdf-tools`.
428
+ - `fontisan` Ruby gem — pulled in automatically through the `Gemfile`; used
429
+ by pillar 1 to parse extracted TrueType (`.ttf`) and CFF/Type 1C font
430
+ programs and emit per-glyph outline data (contours, control points, bbox).
431
+ - `pdftocairo` (poppler) — only required for the experimental v0.1
432
+ `glyphs` cell-extractor path. Alternatives (`pdf2svg`, `dvisvgm`) are
433
+ auto-detected.
434
+ - `pdftk` — only required for the v0.1 `glyphs` command's monolith fallback
435
+ path.
436
+
437
+ ## Architecture
438
+
439
+ Five concerns, each isolated:
440
+
441
+ 1. **`Ucode::Models`** — `lutaml-model` classes for every UCD aggregate.
442
+ 2. **`Ucode::Parsers`** — one streaming parser per UCD text file.
443
+ 3. **`Ucode::Coordinator`** — single-pass enrichment that merges indices
444
+ into each `CodePoint` as it streams.
445
+ 4. **`Ucode::Repo`** — atomic, idempotent writers for the output tree.
446
+ 5. **`Ucode::Glyphs`** — vector glyph extraction from Code Charts PDFs
447
+ (experimental in v0.1).
448
+ 6. **`Ucode::Site`** — Vitepress scaffold + config/page generator.
449
+
450
+ CLI is thin Thor dispatch over `Ucode::Commands::*`. Each command class
451
+ is a pure, in-process testable unit.
452
+
453
+ See `CLAUDE.md` for the full architecture notes. See
454
+ `docs/FONTISAN_MIGRATION.md` for the fontisan integration plan.
455
+
456
+ ## Authoritative source
457
+
458
+ ucode parses the **UCD text files** (per UAX #44). The
459
+ `ucd.all.flat.xml` shipped with the repo is reference-only — it omits
460
+ the human-curated relationship data in `NamesList.txt` and has partial
461
+ Unihan coverage. We never parse it.
462
+
463
+ ## License
464
+
465
+ BSD-2-Clause. See `LICENSE.txt`.
466
+
467
+ ## Code of conduct
468
+
469
+ Contributors are expected to follow the standard fontist org CoC.
data/Rakefile ADDED
@@ -0,0 +1,18 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "rubygems"
4
+ require "rake"
5
+ require "bundler/gem_tasks"
6
+
7
+ require "rspec/core/rake_task"
8
+ RSpec::Core::RakeTask.new(:spec)
9
+
10
+ require "rubocop/rake_task"
11
+ RuboCop::RakeTask.new
12
+
13
+ require "yard"
14
+ YARD::Rake::YardocTask.new do |t|
15
+ t.options = ["--output-dir", "docs/api"]
16
+ end
17
+
18
+ task default: %i[spec rubocop]
@@ -0,0 +1,66 @@
1
+ # TODO.new — audit migration + Mode 2 work
2
+
3
+ Work tracks for the fontisan audit → ucode audit migration, the
4
+ per-font-audit output spec, and the Mode 1 canonical-dataset alignment.
5
+ The full architecture reference is `docs/architecture.md` — read that
6
+ first; these TODOs reference sections of it.
7
+
8
+ ## Tracks
9
+
10
+ ### Alignment & contract (lock these before any code moves)
11
+
12
+ - [01 — Pillar terminology alignment](01-pillar-terminology-alignment.md)
13
+ - [02 — Audit schema design](02-audit-schema-design.md)
14
+ - [03 — Directory output spec](03-directory-output-spec.md)
15
+ - [04 — fontist.org contract](04-fontist-org-contract.md)
16
+
17
+ ### Baseline measurement (know where we are)
18
+
19
+ - [05 — Unicode 17 baseline coverage audit](05-baseline-unicode17-coverage-audit.md)
20
+
21
+ ### Audit migration (the big work)
22
+
23
+ - [06 — Audit namespace skeleton](06-audit-namespace-skeleton.md)
24
+ - [07 — Models::Audit port](07-audit-models-port.md)
25
+ - [08 — Cheap extractors port](08-extractors-cheap-port.md)
26
+ - [09 — Expensive extractors port](09-extractors-expensive-port.md)
27
+ - [10 — Aggregations rewrite on ucode UCD](10-aggregations-ucd-rewrite.md)
28
+ - [11 — Differ + library auditor port](11-differ-and-library-auditor-port.md)
29
+ - [12 — Formatters port](12-formatters-port.md)
30
+
31
+ ### Output + browser
32
+
33
+ - [13 — Directory emitter](13-directory-emitter.md)
34
+ - [14 — HTML face browser](14-html-face-browser.md)
35
+ - [15 — HTML library browser](15-html-library-browser.md)
36
+ - [16 — CLI audit subcommands](16-cli-audit-subcommands.md)
37
+
38
+ ### Fontisan cleanup (after ucode audit ships)
39
+
40
+ - [17 — Fontisan: delete audit subsystem](17-fontisan-cleanup-audit.md)
41
+ - [18 — Fontisan: delete UCD subsystem](18-fontisan-cleanup-ucd.md)
42
+ - [19 — Fontisan: docs and shim update](19-fontisan-docs-update.md)
43
+
44
+ ### Canonical Mode 1 alignment
45
+
46
+ - [20 — Canonical 4-tier resolver](20-canonical-resolver-4-tier.md)
47
+ - [21 — Canonical Unicode 17 dataset build](21-canonical-unicode17-build.md)
48
+
49
+ ### Sequencing
50
+
51
+ - [22 — Implementation order](22-implementation-order.md)
52
+
53
+ ## Conventions
54
+
55
+ - One concern per file. If a TODO grows past ~250 lines it should split.
56
+ - File numbering is stable; reuse the next free number for additions.
57
+ - Every TODO lists: Goal, Files, Scope, Acceptance, References.
58
+ - Specs use real model instances — never `double()` (global rule).
59
+ - All new lib files use Ruby `autoload` (declared in the immediate
60
+ parent namespace's file) for same-library code. No `require_relative`
61
+ and no `require "ucode/..."` inside the library.
62
+ - No AI attribution in any commit, doc, or comment.
63
+ - Branch naming: `audit/<track-slug>` (e.g. `audit/schema-design`).
64
+ One PR per track unless tracks are tightly coupled.
65
+ - Land PR #1 (`tier1-cmap-audit`) before starting any track in this dir.
66
+ The migration builds on top of the merged RealFonts subsystem.
@@ -0,0 +1,69 @@
1
+ # 01 — Pillar terminology alignment
2
+
3
+ ## Goal
4
+
5
+ Fix the inconsistency between the README's "two pillars" claim and the
6
+ actual 4-tier glyph sourcing strategy. The recent commit `24e6bfd`
7
+ ("Pillar-2 content-stream correlation fallback") was named correctly;
8
+ the README at `docs/architecture.md`'s "4-tier strategy" section is
9
+ authoritative.
10
+
11
+ ## Problem
12
+
13
+ The README currently says (line ~155):
14
+
15
+ > ### The v0.2 plan — two pillars
16
+ > 1. Real character glyphs — extract the subsetted fonts from the PDF.
17
+ > 2. Last Resort placeholders — render directly from the UFO source.
18
+
19
+ This collapses Tier 1 (real-font cmap) and the three PDF-side pillars
20
+ into "two pillars". The actual strategy (per project memory and the
21
+ in-tree code) is four-tier:
22
+
23
+ 1. **Tier 1** — real-font cmap (`Ucode::Glyphs::RealFonts`).
24
+ 2. **Pillar 1** — PDF-embedded font with `/ToUnicode` (`EmbeddedFonts::Catalog`).
25
+ 3. **Pillar 2** — PDF content-stream correlation (`ContentStreamCorrelator`).
26
+ 4. **Pillar 3** — Last Resort UFO (`Ucode::Glyphs::LastResort`).
27
+
28
+ The mismatch confuses anyone reading the code (where each tier is
29
+ distinct) vs the README (which merges three of them).
30
+
31
+ ## Files to change
32
+
33
+ - `README.md` — replace the "two pillars" section with the 4-tier table.
34
+ Cross-link to `docs/architecture.md` §"The 4-tier glyph sourcing
35
+ strategy" as the canonical reference.
36
+ - `docs/architecture.md` — already correct; no change here.
37
+ - `CLAUDE.md` — has a brief mention of glyph sourcing; align the
38
+ vocabulary with the 4-tier names.
39
+
40
+ ## Scope
41
+
42
+ In scope:
43
+ - README rewrite (one section, ~50 lines).
44
+ - CLAUDE.md vocabulary tweak (one paragraph).
45
+ - No code changes.
46
+
47
+ Out of scope:
48
+ - Renaming any code symbol. The current symbols (`RealFonts`,
49
+ `EmbeddedFonts::Catalog`, `ContentStreamCorrelator`, `LastResort`) are
50
+ fine; the names match their function. Only the prose label "tier" vs
51
+ "pillar" needs disambiguation.
52
+ - Updating the commit message of `24e6bfd`. The commit was correctly
53
+ named; do not rewrite history.
54
+
55
+ ## Acceptance
56
+
57
+ - `grep -ni "two pillars" README.md` returns no matches.
58
+ - `grep -ni "pillar" README.md` returns matches that fit the 4-tier
59
+ vocabulary (Tier 1 + Pillars 1-3).
60
+ - README's strategy section cross-links to `docs/architecture.md`.
61
+ - No code changes; no spec changes; no changelog entry needed beyond
62
+ the commit message.
63
+
64
+ ## References
65
+
66
+ - `docs/architecture.md` §"The 4-tier glyph sourcing strategy"
67
+ - Commit `24e6bfd` (correctly named)
68
+ - Commit `307fda3` (Tier-1 implementation)
69
+ - Memory: `ucode_glyph_extraction_cell_border_bug.md`