ucode 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (174) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +72 -0
  3. data/Gemfile.lock +2 -2
  4. data/TODO.full/00-README.md +116 -0
  5. data/TODO.full/01-panglyph-vision.md +112 -0
  6. data/TODO.full/02-panglyph-repo-bootstrap.md +184 -0
  7. data/TODO.full/03-panglyph-font-builder.md +201 -0
  8. data/TODO.full/04-panglyph-publish-pipeline.md +126 -0
  9. data/TODO.full/05-ucode-0-1-1-release.md +139 -0
  10. data/TODO.full/06-fontisan-remove-audit.md +142 -0
  11. data/TODO.full/07-fontisan-remove-ucd.md +125 -0
  12. data/TODO.full/08-archive-private-bin-build.md +143 -0
  13. data/TODO.full/09-archive-public-structure.md +164 -0
  14. data/TODO.full/10-fontist-org-woff-glyphs.md +131 -0
  15. data/TODO.full/11-fontist-org-audit-coverage.md +140 -0
  16. data/TODO.full/12-implementation-order.md +216 -0
  17. data/TODO.full/13-fontisan-font-writer-api.md +189 -0
  18. data/TODO.full/14-fontisan-table-writers.md +66 -0
  19. data/TODO.full/15-panglyph-builder-real.md +82 -0
  20. data/TODO.full/16-archive-public-sync-workflows.md +167 -0
  21. data/TODO.full/17-fontist-org-font-picker.md +73 -0
  22. data/TODO.full/18-comprehensive-spec-coverage.md +64 -0
  23. data/TODO.full/19-ucode-0-1-2-patch.md +32 -0
  24. data/TODO.full/20-fontisan-0-2-23-release.md +52 -0
  25. data/TODO.new/00-README.md +30 -0
  26. data/TODO.new/23-universal-glyph-set-source-map.md +312 -0
  27. data/TODO.new/24-universal-glyph-set-build.md +189 -0
  28. data/TODO.new/25-font-audit-against-universal-set.md +195 -0
  29. data/TODO.new/26-missing-glyph-reporter.md +189 -0
  30. data/TODO.new/27-fontist-org-consumer-integration.md +200 -0
  31. data/TODO.new/28-implementation-order-update.md +187 -0
  32. data/TODO.new/29-universal-set-curation-uc17.md +312 -0
  33. data/TODO.new/30-tier1-font-acquisition.md +241 -0
  34. data/TODO.new/31-universal-set-production-build.md +205 -0
  35. data/TODO.new/32-uc17-coverage-matrix.md +165 -0
  36. data/TODO.new/33-specialist-font-acquisition-refresh.md +138 -0
  37. data/TODO.new/34-pillar2-content-stream-correlator.md +147 -0
  38. data/TODO.new/35-universal-set-production-run.md +160 -0
  39. data/TODO.new/36-per-font-coverage-audit.md +145 -0
  40. data/TODO.new/37-coverage-highlight-reporter.md +125 -0
  41. data/TODO.new/38-fontist-org-glyph-consumer.md +141 -0
  42. data/TODO.new/39-implementation-order-update-32-38.md +258 -0
  43. data/TODO.new/40-archive-private-uses-ucode-audit.md +124 -0
  44. data/TODO.new/41-ucode-unicode-archive-bridge.md +160 -0
  45. data/config/specialist_fonts.yml +102 -0
  46. data/config/unicode17_tier1_fonts.yml +42 -0
  47. data/config/unicode17_universal_glyph_set.yml +293 -0
  48. data/lib/ucode/audit/block_aggregator.rb +57 -29
  49. data/lib/ucode/audit/browser/face_page.rb +128 -0
  50. data/lib/ucode/audit/browser/glyph_panel.rb +124 -0
  51. data/lib/ucode/audit/browser/library_page.rb +74 -0
  52. data/lib/ucode/audit/browser/missing_glyph_page.rb +87 -0
  53. data/lib/ucode/audit/browser/template.rb +47 -0
  54. data/lib/ucode/audit/browser/templates/face.css +200 -0
  55. data/lib/ucode/audit/browser/templates/face.html.erb +41 -0
  56. data/lib/ucode/audit/browser/templates/face.js +298 -0
  57. data/lib/ucode/audit/browser/templates/library.css +119 -0
  58. data/lib/ucode/audit/browser/templates/library.html.erb +42 -0
  59. data/lib/ucode/audit/browser/templates/library.js +99 -0
  60. data/lib/ucode/audit/browser/templates/missing_glyph_page.css +119 -0
  61. data/lib/ucode/audit/browser/templates/missing_glyph_page.html.erb +58 -0
  62. data/lib/ucode/audit/browser/templates/missing_glyph_page.js +2 -0
  63. data/lib/ucode/audit/browser.rb +32 -0
  64. data/lib/ucode/audit/context.rb +27 -1
  65. data/lib/ucode/audit/coverage_reference.rb +103 -0
  66. data/lib/ucode/audit/differ.rb +121 -0
  67. data/lib/ucode/audit/emitter/block_emitter.rb +52 -0
  68. data/lib/ucode/audit/emitter/codepoint_emitter.rb +87 -0
  69. data/lib/ucode/audit/emitter/collection_emitter.rb +80 -0
  70. data/lib/ucode/audit/emitter/face_directory.rb +212 -0
  71. data/lib/ucode/audit/emitter/glyph_emitter.rb +48 -0
  72. data/lib/ucode/audit/emitter/index_emitter.rb +149 -0
  73. data/lib/ucode/audit/emitter/library_emitter.rb +96 -0
  74. data/lib/ucode/audit/emitter/paths.rb +312 -0
  75. data/lib/ucode/audit/emitter/plane_emitter.rb +29 -0
  76. data/lib/ucode/audit/emitter/script_emitter.rb +29 -0
  77. data/lib/ucode/audit/emitter.rb +29 -0
  78. data/lib/ucode/audit/extractors/aggregations.rb +31 -2
  79. data/lib/ucode/audit/face_auditor.rb +86 -0
  80. data/lib/ucode/audit/formatters/audit_diff_text.rb +112 -0
  81. data/lib/ucode/audit/formatters/audit_text.rb +411 -0
  82. data/lib/ucode/audit/formatters/color.rb +48 -0
  83. data/lib/ucode/audit/formatters/library_summary_text.rb +98 -0
  84. data/lib/ucode/audit/formatters/text_formatter.rb +83 -0
  85. data/lib/ucode/audit/formatters.rb +23 -0
  86. data/lib/ucode/audit/library_aggregator.rb +86 -0
  87. data/lib/ucode/audit/library_auditor.rb +105 -0
  88. data/lib/ucode/audit/release/emitter.rb +152 -0
  89. data/lib/ucode/audit/release/face_card.rb +93 -0
  90. data/lib/ucode/audit/release/formula_audits.rb +50 -0
  91. data/lib/ucode/audit/release/library_index_builder.rb +78 -0
  92. data/lib/ucode/audit/release/manifest_builder.rb +127 -0
  93. data/lib/ucode/audit/release.rb +42 -0
  94. data/lib/ucode/audit/ucd_only_reference.rb +81 -0
  95. data/lib/ucode/audit/universal_set_reference.rb +136 -0
  96. data/lib/ucode/audit.rb +31 -0
  97. data/lib/ucode/cli.rb +339 -33
  98. data/lib/ucode/commands/audit/browser_command.rb +82 -0
  99. data/lib/ucode/commands/audit/collection_command.rb +103 -0
  100. data/lib/ucode/commands/audit/compare_command.rb +188 -0
  101. data/lib/ucode/commands/audit/font_command.rb +140 -0
  102. data/lib/ucode/commands/audit/library_command.rb +87 -0
  103. data/lib/ucode/commands/audit/reference_builder.rb +64 -0
  104. data/lib/ucode/commands/audit.rb +20 -0
  105. data/lib/ucode/commands/block_feed.rb +73 -0
  106. data/lib/ucode/commands/canonical_build.rb +138 -0
  107. data/lib/ucode/commands/fetch.rb +37 -1
  108. data/lib/ucode/commands/release.rb +115 -0
  109. data/lib/ucode/commands/universal_set.rb +211 -0
  110. data/lib/ucode/commands.rb +5 -0
  111. data/lib/ucode/coordinator/indices.rb +11 -0
  112. data/lib/ucode/coordinator.rb +138 -5
  113. data/lib/ucode/error.rb +30 -2
  114. data/lib/ucode/fetch/font_fetcher/result.rb +39 -0
  115. data/lib/ucode/fetch/font_fetcher.rb +16 -0
  116. data/lib/ucode/fetch/specialist_font_fetcher.rb +280 -0
  117. data/lib/ucode/fetch.rb +7 -3
  118. data/lib/ucode/glyphs/real_fonts/cmap_cache.rb +74 -0
  119. data/lib/ucode/glyphs/real_fonts.rb +1 -0
  120. data/lib/ucode/glyphs/resolver.rb +62 -0
  121. data/lib/ucode/glyphs/source.rb +48 -0
  122. data/lib/ucode/glyphs/source_builder.rb +61 -0
  123. data/lib/ucode/glyphs/source_config/coverage_assertion.rb +79 -0
  124. data/lib/ucode/glyphs/source_config/gap_report.rb +54 -0
  125. data/lib/ucode/glyphs/source_config.rb +104 -0
  126. data/lib/ucode/glyphs/sources/pillar1_embedded_tounicode.rb +63 -0
  127. data/lib/ucode/glyphs/sources/pillar3_last_resort.rb +51 -0
  128. data/lib/ucode/glyphs/sources/tier1_real_font.rb +104 -0
  129. data/lib/ucode/glyphs/sources.rb +20 -0
  130. data/lib/ucode/glyphs/universal_set/builder.rb +161 -0
  131. data/lib/ucode/glyphs/universal_set/coverage_report.rb +139 -0
  132. data/lib/ucode/glyphs/universal_set/idempotency.rb +86 -0
  133. data/lib/ucode/glyphs/universal_set/manifest_accumulator.rb +195 -0
  134. data/lib/ucode/glyphs/universal_set/manifest_writer.rb +61 -0
  135. data/lib/ucode/glyphs/universal_set/pre_build_check.rb +197 -0
  136. data/lib/ucode/glyphs/universal_set/validator.rb +204 -0
  137. data/lib/ucode/glyphs/universal_set.rb +45 -0
  138. data/lib/ucode/glyphs.rb +6 -0
  139. data/lib/ucode/models/audit/baseline.rb +6 -0
  140. data/lib/ucode/models/audit/block_summary.rb +7 -0
  141. data/lib/ucode/models/audit/codepoint_provenance.rb +39 -0
  142. data/lib/ucode/models/audit/release_face.rb +42 -0
  143. data/lib/ucode/models/audit/release_formula.rb +33 -0
  144. data/lib/ucode/models/audit/release_manifest.rb +43 -0
  145. data/lib/ucode/models/audit/release_universal_set.rb +37 -0
  146. data/lib/ucode/models/audit.rb +9 -0
  147. data/lib/ucode/models/block.rb +2 -0
  148. data/lib/ucode/models/build_report.rb +109 -0
  149. data/lib/ucode/models/codepoint/glyph.rb +42 -0
  150. data/lib/ucode/models/codepoint.rb +3 -0
  151. data/lib/ucode/models/glyph_source.rb +86 -0
  152. data/lib/ucode/models/glyph_source_map.rb +138 -0
  153. data/lib/ucode/models/specialist_font.rb +70 -0
  154. data/lib/ucode/models/specialist_font_manifest.rb +48 -0
  155. data/lib/ucode/models/unihan_entry.rb +81 -9
  156. data/lib/ucode/models/unihan_field.rb +21 -0
  157. data/lib/ucode/models/universal_set_entry.rb +47 -0
  158. data/lib/ucode/models/universal_set_manifest.rb +78 -0
  159. data/lib/ucode/models/validation_report.rb +99 -0
  160. data/lib/ucode/models.rb +9 -0
  161. data/lib/ucode/parsers/named_sequences.rb +5 -5
  162. data/lib/ucode/parsers/unihan.rb +50 -19
  163. data/lib/ucode/repo/aggregate_writer.rb +34 -2
  164. data/lib/ucode/repo/block_feed_emitter.rb +153 -0
  165. data/lib/ucode/repo/build_report_accumulator.rb +138 -0
  166. data/lib/ucode/repo/build_report_writer.rb +46 -0
  167. data/lib/ucode/repo/build_validator.rb +229 -0
  168. data/lib/ucode/repo/codepoint_writer.rb +50 -1
  169. data/lib/ucode/repo/paths.rb +8 -0
  170. data/lib/ucode/repo.rb +4 -0
  171. data/lib/ucode/version.rb +1 -1
  172. data/schema/block-feed.output.schema.yml +134 -0
  173. metadata +143 -2
  174. data/ucode.gemspec +0 -56
@@ -0,0 +1,312 @@
1
+ # 23 — Universal glyph set: Tier 1 source map
2
+
3
+ ## Goal
4
+
5
+ Pin the canonical "one best font per Unicode 17 block" map as a
6
+ first-class, versioned artifact. This is the single source of truth
7
+ that drives the universal glyph set build (TODO 24) and the
8
+ audit-against-universal-set pipeline (TODO 25).
9
+
10
+ The resolver in TODO 20 reads this config; the build in TODO 21
11
+ materializes it; audits in TODO 25 reference it. Without this file we
12
+ have resolver mechanics but no opinionated, full-coverage font choice.
13
+
14
+ ## Why a separate file
15
+
16
+ Embedding the block→font table inside the resolver (as TODO 20's
17
+ example shows) blurs two concerns:
18
+
19
+ 1. **Mechanism** (the priority-ordered dispatch loop) — belongs in
20
+ `Resolver`. Stable across Unicode versions.
21
+ 2. **Policy** (which font wins for which block this Unicode version) —
22
+ belongs in a versioned data file. Changes every Unicode release.
23
+
24
+ Lifting policy out into `config/unicode17_universal_glyph_set.yml`
25
+ makes it reviewable on its own, diffable across versions, and editable
26
+ without touching Ruby.
27
+
28
+ ## Files to create
29
+
30
+ - `config/unicode17_universal_glyph_set.yml` — the curated map.
31
+ - `lib/ucode/glyphs/source_config.rb` — loader/validator (returns a
32
+ frozen `SourceConfig` instance with `#fonts_for(block_id)`).
33
+ - `lib/ucode/models/glyph_source.rb` — typed model for one entry in
34
+ the yaml (label, kind, path_or_fontist_name, priority, license).
35
+ - `lib/ucode/models/glyph_source_map.rb` — typed model for the whole
36
+ yaml (top-level `unicode_version`, `map` keyed by block_id).
37
+ - `spec/ucode/glyphs/source_config_spec.rb` — loader specs (real
38
+ fixtures, no doubles).
39
+ - `spec/fixtures/glyph_source_map/minimal.yml` — small fixture.
40
+ - `spec/fixtures/glyph_source_map/full.yml` — symlink or copy of the
41
+ production config (exercised by one smoke spec).
42
+
43
+ ## YAML shape
44
+
45
+ ```yaml
46
+ # config/unicode17_universal_glyph_set.yml
47
+ unicode_version: "17.0.0"
48
+ ucode_version: "0.2.0"
49
+ generated_at: "2026-06-27T12:00:00Z"
50
+
51
+ # Block IDs use the verbatim Unicode original name with underscores
52
+ # (same convention as Blocks.txt folder names). One entry per block;
53
+ # the resolver tries fonts in listed order.
54
+ map:
55
+ Basic_Latin:
56
+ sources:
57
+ - kind: fontist
58
+ label: noto-sans
59
+ priority: 1
60
+ license: OFL
61
+ provenance: "Google Noto Sans, system fallback for Latin"
62
+ - kind: path
63
+ label: system-ui
64
+ path: "/System/Library/Fonts/Helvetica.ttc"
65
+ priority: 2
66
+ license: PROPRIETARY
67
+ provenance: "macOS system font, fallback only"
68
+
69
+ Greek_And_Coptic:
70
+ sources:
71
+ - kind: fontist
72
+ label: noto-sans
73
+ priority: 1
74
+
75
+ CJK_Unified_Ideographs:
76
+ sources:
77
+ - kind: path
78
+ label: FSung-1
79
+ path: "~/Downloads/全宋體/FSung-1.ttf"
80
+ priority: 1
81
+ license: OFL
82
+ provenance: "Taiwan MOE 全宋體, covers U+4E00..U+9FFF core"
83
+ - kind: path
84
+ label: FSung-2
85
+ path: "~/Downloads/全宋體/FSung-2.ttf"
86
+ priority: 2
87
+ # ... FSung-3 .. FSung-X cover the rest of CJK + extensions
88
+ - kind: fontist
89
+ label: noto-sans-cjk-jp
90
+ priority: 99
91
+ provenance: "Catch-all fallback for any CJK codepoint FSung misses"
92
+
93
+ CJK_Unified_Ideographs_Extension_J:
94
+ sources:
95
+ - kind: path
96
+ label: FSung-J
97
+ path: "~/Downloads/全宋體/FSung-J.ttf"
98
+ priority: 1
99
+ - kind: fontist
100
+ label: noto-sans-cjk-jp
101
+ priority: 2
102
+
103
+ Sidetic:
104
+ sources:
105
+ - kind: fontist
106
+ label: lentariso
107
+ priority: 1
108
+ license: OFL
109
+ provenance: "Lentariso ≥1.029 (github.com/Bry10022/Lentariso)"
110
+ - kind: fontist
111
+ label: noto-sans-sidetic
112
+ priority: 2
113
+
114
+ Beria_Erfe:
115
+ sources:
116
+ - kind: fontist
117
+ label: kedebideri
118
+ priority: 1
119
+ license: OFL
120
+ provenance: "Kedebideri 3.001 (software.sil.org/kedebideri)"
121
+
122
+ Tai_Yo:
123
+ sources:
124
+ - kind: path
125
+ label: NotoSerifTaiYo
126
+ path: "data/fonts/NotoSerifTaiYo.ttf"
127
+ priority: 1
128
+ license: OFL
129
+ provenance: "translationcommons.org, proven via correlate-v4"
130
+
131
+ Tolong_Siki:
132
+ sources:
133
+ - kind: fontist
134
+ label: noto-sans-tolong-siki
135
+ priority: 1
136
+
137
+ Sharada_Supplement:
138
+ sources:
139
+ - kind: fontist
140
+ label: noto-sans-sharada
141
+ priority: 1
142
+
143
+ Egyptian_Hieroglyphs:
144
+ sources:
145
+ - kind: path
146
+ label: UniHieroglyphica
147
+ path: "data/fonts/UniHieroglyphica.ttf"
148
+ priority: 1
149
+ license: OFL
150
+ provenance: "suignard.com, authoritative for Egyptian Hieroglyphs"
151
+
152
+ Egyptian_Hieroglyphs_Format_Controls:
153
+ sources:
154
+ - kind: path
155
+ label: Egyptian-Text
156
+ path: "data/fonts/EgyptianText-Regular.ttf"
157
+ priority: 1
158
+ license: OFL
159
+ provenance: "microsoft/font-tools, OFL"
160
+
161
+ Egyptian_Hieroglyphs_Extended_A:
162
+ sources:
163
+ - kind: path
164
+ label: UniHieroglyphica
165
+ path: "data/fonts/UniHieroglyphica.ttf"
166
+ priority: 1
167
+
168
+ Egyptian_Hieroglyphs_Extended_B:
169
+ sources:
170
+ - kind: path
171
+ label: UniHieroglyphica
172
+ path: "data/fonts/UniHieroglyphica.ttf"
173
+ priority: 1
174
+
175
+ Symbols_for_Legacy_Computing_Supplement:
176
+ sources:
177
+ - kind: fontist
178
+ label: babelstone-pseudographica
179
+ priority: 1
180
+ provenance: "BabelStone, partial Unicode 17 coverage"
181
+
182
+ Supplemental_Arrows_C:
183
+ sources:
184
+ - kind: fontist
185
+ label: symbola
186
+ priority: 1
187
+
188
+ Alchemical_Symbols:
189
+ sources:
190
+ - kind: fontist
191
+ label: noto-sans-symbols
192
+ priority: 1
193
+ - kind: fontist
194
+ label: symbola
195
+ priority: 2
196
+
197
+ Miscellaneous_Symbols_Supplement:
198
+ sources:
199
+ - kind: fontist
200
+ label: noto-sans-symbols-2
201
+ priority: 1
202
+
203
+ Musical_Symbols:
204
+ sources:
205
+ - kind: fontist
206
+ label: noto-music
207
+ priority: 1
208
+
209
+ Tangut:
210
+ Tangut_Supplement:
211
+ Tangut_Components:
212
+ sources:
213
+ - kind: fontist
214
+ label: noto-sans-tangut
215
+ priority: 1
216
+
217
+ Adlam:
218
+ sources:
219
+ - kind: fontist
220
+ label: noto-sans-adlam
221
+ priority: 1
222
+
223
+ # ... one entry per Unicode 17 block (~340 total) ...
224
+
225
+ # Blocks with no known Tier 1 font. The resolver falls through to
226
+ # Pillar 1 → Pillar 2 → Pillar 3 for these. Listed here for explicit
227
+ # documentation; resolver treats absent block_id same as empty sources.
228
+ no_tier1_font:
229
+ - Combining_Diacritical_Marks_Extended # additions: font support spotty
230
+ ```
231
+
232
+ ## Source kinds
233
+
234
+ - `fontist` — fontist-resolvable name. `FontLocator` finds/installs.
235
+ - `path` — explicit filesystem path. Used for local-only fonts
236
+ (FSung, NotoSerifTaiYo before upstreaming).
237
+ - `system` — system font via fontist's system index (macOS `/System`,
238
+ Linux `/usr/share/fonts`). Reserve for fallbacks.
239
+
240
+ `priority` is a per-block integer; lower wins. The resolver iterates
241
+ the block's `sources` in priority order; first hit wins.
242
+
243
+ ## Curation rules
244
+
245
+ 1. **One font per script family where possible.** Don't list three
246
+ Latin fonts; pick one (Noto Sans) and let pillar 1-3 catch misses.
247
+ 2. **CJK is the exception** — FSung is split across many files; one
248
+ entry per file with monotonic priority. The resolver loads all
249
+ of them; `fontist` fallback ensures the long tail still hits.
250
+ 3. **Proprietary fonts never ship.** Sources with `license:
251
+ PROPRIETARY` are loaded for glyph extraction only; the extracted
252
+ SVG (open data) ships, the font file does not.
253
+ 4. **Provenance is mandatory.** Every entry cites where the font comes
254
+ from and why it's the chosen source. Without provenance, the entry
255
+ is unreviewable.
256
+ 5. **Versioned.** Bump `ucode_version` field on every config edit.
257
+ Consumers can detect config drift vs the dataset.
258
+
259
+ ## Source config loader
260
+
261
+ ```ruby
262
+ class Ucode::Glyphs::SourceConfig
263
+ # @param yaml_path [Pathname]
264
+ # @return [Ucode::Models::GlyphSourceMap]
265
+ def self.load(yaml_path = DEFAULT_PATH)
266
+ parsed = YAML.safe_load(yaml_path.read)
267
+ Ucode::Models::GlyphSourceMap.from_hash(parsed)
268
+ end
269
+
270
+ DEFAULT_PATH = Pathname.new("config/unicode17_universal_glyph_set.yml")
271
+ end
272
+ ```
273
+
274
+ The loader validates:
275
+ - `unicode_version` matches the active UCD baseline (`Ucode.configuration.unicode_version`).
276
+ - Every block_id in `Blocks.txt` has an entry (empty `sources:` allowed).
277
+ - Every `path:` resolves to an existing file (warning, not error, for
278
+ paths under `~/Downloads` since those are user-local).
279
+ - Every `fontist:` label is known to fontist's index (warning if not).
280
+
281
+ ## Acceptance
282
+
283
+ - `config/unicode17_universal_glyph_set.yml` exists with one entry per
284
+ Unicode 17 block (~340 entries).
285
+ - Every Unicode 17 new block (Sidetic, Beria Erfe, Tai Yo, Tolong
286
+ Siki, Sharada Supplement, CJK Ext J, Symbols Legacy Supp, Supp
287
+ Arrows-C, Alchemical Symbols ext, Misc Symbols Supp, Musical Symbols
288
+ Supp) has at least one Tier 1 source.
289
+ - Every Egyptian Hieroglyphs block has UniHieroglyphica + Egyptian
290
+ Text entries.
291
+ - Loader specs cover: happy path, missing block (warn), invalid yaml
292
+ (raise), missing font file (warn).
293
+ - Smoke spec against `full.yml` confirms the file parses and every
294
+ block_id resolves to a `GlyphSource` array.
295
+ - Rubocop clean.
296
+
297
+ ## Out of scope
298
+
299
+ - The resolver mechanics — TODO 20.
300
+ - The build that materializes glyphs from this config — TODO 24.
301
+ - The audit pipeline that uses the universal set as reference — TODO 25.
302
+ - Pillar 1/2/3 sources — these are not in the yaml; the resolver
303
+ appends them implicitly as fallbacks after Tier 1 sources.
304
+
305
+ ## References
306
+
307
+ - Resolver mechanics: `TODO.new/20-canonical-resolver-4-tier.md`
308
+ - Universal build: `TODO.new/24-universal-glyph-set-build.md`
309
+ - Baseline data: `TODO.new/05-baseline-unicode17-coverage-audit.md`
310
+ - Architecture: `docs/architecture.md` §"The 4-tier glyph sourcing
311
+ strategy"
312
+ - FontLocator: `lib/ucode/glyphs/real_fonts/font_locator.rb`
@@ -0,0 +1,189 @@
1
+ # 24 — Universal glyph set build
2
+
3
+ ## Goal
4
+
5
+ Materialize the universal glyph set: one SVG file per assigned Unicode
6
+ 17 codepoint, sourced via the 4-tier resolver using the curated Tier 1
7
+ config from TODO 23. The set is the canonical reference for "what
8
+ Unicode 17 looks like" — every codepoint has exactly one glyph, with
9
+ documented provenance.
10
+
11
+ This is Part 1 of the user's three-part directive: build the FULL base
12
+ with full coverage so it can serve as the reference for font audits.
13
+
14
+ ## What "universal" means
15
+
16
+ The universal set is:
17
+
18
+ - **Total**: every assigned codepoint has a glyph.
19
+ - **Single-sourced**: exactly one glyph per codepoint (no alternatives).
20
+ - **Provenance-tagged**: each glyph records its tier + source font.
21
+ - **Stable**: re-running with the same config + Unicode version
22
+ produces byte-identical SVGs.
23
+ - **Public**: derived SVGs are open data even when the source font is
24
+ proprietary.
25
+
26
+ The set is distinct from the per-codepoint Mode 1 dataset (TODO 21).
27
+ Mode 1 puts glyph.svg inside each codepoint's directory along with
28
+ full UCD properties. The universal set is glyph-only, in a flat
29
+ layout, designed for fast lookup by audits.
30
+
31
+ ## Files to create
32
+
33
+ ```
34
+ lib/ucode/glyphs/universal_set.rb # namespace hub
35
+ lib/ucode/glyphs/universal_set/builder.rb # iterates codepoints, calls resolver, writes
36
+ lib/ucode/glyphs/universal_set/manifest.rb # builds manifest.json with provenance rollup
37
+ lib/ucode/glyphs/universal_set/idempotency.rb # mtime + content-hash check
38
+ lib/ucode/models/universal_set_entry.rb # one manifest entry
39
+ lib/ucode/models/universal_set_manifest.rb # full manifest model
40
+ lib/ucode/commands/universal_set.rb # CLI: bin/ucode universal-set build
41
+ spec/ucode/glyphs/universal_set/builder_spec.rb
42
+ spec/ucode/glyphs/universal_set/manifest_spec.rb
43
+ spec/ucode/commands/universal_set_spec.rb
44
+ spec/fixtures/universal_set/minimal/ # small slice for fixture-driven specs
45
+ ```
46
+
47
+ ## Output layout
48
+
49
+ ```
50
+ output/universal_glyph_set/
51
+ ├── manifest.json # one entry per codepoint with provenance
52
+ ├── glyphs/
53
+ │ ├── U+0000.svg
54
+ │ ├── U+0001.svg
55
+ │ ├── ...
56
+ │ ├── U+1F6A0.svg
57
+ │ └── ...
58
+ └── reports/
59
+ ├── by_tier.json # tier-1: N1, pillar-1: N2, ...
60
+ ├── by_block.json # per-block tier breakdown
61
+ └── gaps.json # assigned codepoints with no glyph (should be empty)
62
+ ```
63
+
64
+ Filename pattern: `<U+XXXX>.svg` with uppercase hex, zero-padded to 4
65
+ digits (6 for codepoints above U+FFFF). Same convention as Mode 1.
66
+
67
+ ## Manifest shape
68
+
69
+ ```json
70
+ {
71
+ "unicode_version": "17.0.0",
72
+ "ucode_version": "0.2.0",
73
+ "generated_at": "2026-06-27T12:00:00Z",
74
+ "source_config_sha256": "abc...",
75
+ "totals": {
76
+ "codepoints_assigned": 150012,
77
+ "codepoints_built": 150012,
78
+ "codepoints_skipped": 0,
79
+ "codepoints_failed": 0
80
+ },
81
+ "by_tier": {
82
+ "tier-1": 148512,
83
+ "pillar-1": 800,
84
+ "pillar-2": 200,
85
+ "pillar-3": 1500
86
+ },
87
+ "entries": [
88
+ { "codepoint": 65, "id": "U+0041", "tier": "tier-1",
89
+ "source": "noto-sans", "svg_sha256": "def...",
90
+ "svg_size_bytes": 412 },
91
+ { "codepoint": 10980, "id": "U+2AC4", "tier": "tier-1",
92
+ "source": "lentariso", "svg_sha256": "...",
93
+ "svg_size_bytes": 1820 },
94
+ ...
95
+ ]
96
+ }
97
+ ```
98
+
99
+ The manifest is the single index into the set. Audits (TODO 25) read
100
+ the manifest, not the SVGs, for the "is this codepoint in the
101
+ universal set?" check.
102
+
103
+ ## Build flow
104
+
105
+ ```ruby
106
+ builder = Ucode::Glyphs::UniversalSet::Builder.new(
107
+ output_root: Pathname.new("output/universal_glyph_set"),
108
+ resolver: Ucode::Glyphs::Resolver.new(sources: resolver_sources),
109
+ unicode_version: "17.0.0",
110
+ parallel_workers: Ucode.configuration.parallel_workers,
111
+ )
112
+ builder.build
113
+ ```
114
+
115
+ The builder:
116
+
117
+ 1. Reads the assigned-codepoints list from the active UCD baseline.
118
+ 2. For each codepoint, calls `resolver.resolve(codepoint)` → `Result`.
119
+ 3. Writes `glyphs/<U+XXXX>.svg` atomically (reuse
120
+ `Ucode::Repo::AtomicWrites`).
121
+ 4. Records the entry in the manifest.
122
+ 5. Emits the manifest + reports at the end.
123
+
124
+ Idempotency follows Mode 1's pattern: a codepoint whose source font
125
+ mtime + content hash are unchanged is skipped. Re-running with one
126
+ new Tier 1 font re-resolves only the codepoints the new font covers.
127
+
128
+ ## CLI
129
+
130
+ ```bash
131
+ bin/ucode universal-set build \
132
+ --version 17.0.0 \
133
+ --source-config config/unicode17_universal_glyph_set.yml \
134
+ --output output/universal_glyph_set \
135
+ [--parallel 8] \
136
+ [--block Sidetic] # optional: build only one block
137
+ ```
138
+
139
+ Output: stdout reports progress; final manifest at the output root.
140
+
141
+ ## Provenance recording
142
+
143
+ Every `Result` from the resolver carries `tier` and `provenance`. The
144
+ builder copies these into the manifest entry. Per-tier counts are
145
+ rolled up from the entry list.
146
+
147
+ Special: pillar 3 (Last Resort) glyphs are visually identical tofu
148
+ boxes; their `provenance` is `"pillar-3:last-resort"` and their
149
+ `source` field records the Last Resort UFO version. This makes pillar
150
+ 3 coverage visible in the audit drill-down (TODO 26) so users know
151
+ "this glyph is a placeholder; we don't have a real outline."
152
+
153
+ ## Acceptance
154
+
155
+ - `bin/ucode universal-set build` completes against Unicode 17.0
156
+ without errors.
157
+ - `output/universal_glyph_set/manifest.json` shows
158
+ `codepoints_built == codepoints_assigned`.
159
+ - `reports/gaps.json` is empty (or documents each gap with a reason).
160
+ - Re-running with no source changes produces zero file writes
161
+ (idempotency check).
162
+ - `--block Sidetic` produces only the Sidetic glyphs (~26 files);
163
+ manifest reflects the partial build.
164
+ - A new Tier 1 font addition (e.g. adding a Sidetic font) re-resolves
165
+ only Sidetic; manifest delta shows old pillar-1 entries flipping to
166
+ tier-1.
167
+ - Specs cover: builder happy path (small fixture set), idempotency,
168
+ per-block scoping, manifest serialization round-trip.
169
+ - Rubocop clean.
170
+
171
+ ## Out of scope
172
+
173
+ - The Tier 1 source config (TODO 23).
174
+ - Resolver mechanics (TODO 20).
175
+ - Audits that consume the set (TODO 25).
176
+ - Per-codepoint Mode 1 dataset (TODO 21). The universal set is
177
+ separate; it does not replace Mode 1.
178
+ - Site rendering of the universal set (that's a TODO 26 / fontist.org
179
+ concern).
180
+
181
+ ## References
182
+
183
+ - Source config: `TODO.new/23-universal-glyph-set-source-map.md`
184
+ - Resolver: `TODO.new/20-canonical-resolver-4-tier.md`
185
+ - Mode 1 build: `TODO.new/21-canonical-unicode17-build.md`
186
+ - Audit consumer: `TODO.new/25-font-audit-against-universal-set.md`
187
+ - AtomicWrites: `lib/ucode/repo/atomic_writes.rb`
188
+ - Existing pillar implementations: `lib/ucode/glyphs/{real_fonts,
189
+ embedded_fonts,last_resort}/`
@@ -0,0 +1,195 @@
1
+ # 25 — Font audit against universal set
2
+
3
+ ## Goal
4
+
5
+ Replace the current cmap-vs-UCD coverage audit with a cmap-vs-universal-set
6
+ audit. The font's coverage is now compared against the universal glyph set
7
+ (TODO 24) — one glyph per assigned codepoint — instead of against the
8
+ abstract UCD codepoint list.
9
+
10
+ This is Part 2 of the user's three-part directive. The universal set
11
+ becomes the reference for "what could be rendered." A font's coverage
12
+ report shows not just "1,500 codepoints covered" but "1,500 of the
13
+ 150,012 Unicode-17-representable glyphs."
14
+
15
+ ## Why universal-set reference, not UCD codepoint list
16
+
17
+ Today's audit (TODOs 04, 11, 13) compares a font's cmap against the
18
+ abstract set of assigned Unicode 17 codepoints. That's correct but
19
+ abstract — a consumer can't see "what does the missing codepoint
20
+ look like?"
21
+
22
+ By comparing against the universal glyph set instead:
23
+
24
+ - Every "missing" codepoint has a renderable glyph the consumer can
25
+ preview (TODO 26).
26
+ - Tier provenance is visible: "this font is missing U+10980 SIDETIC
27
+ LETTER A, which the universal set sources from Lentariso."
28
+ - Audits across fonts are directly comparable: two fonts both missing
29
+ "all of Sidetic" show the same gap, in the same way.
30
+
31
+ Mechanically, the universal set's codepoint list == the assigned
32
+ codepoint list. The audit logic is identical; the difference is that
33
+ every codepoint has an attached glyph + provenance that the renderer
34
+ (TODO 14, TODO 26) can surface.
35
+
36
+ ## Files to create / change
37
+
38
+ - `lib/ucode/audit/universal_set_reference.rb` — adapter that wraps
39
+ the universal-set manifest as a `CoverageReference` (interface below).
40
+ - `lib/ucode/audit/coverage_reference.rb` — common interface for any
41
+ "what's the assigned codepoint set" reference (UCD-only and
42
+ universal-set both implement).
43
+ - `lib/ucode/audit/extractors/aggregations.rb` — change to accept a
44
+ `CoverageReference` instead of always reading UCD directly. Default:
45
+ universal-set reference if available; fall back to UCD-only.
46
+ - `lib/ucode/audit/face_auditor.rb` — accept `reference:` kwarg;
47
+ thread it through to extractors.
48
+ - `lib/ucode/audit/library_auditor.rb` — same.
49
+ - `lib/ucode/commands/audit.rb` (new, was originally going to be TODO
50
+ 16's CLI) — `ucode audit font` now takes
51
+ `--reference-universal-set=<path>` flag (default: enabled if the
52
+ universal set exists).
53
+ - Specs:
54
+ - `spec/ucode/audit/universal_set_reference_spec.rb`
55
+ - `spec/ucode/audit/extractors/aggregations_with_universal_set_spec.rb`
56
+ - `spec/ucode/commands/audit_with_universal_set_spec.rb`
57
+
58
+ ## CoverageReference interface
59
+
60
+ ```ruby
61
+ class Ucode::Audit::CoverageReference
62
+ Entry = Struct.new(:codepoint, :id, :tier, :source, keyword_init: true)
63
+
64
+ # @param codepoint [Integer]
65
+ # @return [Boolean]
66
+ def include?(codepoint)
67
+ raise NotImplementedError
68
+ end
69
+
70
+ # @param block_id [String] verbatim block name
71
+ # @return [Array<Entry>] every assigned codepoint in the block,
72
+ # with tier + source from the universal-set manifest
73
+ def entries_for_block(block_id)
74
+ raise NotImplementedError
75
+ end
76
+
77
+ # @return [String] e.g. "ucd-17.0.0", "universal-set:17.0.0:sha256"
78
+ def reference_id
79
+ raise NotImplementedError
80
+ end
81
+
82
+ # @return [Hash{String=>String}] provenance metadata for the report
83
+ def baseline_metadata
84
+ raise NotImplementedError
85
+ end
86
+ end
87
+ ```
88
+
89
+ Two concrete implementations:
90
+
91
+ - `Ucode::Audit::UcdOnlyReference` — reads `Blocks.txt` and assigned
92
+ codepoints from the active UCD database. Entry.tier/source are nil.
93
+ - `Ucode::Audit::UniversalSetReference` — reads the universal-set
94
+ manifest (TODO 24). Every entry carries tier + source.
95
+
96
+ ## Aggregation changes
97
+
98
+ `BlockAggregator` previously took `block_total_assigned:` integer from
99
+ the UCD-only baseline. It now takes a `CoverageReference` and calls
100
+ `reference.entries_for_block(block_id)` to get the per-codepoint list.
101
+ For each codepoint, the per-block summary includes:
102
+
103
+ - `covered_count` — codepoints in this block that the font's cmap covers.
104
+ - `missing_codepoints` — codepoints in this block that the font's cmap
105
+ does NOT cover, with universal-set entry attached for renderer drill-down.
106
+
107
+ The `AuditReport.baseline` field gains a `reference_kind` ("ucd" or
108
+ "universal-set") so consumers know which kind of reference produced
109
+ the per-block counts.
110
+
111
+ ## Report shape delta
112
+
113
+ Existing `block_summaries[i]` (per TODO 03 + 04) carries
114
+ `missing_codepoints: [Integer]`. New optional field per
115
+ `BlockSummary`:
116
+
117
+ ```json
118
+ {
119
+ "name": "Sidetic",
120
+ ...
121
+ "missing_codepoints": [10981, 10982, ...],
122
+ "missing_codepoint_provenance": [
123
+ { "codepoint": 10981, "tier": "tier-1", "source": "lentariso" },
124
+ ...
125
+ ]
126
+ }
127
+ ```
128
+
129
+ `missing_codepoint_provenance` is only populated when the reference is
130
+ a UniversalSetReference. UcdOnlyReference produces the existing
131
+ schema (no provenance).
132
+
133
+ This is an additive change. Old consumers ignore the new field. The
134
+ contract (TODO 04) calls this out as a minor version bump.
135
+
136
+ ## Backwards compatibility
137
+
138
+ - `ucode audit font` without a universal set behaves exactly as today
139
+ (UCD-only reference).
140
+ - `ucode audit font` with `--reference-universal-set=<path>` switches
141
+ to universal-set reference. The default is to look for the manifest
142
+ at `output/universal_glyph_set/manifest.json`; if present, use it;
143
+ if absent, warn and fall back to UCD-only.
144
+
145
+ This means CI runs that haven't built the universal set yet continue
146
+ to pass. The new functionality is opt-in via presence of the manifest.
147
+
148
+ ## Cross-font comparison
149
+
150
+ A new optional output: `output/font_audit/_comparison/<label1>_vs_<label2>.json`
151
+ produced by:
152
+
153
+ ```bash
154
+ bin/ucode audit compare <label1> <label2>
155
+ ```
156
+
157
+ Diffs two audits: same blocks, same codepoints, but coverage cells
158
+ differ. Powers "Inter covers these N codepoints that Arial misses"
159
+ visualizations on fontist.org.
160
+
161
+ Implementation: extends `Ucode::Audit::Differ` to compare two
162
+ `AuditReport`s at the codepoint level (current `Differ` compares
163
+ fields and structural inventories; new mode compares per-block
164
+ coverage).
165
+
166
+ ## Acceptance
167
+
168
+ - `UniversalSetReference` round-trips the universal-set manifest into
169
+ the CoverageReference interface correctly (specs).
170
+ - `FaceAuditor` accepts `reference:` kwarg; defaults to UCD-only when
171
+ omitted.
172
+ - `BlockAggregator` produces `missing_codepoint_provenance` when given
173
+ a UniversalSetReference; omits the field for UcdOnlyReference.
174
+ - `bin/ucode audit font <path> --reference-universal-set=<manifest>`
175
+ produces a report where every missing codepoint carries provenance.
176
+ - `bin/ucode audit font <path>` (no flag, no manifest on disk) is
177
+ byte-identical to today's output (regression check).
178
+ - `bin/ucode audit compare` produces a per-block per-codepoint diff.
179
+ - Rubocop clean.
180
+
181
+ ## Out of scope
182
+
183
+ - The drill-down HTML view that renders the universal glyphs next to
184
+ each missing codepoint — TODO 26.
185
+ - The fontist.org consumer side that surfaces the new field — TODO 27.
186
+ - The universal set build itself — TODO 24.
187
+
188
+ ## References
189
+
190
+ - Universal set build: `TODO.new/24-universal-glyph-set-build.md`
191
+ - HTML browser: `TODO.new/14-html-face-browser.md`
192
+ - fontist.org contract: `TODO.new/04-fontist-org-contract.md`
193
+ - Existing Differ: `lib/ucode/audit/differ.rb`
194
+ - Existing aggregations extractor:
195
+ `lib/ucode/audit/extractors/aggregations.rb`