ucode 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +72 -0
- data/Gemfile.lock +2 -2
- data/TODO.full/00-README.md +116 -0
- data/TODO.full/01-panglyph-vision.md +112 -0
- data/TODO.full/02-panglyph-repo-bootstrap.md +184 -0
- data/TODO.full/03-panglyph-font-builder.md +201 -0
- data/TODO.full/04-panglyph-publish-pipeline.md +126 -0
- data/TODO.full/05-ucode-0-1-1-release.md +139 -0
- data/TODO.full/06-fontisan-remove-audit.md +142 -0
- data/TODO.full/07-fontisan-remove-ucd.md +125 -0
- data/TODO.full/08-archive-private-bin-build.md +143 -0
- data/TODO.full/09-archive-public-structure.md +164 -0
- data/TODO.full/10-fontist-org-woff-glyphs.md +131 -0
- data/TODO.full/11-fontist-org-audit-coverage.md +140 -0
- data/TODO.full/12-implementation-order.md +216 -0
- data/TODO.full/13-fontisan-font-writer-api.md +189 -0
- data/TODO.full/14-fontisan-table-writers.md +66 -0
- data/TODO.full/15-panglyph-builder-real.md +82 -0
- data/TODO.full/16-archive-public-sync-workflows.md +167 -0
- data/TODO.full/17-fontist-org-font-picker.md +73 -0
- data/TODO.full/18-comprehensive-spec-coverage.md +64 -0
- data/TODO.full/19-ucode-0-1-2-patch.md +32 -0
- data/TODO.full/20-fontisan-0-2-23-release.md +52 -0
- data/TODO.new/00-README.md +30 -0
- data/TODO.new/23-universal-glyph-set-source-map.md +312 -0
- data/TODO.new/24-universal-glyph-set-build.md +189 -0
- data/TODO.new/25-font-audit-against-universal-set.md +195 -0
- data/TODO.new/26-missing-glyph-reporter.md +189 -0
- data/TODO.new/27-fontist-org-consumer-integration.md +200 -0
- data/TODO.new/28-implementation-order-update.md +187 -0
- data/TODO.new/29-universal-set-curation-uc17.md +312 -0
- data/TODO.new/30-tier1-font-acquisition.md +241 -0
- data/TODO.new/31-universal-set-production-build.md +205 -0
- data/TODO.new/32-uc17-coverage-matrix.md +165 -0
- data/TODO.new/33-specialist-font-acquisition-refresh.md +138 -0
- data/TODO.new/34-pillar2-content-stream-correlator.md +147 -0
- data/TODO.new/35-universal-set-production-run.md +160 -0
- data/TODO.new/36-per-font-coverage-audit.md +145 -0
- data/TODO.new/37-coverage-highlight-reporter.md +125 -0
- data/TODO.new/38-fontist-org-glyph-consumer.md +141 -0
- data/TODO.new/39-implementation-order-update-32-38.md +258 -0
- data/TODO.new/40-archive-private-uses-ucode-audit.md +124 -0
- data/TODO.new/41-ucode-unicode-archive-bridge.md +160 -0
- data/config/specialist_fonts.yml +102 -0
- data/config/unicode17_tier1_fonts.yml +42 -0
- data/config/unicode17_universal_glyph_set.yml +293 -0
- data/lib/ucode/audit/block_aggregator.rb +57 -29
- data/lib/ucode/audit/browser/face_page.rb +128 -0
- data/lib/ucode/audit/browser/glyph_panel.rb +124 -0
- data/lib/ucode/audit/browser/library_page.rb +74 -0
- data/lib/ucode/audit/browser/missing_glyph_page.rb +87 -0
- data/lib/ucode/audit/browser/template.rb +47 -0
- data/lib/ucode/audit/browser/templates/face.css +200 -0
- data/lib/ucode/audit/browser/templates/face.html.erb +41 -0
- data/lib/ucode/audit/browser/templates/face.js +298 -0
- data/lib/ucode/audit/browser/templates/library.css +119 -0
- data/lib/ucode/audit/browser/templates/library.html.erb +42 -0
- data/lib/ucode/audit/browser/templates/library.js +99 -0
- data/lib/ucode/audit/browser/templates/missing_glyph_page.css +119 -0
- data/lib/ucode/audit/browser/templates/missing_glyph_page.html.erb +58 -0
- data/lib/ucode/audit/browser/templates/missing_glyph_page.js +2 -0
- data/lib/ucode/audit/browser.rb +32 -0
- data/lib/ucode/audit/context.rb +27 -1
- data/lib/ucode/audit/coverage_reference.rb +103 -0
- data/lib/ucode/audit/differ.rb +121 -0
- data/lib/ucode/audit/emitter/block_emitter.rb +52 -0
- data/lib/ucode/audit/emitter/codepoint_emitter.rb +87 -0
- data/lib/ucode/audit/emitter/collection_emitter.rb +80 -0
- data/lib/ucode/audit/emitter/face_directory.rb +212 -0
- data/lib/ucode/audit/emitter/glyph_emitter.rb +48 -0
- data/lib/ucode/audit/emitter/index_emitter.rb +149 -0
- data/lib/ucode/audit/emitter/library_emitter.rb +96 -0
- data/lib/ucode/audit/emitter/paths.rb +312 -0
- data/lib/ucode/audit/emitter/plane_emitter.rb +29 -0
- data/lib/ucode/audit/emitter/script_emitter.rb +29 -0
- data/lib/ucode/audit/emitter.rb +29 -0
- data/lib/ucode/audit/extractors/aggregations.rb +31 -2
- data/lib/ucode/audit/face_auditor.rb +86 -0
- data/lib/ucode/audit/formatters/audit_diff_text.rb +112 -0
- data/lib/ucode/audit/formatters/audit_text.rb +411 -0
- data/lib/ucode/audit/formatters/color.rb +48 -0
- data/lib/ucode/audit/formatters/library_summary_text.rb +98 -0
- data/lib/ucode/audit/formatters/text_formatter.rb +83 -0
- data/lib/ucode/audit/formatters.rb +23 -0
- data/lib/ucode/audit/library_aggregator.rb +86 -0
- data/lib/ucode/audit/library_auditor.rb +105 -0
- data/lib/ucode/audit/release/emitter.rb +152 -0
- data/lib/ucode/audit/release/face_card.rb +93 -0
- data/lib/ucode/audit/release/formula_audits.rb +50 -0
- data/lib/ucode/audit/release/library_index_builder.rb +78 -0
- data/lib/ucode/audit/release/manifest_builder.rb +127 -0
- data/lib/ucode/audit/release.rb +42 -0
- data/lib/ucode/audit/ucd_only_reference.rb +81 -0
- data/lib/ucode/audit/universal_set_reference.rb +136 -0
- data/lib/ucode/audit.rb +31 -0
- data/lib/ucode/cli.rb +339 -33
- data/lib/ucode/commands/audit/browser_command.rb +82 -0
- data/lib/ucode/commands/audit/collection_command.rb +103 -0
- data/lib/ucode/commands/audit/compare_command.rb +188 -0
- data/lib/ucode/commands/audit/font_command.rb +140 -0
- data/lib/ucode/commands/audit/library_command.rb +87 -0
- data/lib/ucode/commands/audit/reference_builder.rb +64 -0
- data/lib/ucode/commands/audit.rb +20 -0
- data/lib/ucode/commands/block_feed.rb +73 -0
- data/lib/ucode/commands/canonical_build.rb +138 -0
- data/lib/ucode/commands/fetch.rb +37 -1
- data/lib/ucode/commands/release.rb +115 -0
- data/lib/ucode/commands/universal_set.rb +211 -0
- data/lib/ucode/commands.rb +5 -0
- data/lib/ucode/coordinator/indices.rb +11 -0
- data/lib/ucode/coordinator.rb +138 -5
- data/lib/ucode/error.rb +30 -2
- data/lib/ucode/fetch/font_fetcher/result.rb +39 -0
- data/lib/ucode/fetch/font_fetcher.rb +16 -0
- data/lib/ucode/fetch/specialist_font_fetcher.rb +280 -0
- data/lib/ucode/fetch.rb +7 -3
- data/lib/ucode/glyphs/real_fonts/cmap_cache.rb +74 -0
- data/lib/ucode/glyphs/real_fonts.rb +1 -0
- data/lib/ucode/glyphs/resolver.rb +62 -0
- data/lib/ucode/glyphs/source.rb +48 -0
- data/lib/ucode/glyphs/source_builder.rb +61 -0
- data/lib/ucode/glyphs/source_config/coverage_assertion.rb +79 -0
- data/lib/ucode/glyphs/source_config/gap_report.rb +54 -0
- data/lib/ucode/glyphs/source_config.rb +104 -0
- data/lib/ucode/glyphs/sources/pillar1_embedded_tounicode.rb +63 -0
- data/lib/ucode/glyphs/sources/pillar3_last_resort.rb +51 -0
- data/lib/ucode/glyphs/sources/tier1_real_font.rb +104 -0
- data/lib/ucode/glyphs/sources.rb +20 -0
- data/lib/ucode/glyphs/universal_set/builder.rb +161 -0
- data/lib/ucode/glyphs/universal_set/coverage_report.rb +139 -0
- data/lib/ucode/glyphs/universal_set/idempotency.rb +86 -0
- data/lib/ucode/glyphs/universal_set/manifest_accumulator.rb +195 -0
- data/lib/ucode/glyphs/universal_set/manifest_writer.rb +61 -0
- data/lib/ucode/glyphs/universal_set/pre_build_check.rb +197 -0
- data/lib/ucode/glyphs/universal_set/validator.rb +204 -0
- data/lib/ucode/glyphs/universal_set.rb +45 -0
- data/lib/ucode/glyphs.rb +6 -0
- data/lib/ucode/models/audit/baseline.rb +6 -0
- data/lib/ucode/models/audit/block_summary.rb +7 -0
- data/lib/ucode/models/audit/codepoint_provenance.rb +39 -0
- data/lib/ucode/models/audit/release_face.rb +42 -0
- data/lib/ucode/models/audit/release_formula.rb +33 -0
- data/lib/ucode/models/audit/release_manifest.rb +43 -0
- data/lib/ucode/models/audit/release_universal_set.rb +37 -0
- data/lib/ucode/models/audit.rb +9 -0
- data/lib/ucode/models/block.rb +2 -0
- data/lib/ucode/models/build_report.rb +109 -0
- data/lib/ucode/models/codepoint/glyph.rb +42 -0
- data/lib/ucode/models/codepoint.rb +3 -0
- data/lib/ucode/models/glyph_source.rb +86 -0
- data/lib/ucode/models/glyph_source_map.rb +138 -0
- data/lib/ucode/models/specialist_font.rb +70 -0
- data/lib/ucode/models/specialist_font_manifest.rb +48 -0
- data/lib/ucode/models/unihan_entry.rb +81 -9
- data/lib/ucode/models/unihan_field.rb +21 -0
- data/lib/ucode/models/universal_set_entry.rb +47 -0
- data/lib/ucode/models/universal_set_manifest.rb +78 -0
- data/lib/ucode/models/validation_report.rb +99 -0
- data/lib/ucode/models.rb +9 -0
- data/lib/ucode/parsers/named_sequences.rb +5 -5
- data/lib/ucode/parsers/unihan.rb +50 -19
- data/lib/ucode/repo/aggregate_writer.rb +34 -2
- data/lib/ucode/repo/block_feed_emitter.rb +153 -0
- data/lib/ucode/repo/build_report_accumulator.rb +138 -0
- data/lib/ucode/repo/build_report_writer.rb +46 -0
- data/lib/ucode/repo/build_validator.rb +229 -0
- data/lib/ucode/repo/codepoint_writer.rb +50 -1
- data/lib/ucode/repo/paths.rb +8 -0
- data/lib/ucode/repo.rb +4 -0
- data/lib/ucode/version.rb +1 -1
- data/schema/block-feed.output.schema.yml +134 -0
- metadata +143 -2
- data/ucode.gemspec +0 -56
|
@@ -0,0 +1,205 @@
|
|
|
1
|
+
# 31 — Universal set production build + coverage validation
|
|
2
|
+
|
|
3
|
+
## Goal
|
|
4
|
+
|
|
5
|
+
Execute the universal-set build (TODO 24) end-to-end against the
|
|
6
|
+
curated source config (TODO 29) with the acquired fonts (TODO 30).
|
|
7
|
+
Validate the output: every assigned Unicode 17 codepoint has a glyph,
|
|
8
|
+
the manifest is complete, provenance is recorded, and per-tier
|
|
9
|
+
coverage matches the curated expectations.
|
|
10
|
+
|
|
11
|
+
This is the actual **production run**. It produces the artifact that
|
|
12
|
+
fontist.org (TODO 27) and the missing-glyph reporter (TODO 26)
|
|
13
|
+
consume.
|
|
14
|
+
|
|
15
|
+
## Why a separate TODO
|
|
16
|
+
|
|
17
|
+
TODO 24 built the **mechanics**. TODO 29 curated the **policy**. TODO
|
|
18
|
+
30 fetched the **fonts**. TODO 31 is **execution + validation**.
|
|
19
|
+
|
|
20
|
+
Splitting execution from mechanics lets us:
|
|
21
|
+
|
|
22
|
+
- Catch curation gaps (a font that doesn't actually cover a block).
|
|
23
|
+
- Catch resolver bugs (a Tier 1 font listed but never queried).
|
|
24
|
+
- Catch pillar fallback regressions (pillar-2 should produce
|
|
25
|
+
identical results to correlate-v4, but only if the catalog wiring
|
|
26
|
+
is correct).
|
|
27
|
+
- Produce an auditable coverage report alongside the manifest.
|
|
28
|
+
|
|
29
|
+
## Pre-build validation
|
|
30
|
+
|
|
31
|
+
Before running the build, assert:
|
|
32
|
+
|
|
33
|
+
1. **Source config loads cleanly.** `SourceConfig.load(path)` returns
|
|
34
|
+
a `GlyphSourceMap` with no schema errors.
|
|
35
|
+
2. **All fonts present.** Every `path:` entry in the YAML exists on
|
|
36
|
+
disk (or is fontist-discoverable). Missing fonts = list + abort.
|
|
37
|
+
Don't start a 4-hour build with known-missing inputs.
|
|
38
|
+
3. **Coverage assertion runs.** TODO 29's `CoverageAssertion` runs;
|
|
39
|
+
gaps are listed but don't abort (expected for some blocks).
|
|
40
|
+
|
|
41
|
+
If pre-build validation fails, abort with a typed
|
|
42
|
+
`Ucode::Glyphs::UniversalSet::PreBuildError` listing each failure.
|
|
43
|
+
|
|
44
|
+
## Build execution
|
|
45
|
+
|
|
46
|
+
```bash
|
|
47
|
+
bin/ucode universal-set build \
|
|
48
|
+
--version 17.0.0 \
|
|
49
|
+
--source-config config/unicode17_universal_glyph_set.yml \
|
|
50
|
+
--output output/universal_glyph_set \
|
|
51
|
+
--parallel 8
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
Expected runtime: ~3-4 hours for full Unicode 17 (150,000+ codepoints).
|
|
55
|
+
CJK dominates the runtime (~45,000 ideographs via FSung).
|
|
56
|
+
|
|
57
|
+
## Post-build validation
|
|
58
|
+
|
|
59
|
+
After the build, validate:
|
|
60
|
+
|
|
61
|
+
1. **Completeness.** Every assigned codepoint has a `glyphs/<U+XXXX>.svg`.
|
|
62
|
+
2. **Manifest integrity.** `manifest.json` parses, has an entry for
|
|
63
|
+
every assigned codepoint, totals reconcile.
|
|
64
|
+
3. **Provenance recorded.** Every entry has non-nil `tier` and
|
|
65
|
+
`source` fields.
|
|
66
|
+
4. **No tofu leaks.** Count pillar-3 entries; investigate any that
|
|
67
|
+
aren't documented as expected gaps (unassigned, PUA,
|
|
68
|
+
noncharacter — Last Resort is correct for these).
|
|
69
|
+
5. **Idempotency.** Re-running with no source changes produces zero
|
|
70
|
+
file writes.
|
|
71
|
+
|
|
72
|
+
## Per-tier coverage report
|
|
73
|
+
|
|
74
|
+
`reports/by_tier.json`:
|
|
75
|
+
|
|
76
|
+
```json
|
|
77
|
+
{
|
|
78
|
+
"tier-1": 148512,
|
|
79
|
+
"pillar-1": 800,
|
|
80
|
+
"pillar-2": 200,
|
|
81
|
+
"pillar-3": 1500,
|
|
82
|
+
"gaps": 0
|
|
83
|
+
}
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
Target: tier-1 ≥ 95% of assigned codepoints. Tier-3 (Last Resort
|
|
87
|
+
tofu) ≤ 1% of assigned codepoints (Last Resort is the correct tier
|
|
88
|
+
for unassigned/PUA/noncharacter — those should be the only tier-3
|
|
89
|
+
entries among assigned codepoints, and there should be none).
|
|
90
|
+
|
|
91
|
+
## Per-block coverage report
|
|
92
|
+
|
|
93
|
+
`reports/by_block.json`:
|
|
94
|
+
|
|
95
|
+
```json
|
|
96
|
+
{
|
|
97
|
+
"Sidetic": {
|
|
98
|
+
"assigned": 26, "tier-1": 26, "pillar-1": 0, "pillar-2": 0, "pillar-3": 0
|
|
99
|
+
},
|
|
100
|
+
"Beria_Erfe": {
|
|
101
|
+
"assigned": 50, "tier-1": 50, "pillar-1": 0, "pillar-2": 0, "pillar-3": 0
|
|
102
|
+
},
|
|
103
|
+
"Combining_Diacritical_Marks_Extended": {
|
|
104
|
+
"assigned": 90, "tier-1": 63, "pillar-1": 0, "pillar-2": 27, "pillar-3": 0
|
|
105
|
+
}
|
|
106
|
+
}
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Each block's per-tier breakdown makes it obvious where Tier 1 coverage
|
|
110
|
+
is incomplete. In the example, Combining Diacritical Marks Extended
|
|
111
|
+
has 27 codepoints that fell through to pillar-2 — the residual gap
|
|
112
|
+
the curation (TODO 29) flagged.
|
|
113
|
+
|
|
114
|
+
## Gap investigation
|
|
115
|
+
|
|
116
|
+
`reports/gaps.json` lists every assigned codepoint that ended up at
|
|
117
|
+
pillar-3 (tofu) — these are **bugs**:
|
|
118
|
+
|
|
119
|
+
```json
|
|
120
|
+
[
|
|
121
|
+
{ "codepoint": 119808, "block": "Mathematical_Alphanumeric_Symbols",
|
|
122
|
+
"reason": "tier-1:noto-sans-math did not cover; pillar-2 catalog miss" }
|
|
123
|
+
]
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
Each gap entry records the path through the resolver that led to tofu.
|
|
127
|
+
Zero gaps = perfect coverage. Non-zero gaps = actionable curation
|
|
128
|
+
follow-ups (typically: "add font X to block Y's source list").
|
|
129
|
+
|
|
130
|
+
## CJK Ext J verification
|
|
131
|
+
|
|
132
|
+
Special verification for the largest single block: CJK Unified
|
|
133
|
+
Ideographs Extension J (4,298 codepoints). The build should produce:
|
|
134
|
+
|
|
135
|
+
- `tier-1` count == 4,298 if FSung-* covers all of them.
|
|
136
|
+
- `tier-1` + `pillar-1` count == 4,298 if FSung-* misses some that
|
|
137
|
+
Code Charts PDF covers.
|
|
138
|
+
|
|
139
|
+
Either is acceptable. The `reports/by_block.json` row for Ext J
|
|
140
|
+
documents which path actually fired.
|
|
141
|
+
|
|
142
|
+
## Files to create
|
|
143
|
+
|
|
144
|
+
- `lib/ucode/glyphs/universal_set/validator.rb` — post-build
|
|
145
|
+
validator. Reads manifest + glyphs dir, runs the 5 checks above.
|
|
146
|
+
- `lib/ucode/glyphs/universal_set/coverage_report.rb` — emits
|
|
147
|
+
per-tier + per-block + gaps JSON reports.
|
|
148
|
+
- `lib/ucode/glyphs/universal_set/pre_build_check.rb` — runs
|
|
149
|
+
pre-build validation (config + fonts + assertion).
|
|
150
|
+
- `lib/ucode/commands/universal_set.rb` — autoload hub (extend if
|
|
151
|
+
present).
|
|
152
|
+
- `lib/ucode/commands/universal_set/validate.rb` — CLI subcommand.
|
|
153
|
+
- Specs:
|
|
154
|
+
- `spec/ucode/glyphs/universal_set/validator_spec.rb`
|
|
155
|
+
- `spec/ucode/glyphs/universal_set/coverage_report_spec.rb`
|
|
156
|
+
- `spec/ucode/glyphs/universal_set/pre_build_check_spec.rb`
|
|
157
|
+
|
|
158
|
+
## CLI
|
|
159
|
+
|
|
160
|
+
```bash
|
|
161
|
+
bin/ucode universal-set build # TODO 24, existing
|
|
162
|
+
bin/ucode universal-set validate # TODO 31, new — post-build validation
|
|
163
|
+
bin/ucode universal-set report # TODO 31, new — emit coverage reports
|
|
164
|
+
bin/ucode universal-set pre-check # TODO 31, new — pre-build validation
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
`build` runs `pre-check` automatically before starting; the standalone
|
|
168
|
+
`pre-check` is for iterating on curation without burning a 4-hour
|
|
169
|
+
build.
|
|
170
|
+
|
|
171
|
+
## Acceptance
|
|
172
|
+
|
|
173
|
+
- `bin/ucode universal-set build` completes against Unicode 17.0 in
|
|
174
|
+
under 4 hours.
|
|
175
|
+
- `output/universal_glyph_set/manifest.json` shows
|
|
176
|
+
`codepoints_built == codepoints_assigned` (≥ 150,000).
|
|
177
|
+
- `reports/gaps.json` is empty for assigned codepoints outside the
|
|
178
|
+
documented residual cases (Combining Diacritical Marks Extended
|
|
179
|
+
additions, Symbols Legacy Supp additions, Supp Arrows-C additions).
|
|
180
|
+
- `reports/by_tier.json` shows tier-1 ≥ 95% (target: 100% for
|
|
181
|
+
assigned codepoints outside documented gaps).
|
|
182
|
+
- Re-running with no source changes produces zero file writes.
|
|
183
|
+
- The build correctly handles CJK Ext J: all 4,298 codepoints
|
|
184
|
+
resolved via FSung-* or noto-sans-cjk-jp fallback (no tofu leaks).
|
|
185
|
+
- Residual gaps fall through to Pillar 2 cleanly; no crashes, no
|
|
186
|
+
silent skips.
|
|
187
|
+
- `pre-check` aborts on missing font files with a clear list of
|
|
188
|
+
what's missing.
|
|
189
|
+
- Rubocop clean.
|
|
190
|
+
|
|
191
|
+
## Out of scope
|
|
192
|
+
|
|
193
|
+
- Source config curation — TODO 29.
|
|
194
|
+
- Font acquisition — TODO 30.
|
|
195
|
+
- fontist.org consumer integration — TODO 27.
|
|
196
|
+
- Site rendering of the universal set — TODO 26 / TODO 27.
|
|
197
|
+
|
|
198
|
+
## References
|
|
199
|
+
|
|
200
|
+
- Build mechanics: `TODO.new/24-universal-glyph-set-build.md`
|
|
201
|
+
- Source config: `TODO.new/29-universal-set-curation-uc17.md`
|
|
202
|
+
- Font acquisition: `TODO.new/30-tier1-font-acquisition.md`
|
|
203
|
+
- Audit consumer: `TODO.new/25-font-audit-against-universal-set.md`
|
|
204
|
+
- Existing builder: `lib/ucode/glyphs/universal_set/builder.rb`
|
|
205
|
+
- Existing manifest model: `lib/ucode/models/universal_set_manifest.rb`
|
|
@@ -0,0 +1,165 @@
|
|
|
1
|
+
# 32 — Universal glyph set: full UC17 coverage matrix (Part 1 master)
|
|
2
|
+
|
|
3
|
+
## Goal
|
|
4
|
+
|
|
5
|
+
Produce **one canonical Tier 1 font recommendation per Unicode 17 block**
|
|
6
|
+
(~346 entries). This is the master output of Part 1 — the artifact that
|
|
7
|
+
defines "full coverage" for ucode's universal glyph set. Once this
|
|
8
|
+
matrix is encoded in `config/unicode17_universal_glyph_set.yml`, every
|
|
9
|
+
downstream TODO (production build, per-font audit, missing-glyph
|
|
10
|
+
reporter, fontist.org consumer) treats it as ground truth.
|
|
11
|
+
|
|
12
|
+
The matrix does NOT require fonts to be installed or cmaps to be
|
|
13
|
+
verified — that's TODO 35 (production build) and TODO 36 (per-font
|
|
14
|
+
audit). This TODO is purely the **policy**: "for block X, use font Y
|
|
15
|
+
(fallback chain Z)."
|
|
16
|
+
|
|
17
|
+
## Why a separate TODO
|
|
18
|
+
|
|
19
|
+
TODO 29 (UC17 curation) started this work but stopped at ~30 specialist
|
|
20
|
+
blocks. The remaining ~315 blocks rely on a single `default_sources`
|
|
21
|
+
entry pointing at `noto-sans` via fontist — which the fontist formula
|
|
22
|
+
repo doesn't actually carry as a generic package. So the current config
|
|
23
|
+
CLAIMS full coverage but the resolver can't materialize glyphs for most
|
|
24
|
+
blocks.
|
|
25
|
+
|
|
26
|
+
This TODO splits the policy work from the acquisition work:
|
|
27
|
+
|
|
28
|
+
- **TODO 32 (this)**: decide the canonical font per block (policy)
|
|
29
|
+
- **TODO 33**: fix the acquisition paths (URLs + fontist formulas)
|
|
30
|
+
- **TODO 35**: build the universal set end-to-end (run the policy)
|
|
31
|
+
|
|
32
|
+
Reviewers can sign off on the per-block choices here without waiting
|
|
33
|
+
for font availability.
|
|
34
|
+
|
|
35
|
+
## Coverage policy (the recommendation)
|
|
36
|
+
|
|
37
|
+
### Tier 1 default — Noto Sans family
|
|
38
|
+
|
|
39
|
+
Noto is the canonical Tier 1 source for ~250 of ~346 blocks. Where a
|
|
40
|
+
dedicated Noto Sans <Script> variant exists, use it; otherwise fall
|
|
41
|
+
back to `noto-sans` (Latin/core).
|
|
42
|
+
|
|
43
|
+
| Script family | Tier 1 font | Blocks covered |
|
|
44
|
+
|---|---|---|
|
|
45
|
+
| Latin + extensions + IPA + Spacing Modifier + Combining Diacriticals | `noto-sans` | ~20 blocks |
|
|
46
|
+
| Greek + Coptic | `noto-sans` | 2 |
|
|
47
|
+
| Cyrillic (all extensions) | `noto-sans` | 4 |
|
|
48
|
+
| Armenian, Hebrew | `noto-sans-armenian`, `noto-sans-hebrew` | 2 |
|
|
49
|
+
| Arabic + extensions + Supplement | `noto-naskh-arabic` / `noto-sans-arabic` | 4 |
|
|
50
|
+
| Syriac, Thaana, NKo, Samaritan, Mandaic | `noto-sans-<script>` | 5 |
|
|
51
|
+
| Brahmic (Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala) | `noto-sans-<script>` | 10 |
|
|
52
|
+
| Tibetan, Myanmar, Georgian | `noto-sans-<script>` | 3+ |
|
|
53
|
+
| Hangul Jamo + compatibility | `noto-sans-hangul` or `noto-sans-kr` | 5 |
|
|
54
|
+
| Ethiopic + extensions | `noto-sans-ethiopic` | 3 |
|
|
55
|
+
| Cherokee, Canadian Aboriginal | `noto-sans-cherokee`, `noto-sans-canadian-aboriginal` | 2 |
|
|
56
|
+
| Khmer, Mongolian, Limbu, Tai Le, Tai Tham, Buginese | `noto-sans-<script>` | 6 |
|
|
57
|
+
| Symbol blocks (Math, Arrows, Misc, Geometric, Dingbats) | `noto-sans-symbols`, `noto-sans-symbols-2`, `noto-sans-math` | ~10 |
|
|
58
|
+
| Music | `noto-music` | 1 |
|
|
59
|
+
|
|
60
|
+
### Tier 1 specialists (non-Noto)
|
|
61
|
+
|
|
62
|
+
These blocks need fonts outside the Noto family. Each must be acquired
|
|
63
|
+
via `ucode fetch fonts` (specialist manifest, TODO 30).
|
|
64
|
+
|
|
65
|
+
| Block | Range | Tier 1 font | Provenance | Confidence |
|
|
66
|
+
|---|---|---|---:|---|
|
|
67
|
+
| Sidetic | U+10940–1095F | Lentariso ≥1.029 | github.com/Bry10022/Lentariso | HIGH |
|
|
68
|
+
| Beria Erfe | U+16EA0–16EDF | Kedebideri 3.001 | software.sil.org/kedebideri | HIGH |
|
|
69
|
+
| Tai Yo | U+1E6C0–1E6F3 | NotoSerifTaiYo | translationcommons.org | HIGH |
|
|
70
|
+
| Tolong Siki | U+11DB0–11DEF | Noto Sans Tolong Siki | notofonts.github.io | HIGH |
|
|
71
|
+
| Sharada Supplement | U+11B60–11B7F | Noto Sans Sharada | Google Fonts | HIGH |
|
|
72
|
+
| Egyptian Hieroglyphs | U+13000–1342F | UniHieroglyphica v16 | suignard.com/Ptolemaic/ | HIGH |
|
|
73
|
+
| Egyptian Hieroglyph Format Controls | U+13430–1345F | Egyptian Text | github.com/microsoft/font-tools | HIGH |
|
|
74
|
+
| Egyptian Hieroglyphs Extended-A | U+13460–143FF | UniHieroglyphica v16 | suignard.com | HIGH |
|
|
75
|
+
| Egyptian Hieroglyphs Extended-B (new UC17) | U+134A0.. | UniHieroglyphica v16 | suignard.com | HIGH |
|
|
76
|
+
| CJK Unified Ideographs | U+4E00–9FFF | FSung-1.ttf (local) + Noto Sans CJK JP fallback | ~/Downloads/全宋體 | HIGH |
|
|
77
|
+
| CJK Unified Ideographs Extension A | U+3400–4DBF | FSung + Noto Sans CJK JP | ~/Downloads/全宋體 | HIGH |
|
|
78
|
+
| CJK Unified Ideographs Extension B–H | various | FSung-2.ttf..FSung-X.ttf | ~/Downloads/全宋體 | HIGH |
|
|
79
|
+
| CJK Unified Ideographs Extension J (new UC17) | U+31350–323AF | FSung (latest) + Noto Sans CJK JP | ~/Downloads/全宋體 | HIGH |
|
|
80
|
+
| Tangut + Components + Supplement | U+17000–187FF | Noto Sans Tangut | notofonts.github.io | HIGH |
|
|
81
|
+
| Symbols for Legacy Computing Supplement | U+1CC00–1CCFF | BabelStone Pseudographica | babelstone.co.uk | MEDIUM |
|
|
82
|
+
| Supplemental Arrows-C (new UC17) | U+1CF00–1CFCF | Symbola | dn-works.com / github.com/zhm/symbola mirror | MEDIUM |
|
|
83
|
+
|
|
84
|
+
### Tier 1 emoji
|
|
85
|
+
|
|
86
|
+
| Block | Range | Tier 1 font |
|
|
87
|
+
|---|---|---|
|
|
88
|
+
| Emoticons + Pictographs + Supplemental + Transport + Symbols & Pictographs Extended-A | various | Noto Emoji (monochrome; Noto Color Emoji for color rendering only) |
|
|
89
|
+
| Variation Selectors | U+FE00–FE0F | Noto Sans (special handling — invisible format chars) |
|
|
90
|
+
|
|
91
|
+
### Pillar 2 fallback (no Tier 1 available)
|
|
92
|
+
|
|
93
|
+
Blocks with no redistributable Tier 1 font MUST go through pillar 2
|
|
94
|
+
(content-stream correlation). TODO 34 builds this; TODO 32 just
|
|
95
|
+
records the policy.
|
|
96
|
+
|
|
97
|
+
| Block | Why pillar 2 | Pillar 2 PDF source |
|
|
98
|
+
|---|---|---|
|
|
99
|
+
| Sidetic (if Lentariso unavailable) | Private foundry | U10940.pdf |
|
|
100
|
+
| Beria Erfe (if Kedebideri unavailable) | UFO source, complex extract | U16EA0.pdf |
|
|
101
|
+
| Egyptian Hieroglyph Format Controls (gap) | Egyptian Text limitations | U13430.pdf |
|
|
102
|
+
|
|
103
|
+
### Pillar 3 last resort (always-on fallback)
|
|
104
|
+
|
|
105
|
+
When both Tier 1 and pillar 2 fail (or for unassigned/PUA ranges that
|
|
106
|
+
still need a placeholder glyph), the resolver emits a Last Resort Font
|
|
107
|
+
tofu box. This is encoded as the lowest-priority source on
|
|
108
|
+
`default_sources`, not per-block.
|
|
109
|
+
|
|
110
|
+
## Scope
|
|
111
|
+
|
|
112
|
+
1. **YAML structure** — extend `Models::GlyphSourceMap` to accept a
|
|
113
|
+
`default_sources` block at the top level (currently forces ~315
|
|
114
|
+
repetitions of the same Noto Sans entry). See TODO 29 §"Architectural
|
|
115
|
+
improvements" for the shape.
|
|
116
|
+
|
|
117
|
+
2. **Curate every block** — walk `output/blocks/index.json`, decide
|
|
118
|
+
Tier 1 for each. Output: ~340 distinct entries (or ~30 specialists +
|
|
119
|
+
`default_sources`).
|
|
120
|
+
|
|
121
|
+
3. **Per-block rationale comment** — every non-default entry must
|
|
122
|
+
explain WHY (provenance URL, OFL check, known coverage gaps). This
|
|
123
|
+
becomes the documentation for the universal set; reviewers should
|
|
124
|
+
not need to chase external links to understand a choice.
|
|
125
|
+
|
|
126
|
+
4. **Resolve the specialists named in TODO 29** that didn't have
|
|
127
|
+
concrete URLs:
|
|
128
|
+
- Lentariso: GitHub repo has no releases (the prior URL was 404).
|
|
129
|
+
Policy: vendor the TTFs from `TTFs/` folder of the repo ZIP
|
|
130
|
+
(downloadable via `git clone` or codeload ZIP).
|
|
131
|
+
- EgyptianText: Microsoft/font-tools has no releases. Policy: pull
|
|
132
|
+
from `EgyptianOpenType/` directory in the repo.
|
|
133
|
+
- UniHieroglyphica: canonical URL is `suignard.com/Ptolemaic/` (BBAW
|
|
134
|
+
page is authoritative), not the prior `/UniHieroglyphica/` path.
|
|
135
|
+
- Symbola: dn-works.com no longer hosts public downloads. Policy:
|
|
136
|
+
mirror via `github.com/zhm/symbola` (OFL, version-pinned).
|
|
137
|
+
|
|
138
|
+
5. **Test fixtures** — for each curated specialist, capture a small
|
|
139
|
+
fixture (1–5 codepoint ids) and assert the source map returns the
|
|
140
|
+
expected font label. Tests run without the font installed.
|
|
141
|
+
|
|
142
|
+
## Acceptance
|
|
143
|
+
|
|
144
|
+
- [ ] `config/unicode17_universal_glyph_set.yml` lists every Unicode
|
|
145
|
+
17 block by id, with `sources:` per entry or implicit
|
|
146
|
+
`default_sources` fallback.
|
|
147
|
+
- [ ] Each specialist entry carries `provenance`, `license`, `url`
|
|
148
|
+
(or `path` for local), and a rationale comment.
|
|
149
|
+
- [ ] `Ucode::Models::GlyphSourceMap#sources_for(block_id)` returns
|
|
150
|
+
the right list for default AND specialist entries.
|
|
151
|
+
- [ ] Every specialist URL is HTTP 200-verifiable (or marked
|
|
152
|
+
`local_only: true` for user-supplied fonts like FSung).
|
|
153
|
+
- [ ] Curation specs cover at least: Basic_Latin (default),
|
|
154
|
+
Sidetic (specialist fontist), Tai Yo (specialist path),
|
|
155
|
+
CJK Unified Ideographs (specialist multi-source with fallback),
|
|
156
|
+
Egyptian Hieroglyphs (specialist path).
|
|
157
|
+
|
|
158
|
+
## References
|
|
159
|
+
|
|
160
|
+
- [TODO 23](23-universal-glyph-set-source-map.md) — source map mechanism
|
|
161
|
+
- [TODO 29](29-universal-set-curation-uc17.md) — initial curation
|
|
162
|
+
- [TODO 33](33-specialist-font-acquisition-refresh.md) — fix URLs
|
|
163
|
+
- [TODO 35](35-universal-set-production-run.md) — build it
|
|
164
|
+
- `docs/architecture.md` — 4-tier glyph strategy
|
|
165
|
+
- BBAW Egyptological Unicode Fonts page — authoritative for Egyptian family
|
|
@@ -0,0 +1,138 @@
|
|
|
1
|
+
# 33 — Specialist font acquisition refresh
|
|
2
|
+
|
|
3
|
+
## Goal
|
|
4
|
+
|
|
5
|
+
Fix the broken acquisition paths that block the universal-set build
|
|
6
|
+
(TODO 35) from completing. Five of the seven specialists in
|
|
7
|
+
`config/specialist_fonts.yml` return HTTP 404/301 today:
|
|
8
|
+
|
|
9
|
+
| Label | Current URL | Status | Working URL |
|
|
10
|
+
|---|---|---|---|
|
|
11
|
+
| Lentariso | `github.com/Bry10022/Lentariso/releases/download/1.033/Lentariso.otf` | 404 (no releases published) | Repo has no release artifacts; vendor `TTFs/*.ttf` from `archive/master.zip` |
|
|
12
|
+
| NotoSerifTaiYo | `translationcommons.org/wp-content/uploads/2025/09/NotoSerifTaiYo.ttf` | 404 | Path changed; needs upstream contact or alternate mirror |
|
|
13
|
+
| UniHieroglyphica | `suignard.com/UniHieroglyphica/UniHieroglyphica-16.0.zip` | 301 redirect | New path is `suignard.com/Ptolemaic/` per BBAW |
|
|
14
|
+
| EgyptianText | `github.com/microsoft/font-tools/releases/download/v1.0/EgyptianText-Regular.ttf` | 404 (no releases) | Vendor from `EgyptianOpenType/` in the repo |
|
|
15
|
+
| BabelStonePseudographica | `babelstone.co.uk/Fonts/Download/BabelStonePseudographica.zip` | 404 | Page exists; download path moved — needs page scrape |
|
|
16
|
+
| Symbola | `dn-works.com/wp-content/uploads/2020/ufas/Symbola.zip` | 404 (site no longer hosts downloads) | Mirror at `github.com/zhm/symbola` (version-pinned) |
|
|
17
|
+
|
|
18
|
+
Plus: `noto-sans`, `noto-sans-cjk-jp`, `noto-sans-arabic`, `noto-sans-telugu`,
|
|
19
|
+
`noto-sans-kannada`, `noto-sans-symbols`, `noto-sans-symbols-2`, `noto-music`,
|
|
20
|
+
`noto-sans-sharada`, `noto-sans-sidetic`, `noto-sans-tolong-siki`,
|
|
21
|
+
`noto-sans-tangut` — none are resolvable via `fontist install` (not in
|
|
22
|
+
the formulas repo).
|
|
23
|
+
|
|
24
|
+
## Why a separate TODO
|
|
25
|
+
|
|
26
|
+
The fontist formulas repo (`github.com/fontist/formulas`) doesn't carry
|
|
27
|
+
most Noto variants as separate packages. ucode's pre-build check fails
|
|
28
|
+
hard on the first missing font; without fixes here, TODO 35 cannot
|
|
29
|
+
proceed.
|
|
30
|
+
|
|
31
|
+
Two distinct fixes are needed:
|
|
32
|
+
|
|
33
|
+
1. **Direct-fetch URLs** — for specialists with known canonical sources
|
|
34
|
+
not in fontist (Lentariso, EgyptianText, UniHieroglyphica,
|
|
35
|
+
NotoSerifTaiYo, BabelStone, Symbola). These go through
|
|
36
|
+
`ucode fetch fonts` via `config/specialist_fonts.yml`.
|
|
37
|
+
|
|
38
|
+
2. **fontist formula PRs** — for Noto variants that SHOULD be in
|
|
39
|
+
fontist but aren't yet. Upstream PRs to
|
|
40
|
+
`github.com/fontist/formulas`. Until merged, ucode can fall back
|
|
41
|
+
to direct GitHub release URLs (notofonts.github.io publishes
|
|
42
|
+
release artifacts).
|
|
43
|
+
|
|
44
|
+
## Scope
|
|
45
|
+
|
|
46
|
+
### Phase A — Specialist URL refresh (this ucode repo)
|
|
47
|
+
|
|
48
|
+
1. **Lentariso** — change `url:` from the dead release URL to the
|
|
49
|
+
codeload archive: `https://codeload.github.com/Bry10022/Lentariso/zip/refs/heads/master`,
|
|
50
|
+
set `extract: true`, `extract_member: TTFs/Lentariso-Regular.ttf`
|
|
51
|
+
(and Bold/Italic if needed). Set `extract_multi: true` if the
|
|
52
|
+
fetcher needs to pull multiple members.
|
|
53
|
+
|
|
54
|
+
2. **EgyptianText** — `https://codeload.github.com/microsoft/font-tools/zip/refs/heads/main`,
|
|
55
|
+
`extract: true`, `extract_member: EgyptianOpenType/EgyptianText-Regular.ttf`.
|
|
56
|
+
License is MIT per the repo; confirm against the LICENSE file
|
|
57
|
+
before recording.
|
|
58
|
+
|
|
59
|
+
3. **UniHieroglyphica** — change `url:` to the new path under
|
|
60
|
+
`suignard.com/Ptolemaic/`. The exact filename needs a HEAD request
|
|
61
|
+
to discover (likely `UniHieroglyphica-16.0.zip` or
|
|
62
|
+
`UniHieroglyphica.zip`). Record sha256 on first successful fetch.
|
|
63
|
+
|
|
64
|
+
4. **BabelStonePseudographica** — fetch
|
|
65
|
+
`babelstone.co.uk/Fonts/Pseudographica.html`, parse for the actual
|
|
66
|
+
download link (likely `BabelStonePseudographica.ttf` direct, not
|
|
67
|
+
zip). Update URL accordingly.
|
|
68
|
+
|
|
69
|
+
5. **Symbola** — change `url:` to
|
|
70
|
+
`https://raw.githubusercontent.com/zhm/symbola/master/fonts/Symbola.ttf`
|
|
71
|
+
(verified HTTP 200). License: OFL per the mirror; confirm upstream
|
|
72
|
+
license matches before recording.
|
|
73
|
+
|
|
74
|
+
6. **NotoSerifTaiYo** — needs upstream contact (translationcommons.org
|
|
75
|
+
doesn't expose a current download). Options:
|
|
76
|
+
- Email the maintainers (out of scope for code)
|
|
77
|
+
- Mark `local_only: true` and document that the user must supply
|
|
78
|
+
the file
|
|
79
|
+
- Find a GitHub mirror with the font committed
|
|
80
|
+
|
|
81
|
+
**Recommendation:** mark `local_only: true` for now, document in
|
|
82
|
+
the entry's `provenance:` field. Pillar 2 (TODO 34) covers
|
|
83
|
+
U+1E6C0–U+1E6FF if the font isn't available.
|
|
84
|
+
|
|
85
|
+
### Phase B — fontist formula PRs (external repo)
|
|
86
|
+
|
|
87
|
+
For each missing Noto variant, open a PR against
|
|
88
|
+
`github.com/fontist/formulas` adding a formula. Each formula is a
|
|
89
|
+
small YAML carrying:
|
|
90
|
+
|
|
91
|
+
- Font metadata (name, license, copyright)
|
|
92
|
+
- One or more release URLs with sha256
|
|
93
|
+
- Per-platform install paths
|
|
94
|
+
|
|
95
|
+
Variants to upstream (in priority order):
|
|
96
|
+
|
|
97
|
+
1. **noto-sans-cjk-jp** — covers the most codepoints; user-visible
|
|
98
|
+
block (CJK Unified Ideographs). Already documented at
|
|
99
|
+
`github.com/notofonts/noto-cjk`.
|
|
100
|
+
2. **noto-sans-symbols** + **noto-sans-symbols-2** — cover ~10 symbol
|
|
101
|
+
blocks.
|
|
102
|
+
3. **noto-music** — covers Musical Symbols block.
|
|
103
|
+
4. **noto-sans-sharada**, **noto-sans-sidetic**, **noto-sans-tolong-siki** —
|
|
104
|
+
UC17 specialists.
|
|
105
|
+
5. **noto-sans-arabic**, **noto-sans-telugu**, **noto-sans-kannada** —
|
|
106
|
+
scripts where ucode needs the variant.
|
|
107
|
+
6. **noto-sans-tangut** — Tangut block.
|
|
108
|
+
|
|
109
|
+
### Phase C — Local fallbacks (until Phase B merges)
|
|
110
|
+
|
|
111
|
+
Until fontist/formulas merges the new formulas, ucode's fetcher
|
|
112
|
+
subsystem can pull directly from `notofonts.github.io` release
|
|
113
|
+
artifacts (e.g. `https://github.com/notofonts/notofonts.github.io/raw/main/fonts/NotoSansTolongSiki/hinted/ttf/NotoSansTolongSiki-Regular.ttf`).
|
|
114
|
+
|
|
115
|
+
Extend `specialist_fonts.yml` to include these as fallback entries
|
|
116
|
+
when `kind: fontist` resolution fails. The fetcher already supports
|
|
117
|
+
`kind: path` for direct URLs; just add Noto variants as path-kind
|
|
118
|
+
entries.
|
|
119
|
+
|
|
120
|
+
## Acceptance
|
|
121
|
+
|
|
122
|
+
- [ ] `config/specialist_fonts.yml` URLs all return HTTP 200 (or are
|
|
123
|
+
marked `local_only: true` with documented user-supplied path)
|
|
124
|
+
- [ ] `ucode fetch fonts` succeeds for every entry (including the
|
|
125
|
+
previously-broken ones); sha256 recorded in the YAML
|
|
126
|
+
- [ ] Universal-set pre-check (`ucode universal-set pre-check 17.0.0`)
|
|
127
|
+
reports zero `fontist`-kind missing fonts (path-kind allowed
|
|
128
|
+
for not-yet-upstreamed Noto variants)
|
|
129
|
+
- [ ] At least 3 fontist/formulas PRs opened for the highest-priority
|
|
130
|
+
Noto variants (CJK JP, Symbols, Symbols 2)
|
|
131
|
+
- [ ] Each PR carries the upstream license + sha256 in the formula YAML
|
|
132
|
+
|
|
133
|
+
## References
|
|
134
|
+
|
|
135
|
+
- [TODO 30](30-tier1-font-acquisition.md) — original acquisition design
|
|
136
|
+
- [TODO 32](32-uc17-coverage-matrix.md) — what we need these fonts FOR
|
|
137
|
+
- `config/specialist_fonts.yml` — current (broken) manifest
|
|
138
|
+
- `lib/ucode/commands/fetch.rb` — fetcher implementation
|
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
# 34 — Pillar 2 ContentStreamCorrelator (generalize correlate-v4)
|
|
2
|
+
|
|
3
|
+
## Goal
|
|
4
|
+
|
|
5
|
+
Promote pillar 2 (PDF content-stream positional correlation) from a
|
|
6
|
+
throwaway proof-of-concept into a first-class fallback in the
|
|
7
|
+
canonical 4-tier resolver. Today `lib/ucode/glyphs/embedded_fonts/catalog.rb`
|
|
8
|
+
bails at line 226 when `tu_ref` (ToUnicode CMap) is nil; this TODO
|
|
9
|
+
makes it delegate to a new `ContentStreamCorrelator` that recovers
|
|
10
|
+
CID→codepoint mappings from chart geometry alone.
|
|
11
|
+
|
|
12
|
+
Proven on Tai Yo (all 54 assigned codepoints correctly mapped without
|
|
13
|
+
ToUnicode — see `/tmp/correlate_v4.rb`). This TODO generalizes that
|
|
14
|
+
script and makes it the fallback for blocks where:
|
|
15
|
+
|
|
16
|
+
- Tier 1 font is unavailable (Sidetic if Lentariso unavailable,
|
|
17
|
+
Beria Erfe if Kedebideri unavailable, Egyptian Hieroglyph Format
|
|
18
|
+
Controls gaps)
|
|
19
|
+
- The Code Charts PDF embeds subsetted CIDFonts without ToUnicode
|
|
20
|
+
(common for private specimen fonts — Unicode Consortium uses 80+
|
|
21
|
+
such fonts that are not redistributable)
|
|
22
|
+
|
|
23
|
+
## Why a separate TODO
|
|
24
|
+
|
|
25
|
+
The Catalog is the entry point for pillar 1 extraction. When
|
|
26
|
+
`tu_ref` is nil, today it returns nil, which means the resolver
|
|
27
|
+
silently drops to pillar 3 (Last Resort tofu). For blocks like
|
|
28
|
+
Egyptian Hieroglyphs (4k+ codepoints where source fonts are
|
|
29
|
+
private), this would mean 4k tofu boxes instead of real outlines.
|
|
30
|
+
|
|
31
|
+
Pillar 2 is the only path for these blocks. Generalizing the Tai Yo
|
|
32
|
+
proof is the unlock.
|
|
33
|
+
|
|
34
|
+
## Algorithm (extracted from correlate-v4)
|
|
35
|
+
|
|
36
|
+
```ruby
|
|
37
|
+
# 1. Render the chart page to SVG via mutool:
|
|
38
|
+
# `mutool draw -F svg <pdf> <page>` produces an SVG with:
|
|
39
|
+
# <defs><path id="font_N_M"/> for every CID M in specimen font N
|
|
40
|
+
# <use xlink:href="#font_N_M" transform="matrix(a,b,c,d,X,Y)"/>
|
|
41
|
+
# for every placement
|
|
42
|
+
#
|
|
43
|
+
# 2. Partition <use> elements by their font index (the N in font_N_M):
|
|
44
|
+
# - Labels: fonts that emit hex digits (typically font_3, font_8)
|
|
45
|
+
# - Specimens: the CIDFont carrying the actual glyph outlines
|
|
46
|
+
# (typically font_4 or font_6)
|
|
47
|
+
#
|
|
48
|
+
# 3. Cluster label uses by Y-row:
|
|
49
|
+
# yb = (y / 1.5).round * 1.5 # quantize to row height
|
|
50
|
+
# xb = (x / 50.0).round * 50.0 # quantize to column width
|
|
51
|
+
# clusters[[yb, xb]] << label
|
|
52
|
+
#
|
|
53
|
+
# 4. Per cluster, sort members by X and join decoded text:
|
|
54
|
+
# decode = ->(s) { s.gsub(/&#x([0-9a-fA-F]+);/) { [$1.to_i(16)].pack("U") } }
|
|
55
|
+
# cp_hex = members.sort_by { |m| m[:x] }.map { |m| decode.call(m[:text]) }.join
|
|
56
|
+
#
|
|
57
|
+
# 5. The rightmost cluster per Y-row is the specimen codepoint label.
|
|
58
|
+
# The rightmost <use> per Y-row in the specimen font is the
|
|
59
|
+
# specimen glyph placement. CID(M) ↔ codepoint established.
|
|
60
|
+
#
|
|
61
|
+
# 6. Lift <path id="font_<specimen_idx>_<CID>"> outline, normalize
|
|
62
|
+
# viewBox, emit glyph.svg.
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
The Y-quantization (1.5) and X-quantization (50.0) come from the
|
|
66
|
+
Code Charts typesetting convention. They should be parameters, not
|
|
67
|
+
constants — different charts may use different grid sizes. Empirical
|
|
68
|
+
discovery: walk all labels, find the smallest Y-gap, use that as
|
|
69
|
+
quantization base.
|
|
70
|
+
|
|
71
|
+
## Combinator caveat (managed)
|
|
72
|
+
|
|
73
|
+
Code Charts convention draws combining marks (Mn category) as
|
|
74
|
+
"dotted-circle + mark" side-by-side. The dotted circle is a separate
|
|
75
|
+
`<use>` element; it does NOT contaminate the mark's glyf outline.
|
|
76
|
+
Verified clean on all 5 Tai Yo Mn codepoints.
|
|
77
|
+
|
|
78
|
+
However, some foundries ship composite glyphs (mark + base in same
|
|
79
|
+
glyf). For those we'd need a dotted-circle subtraction step:
|
|
80
|
+
|
|
81
|
+
1. Detect U+25CC outline in the extracted path (signature: ring of
|
|
82
|
+
small dots)
|
|
83
|
+
2. Remove its subpaths from the final glyph
|
|
84
|
+
|
|
85
|
+
This is a follow-up if any block needs it. Initial implementation
|
|
86
|
+
just extracts the outline as-is; compositing artifacts get flagged
|
|
87
|
+
in the validator (TODO 35).
|
|
88
|
+
|
|
89
|
+
## Scope
|
|
90
|
+
|
|
91
|
+
1. **`Ucode::Glyphs::EmbeddedFonts::ContentStreamCorrelator`** —
|
|
92
|
+
new class next to `Catalog`. API:
|
|
93
|
+
```ruby
|
|
94
|
+
correlator = ContentStreamCorrelator.new(pdf_page:, specimen_font_index:)
|
|
95
|
+
mapping = correlator.call # { codepoint_int => cid_int }
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
2. **Patch `Catalog#build_entry`** — when `tu_ref` is nil, instead
|
|
99
|
+
of returning nil, delegate to ContentStreamCorrelator. Caller-
|
|
100
|
+
unchanged. Catalog callers see a populated entry regardless of
|
|
101
|
+
whether pillar 1 or pillar 2 produced the mapping.
|
|
102
|
+
|
|
103
|
+
3. **Page-walk helper** — for a given block PDF, identify the
|
|
104
|
+
specimen font index automatically (currently hardcoded in
|
|
105
|
+
correlate-v4 as font_4). Heuristic: the font with the most
|
|
106
|
+
`<use>` placements AND the highest CID count in `<defs>` is the
|
|
107
|
+
specimen font.
|
|
108
|
+
|
|
109
|
+
4. **Y-row quantization auto-discovery** — collect all label Y
|
|
110
|
+
positions, find the smallest non-trivial gap, use that as the
|
|
111
|
+
row-height quantization. Same for X-gap → column width.
|
|
112
|
+
|
|
113
|
+
5. **Path lifting** — given the specimen font index and CID, find
|
|
114
|
+
`<path id="font_<idx>_<cid>">` in the SVG, extract its `d=`
|
|
115
|
+
attribute, normalize the viewBox (typical Code Charts cell is
|
|
116
|
+
~1000×1000 user units).
|
|
117
|
+
|
|
118
|
+
6. **mutool integration** — wrap the `mutool draw -F svg` shell
|
|
119
|
+
call. Cache the rendered SVG keyed by PDF path + page number
|
|
120
|
+
under `~/.cache/ucode/unicode/<version>/svg/<block_id>-<page>.svg`.
|
|
121
|
+
|
|
122
|
+
7. **Specs** — fixture-based tests for:
|
|
123
|
+
- Tai Yo (proven baseline — must reproduce correlate-v4 output
|
|
124
|
+
exactly)
|
|
125
|
+
- Sidetic (no Tier 1 fallback available; pillar 2 mandatory)
|
|
126
|
+
- Beria Erfe (same)
|
|
127
|
+
- At least one block WITH ToUnicode to ensure pillar 1 still
|
|
128
|
+
works (regression guard)
|
|
129
|
+
|
|
130
|
+
## Acceptance
|
|
131
|
+
|
|
132
|
+
- [ ] `ContentStreamCorrelator` class exists with documented API
|
|
133
|
+
- [ ] Catalog delegates to it when `tu_ref` is nil
|
|
134
|
+
- [ ] Tai Yo test fixture reproduces the correlate-v4 mapping (54/54
|
|
135
|
+
codepoints correctly attributed)
|
|
136
|
+
- [ ] Sidetic + Beria Erfe PDFs produce complete mappings via
|
|
137
|
+
pillar 2 (no tofu fallback)
|
|
138
|
+
- [ ] Combinator cleanliness check: every Mn codepoint's extracted
|
|
139
|
+
glyph passes the "no U+25CC sub-path" heuristic
|
|
140
|
+
- [ ] mutool SVG output is cached; re-runs are no-ops
|
|
141
|
+
|
|
142
|
+
## References
|
|
143
|
+
|
|
144
|
+
- `/tmp/correlate_v4.rb` — proven implementation (112 lines, Tai Yo)
|
|
145
|
+
- `lib/ucode/glyphs/embedded_fonts/catalog.rb:226` — bail point
|
|
146
|
+
- [TODO 20](20-canonical-resolver-4-tier.md) — original 4-tier design
|
|
147
|
+
- [TODO 32](32-uc17-coverage-matrix.md) — pillar 2 fallback policy
|