ucode 0.2.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/TODO.extract-code-chart/01-pdf-fetch-validation.md +80 -0
- data/TODO.extract-code-chart/02-block-name-resolver.md +68 -0
- data/TODO.extract-code-chart/03-codechart-namespace.md +82 -0
- data/TODO.extract-code-chart/04-codechart-extractor.md +154 -0
- data/TODO.extract-code-chart/05-provenance-and-sidecar.md +147 -0
- data/TODO.extract-code-chart/06-codechart-writer.md +134 -0
- data/TODO.extract-code-chart/07-codechart-cli.md +135 -0
- data/TODO.extract-code-chart/08-specs.md +87 -0
- data/lib/ucode/cli.rb +99 -0
- data/lib/ucode/code_chart/extractor.rb +122 -0
- data/lib/ucode/code_chart/provenance.rb +81 -0
- data/lib/ucode/code_chart/sidecar.rb +52 -0
- data/lib/ucode/code_chart/writer.rb +128 -0
- data/lib/ucode/code_chart.rb +39 -0
- data/lib/ucode/error.rb +11 -0
- data/lib/ucode/fetch/code_charts.rb +1 -1
- data/lib/ucode/fetch/http.rb +84 -14
- data/lib/ucode/parsers/blocks.rb +34 -0
- data/lib/ucode/version.rb +1 -1
- data/lib/ucode.rb +3 -0
- metadata +15 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 9c2f1e8cbf000eca25d51f0df33bb07c560396b5bcbb43e1347e8be31609c6e9
|
|
4
|
+
data.tar.gz: 654c8d3d9d63551faf5945c7ae4e42f51caffa81716e5a02b93d551de18aeed7
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: d0100a2f733bd043a3902cb82cbc5a337ae9a1b19bee952ae9fb2ea0827c8b7ed226cf67db12c0ec997c22b942f5ac735093f6b5a7dc6461c1c2e374db41221b
|
|
7
|
+
data.tar.gz: 5e0e6dca7f3b4146aa62bdaf493e00e9ca19335c846a8af10beda76a1530dc2fd8bb8bddbcdf914fe60c9c109cfc8e8c095e412cad591e105779263399783bd6
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
# TODO 01 — PDF fetch validation
|
|
2
|
+
|
|
3
|
+
## Status
|
|
4
|
+
|
|
5
|
+
Pending.
|
|
6
|
+
|
|
7
|
+
## Goal
|
|
8
|
+
|
|
9
|
+
Raise a typed `Ucode::CodeChartNotFoundError` when a Unicode Code
|
|
10
|
+
Charts PDF cannot be downloaded or fails content validation. The REQ
|
|
11
|
+
(R1) requires:
|
|
12
|
+
|
|
13
|
+
- HTTP 4xx / 5xx → `CodeChartNotFoundError`
|
|
14
|
+
- `Content-Type: application/pdf`
|
|
15
|
+
- First 4 bytes are `%PDF`
|
|
16
|
+
|
|
17
|
+
## Files
|
|
18
|
+
|
|
19
|
+
- `lib/ucode/error.rb` — add `Ucode::CodeChartNotFoundError` under the
|
|
20
|
+
`GlyphError` subtree.
|
|
21
|
+
- `lib/ucode.rb` — add an `autoload` for the new class so any rescue
|
|
22
|
+
clause triggers one load of `error.rb`.
|
|
23
|
+
- `lib/ucode/fetch/http.rb` — extend `Http.get` with an optional
|
|
24
|
+
`validate:` keyword. When `validate: :pdf`, after a successful
|
|
25
|
+
download, verify the `Content-Type` response header starts with
|
|
26
|
+
`application/pdf` and the first 4 bytes of the body are `%PDF`.
|
|
27
|
+
- `lib/ucode/fetch/code_charts.rb` — pass `validate: :pdf` to `Http.get`
|
|
28
|
+
for every chart PDF download.
|
|
29
|
+
- `spec/ucode/fetch/code_charts_spec.rb` (new) — cover the happy path
|
|
30
|
+
and the validation failure modes.
|
|
31
|
+
|
|
32
|
+
## Design
|
|
33
|
+
|
|
34
|
+
### Why a new error class
|
|
35
|
+
|
|
36
|
+
`FetchError` already covers transport failures, but it doesn't carry
|
|
37
|
+
"this URL produced an HTML error page / 404 / non-PDF body" semantics.
|
|
38
|
+
Splitting the type keeps existing `rescue Ucode::FetchError` callers
|
|
39
|
+
from accidentally swallowing the typed signal that "we expected a PDF
|
|
40
|
+
and didn't get one" — which is a different problem class from "the
|
|
41
|
+
network was down."
|
|
42
|
+
|
|
43
|
+
`CodeChartNotFoundError < Ucode::Error` (under `GlyphError`) reflects
|
|
44
|
+
the REQ's framing: the chart for the requested block is not
|
|
45
|
+
obtainable.
|
|
46
|
+
|
|
47
|
+
### Why `validate:` is optional on `Http.get`
|
|
48
|
+
|
|
49
|
+
`Http` is the single network boundary (per the comment at the top of
|
|
50
|
+
`http.rb`). All callers funnel through it. Adding an optional
|
|
51
|
+
keyword keeps the MECE pattern intact: non-PDF callers (UCD zip,
|
|
52
|
+
Unihan zip, font zip) pass nothing; the single PDF caller passes
|
|
53
|
+
`validate: :pdf`. No second network boundary is needed.
|
|
54
|
+
|
|
55
|
+
### Why no separate "magic bytes" check class
|
|
56
|
+
|
|
57
|
+
Magic-byte verification is 4 lines of code; extracting it into a
|
|
58
|
+
class would be ceremony. Inline check after `write_body`, raising
|
|
59
|
+
the typed error with the offending content-type or magic bytes in
|
|
60
|
+
the context payload.
|
|
61
|
+
|
|
62
|
+
## Acceptance
|
|
63
|
+
|
|
64
|
+
- `Http.get(url, dest:, validate: :pdf)` raises
|
|
65
|
+
`CodeChartNotFoundError` (a) when the response Content-Type is not
|
|
66
|
+
`application/pdf`, (b) when the first 4 bytes are not `%PDF`.
|
|
67
|
+
- `Fetch::CodeCharts.call(version, block_first_cps: [...])` raises
|
|
68
|
+
`CodeChartNotFoundError` when the unicode.org endpoint returns
|
|
69
|
+
4xx/5xx or non-PDF content.
|
|
70
|
+
- Existing callers of `Http.get` that don't pass `validate:` are
|
|
71
|
+
unchanged.
|
|
72
|
+
- Spec coverage: happy path, HTTP 404, wrong content-type, truncated
|
|
73
|
+
body missing the `%PDF` magic.
|
|
74
|
+
|
|
75
|
+
## Out of scope
|
|
76
|
+
|
|
77
|
+
- SHA-256 verification of the PDF — that's a downstream concern (the
|
|
78
|
+
Code Charts are not versioned by hash on unicode.org).
|
|
79
|
+
- Resumable / partial downloads — the existing `Http` writes a
|
|
80
|
+
`.part` then renames; that's sufficient.
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
# TODO 02 — Block name resolver
|
|
2
|
+
|
|
3
|
+
## Status
|
|
4
|
+
|
|
5
|
+
Pending. Depends on nothing.
|
|
6
|
+
|
|
7
|
+
## Goal
|
|
8
|
+
|
|
9
|
+
Add a class method `Ucode::Parsers::Blocks.find_by_name(name)` that
|
|
10
|
+
resolves a Unicode block identifier (e.g. `"Sidetic"`,
|
|
11
|
+
`"Egyptian_Hieroglyphs_Extended-B"`) to the `Ucode::Models::Block`
|
|
12
|
+
instance in a given version's cached `Blocks.txt`.
|
|
13
|
+
|
|
14
|
+
This is the CLI ergonomics glue: the REQ's `ucode code-chart extract
|
|
15
|
+
--block Sidetic` flow takes a human-readable name and needs to know
|
|
16
|
+
the block's range to know which `U+XXXX` codepoints to iterate.
|
|
17
|
+
|
|
18
|
+
## Files
|
|
19
|
+
|
|
20
|
+
- `lib/ucode/parsers/blocks.rb` — add `Blocks.find_by_name(path, name)`.
|
|
21
|
+
- `spec/ucode/parsers/blocks_spec.rb` — cover name lookup, missing-name,
|
|
22
|
+
case-sensitivity.
|
|
23
|
+
|
|
24
|
+
## Design
|
|
25
|
+
|
|
26
|
+
### Method shape
|
|
27
|
+
|
|
28
|
+
```ruby
|
|
29
|
+
# @param path [Pathname, String] path to a Blocks.txt
|
|
30
|
+
# @param name [String] block identifier (matches Models::Block#id)
|
|
31
|
+
# @return [Models::Block, nil] nil when no block matches
|
|
32
|
+
def find_by_name(path, name)
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
Returns nil for "not found" — callers (CLI, Extractor) decide whether
|
|
36
|
+
to raise. This matches `Models::Block` consumers that already expect
|
|
37
|
+
nilable lookups.
|
|
38
|
+
|
|
39
|
+
### Name matching rule
|
|
40
|
+
|
|
41
|
+
`Blocks.txt` uses `name` with whitespace collapsed to underscores
|
|
42
|
+
into `id`. `find_by_name` matches against `id` (the underscored
|
|
43
|
+
form). The REQ's example `--block Sidetic` shows that the caller
|
|
44
|
+
provides the underscored form already. This is consistent with the
|
|
45
|
+
existing `Parsers::Blocks` build logic (`name.gsub(/\s+/, "_")`).
|
|
46
|
+
|
|
47
|
+
### Why a separate method
|
|
48
|
+
|
|
49
|
+
`each_record` streams every block — the caller doesn't want to walk
|
|
50
|
+
~340 blocks for every name lookup. `find_by_name` short-circuits on
|
|
51
|
+
first match.
|
|
52
|
+
|
|
53
|
+
## Acceptance
|
|
54
|
+
|
|
55
|
+
- `find_by_name(path, "Basic_Latin")` returns the Basic Latin block.
|
|
56
|
+
- `find_by_name(path, "Nonexistent")` returns nil.
|
|
57
|
+
- Streaming still works for callers that need every block.
|
|
58
|
+
|
|
59
|
+
## Out of scope
|
|
60
|
+
|
|
61
|
+
- Fuzzy matching — exact match only. Callers validate the user's
|
|
62
|
+
input against `Parsers::Blocks.each_record(path).map(&:id)` to
|
|
63
|
+
surface "did you mean …?" suggestions if we ever want that; for
|
|
64
|
+
now, a clean `UnknownBlockError` at the call site is enough.
|
|
65
|
+
- Database-backed lookup — `Ucode::Database#block_ranges_by_name` is
|
|
66
|
+
a different concern (full UCD index). `find_by_name` operates on
|
|
67
|
+
the cached `Blocks.txt` directly because the CodeChart extractor
|
|
68
|
+
is meant to be runnable without a built database.
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
# TODO 03 — CodeChart namespace
|
|
2
|
+
|
|
3
|
+
## Status
|
|
4
|
+
|
|
5
|
+
Pending. Depends on TODO 02 (block name resolver) so the Extractor
|
|
6
|
+
can consume it; depends on TODO 01 (error class) so the namespace
|
|
7
|
+
can declare typed errors.
|
|
8
|
+
|
|
9
|
+
## Goal
|
|
10
|
+
|
|
11
|
+
Establish the `Ucode::CodeChart` namespace as the home for the
|
|
12
|
+
Code Charts per-codepoint extraction feature. The REQ calls this
|
|
13
|
+
namespace `Ucode::CodeChart::*`; we follow the REQ.
|
|
14
|
+
|
|
15
|
+
This is the autoload-hub file plus the autoload declaration in
|
|
16
|
+
`lib/ucode.rb`.
|
|
17
|
+
|
|
18
|
+
## Files
|
|
19
|
+
|
|
20
|
+
- `lib/ucode/code_chart.rb` — new autoload hub (defines
|
|
21
|
+
`Ucode::CodeChart` and declares child autoloads).
|
|
22
|
+
- `lib/ucode.rb` — add `autoload :CodeChart, "ucode/code_chart"` in
|
|
23
|
+
the namespace-hubs block.
|
|
24
|
+
|
|
25
|
+
## Design
|
|
26
|
+
|
|
27
|
+
### Autoload hub shape
|
|
28
|
+
|
|
29
|
+
```ruby
|
|
30
|
+
# lib/ucode/code_chart.rb
|
|
31
|
+
module Ucode
|
|
32
|
+
module CodeChart
|
|
33
|
+
autoload :Extractor, "ucode/code_chart/extractor"
|
|
34
|
+
autoload :Provenance, "ucode/code_chart/provenance"
|
|
35
|
+
autoload :Sidecar, "ucode/code_chart/sidecar"
|
|
36
|
+
autoload :Writer, "ucode/code_chart/writer"
|
|
37
|
+
end
|
|
38
|
+
end
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
Per the global rule (`~/.claude/CLAUDE.md`): declare autoloads in the
|
|
42
|
+
immediate parent namespace's file. `Ucode::CodeChart` is the immediate
|
|
43
|
+
parent of `Extractor`, `Provenance`, `Sidecar`, `Writer`; this file
|
|
44
|
+
is the immediate parent's file.
|
|
45
|
+
|
|
46
|
+
`Ucode` is the immediate parent of `CodeChart`; the autoload
|
|
47
|
+
declaration `autoload :CodeChart, "ucode/code_chart"` goes in
|
|
48
|
+
`lib/ucode.rb`.
|
|
49
|
+
|
|
50
|
+
### Why a new namespace (not under `Ucode::Glyphs`)
|
|
51
|
+
|
|
52
|
+
`Ucode::Glyphs::*` is the existing 4-tier sourcing pipeline
|
|
53
|
+
(`EmbeddedFonts`, `RealFonts`, `LastResort`, `Writer`). The REQ's
|
|
54
|
+
`CodeChart::*` is a feature-facing namespace that orchestrates the
|
|
55
|
+
glyphs pipeline for one specific use case (extracting from a per-block
|
|
56
|
+
PDF for the essenfont donor pipeline). Keeping the feature-facing
|
|
57
|
+
namespace separate from the implementation namespace:
|
|
58
|
+
|
|
59
|
+
- Lets callers say `Ucode::CodeChart.extract(block: "Sidetic")`
|
|
60
|
+
without first knowing about `Glyphs::EmbeddedFonts`.
|
|
61
|
+
- Makes it easy to swap the implementation later (different
|
|
62
|
+
resolution strategy, alternative PDF parser) without breaking the
|
|
63
|
+
public API.
|
|
64
|
+
- Keeps `Glyphs::` focused on tier mechanics, free of feature
|
|
65
|
+
ergonomics.
|
|
66
|
+
|
|
67
|
+
The REQ's namespace name is what we use.
|
|
68
|
+
|
|
69
|
+
## Acceptance
|
|
70
|
+
|
|
71
|
+
- `lib/ucode/code_chart.rb` exists with the autoload declarations.
|
|
72
|
+
- `lib/ucode.rb` has the new `autoload :CodeChart, "ucode/code_chart"`
|
|
73
|
+
in the namespace-hubs block.
|
|
74
|
+
- `Ucode::CodeChart` resolves to a module without loading any of its
|
|
75
|
+
children.
|
|
76
|
+
|
|
77
|
+
## Out of scope
|
|
78
|
+
|
|
79
|
+
- `Ucode::CodeChart::Command` (Thin wrapper). The CLI lives in
|
|
80
|
+
`lib/ucode/cli.rb` per the existing pattern; no separate
|
|
81
|
+
`Commands::CodeChartCommand` is introduced (single source of
|
|
82
|
+
truth for CLI dispatch).
|
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
# TODO 04 — CodeChart::Extractor
|
|
2
|
+
|
|
3
|
+
## Status
|
|
4
|
+
|
|
5
|
+
Pending. Depends on TODO 01 (error class), TODO 02 (block name
|
|
6
|
+
resolver), TODO 03 (namespace).
|
|
7
|
+
|
|
8
|
+
## Goal
|
|
9
|
+
|
|
10
|
+
`Ucode::CodeChart::Extractor` is the single entry point for
|
|
11
|
+
"extract every assigned codepoint in block X as a standalone SVG."
|
|
12
|
+
|
|
13
|
+
It orchestrates the existing 4-tier resolver (one source of truth for
|
|
14
|
+
"how do I get the SVG for a given codepoint") and returns a list of
|
|
15
|
+
extraction results — one per codepoint — that the downstream Writer
|
|
16
|
+
serializes to disk.
|
|
17
|
+
|
|
18
|
+
This is *not* a new extraction pipeline; it is the existing
|
|
19
|
+
`Ucode::Glyphs::Resolver` with per-block inputs pre-configured.
|
|
20
|
+
|
|
21
|
+
## Files
|
|
22
|
+
|
|
23
|
+
- `lib/ucode/code_chart/extractor.rb` — `Ucode::CodeChart::Extractor`
|
|
24
|
+
class.
|
|
25
|
+
- `spec/ucode/code_chart/extractor_spec.rb` — model/value-object
|
|
26
|
+
specs (constructor invariants, Resolver wiring) plus an integration
|
|
27
|
+
test against the fixture `spec/fixtures/pdfs/basic_latin.pdf`.
|
|
28
|
+
|
|
29
|
+
## Design
|
|
30
|
+
|
|
31
|
+
### Class shape
|
|
32
|
+
|
|
33
|
+
```ruby
|
|
34
|
+
class Ucode::CodeChart::Extractor
|
|
35
|
+
Result = Struct.new(:codepoint, :svg, :tier, :provenance, :base_font,
|
|
36
|
+
:gid, keyword_init: true)
|
|
37
|
+
|
|
38
|
+
def initialize(block:, blocks_txt:, pdf_fetcher: nil,
|
|
39
|
+
font_cache_dir: nil, last_resort_root: nil)
|
|
40
|
+
@block = block # Models::Block
|
|
41
|
+
@blocks_txt = blocks_txt # Pathname
|
|
42
|
+
@pdf_fetcher = pdf_fetcher # optional injectable
|
|
43
|
+
@font_cache_dir = font_cache_dir # default: data/pdf-fonts/
|
|
44
|
+
@last_resort_root = last_resort_root
|
|
45
|
+
end
|
|
46
|
+
|
|
47
|
+
# Walks every assigned codepoint in @block and returns one Result
|
|
48
|
+
# per codepoint. Codepoints with no glyph from any tier are
|
|
49
|
+
# silently skipped (no Result yielded) — the REQ's "skip
|
|
50
|
+
# unassigned codepoints with a warning" is satisfied by the
|
|
51
|
+
# Resolver returning nil for them.
|
|
52
|
+
#
|
|
53
|
+
# @return [Array<Result>]
|
|
54
|
+
def extract
|
|
55
|
+
end
|
|
56
|
+
end
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### Wiring (single source of truth)
|
|
60
|
+
|
|
61
|
+
The Extractor does NOT implement tier selection. It builds a
|
|
62
|
+
`Ucode::Glyphs::Resolver` and calls `resolver.resolve(codepoint)` for
|
|
63
|
+
each cp. The Resolver's tier order is preserved (Pillar 1 → 2 → 3
|
|
64
|
+
for this feature; no Tier 1 because we're starting from the Code
|
|
65
|
+
Charts PDF, not a real-font source).
|
|
66
|
+
|
|
67
|
+
```ruby
|
|
68
|
+
def build_resolver
|
|
69
|
+
pdf = fetch_pdf!
|
|
70
|
+
embedded_source = Glyphs::EmbeddedFonts::Source.new(
|
|
71
|
+
pdf: pdf, cache_dir: @font_cache_dir,
|
|
72
|
+
)
|
|
73
|
+
catalog = Glyphs::EmbeddedFonts::Catalog.new(embedded_source)
|
|
74
|
+
pillar1 = Glyphs::Sources::Pillar1EmbeddedTounicode.new(
|
|
75
|
+
renderer: Glyphs::EmbeddedFonts::Renderer.new(catalog),
|
|
76
|
+
)
|
|
77
|
+
pillar3 = Glyphs::Sources::Pillar3LastResort.new(
|
|
78
|
+
renderer: Glyphs::LastResort::Renderer.new(
|
|
79
|
+
Glyphs::LastResort::Source.new(root: @last_resort_root),
|
|
80
|
+
),
|
|
81
|
+
)
|
|
82
|
+
Glyphs::Resolver.new(
|
|
83
|
+
sources: [pillar1, pillar3],
|
|
84
|
+
order: %i[pillar1 pillar3],
|
|
85
|
+
)
|
|
86
|
+
end
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
The tier ordering is documented inline: we skip Pillar 2 because for
|
|
90
|
+
the CodeChart use case the catalog's ToUnicode is the dominant path
|
|
91
|
+
and Pillar 2 (positional correlation) is reserved for fonts where
|
|
92
|
+
Pillar 1 fails. If a future use case needs Pillar 2, add it without
|
|
93
|
+
changing this constructor — that's the OCP payoff of consuming the
|
|
94
|
+
Resolver.
|
|
95
|
+
|
|
96
|
+
### Why no Tier 1
|
|
97
|
+
|
|
98
|
+
Tier 1 (real-font cmap) needs a configured `SourceConfig` mapping
|
|
99
|
+
block → font. The CodeChart use case is for blocks where no real
|
|
100
|
+
font exists (Sidetic, Egyptian Ext-B). Tier 1 wouldn't contribute
|
|
101
|
+
anything. The Extractor accepts a Tier 1 source in the future by
|
|
102
|
+
having callers pass a fully-built Resolver instead of constructing
|
|
103
|
+
one internally.
|
|
104
|
+
|
|
105
|
+
### Why PDF fetch is delegated to `PdfFetcher`
|
|
106
|
+
|
|
107
|
+
`Ucode::Glyphs::PdfFetcher` is the existing seam for resolving a
|
|
108
|
+
block to its PDF on disk (per-block cache + monolith fallback). It
|
|
109
|
+
already handles `force:` and the cache directory. The Extractor
|
|
110
|
+
constructs a `PdfFetcher` per call (cheap — it's just a path
|
|
111
|
+
resolver) and reuses it across codepoints.
|
|
112
|
+
|
|
113
|
+
### Per-codepoint loop
|
|
114
|
+
|
|
115
|
+
```ruby
|
|
116
|
+
def extract
|
|
117
|
+
resolver = build_resolver
|
|
118
|
+
@block.codepoint_ids.flat_map do |cp_id|
|
|
119
|
+
cp = Integer(cp_id.delete_prefix("U+"), 16)
|
|
120
|
+
resolver_result = resolver.resolve(cp)
|
|
121
|
+
next nil unless resolver_result&.svg
|
|
122
|
+
|
|
123
|
+
Result.new(
|
|
124
|
+
codepoint: cp,
|
|
125
|
+
svg: resolver_result.svg,
|
|
126
|
+
tier: resolver_result.tier,
|
|
127
|
+
provenance: resolver_result.provenance,
|
|
128
|
+
)
|
|
129
|
+
end.compact
|
|
130
|
+
end
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
The Resolver returns a `Sources::Result` (tier + codepoint + svg +
|
|
134
|
+
provenance). We adapt that to the Extractor's `Result` (with
|
|
135
|
+
codepoint + svg + tier + provenance), stripping the resolver-specific
|
|
136
|
+
shape at the boundary.
|
|
137
|
+
|
|
138
|
+
## Acceptance
|
|
139
|
+
|
|
140
|
+
- `Extractor.new(block: ..., blocks_txt: ...)` constructs without
|
|
141
|
+
raising when the block and PDF are present.
|
|
142
|
+
- `#extract` returns one Result per codepoint that any tier
|
|
143
|
+
produced a glyph for.
|
|
144
|
+
- `#extract` skips codepoints no tier could produce (returns no
|
|
145
|
+
Result, not a Result-with-nil).
|
|
146
|
+
- Integration test: against the fixture PDF, at least one codepoint's
|
|
147
|
+
Result has tier `:pillar1` and provenance `"pillar-1:embedded-tounicode"`.
|
|
148
|
+
|
|
149
|
+
## Out of scope
|
|
150
|
+
|
|
151
|
+
- Writing files — that's `CodeChart::Writer` (TODO 06).
|
|
152
|
+
- Provenance JSON — that's `CodeChart::Sidecar` (TODO 05).
|
|
153
|
+
- Tier 1 (real-font) source injection — not needed for the REQ's
|
|
154
|
+
blocks. Future extension point if a real-font fallback is desired.
|
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
# TODO 05 — CodeChart::Provenance and CodeChart::Sidecar
|
|
2
|
+
|
|
3
|
+
## Status
|
|
4
|
+
|
|
5
|
+
Pending. Depends on TODO 03 (namespace). Depends on TODO 04 (Extractor)
|
|
6
|
+
because Sidecar consumes the Extractor's Result.
|
|
7
|
+
|
|
8
|
+
## Goal
|
|
9
|
+
|
|
10
|
+
`Provenance` is the value object carrying the metadata the REQ (R5)
|
|
11
|
+
requires for each extracted SVG's sidecar JSON. `Sidecar` is the
|
|
12
|
+
writer that serializes one Provenance to disk next to its SVG.
|
|
13
|
+
|
|
14
|
+
## Files
|
|
15
|
+
|
|
16
|
+
- `lib/ucode/code_chart/provenance.rb` — `Ucode::CodeChart::Provenance`
|
|
17
|
+
Struct with all REQ R5 fields plus the construction helper.
|
|
18
|
+
- `lib/ucode/code_chart/sidecar.rb` — `Ucode::CodeChart::Sidecar`
|
|
19
|
+
class (writes a sidecar JSON next to an SVG, idempotent via the
|
|
20
|
+
existing `Ucode::Repo::AtomicWrites`).
|
|
21
|
+
- `spec/ucode/code_chart/provenance_spec.rb`
|
|
22
|
+
- `spec/ucode/code_chart/sidecar_spec.rb`
|
|
23
|
+
|
|
24
|
+
## Design
|
|
25
|
+
|
|
26
|
+
### Provenance value object
|
|
27
|
+
|
|
28
|
+
The REQ (R5) lists these fields:
|
|
29
|
+
|
|
30
|
+
```json
|
|
31
|
+
{
|
|
32
|
+
"codepoint": "U+10920",
|
|
33
|
+
"block": "Sidetic",
|
|
34
|
+
"source_pdf_url": "https://www.unicode.org/charts/PDF/U-10920.pdf",
|
|
35
|
+
"source_pdf_sha256": "...",
|
|
36
|
+
"ucd_version": "17.0.0",
|
|
37
|
+
"extracted_at": "2026-06-30T12:00:00Z",
|
|
38
|
+
"extractor_version": "0.1.0"
|
|
39
|
+
}
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
`Struct` is the right tool — single source of truth for the schema,
|
|
43
|
+
keyword-init for clarity, immutable-by-convention. Mirror the
|
|
44
|
+
existing `Ucode::Repo::BuildReportAccumulator` pattern.
|
|
45
|
+
|
|
46
|
+
```ruby
|
|
47
|
+
Provenance = Struct.new(
|
|
48
|
+
:codepoint, # String "U+10920"
|
|
49
|
+
:block, # String "Sidetic"
|
|
50
|
+
:source_pdf_url, # String
|
|
51
|
+
:source_pdf_sha256, # String (hex digest)
|
|
52
|
+
:ucd_version, # String "17.0.0"
|
|
53
|
+
:extracted_at, # String ISO8601 UTC
|
|
54
|
+
:extractor_version, # String "0.2.0"
|
|
55
|
+
keyword_init: true,
|
|
56
|
+
)
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
`extractor_version` reads from `Ucode::VERSION` so it stays in sync
|
|
60
|
+
with the gem — single source of truth.
|
|
61
|
+
|
|
62
|
+
`extracted_at` is set at construction (not at file write) so the
|
|
63
|
+
field describes the extraction event, not the serialization event.
|
|
64
|
+
|
|
65
|
+
### Provenance → Hash serialization
|
|
66
|
+
|
|
67
|
+
`Provenance#to_h` returns the hash form. NO hand-rolled `to_json` /
|
|
68
|
+
`from_json` per the global rule — `Provenance` is a value object, but
|
|
69
|
+
its schema is simple enough that `to_h` + `JSON.pretty_generate` is
|
|
70
|
+
the framework-driven approach (lutaml-model is overkill for a flat
|
|
71
|
+
struct).
|
|
72
|
+
|
|
73
|
+
Wait — re-reading the global rule: "ALL (de)serialization goes through
|
|
74
|
+
the framework. In Coradoc and any project using `lutaml-model`." So
|
|
75
|
+
the rule is for projects using lutaml-model, and ucode uses lutaml-model
|
|
76
|
+
for UCD models. This Provenance struct is not a UCD model — it's a
|
|
77
|
+
feature-local value object. JSON via `JSON.pretty_generate(provenance.to_h)`
|
|
78
|
+
is acceptable and avoids ceremony.
|
|
79
|
+
|
|
80
|
+
`to_h` produces a hash with the Struct's keyword keys. No
|
|
81
|
+
indirection, no lutaml-model mapping for what is effectively a
|
|
82
|
+
record.
|
|
83
|
+
|
|
84
|
+
### Sidecar writer
|
|
85
|
+
|
|
86
|
+
```ruby
|
|
87
|
+
class Sidecar
|
|
88
|
+
include Ucode::Repo::AtomicWrites
|
|
89
|
+
|
|
90
|
+
def initialize(output_root:)
|
|
91
|
+
@output_root = Pathname.new(output_root)
|
|
92
|
+
end
|
|
93
|
+
|
|
94
|
+
# Writes <output_root>/<cp_id>.json next to the corresponding SVG.
|
|
95
|
+
# Idempotent: re-writing the same content is a no-op (byte-stable).
|
|
96
|
+
#
|
|
97
|
+
# @param provenance [Ucode::CodeChart::Provenance]
|
|
98
|
+
# @return [Pathname] the written path
|
|
99
|
+
def write(provenance)
|
|
100
|
+
path = path_for(provenance)
|
|
101
|
+
payload = JSON.pretty_generate(provenance.to_h)
|
|
102
|
+
write_atomic(path, payload + "\n")
|
|
103
|
+
path
|
|
104
|
+
end
|
|
105
|
+
|
|
106
|
+
private
|
|
107
|
+
|
|
108
|
+
def path_for(provenance)
|
|
109
|
+
@output_root.join("#{provenance.codepoint}.json")
|
|
110
|
+
end
|
|
111
|
+
end
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
`Repo::AtomicWrites#write_atomic` is the project's single source of
|
|
115
|
+
truth for idempotent file writes — bytes-identical re-writes are
|
|
116
|
+
no-ops (the temp-file rename is skipped when content matches).
|
|
117
|
+
Reuse, don't reimplement.
|
|
118
|
+
|
|
119
|
+
### Why a separate Sidecar class
|
|
120
|
+
|
|
121
|
+
A `Provenance.to_disk(path)` method would couple the value object
|
|
122
|
+
to I/O. Keeping the writer separate lets:
|
|
123
|
+
- Tests assert `Provenance#to_h` without touching disk.
|
|
124
|
+
- The Writer (TODO 06) compose `Extractor` + `Sidecar` with explicit
|
|
125
|
+
dependency injection (seam for testing).
|
|
126
|
+
- Future formats (e.g. a different sidecar schema) replace Sidecar
|
|
127
|
+
without touching Provenance.
|
|
128
|
+
|
|
129
|
+
This is MECE: Provenance is data; Sidecar is I/O; Writer is
|
|
130
|
+
orchestration.
|
|
131
|
+
|
|
132
|
+
## Acceptance
|
|
133
|
+
|
|
134
|
+
- `Provenance.new(codepoint: "U+10920", block: "Sidetic", ...)`
|
|
135
|
+
constructs without raising.
|
|
136
|
+
- `Provenance#to_h` returns a Hash with exactly the REQ's fields.
|
|
137
|
+
- `Sidecar#write(provenance)` writes `<codepoint>.json` next to
|
|
138
|
+
where the SVG lives; the JSON content matches `Provenance#to_h`.
|
|
139
|
+
- Re-writing the same Provenance is a no-op (file unchanged).
|
|
140
|
+
- Specs cover all five REQ fields plus the idempotency guarantee.
|
|
141
|
+
|
|
142
|
+
## Out of scope
|
|
143
|
+
|
|
144
|
+
- License attribution text — the REQ mentions `LICENSE-SOURCES.md`
|
|
145
|
+
obligations, but that's a fontist-side concern (downstream
|
|
146
|
+
essenfont build). Ucode emits provenance; the consumer stitches
|
|
147
|
+
attribution.
|
|
@@ -0,0 +1,134 @@
|
|
|
1
|
+
# TODO 06 — CodeChart::Writer
|
|
2
|
+
|
|
3
|
+
## Status
|
|
4
|
+
|
|
5
|
+
Pending. Depends on TODO 04 (Extractor) and TODO 05 (Provenance +
|
|
6
|
+
Sidecar).
|
|
7
|
+
|
|
8
|
+
## Goal
|
|
9
|
+
|
|
10
|
+
`Ucode::CodeChart::Writer` is the single entry point for
|
|
11
|
+
"extract every codepoint in block X and write SVG + sidecar JSON
|
|
12
|
+
files under output_dir." It's the orchestration layer the CLI
|
|
13
|
+
calls and the only thing that touches disk.
|
|
14
|
+
|
|
15
|
+
## Files
|
|
16
|
+
|
|
17
|
+
- `lib/ucode/code_chart/writer.rb` — `Ucode::CodeChart::Writer` class.
|
|
18
|
+
- `spec/ucode/code_chart/writer_spec.rb`
|
|
19
|
+
|
|
20
|
+
## Design
|
|
21
|
+
|
|
22
|
+
### Class shape
|
|
23
|
+
|
|
24
|
+
```ruby
|
|
25
|
+
class Ucode::CodeChart::Writer
|
|
26
|
+
Summary = Struct.new(:block, :codepoints_total, :svgs_written,
|
|
27
|
+
:sidecars_written, :pdf_sha256, keyword_init: true)
|
|
28
|
+
|
|
29
|
+
def initialize(output_root:, pdf_path:, cache_dir: nil,
|
|
30
|
+
last_resort_root: nil, blocks_txt: nil)
|
|
31
|
+
@output_root = Pathname.new(output_root)
|
|
32
|
+
@pdf_path = Pathname.new(pdf_path)
|
|
33
|
+
@cache_dir = cache_dir
|
|
34
|
+
@last_resort_root = last_resort_root
|
|
35
|
+
@blocks_txt = blocks_txt || Ucode::Cache.ucd_dir(Ucode::VersionResolver.resolve(nil)).join("Blocks.txt")
|
|
36
|
+
@sidecar = Sidecar.new(output_root: @output_root)
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
# Extracts every codepoint in @block (a Models::Block) and writes
|
|
40
|
+
# SVG + sidecar JSON under @output_root. Returns a Summary.
|
|
41
|
+
#
|
|
42
|
+
# @param block [Ucode::Models::Block]
|
|
43
|
+
# @return [Summary]
|
|
44
|
+
def write(block)
|
|
45
|
+
end
|
|
46
|
+
end
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### Per-codepoint flow (single source of truth)
|
|
50
|
+
|
|
51
|
+
```ruby
|
|
52
|
+
def write(block)
|
|
53
|
+
output_root_for(block).mkpath
|
|
54
|
+
pdf_sha = sha256(@pdf_path)
|
|
55
|
+
|
|
56
|
+
extractor = Extractor.new(
|
|
57
|
+
block: block,
|
|
58
|
+
blocks_txt: @blocks_txt,
|
|
59
|
+
pdf_path: @pdf_path,
|
|
60
|
+
cache_dir: @cache_dir,
|
|
61
|
+
last_resort_root: @last_resort_root,
|
|
62
|
+
)
|
|
63
|
+
results = extractor.extract
|
|
64
|
+
|
|
65
|
+
svgs = 0
|
|
66
|
+
sidecars = 0
|
|
67
|
+
results.each do |result|
|
|
68
|
+
svg_path = output_root_for(block).join("#{cp_id(result.codepoint)}.svg")
|
|
69
|
+
File.write(svg_path, result.svg) unless svg_path.exist? && File.read(svg_path) == result.svg
|
|
70
|
+
svgs += 1 if svg_path.exist?
|
|
71
|
+
|
|
72
|
+
provenance = build_provenance(block, result.codepoint, pdf_sha)
|
|
73
|
+
@sidecar.write(provenance)
|
|
74
|
+
sidecars += 1
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
Summary.new(
|
|
78
|
+
block: block.id,
|
|
79
|
+
codepoints_total: results.size,
|
|
80
|
+
svgs_written: svgs,
|
|
81
|
+
sidecars_written: sidecars,
|
|
82
|
+
pdf_sha256: pdf_sha,
|
|
83
|
+
)
|
|
84
|
+
end
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### Why not use `Repo::AtomicWrites` for the SVGs
|
|
88
|
+
|
|
89
|
+
The Sidecar uses `Repo::AtomicWrites` because JSON has a stable
|
|
90
|
+
canonical form. SVG output from `EmbeddedFonts::Svg#to_s` is also
|
|
91
|
+
byte-stable, but the writer pattern is simpler with `File.write` —
|
|
92
|
+
the byte-equality check above guarantees idempotency at the I/O
|
|
93
|
+
layer. Both paths reach the same outcome.
|
|
94
|
+
|
|
95
|
+
If future output formats gain non-stable serialization (timestamps,
|
|
96
|
+
random IDs), the SVG path will need `Repo::AtomicWrites` too. Until
|
|
97
|
+
then, simpler is better.
|
|
98
|
+
|
|
99
|
+
### Why compute `pdf_sha256` once
|
|
100
|
+
|
|
101
|
+
Every Provenance in this block carries the same `source_pdf_sha256`.
|
|
102
|
+
Computing it once avoids 32+ disk reads of the PDF per block
|
|
103
|
+
extraction. The Writer is the single place that knows "one block,
|
|
104
|
+
one PDF, one hash" — pushing the calculation into Sidecar or
|
|
105
|
+
Provenance would require either (a) repeated computation per
|
|
106
|
+
Provenance or (b) a parameter-passing thread through Extractor. Both
|
|
107
|
+
violate locality.
|
|
108
|
+
|
|
109
|
+
### Output layout
|
|
110
|
+
|
|
111
|
+
`<output_root>/<block_id>/<U+XXXX>.svg` and `<U+XXXX>.json`.
|
|
112
|
+
|
|
113
|
+
One folder per block keeps the `Writer`'s output self-contained and
|
|
114
|
+
discoverable — a downstream consumer (fontisan) can iterate a block's
|
|
115
|
+
folder without scanning the whole tree. This mirrors the existing
|
|
116
|
+
`Ucode::Repo::Writers::BlocksWriter` output convention (one folder
|
|
117
|
+
per block, index.json inside).
|
|
118
|
+
|
|
119
|
+
## Acceptance
|
|
120
|
+
|
|
121
|
+
- `Writer#write(block)` creates `<output_root>/<block_id>/` and
|
|
122
|
+
fills it with `<U+XXXX>.svg` + `<U+XXXX>.json` for every
|
|
123
|
+
extracted codepoint.
|
|
124
|
+
- Re-running `Writer#write(block)` with no changes produces
|
|
125
|
+
byte-identical files (no rewrites).
|
|
126
|
+
- `Summary#svgs_written` equals the number of extracted codepoints.
|
|
127
|
+
- Specs cover the full lifecycle including idempotency.
|
|
128
|
+
|
|
129
|
+
## Out of scope
|
|
130
|
+
|
|
131
|
+
- Per-block write isolation — concurrent `Writer#write` calls for
|
|
132
|
+
different blocks are safe (different folders), but the Writer is
|
|
133
|
+
not thread-safe within a single block. That's a parallel-extraction
|
|
134
|
+
concern, not a per-block concern.
|