ucode 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d9d5a29a388017c0338a8d2723f97dfd4ac8985324b6ab7fc9b8921a5f4b0b82
4
- data.tar.gz: f6988cf20f74efc1b94ebfedbd7b9507680de1c0880ffd28ac9a4a00851195a8
3
+ metadata.gz: 9c2f1e8cbf000eca25d51f0df33bb07c560396b5bcbb43e1347e8be31609c6e9
4
+ data.tar.gz: 654c8d3d9d63551faf5945c7ae4e42f51caffa81716e5a02b93d551de18aeed7
5
5
  SHA512:
6
- metadata.gz: 890fec1b40ed6016b269982d32d28b15eb2683dfa7d4e9fc32e5a6633c46fb6f8a94b0d004d1bbd4381fb5bbe629faf00984729036005c12e1558c65d70afe4f
7
- data.tar.gz: 792b58d5eb3434b05008740457b1c432f5810bd45b2d0d80936bd0c539445098b835c1d242093487cf0a4a4754efba2fe725c500b2d3f1a23f4906b4a17809dc
6
+ metadata.gz: d0100a2f733bd043a3902cb82cbc5a337ae9a1b19bee952ae9fb2ea0827c8b7ed226cf67db12c0ec997c22b942f5ac735093f6b5a7dc6461c1c2e374db41221b
7
+ data.tar.gz: 5e0e6dca7f3b4146aa62bdaf493e00e9ca19335c846a8af10beda76a1530dc2fd8bb8bddbcdf914fe60c9c109cfc8e8c095e412cad591e105779263399783bd6
@@ -0,0 +1,80 @@
1
+ # TODO 01 — PDF fetch validation
2
+
3
+ ## Status
4
+
5
+ Pending.
6
+
7
+ ## Goal
8
+
9
+ Raise a typed `Ucode::CodeChartNotFoundError` when a Unicode Code
10
+ Charts PDF cannot be downloaded or fails content validation. The REQ
11
+ (R1) requires:
12
+
13
+ - HTTP 4xx / 5xx → `CodeChartNotFoundError`
14
+ - `Content-Type: application/pdf`
15
+ - First 4 bytes are `%PDF`
16
+
17
+ ## Files
18
+
19
+ - `lib/ucode/error.rb` — add `Ucode::CodeChartNotFoundError` under the
20
+ `GlyphError` subtree.
21
+ - `lib/ucode.rb` — add an `autoload` for the new class so any rescue
22
+ clause triggers one load of `error.rb`.
23
+ - `lib/ucode/fetch/http.rb` — extend `Http.get` with an optional
24
+ `validate:` keyword. When `validate: :pdf`, after a successful
25
+ download, verify the `Content-Type` response header starts with
26
+ `application/pdf` and the first 4 bytes of the body are `%PDF`.
27
+ - `lib/ucode/fetch/code_charts.rb` — pass `validate: :pdf` to `Http.get`
28
+ for every chart PDF download.
29
+ - `spec/ucode/fetch/code_charts_spec.rb` (new) — cover the happy path
30
+ and the validation failure modes.
31
+
32
+ ## Design
33
+
34
+ ### Why a new error class
35
+
36
+ `FetchError` already covers transport failures, but it doesn't carry
37
+ "this URL produced an HTML error page / 404 / non-PDF body" semantics.
38
+ Splitting the type keeps existing `rescue Ucode::FetchError` callers
39
+ from accidentally swallowing the typed signal that "we expected a PDF
40
+ and didn't get one" — which is a different problem class from "the
41
+ network was down."
42
+
43
+ `CodeChartNotFoundError < Ucode::Error` (under `GlyphError`) reflects
44
+ the REQ's framing: the chart for the requested block is not
45
+ obtainable.
46
+
47
+ ### Why `validate:` is optional on `Http.get`
48
+
49
+ `Http` is the single network boundary (per the comment at the top of
50
+ `http.rb`). All callers funnel through it. Adding an optional
51
+ keyword keeps the MECE pattern intact: non-PDF callers (UCD zip,
52
+ Unihan zip, font zip) pass nothing; the single PDF caller passes
53
+ `validate: :pdf`. No second network boundary is needed.
54
+
55
+ ### Why no separate "magic bytes" check class
56
+
57
+ Magic-byte verification is 4 lines of code; extracting it into a
58
+ class would be ceremony. Inline check after `write_body`, raising
59
+ the typed error with the offending content-type or magic bytes in
60
+ the context payload.
61
+
62
+ ## Acceptance
63
+
64
+ - `Http.get(url, dest:, validate: :pdf)` raises
65
+ `CodeChartNotFoundError` (a) when the response Content-Type is not
66
+ `application/pdf`, (b) when the first 4 bytes are not `%PDF`.
67
+ - `Fetch::CodeCharts.call(version, block_first_cps: [...])` raises
68
+ `CodeChartNotFoundError` when the unicode.org endpoint returns
69
+ 4xx/5xx or non-PDF content.
70
+ - Existing callers of `Http.get` that don't pass `validate:` are
71
+ unchanged.
72
+ - Spec coverage: happy path, HTTP 404, wrong content-type, truncated
73
+ body missing the `%PDF` magic.
74
+
75
+ ## Out of scope
76
+
77
+ - SHA-256 verification of the PDF — that's a downstream concern (the
78
+ Code Charts are not versioned by hash on unicode.org).
79
+ - Resumable / partial downloads — the existing `Http` writes a
80
+ `.part` then renames; that's sufficient.
@@ -0,0 +1,68 @@
1
+ # TODO 02 — Block name resolver
2
+
3
+ ## Status
4
+
5
+ Pending. Depends on nothing.
6
+
7
+ ## Goal
8
+
9
+ Add a class method `Ucode::Parsers::Blocks.find_by_name(name)` that
10
+ resolves a Unicode block identifier (e.g. `"Sidetic"`,
11
+ `"Egyptian_Hieroglyphs_Extended-B"`) to the `Ucode::Models::Block`
12
+ instance in a given version's cached `Blocks.txt`.
13
+
14
+ This is the CLI ergonomics glue: the REQ's `ucode code-chart extract
15
+ --block Sidetic` flow takes a human-readable name and needs to know
16
+ the block's range to know which `U+XXXX` codepoints to iterate.
17
+
18
+ ## Files
19
+
20
+ - `lib/ucode/parsers/blocks.rb` — add `Blocks.find_by_name(path, name)`.
21
+ - `spec/ucode/parsers/blocks_spec.rb` — cover name lookup, missing-name,
22
+ case-sensitivity.
23
+
24
+ ## Design
25
+
26
+ ### Method shape
27
+
28
+ ```ruby
29
+ # @param path [Pathname, String] path to a Blocks.txt
30
+ # @param name [String] block identifier (matches Models::Block#id)
31
+ # @return [Models::Block, nil] nil when no block matches
32
+ def find_by_name(path, name)
33
+ ```
34
+
35
+ Returns nil for "not found" — callers (CLI, Extractor) decide whether
36
+ to raise. This matches `Models::Block` consumers that already expect
37
+ nilable lookups.
38
+
39
+ ### Name matching rule
40
+
41
+ `Blocks.txt` uses `name` with whitespace collapsed to underscores
42
+ into `id`. `find_by_name` matches against `id` (the underscored
43
+ form). The REQ's example `--block Sidetic` shows that the caller
44
+ provides the underscored form already. This is consistent with the
45
+ existing `Parsers::Blocks` build logic (`name.gsub(/\s+/, "_")`).
46
+
47
+ ### Why a separate method
48
+
49
+ `each_record` streams every block — the caller doesn't want to walk
50
+ ~340 blocks for every name lookup. `find_by_name` short-circuits on
51
+ first match.
52
+
53
+ ## Acceptance
54
+
55
+ - `find_by_name(path, "Basic_Latin")` returns the Basic Latin block.
56
+ - `find_by_name(path, "Nonexistent")` returns nil.
57
+ - Streaming still works for callers that need every block.
58
+
59
+ ## Out of scope
60
+
61
+ - Fuzzy matching — exact match only. Callers validate the user's
62
+ input against `Parsers::Blocks.each_record(path).map(&:id)` to
63
+ surface "did you mean …?" suggestions if we ever want that; for
64
+ now, a clean `UnknownBlockError` at the call site is enough.
65
+ - Database-backed lookup — `Ucode::Database#block_ranges_by_name` is
66
+ a different concern (full UCD index). `find_by_name` operates on
67
+ the cached `Blocks.txt` directly because the CodeChart extractor
68
+ is meant to be runnable without a built database.
@@ -0,0 +1,82 @@
1
+ # TODO 03 — CodeChart namespace
2
+
3
+ ## Status
4
+
5
+ Pending. Depends on TODO 02 (block name resolver) so the Extractor
6
+ can consume it; depends on TODO 01 (error class) so the namespace
7
+ can declare typed errors.
8
+
9
+ ## Goal
10
+
11
+ Establish the `Ucode::CodeChart` namespace as the home for the
12
+ Code Charts per-codepoint extraction feature. The REQ calls this
13
+ namespace `Ucode::CodeChart::*`; we follow the REQ.
14
+
15
+ This is the autoload-hub file plus the autoload declaration in
16
+ `lib/ucode.rb`.
17
+
18
+ ## Files
19
+
20
+ - `lib/ucode/code_chart.rb` — new autoload hub (defines
21
+ `Ucode::CodeChart` and declares child autoloads).
22
+ - `lib/ucode.rb` — add `autoload :CodeChart, "ucode/code_chart"` in
23
+ the namespace-hubs block.
24
+
25
+ ## Design
26
+
27
+ ### Autoload hub shape
28
+
29
+ ```ruby
30
+ # lib/ucode/code_chart.rb
31
+ module Ucode
32
+ module CodeChart
33
+ autoload :Extractor, "ucode/code_chart/extractor"
34
+ autoload :Provenance, "ucode/code_chart/provenance"
35
+ autoload :Sidecar, "ucode/code_chart/sidecar"
36
+ autoload :Writer, "ucode/code_chart/writer"
37
+ end
38
+ end
39
+ ```
40
+
41
+ Per the global rule (`~/.claude/CLAUDE.md`): declare autoloads in the
42
+ immediate parent namespace's file. `Ucode::CodeChart` is the immediate
43
+ parent of `Extractor`, `Provenance`, `Sidecar`, `Writer`; this file
44
+ is the immediate parent's file.
45
+
46
+ `Ucode` is the immediate parent of `CodeChart`; the autoload
47
+ declaration `autoload :CodeChart, "ucode/code_chart"` goes in
48
+ `lib/ucode.rb`.
49
+
50
+ ### Why a new namespace (not under `Ucode::Glyphs`)
51
+
52
+ `Ucode::Glyphs::*` is the existing 4-tier sourcing pipeline
53
+ (`EmbeddedFonts`, `RealFonts`, `LastResort`, `Writer`). The REQ's
54
+ `CodeChart::*` is a feature-facing namespace that orchestrates the
55
+ glyphs pipeline for one specific use case (extracting from a per-block
56
+ PDF for the essenfont donor pipeline). Keeping the feature-facing
57
+ namespace separate from the implementation namespace:
58
+
59
+ - Lets callers say `Ucode::CodeChart.extract(block: "Sidetic")`
60
+ without first knowing about `Glyphs::EmbeddedFonts`.
61
+ - Makes it easy to swap the implementation later (different
62
+ resolution strategy, alternative PDF parser) without breaking the
63
+ public API.
64
+ - Keeps `Glyphs::` focused on tier mechanics, free of feature
65
+ ergonomics.
66
+
67
+ The REQ's namespace name is what we use.
68
+
69
+ ## Acceptance
70
+
71
+ - `lib/ucode/code_chart.rb` exists with the autoload declarations.
72
+ - `lib/ucode.rb` has the new `autoload :CodeChart, "ucode/code_chart"`
73
+ in the namespace-hubs block.
74
+ - `Ucode::CodeChart` resolves to a module without loading any of its
75
+ children.
76
+
77
+ ## Out of scope
78
+
79
+ - `Ucode::CodeChart::Command` (Thin wrapper). The CLI lives in
80
+ `lib/ucode/cli.rb` per the existing pattern; no separate
81
+ `Commands::CodeChartCommand` is introduced (single source of
82
+ truth for CLI dispatch).
@@ -0,0 +1,154 @@
1
+ # TODO 04 — CodeChart::Extractor
2
+
3
+ ## Status
4
+
5
+ Pending. Depends on TODO 01 (error class), TODO 02 (block name
6
+ resolver), TODO 03 (namespace).
7
+
8
+ ## Goal
9
+
10
+ `Ucode::CodeChart::Extractor` is the single entry point for
11
+ "extract every assigned codepoint in block X as a standalone SVG."
12
+
13
+ It orchestrates the existing 4-tier resolver (one source of truth for
14
+ "how do I get the SVG for a given codepoint") and returns a list of
15
+ extraction results — one per codepoint — that the downstream Writer
16
+ serializes to disk.
17
+
18
+ This is *not* a new extraction pipeline; it is the existing
19
+ `Ucode::Glyphs::Resolver` with per-block inputs pre-configured.
20
+
21
+ ## Files
22
+
23
+ - `lib/ucode/code_chart/extractor.rb` — `Ucode::CodeChart::Extractor`
24
+ class.
25
+ - `spec/ucode/code_chart/extractor_spec.rb` — model/value-object
26
+ specs (constructor invariants, Resolver wiring) plus an integration
27
+ test against the fixture `spec/fixtures/pdfs/basic_latin.pdf`.
28
+
29
+ ## Design
30
+
31
+ ### Class shape
32
+
33
+ ```ruby
34
+ class Ucode::CodeChart::Extractor
35
+ Result = Struct.new(:codepoint, :svg, :tier, :provenance, :base_font,
36
+ :gid, keyword_init: true)
37
+
38
+ def initialize(block:, blocks_txt:, pdf_fetcher: nil,
39
+ font_cache_dir: nil, last_resort_root: nil)
40
+ @block = block # Models::Block
41
+ @blocks_txt = blocks_txt # Pathname
42
+ @pdf_fetcher = pdf_fetcher # optional injectable
43
+ @font_cache_dir = font_cache_dir # default: data/pdf-fonts/
44
+ @last_resort_root = last_resort_root
45
+ end
46
+
47
+ # Walks every assigned codepoint in @block and returns one Result
48
+ # per codepoint. Codepoints with no glyph from any tier are
49
+ # silently skipped (no Result yielded) — the REQ's "skip
50
+ # unassigned codepoints with a warning" is satisfied by the
51
+ # Resolver returning nil for them.
52
+ #
53
+ # @return [Array<Result>]
54
+ def extract
55
+ end
56
+ end
57
+ ```
58
+
59
+ ### Wiring (single source of truth)
60
+
61
+ The Extractor does NOT implement tier selection. It builds a
62
+ `Ucode::Glyphs::Resolver` and calls `resolver.resolve(codepoint)` for
63
+ each cp. The Resolver's tier order is preserved (Pillar 1 → 2 → 3
64
+ for this feature; no Tier 1 because we're starting from the Code
65
+ Charts PDF, not a real-font source).
66
+
67
+ ```ruby
68
+ def build_resolver
69
+ pdf = fetch_pdf!
70
+ embedded_source = Glyphs::EmbeddedFonts::Source.new(
71
+ pdf: pdf, cache_dir: @font_cache_dir,
72
+ )
73
+ catalog = Glyphs::EmbeddedFonts::Catalog.new(embedded_source)
74
+ pillar1 = Glyphs::Sources::Pillar1EmbeddedTounicode.new(
75
+ renderer: Glyphs::EmbeddedFonts::Renderer.new(catalog),
76
+ )
77
+ pillar3 = Glyphs::Sources::Pillar3LastResort.new(
78
+ renderer: Glyphs::LastResort::Renderer.new(
79
+ Glyphs::LastResort::Source.new(root: @last_resort_root),
80
+ ),
81
+ )
82
+ Glyphs::Resolver.new(
83
+ sources: [pillar1, pillar3],
84
+ order: %i[pillar1 pillar3],
85
+ )
86
+ end
87
+ ```
88
+
89
+ The tier ordering is documented inline: we skip Pillar 2 because for
90
+ the CodeChart use case the catalog's ToUnicode is the dominant path
91
+ and Pillar 2 (positional correlation) is reserved for fonts where
92
+ Pillar 1 fails. If a future use case needs Pillar 2, add it without
93
+ changing this constructor — that's the OCP payoff of consuming the
94
+ Resolver.
95
+
96
+ ### Why no Tier 1
97
+
98
+ Tier 1 (real-font cmap) needs a configured `SourceConfig` mapping
99
+ block → font. The CodeChart use case is for blocks where no real
100
+ font exists (Sidetic, Egyptian Ext-B). Tier 1 wouldn't contribute
101
+ anything. The Extractor accepts a Tier 1 source in the future by
102
+ having callers pass a fully-built Resolver instead of constructing
103
+ one internally.
104
+
105
+ ### Why PDF fetch is delegated to `PdfFetcher`
106
+
107
+ `Ucode::Glyphs::PdfFetcher` is the existing seam for resolving a
108
+ block to its PDF on disk (per-block cache + monolith fallback). It
109
+ already handles `force:` and the cache directory. The Extractor
110
+ constructs a `PdfFetcher` per call (cheap — it's just a path
111
+ resolver) and reuses it across codepoints.
112
+
113
+ ### Per-codepoint loop
114
+
115
+ ```ruby
116
+ def extract
117
+ resolver = build_resolver
118
+ @block.codepoint_ids.flat_map do |cp_id|
119
+ cp = Integer(cp_id.delete_prefix("U+"), 16)
120
+ resolver_result = resolver.resolve(cp)
121
+ next nil unless resolver_result&.svg
122
+
123
+ Result.new(
124
+ codepoint: cp,
125
+ svg: resolver_result.svg,
126
+ tier: resolver_result.tier,
127
+ provenance: resolver_result.provenance,
128
+ )
129
+ end.compact
130
+ end
131
+ ```
132
+
133
+ The Resolver returns a `Sources::Result` (tier + codepoint + svg +
134
+ provenance). We adapt that to the Extractor's `Result` (with
135
+ codepoint + svg + tier + provenance), stripping the resolver-specific
136
+ shape at the boundary.
137
+
138
+ ## Acceptance
139
+
140
+ - `Extractor.new(block: ..., blocks_txt: ...)` constructs without
141
+ raising when the block and PDF are present.
142
+ - `#extract` returns one Result per codepoint that any tier
143
+ produced a glyph for.
144
+ - `#extract` skips codepoints no tier could produce (returns no
145
+ Result, not a Result-with-nil).
146
+ - Integration test: against the fixture PDF, at least one codepoint's
147
+ Result has tier `:pillar1` and provenance `"pillar-1:embedded-tounicode"`.
148
+
149
+ ## Out of scope
150
+
151
+ - Writing files — that's `CodeChart::Writer` (TODO 06).
152
+ - Provenance JSON — that's `CodeChart::Sidecar` (TODO 05).
153
+ - Tier 1 (real-font) source injection — not needed for the REQ's
154
+ blocks. Future extension point if a real-font fallback is desired.
@@ -0,0 +1,147 @@
1
+ # TODO 05 — CodeChart::Provenance and CodeChart::Sidecar
2
+
3
+ ## Status
4
+
5
+ Pending. Depends on TODO 03 (namespace). Depends on TODO 04 (Extractor)
6
+ because Sidecar consumes the Extractor's Result.
7
+
8
+ ## Goal
9
+
10
+ `Provenance` is the value object carrying the metadata the REQ (R5)
11
+ requires for each extracted SVG's sidecar JSON. `Sidecar` is the
12
+ writer that serializes one Provenance to disk next to its SVG.
13
+
14
+ ## Files
15
+
16
+ - `lib/ucode/code_chart/provenance.rb` — `Ucode::CodeChart::Provenance`
17
+ Struct with all REQ R5 fields plus the construction helper.
18
+ - `lib/ucode/code_chart/sidecar.rb` — `Ucode::CodeChart::Sidecar`
19
+ class (writes a sidecar JSON next to an SVG, idempotent via the
20
+ existing `Ucode::Repo::AtomicWrites`).
21
+ - `spec/ucode/code_chart/provenance_spec.rb`
22
+ - `spec/ucode/code_chart/sidecar_spec.rb`
23
+
24
+ ## Design
25
+
26
+ ### Provenance value object
27
+
28
+ The REQ (R5) lists these fields:
29
+
30
+ ```json
31
+ {
32
+ "codepoint": "U+10920",
33
+ "block": "Sidetic",
34
+ "source_pdf_url": "https://www.unicode.org/charts/PDF/U-10920.pdf",
35
+ "source_pdf_sha256": "...",
36
+ "ucd_version": "17.0.0",
37
+ "extracted_at": "2026-06-30T12:00:00Z",
38
+ "extractor_version": "0.1.0"
39
+ }
40
+ ```
41
+
42
+ `Struct` is the right tool — single source of truth for the schema,
43
+ keyword-init for clarity, immutable-by-convention. Mirror the
44
+ existing `Ucode::Repo::BuildReportAccumulator` pattern.
45
+
46
+ ```ruby
47
+ Provenance = Struct.new(
48
+ :codepoint, # String "U+10920"
49
+ :block, # String "Sidetic"
50
+ :source_pdf_url, # String
51
+ :source_pdf_sha256, # String (hex digest)
52
+ :ucd_version, # String "17.0.0"
53
+ :extracted_at, # String ISO8601 UTC
54
+ :extractor_version, # String "0.2.0"
55
+ keyword_init: true,
56
+ )
57
+ ```
58
+
59
+ `extractor_version` reads from `Ucode::VERSION` so it stays in sync
60
+ with the gem — single source of truth.
61
+
62
+ `extracted_at` is set at construction (not at file write) so the
63
+ field describes the extraction event, not the serialization event.
64
+
65
+ ### Provenance → Hash serialization
66
+
67
+ `Provenance#to_h` returns the hash form. NO hand-rolled `to_json` /
68
+ `from_json` per the global rule — `Provenance` is a value object, but
69
+ its schema is simple enough that `to_h` + `JSON.pretty_generate` is
70
+ the framework-driven approach (lutaml-model is overkill for a flat
71
+ struct).
72
+
73
+ Wait — re-reading the global rule: "ALL (de)serialization goes through
74
+ the framework. In Coradoc and any project using `lutaml-model`." So
75
+ the rule is for projects using lutaml-model, and ucode uses lutaml-model
76
+ for UCD models. This Provenance struct is not a UCD model — it's a
77
+ feature-local value object. JSON via `JSON.pretty_generate(provenance.to_h)`
78
+ is acceptable and avoids ceremony.
79
+
80
+ `to_h` produces a hash with the Struct's keyword keys. No
81
+ indirection, no lutaml-model mapping for what is effectively a
82
+ record.
83
+
84
+ ### Sidecar writer
85
+
86
+ ```ruby
87
+ class Sidecar
88
+ include Ucode::Repo::AtomicWrites
89
+
90
+ def initialize(output_root:)
91
+ @output_root = Pathname.new(output_root)
92
+ end
93
+
94
+ # Writes <output_root>/<cp_id>.json next to the corresponding SVG.
95
+ # Idempotent: re-writing the same content is a no-op (byte-stable).
96
+ #
97
+ # @param provenance [Ucode::CodeChart::Provenance]
98
+ # @return [Pathname] the written path
99
+ def write(provenance)
100
+ path = path_for(provenance)
101
+ payload = JSON.pretty_generate(provenance.to_h)
102
+ write_atomic(path, payload + "\n")
103
+ path
104
+ end
105
+
106
+ private
107
+
108
+ def path_for(provenance)
109
+ @output_root.join("#{provenance.codepoint}.json")
110
+ end
111
+ end
112
+ ```
113
+
114
+ `Repo::AtomicWrites#write_atomic` is the project's single source of
115
+ truth for idempotent file writes — bytes-identical re-writes are
116
+ no-ops (the temp-file rename is skipped when content matches).
117
+ Reuse, don't reimplement.
118
+
119
+ ### Why a separate Sidecar class
120
+
121
+ A `Provenance.to_disk(path)` method would couple the value object
122
+ to I/O. Keeping the writer separate lets:
123
+ - Tests assert `Provenance#to_h` without touching disk.
124
+ - The Writer (TODO 06) compose `Extractor` + `Sidecar` with explicit
125
+ dependency injection (seam for testing).
126
+ - Future formats (e.g. a different sidecar schema) replace Sidecar
127
+ without touching Provenance.
128
+
129
+ This is MECE: Provenance is data; Sidecar is I/O; Writer is
130
+ orchestration.
131
+
132
+ ## Acceptance
133
+
134
+ - `Provenance.new(codepoint: "U+10920", block: "Sidetic", ...)`
135
+ constructs without raising.
136
+ - `Provenance#to_h` returns a Hash with exactly the REQ's fields.
137
+ - `Sidecar#write(provenance)` writes `<codepoint>.json` next to
138
+ where the SVG lives; the JSON content matches `Provenance#to_h`.
139
+ - Re-writing the same Provenance is a no-op (file unchanged).
140
+ - Specs cover all five REQ fields plus the idempotency guarantee.
141
+
142
+ ## Out of scope
143
+
144
+ - License attribution text — the REQ mentions `LICENSE-SOURCES.md`
145
+ obligations, but that's a fontist-side concern (downstream
146
+ essenfont build). Ucode emits provenance; the consumer stitches
147
+ attribution.
@@ -0,0 +1,134 @@
1
+ # TODO 06 — CodeChart::Writer
2
+
3
+ ## Status
4
+
5
+ Pending. Depends on TODO 04 (Extractor) and TODO 05 (Provenance +
6
+ Sidecar).
7
+
8
+ ## Goal
9
+
10
+ `Ucode::CodeChart::Writer` is the single entry point for
11
+ "extract every codepoint in block X and write SVG + sidecar JSON
12
+ files under output_dir." It's the orchestration layer the CLI
13
+ calls and the only thing that touches disk.
14
+
15
+ ## Files
16
+
17
+ - `lib/ucode/code_chart/writer.rb` — `Ucode::CodeChart::Writer` class.
18
+ - `spec/ucode/code_chart/writer_spec.rb`
19
+
20
+ ## Design
21
+
22
+ ### Class shape
23
+
24
+ ```ruby
25
+ class Ucode::CodeChart::Writer
26
+ Summary = Struct.new(:block, :codepoints_total, :svgs_written,
27
+ :sidecars_written, :pdf_sha256, keyword_init: true)
28
+
29
+ def initialize(output_root:, pdf_path:, cache_dir: nil,
30
+ last_resort_root: nil, blocks_txt: nil)
31
+ @output_root = Pathname.new(output_root)
32
+ @pdf_path = Pathname.new(pdf_path)
33
+ @cache_dir = cache_dir
34
+ @last_resort_root = last_resort_root
35
+ @blocks_txt = blocks_txt || Ucode::Cache.ucd_dir(Ucode::VersionResolver.resolve(nil)).join("Blocks.txt")
36
+ @sidecar = Sidecar.new(output_root: @output_root)
37
+ end
38
+
39
+ # Extracts every codepoint in @block (a Models::Block) and writes
40
+ # SVG + sidecar JSON under @output_root. Returns a Summary.
41
+ #
42
+ # @param block [Ucode::Models::Block]
43
+ # @return [Summary]
44
+ def write(block)
45
+ end
46
+ end
47
+ ```
48
+
49
+ ### Per-codepoint flow (single source of truth)
50
+
51
+ ```ruby
52
+ def write(block)
53
+ output_root_for(block).mkpath
54
+ pdf_sha = sha256(@pdf_path)
55
+
56
+ extractor = Extractor.new(
57
+ block: block,
58
+ blocks_txt: @blocks_txt,
59
+ pdf_path: @pdf_path,
60
+ cache_dir: @cache_dir,
61
+ last_resort_root: @last_resort_root,
62
+ )
63
+ results = extractor.extract
64
+
65
+ svgs = 0
66
+ sidecars = 0
67
+ results.each do |result|
68
+ svg_path = output_root_for(block).join("#{cp_id(result.codepoint)}.svg")
69
+ File.write(svg_path, result.svg) unless svg_path.exist? && File.read(svg_path) == result.svg
70
+ svgs += 1 if svg_path.exist?
71
+
72
+ provenance = build_provenance(block, result.codepoint, pdf_sha)
73
+ @sidecar.write(provenance)
74
+ sidecars += 1
75
+ end
76
+
77
+ Summary.new(
78
+ block: block.id,
79
+ codepoints_total: results.size,
80
+ svgs_written: svgs,
81
+ sidecars_written: sidecars,
82
+ pdf_sha256: pdf_sha,
83
+ )
84
+ end
85
+ ```
86
+
87
+ ### Why not use `Repo::AtomicWrites` for the SVGs
88
+
89
+ The Sidecar uses `Repo::AtomicWrites` because JSON has a stable
90
+ canonical form. SVG output from `EmbeddedFonts::Svg#to_s` is also
91
+ byte-stable, but the writer pattern is simpler with `File.write` —
92
+ the byte-equality check above guarantees idempotency at the I/O
93
+ layer. Both paths reach the same outcome.
94
+
95
+ If future output formats gain non-stable serialization (timestamps,
96
+ random IDs), the SVG path will need `Repo::AtomicWrites` too. Until
97
+ then, simpler is better.
98
+
99
+ ### Why compute `pdf_sha256` once
100
+
101
+ Every Provenance in this block carries the same `source_pdf_sha256`.
102
+ Computing it once avoids 32+ disk reads of the PDF per block
103
+ extraction. The Writer is the single place that knows "one block,
104
+ one PDF, one hash" — pushing the calculation into Sidecar or
105
+ Provenance would require either (a) repeated computation per
106
+ Provenance or (b) a parameter-passing thread through Extractor. Both
107
+ violate locality.
108
+
109
+ ### Output layout
110
+
111
+ `<output_root>/<block_id>/<U+XXXX>.svg` and `<U+XXXX>.json`.
112
+
113
+ One folder per block keeps the `Writer`'s output self-contained and
114
+ discoverable — a downstream consumer (fontisan) can iterate a block's
115
+ folder without scanning the whole tree. This mirrors the existing
116
+ `Ucode::Repo::Writers::BlocksWriter` output convention (one folder
117
+ per block, index.json inside).
118
+
119
+ ## Acceptance
120
+
121
+ - `Writer#write(block)` creates `<output_root>/<block_id>/` and
122
+ fills it with `<U+XXXX>.svg` + `<U+XXXX>.json` for every
123
+ extracted codepoint.
124
+ - Re-running `Writer#write(block)` with no changes produces
125
+ byte-identical files (no rewrites).
126
+ - `Summary#svgs_written` equals the number of extracted codepoints.
127
+ - Specs cover the full lifecycle including idempotency.
128
+
129
+ ## Out of scope
130
+
131
+ - Per-block write isolation — concurrent `Writer#write` calls for
132
+ different blocks are safe (different folders), but the Writer is
133
+ not thread-safe within a single block. That's a parallel-extraction
134
+ concern, not a per-block concern.