woods 1.3.0 → 1.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 69b5f822b28adb68fa962350e44b8721811c62bfa25b0350ab8a46b8b121f3d4
4
- data.tar.gz: 6e1e42f994fd57f5de592f662d451e73b967fd4c84744b1638d94e833a044852
3
+ metadata.gz: b2da9b3b863eb794ca880de7a8b327c7edd22f5ea0b027bd705191af85ea755a
4
+ data.tar.gz: 31c23f340816f84d3c1acc8e2cf09daa9bb7009179d1bd51cca51f7833c376e5
5
5
  SHA512:
6
- metadata.gz: a83cb96217695d13bba825a2fb25dcf257bb1de94565c2005fb559771d0670b628db70a19651c7626d02f2baf030a98e116bd7c971f22823ce4300642dc1bd73
7
- data.tar.gz: 8d420c40672e99f2395a410b4f83cf8df761c7f06b42e8558199b7c9d26db9b1ffd9aa866520f3d245a2fcd3bdd2cea9f007c01cfd3ed970d509fe6aebd80eab
6
+ metadata.gz: 69f9bc1e0e83a7894ab0618b1044608f7eb3b869c7a881b04820d033a1a4c66bae7ce56be4c7bd858915679e941354af3f2907c2faa2decac1de8d0a4511913c
7
+ data.tar.gz: 17360ebaf41923cb074d0b829b8940e24fbd3b7724243a0738fe73ee5a4fbcaf43108194d025a955073088dcf2ea0cd19380304191c41d9b325f02dcee43badd
data/CHANGELOG.md CHANGED
@@ -7,6 +7,81 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [1.4.1] - 2026-06-10
11
+
12
+ ### Fixed
13
+
14
+ - **Unblocked sync: multiple units sharing one file no longer collide on a single
15
+ URI** (#130). A document's URI derives from `file_path`, so a file defining
16
+ several extracted units (nested/namespaced classes, STI subclasses, multiple
17
+ classes in one `.rb`) mapped every unit to the same URI — the remote document
18
+ was overwritten per unit (only the last survived) and, under the content-hash
19
+ manifest, those units re-pushed on every run. The exporter now detects files
20
+ shared by more than one synced unit and disambiguates: the lexically-first
21
+ identifier keeps the bare blob URL, siblings get a `?unit=<identifier>` suffix.
22
+ Solo files (the overwhelming majority) are untouched. Sibling of the
23
+ no-`file_path` guard shipped in 1.4.0.
24
+
25
+ ## [1.4.0] - 2026-06-10
26
+
27
+ ### Added — Incremental Unblocked sync (PR #128)
28
+
29
+ - **`woods:unblocked_sync` is now incremental.** A new `Woods::Unblocked::SyncManifest`
30
+ (JSON at `<output_dir>/unblocked_sync_manifest.json`) records the content hash and
31
+ remote document id of everything last pushed. Each run skips unchanged documents,
32
+ pushes only new/changed ones, and deletes documents whose source unit disappeared.
33
+ A missing manifest (first run / CI cache miss) degrades to a correct full re-push
34
+ that rebuilds it; steady state on an unchanged codebase costs ~0 API calls (was
35
+ ~800–1200 per run). Persist the manifest across CI runs via your provider's cache —
36
+ see `docs/UNBLOCKED_INTEGRATION.md`.
37
+ - **Deletion safety.** Orphan purging is skipped when the daily API budget exhausts
38
+ mid-run, and a mass-deletion guard refuses to delete more than 30% of a ≥10-entry
39
+ manifest in one run (`UNBLOCKED_FORCE_PURGE=1` overrides) — protection against
40
+ syncing a partial index. `UNBLOCKED_FORCE_FULL_SYNC=1` re-pushes everything (use
41
+ after a document-format change). Both flags parse `1`/`true`/`yes` (case-insensitive).
42
+ - **`Client#list_documents` / `#all_documents`** — paginated document listing with
43
+ client-side collection filtering, used to reconcile remote document ids when the
44
+ manifest is missing. Cursors are URL-encoded and a page without a cursor id stops
45
+ pagination rather than looping against the rate budget.
46
+ - **`Woods::Unblocked::ApiError`** (subclass of `Woods::Error`) carries the required
47
+ HTTP `status` of failed API calls; a 404 on delete is treated as already-gone.
48
+ **`Woods::Unblocked::BudgetExhaustedError`** (also a `Woods::Error` subclass) is
49
+ raised by the rate limiter, so budget detection no longer depends on message text.
50
+ - **CI-visible failures.** `woods:unblocked_sync` exits non-zero when the sync
51
+ recorded errors (delete failures are now surfaced in the error list too). The one
52
+ tolerated shape is budget exhaustion with partial progress — the expected
53
+ cold-start outcome, which converges on the next run. Reconcile aborts loudly on
54
+ auth failures (401/403) instead of burning the budget on doomed calls.
55
+ - **Deterministic document bodies.** `DocumentBuilder` sorts every rendered collection
56
+ (associations, dependents, routes, enums, scopes, concerns, callbacks) so an
57
+ unchanged unit always produces byte-identical output — the precondition for
58
+ hash-based change detection.
59
+ - **`Client#create_collection` defaults `iconUrl`** to the repo-hosted Woods mark.
60
+ The live API rejects collection creation without an `iconUrl` despite the API docs
61
+ marking it optional (documented quirk).
62
+ - **Branding.** Tree Rings logo set under `assets/` (marks, wordmark lockups, PNG
63
+ exports); README wordmark.
64
+
65
+ ### Fixed
66
+
67
+ - `Client#list_collections` no longer raises `TypeError` on the live API's bare-array
68
+ response.
69
+ - API error messages now surface RFC7807 `title`/`detail` fields (previously
70
+ "Unknown error").
71
+ - `require 'woods/unblocked/client'` works standalone (previously needed `woods`
72
+ loaded first).
73
+ - Units without a `file_path` are skipped instead of synced. Previously every
74
+ such unit fell back to the bare repo URL as its document URI — and since URIs
75
+ are the upsert key, they silently overwrote each other in the collection
76
+ (and would have ping-ponged the new manifest hash every run).
77
+
78
+ ### Build
79
+
80
+ - The suite now installs and runs on Ruby 4.0: the optional `tokenizers` gem (whose
81
+ native extension cannot build against the Ruby 4.0 ABI) is gated behind
82
+ `install_if (Ruby < 4.0)`, and `benchmark` (no longer a default gem in 4.0) is
83
+ declared explicitly. Lockfile unchanged.
84
+
10
85
  ## [1.3.0] - 2026-05-13
11
86
 
12
87
  ### Upgrade Notes
data/README.md CHANGED
@@ -1,3 +1,7 @@
1
+ <p align="center">
2
+ <img src="assets/woods-wordmark-white-with-bg.png" width="400" alt="woods">
3
+ </p>
4
+
1
5
  # Woods
2
6
 
3
7
  **Your AI coding assistant is guessing about your Rails app. Woods gives it the real answers.**
@@ -1,7 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  # Woods configuration
4
- # Full reference: https://github.com/bigcartel/woods/blob/main/docs/CONFIGURATION_REFERENCE.md
4
+ # Full reference: https://github.com/lost-in-the/woods/blob/main/docs/CONFIGURATION_REFERENCE.md
5
5
  #
6
6
  # Quick-start presets (uncomment one instead of the full block below):
7
7
  # Woods.configure_with_preset(:local) # in-memory + Ollama, no external services
data/lib/tasks/woods.rake CHANGED
@@ -604,25 +604,46 @@ namespace :woods do
604
604
  end
605
605
 
606
606
  output_dir = ENV.fetch('WOODS_OUTPUT', config.output_dir)
607
+ # Truthy set, so FLAG=false / FLAG=0 disables rather than silently enabling.
608
+ env_flag = ->(name) { %w[1 true yes].include?(ENV.fetch(name, '').strip.downcase) }
609
+ force_full = env_flag.call('UNBLOCKED_FORCE_FULL_SYNC')
610
+ force_purge = env_flag.call('UNBLOCKED_FORCE_PURGE')
607
611
 
608
612
  puts 'Syncing extraction data to Unblocked...'
609
613
  puts " Output dir: #{output_dir}"
610
614
  puts " Collection: #{config.unblocked_collection_id}"
611
615
  puts " Repo URL: #{config.unblocked_repo_url}"
616
+ puts ' Mode: full re-sync (UNBLOCKED_FORCE_FULL_SYNC set)' if force_full
612
617
  puts
613
618
 
614
- exporter = Woods::Unblocked::Exporter.new(index_dir: output_dir)
619
+ exporter = Woods::Unblocked::Exporter.new(
620
+ index_dir: output_dir,
621
+ force_full: force_full,
622
+ force_purge: force_purge
623
+ )
615
624
  stats = exporter.sync_all
616
625
 
617
626
  puts
618
627
  puts 'Sync complete!'
619
628
  puts " Documents synced: #{stats[:synced]}"
620
629
  puts " Documents skipped: #{stats[:skipped]}"
630
+ puts " Documents deleted: #{stats[:deleted]}"
621
631
 
622
632
  if stats[:errors].any?
623
633
  puts " Errors: #{stats[:errors].size}"
624
634
  stats[:errors].first(5).each { |e| puts " - #{e}" }
625
635
  puts " ... and #{stats[:errors].size - 5} more" if stats[:errors].size > 5
636
+
637
+ # Fail the task so CI notices — a printed-but-green run is invisible in
638
+ # post-merge pipelines (a dead token would otherwise stay green forever).
639
+ # Exception: budget exhaustion *with* partial progress is the expected
640
+ # cold-start shape; it converges on the next run.
641
+ budget_only = stats[:errors].all? { |e| e.include?('daily budget exhausted') }
642
+ unless budget_only && stats[:synced].positive?
643
+ puts
644
+ puts 'Sync completed with errors — failing so CI surfaces it.'
645
+ exit 1
646
+ end
626
647
  end
627
648
  end
628
649
 
@@ -226,10 +226,10 @@ module Woods
226
226
 
227
227
  # Parent chain for understanding inherited behavior
228
228
  ancestors: controller.ancestors
229
- .take_while { |a| a != ActionController::Base && a != ActionController::API }
230
- .grep(Class)
231
- .map(&:name)
232
- .compact,
229
+ .take_while { |a| a != ActionController::Base && a != ActionController::API }
230
+ .grep(Class)
231
+ .map(&:name)
232
+ .compact,
233
233
 
234
234
  # Concerns included
235
235
  included_concerns: extract_included_concerns(controller),
@@ -3,10 +3,28 @@
3
3
  require 'json'
4
4
  require 'net/http'
5
5
  require 'uri'
6
+ require 'woods'
6
7
  require_relative 'rate_limiter'
7
8
 
8
9
  module Woods
9
10
  module Unblocked
11
+ # API error carrying the HTTP status code, so callers can branch on
12
+ # status (e.g. treat a 404 on delete as "already gone") instead of
13
+ # matching message strings. Subclasses Woods::Error, so existing
14
+ # +rescue Woods::Error+ sites keep working unchanged.
15
+ class ApiError < Woods::Error
16
+ # @return [Integer] HTTP status code of the failed response
17
+ attr_reader :status
18
+
19
+ # @param message [String] Error message
20
+ # @param status [Integer] HTTP status code — required, because callers
21
+ # branch on it (a nil status would silently miss every status check)
22
+ def initialize(message, status:)
23
+ super(message)
24
+ @status = Integer(status)
25
+ end
26
+ end
27
+
10
28
  # REST client for the Unblocked API v1.
11
29
  #
12
30
  # Handles document and collection CRUD with rate limiting, retries,
@@ -25,6 +43,12 @@ module Woods
25
43
  BASE_URL = 'https://getunblocked.com/api/v1'
26
44
  MAX_RETRIES = 3
27
45
  DEFAULT_TIMEOUT = 30
46
+ # Max page size the list endpoint accepts (per API docs).
47
+ PAGE_SIZE = 200
48
+ # Repo-hosted Woods mark, used as the collection icon when none is given.
49
+ # The live API rejects collection creation without an iconUrl (despite
50
+ # the API docs marking it optional), so a working default matters.
51
+ DEFAULT_ICON_URL = 'https://raw.githubusercontent.com/lost-in-the/woods/main/assets/woods-mark-black.svg'
28
52
 
29
53
  # @param api_token [String] Unblocked API token (Personal or Team)
30
54
  # @param rate_limiter [RateLimiter] Rate limiter instance
@@ -60,12 +84,17 @@ module Woods
60
84
  #
61
85
  # @param name [String] Collection name (1-32 chars)
62
86
  # @param description [String] Collection description (1-4096 chars)
63
- # @param icon_url [String, nil] Optional icon URL
87
+ # @param icon_url [String, nil] Icon URL. The live API rejects creation
88
+ # with a bare 400 when omitted (despite the API docs marking it
89
+ # optional), so nil falls back to DEFAULT_ICON_URL — the repo-hosted
90
+ # Woods mark.
64
91
  # @return [Hash] { "id" => "collection-uuid", "name" => "...", ... }
65
92
  def create_collection(name:, description:, icon_url: nil)
66
- body = { name: name, description: description }
67
- body[:iconUrl] = icon_url if icon_url
68
- request(:post, 'collections', body)
93
+ request(:post, 'collections', {
94
+ name: name,
95
+ description: description,
96
+ iconUrl: icon_url || DEFAULT_ICON_URL
97
+ })
69
98
  end
70
99
 
71
100
  # List all collections.
@@ -73,6 +102,10 @@ module Woods
73
102
  # @return [Array<Hash>] Collection objects
74
103
  def list_collections
75
104
  result = request(:get, 'collections')
105
+ # The live API returns a bare JSON array; the envelope fallbacks are
106
+ # defensive (calling ['items'] on an Array raises TypeError).
107
+ return result if result.is_a?(Array)
108
+
76
109
  result['items'] || result['data'] || [result].flatten.compact
77
110
  end
78
111
 
@@ -84,6 +117,52 @@ module Woods
84
117
  request(:delete, "documents/#{document_id}")
85
118
  end
86
119
 
120
+ # List a single page of documents.
121
+ #
122
+ # The endpoint returns a bare JSON array of document metadata (no body):
123
+ # `id, collectionId, title, uri, createdAt, updatedAt`. Pagination is
124
+ # cursor-based via `after`/`before` (opaque cursors); there is no
125
+ # server-side collection filter.
126
+ #
127
+ # @param limit [Integer] Page size (1-200)
128
+ # @param after [String, nil] Opaque forward cursor (typically the last id)
129
+ # @return [Array<Hash>] One page of document metadata
130
+ def list_documents(limit: PAGE_SIZE, after: nil)
131
+ query = "limit=#{limit}"
132
+ query += "&after=#{URI.encode_www_form_component(after)}" if after
133
+ result = request(:get, "documents?#{query}")
134
+ return result if result.is_a?(Array)
135
+
136
+ result['items'] || result['data'] || []
137
+ end
138
+
139
+ # List every document in a collection, paging until exhausted.
140
+ #
141
+ # Filters client-side on `collectionId` since the API has no collection
142
+ # filter. ~5 calls for ~1000 documents; each goes through the rate limiter.
143
+ #
144
+ # @param collection_id [String] Collection UUID to filter to
145
+ # @return [Array<Hash>] All matching document metadata
146
+ def all_documents(collection_id:)
147
+ docs = []
148
+ after = nil
149
+
150
+ loop do
151
+ page = list_documents(limit: PAGE_SIZE, after: after)
152
+ break if page.empty?
153
+
154
+ docs.concat(page)
155
+ break if page.size < PAGE_SIZE
156
+
157
+ after = page.last['id']
158
+ # A full page with no cursor id would refetch page 1 forever —
159
+ # stop with what we have rather than loop against the budget.
160
+ break if after.nil?
161
+ end
162
+
163
+ docs.select { |doc| doc['collectionId'] == collection_id }
164
+ end
165
+
87
166
  private
88
167
 
89
168
  def request(method, path, body = nil)
@@ -155,8 +234,10 @@ module Woods
155
234
  rescue JSON::ParserError, TypeError
156
235
  { 'message' => response.body&.slice(0, 200) || 'Unknown error' }
157
236
  end
158
- message = parsed['message'] || parsed['error'] || 'Unknown error'
159
- raise Woods::Error, "Unblocked API error #{response.code}: #{message}"
237
+ # The Unblocked API returns RFC7807-style bodies ({ status, title, detail });
238
+ # older/other paths use message/error. Check all so failures stay legible.
239
+ message = parsed['message'] || parsed['error'] || parsed['detail'] || parsed['title'] || 'Unknown error'
240
+ raise ApiError.new("Unblocked API error #{response.code}: #{message}", status: response.code.to_i)
160
241
  end
161
242
  end
162
243
  end
@@ -27,23 +27,30 @@ module Woods
27
27
  def build(unit_data)
28
28
  type = unit_data['type']
29
29
  identifier = unit_data['identifier']
30
- file_path = unit_data['file_path']
31
30
 
32
31
  {
33
32
  title: "#{identifier} (#{type})",
34
33
  body: build_body(unit_data),
35
- uri: build_uri(file_path)
34
+ uri: uri_for(unit_data)
36
35
  }
37
36
  end
38
37
 
39
- private
40
-
41
- def build_uri(file_path)
38
+ # The citation URI for a unit (GitHub blob URL, or the repo root when the
39
+ # unit has no file_path). Public so callers can compute a unit's URI
40
+ # cheaply — e.g. to build the set of currently-existing URIs — without
41
+ # building the full document body.
42
+ #
43
+ # @param unit_data [Hash] Parsed unit JSON (needs 'file_path')
44
+ # @return [String] Citation URI
45
+ def uri_for(unit_data)
46
+ file_path = unit_data['file_path']
42
47
  return @repo_url unless file_path
43
48
 
44
49
  "#{@repo_url}/blob/main/#{file_path}"
45
50
  end
46
51
 
52
+ private
53
+
47
54
  def build_body(unit_data)
48
55
  type = unit_data['type']
49
56
  body = case type
@@ -125,7 +132,9 @@ module Woods
125
132
  dep = a.dig('options', 'dependent')
126
133
  dep ? "#{name} (#{dep})" : name
127
134
  end
128
- lines << "**#{type}:** #{targets.join(', ')}"
135
+ # Sorted so the body is a function of association content, not order
136
+ # (the exporter hashes this body to detect changes).
137
+ lines << "**#{type}:** #{targets.sort.join(', ')}"
129
138
  end
130
139
 
131
140
  lines.join("\n")
@@ -136,7 +145,7 @@ module Woods
136
145
  return nil if deps.empty?
137
146
 
138
147
  grouped = deps.group_by { |d| d['type'] }
139
- summary_parts = grouped.map { |type, items| "#{items.size} #{type}s" }
148
+ summary_parts = grouped.sort_by { |type, _| type.to_s }.map { |type, items| "#{items.size} #{type}s" }
140
149
 
141
150
  lines = ["## Dependents (#{deps.size} units)"]
142
151
  lines << summary_parts.join(', ')
@@ -160,9 +169,9 @@ module Woods
160
169
  return nil if controllers.empty? && graphql.empty?
161
170
 
162
171
  lines = ['## Entry Points']
163
- lines << "**Controllers:** #{controllers.map { |c| c['identifier'] }.join(', ')}" if controllers.any?
164
- lines << "**GraphQL:** #{graphql.map { |g| g['identifier'] }.join(', ')}" if graphql.any?
165
- lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.join(', ')}" if jobs.any?
172
+ lines << "**Controllers:** #{controllers.map { |c| c['identifier'] }.sort.join(', ')}" if controllers.any?
173
+ lines << "**GraphQL:** #{graphql.map { |g| g['identifier'] }.sort.join(', ')}" if graphql.any?
174
+ lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.sort.join(', ')}" if jobs.any?
166
175
 
167
176
  lines.join("\n")
168
177
  end
@@ -172,15 +181,16 @@ module Woods
172
181
 
173
182
  enums = meta['enums']
174
183
  if enums.is_a?(Hash) && enums.any?
175
- enum_strs = enums.map { |name, values| "#{name} (#{format_enum_values(values)})" }
184
+ enum_strs = enums.sort_by { |name, _| name.to_s }
185
+ .map { |name, values| "#{name} (#{format_enum_values(values)})" }
176
186
  parts << "**Enums:** #{enum_strs.join('; ')}"
177
187
  end
178
188
 
179
189
  scopes = meta['scopes']
180
- parts << "**Scopes:** #{scopes.map { |s| s['name'] }.join(', ')}" if scopes.is_a?(Array) && scopes.any?
190
+ parts << "**Scopes:** #{scopes.map { |s| s['name'] }.sort.join(', ')}" if scopes.is_a?(Array) && scopes.any?
181
191
 
182
192
  concerns = meta['inlined_concerns']
183
- parts << "**Concerns:** #{concerns.join(', ')}" if concerns.is_a?(Array) && concerns.any?
193
+ parts << "**Concerns:** #{concerns.sort.join(', ')}" if concerns.is_a?(Array) && concerns.any?
184
194
 
185
195
  callbacks = meta['callbacks']
186
196
  if callbacks.is_a?(Array) && callbacks.any?
@@ -200,8 +210,8 @@ module Woods
200
210
  return nil if jobs.empty? && mailers.empty?
201
211
 
202
212
  lines = ['## Side Effects']
203
- lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.join(', ')}" if jobs.any?
204
- lines << "**Mailers:** #{mailers.map { |m| m['identifier'] }.join(', ')}" if mailers.any?
213
+ lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.sort.join(', ')}" if jobs.any?
214
+ lines << "**Mailers:** #{mailers.map { |m| m['identifier'] }.sort.join(', ')}" if mailers.any?
205
215
 
206
216
  lines.join("\n")
207
217
  end
@@ -229,18 +239,21 @@ module Woods
229
239
  routes = meta['routes']
230
240
  return nil unless routes.is_a?(Hash) && routes.any?
231
241
 
232
- lines = ['## Routes']
242
+ route_lines = []
233
243
  routes.each do |action, route_list|
234
244
  next unless route_list.is_a?(Array)
235
245
 
236
246
  route_list.each do |route|
237
247
  next unless route.is_a?(Hash)
238
248
 
239
- lines << "- `#{route['verb']} #{route['path']}` (#{action})"
249
+ route_lines << "- `#{route['verb']} #{route['path']}` (#{action})"
240
250
  end
241
251
  end
242
252
 
243
- lines.size > 1 ? lines.first(20).join("\n") : nil
253
+ # Sort before truncating so the kept subset is stable across runs.
254
+ return nil if route_lines.empty?
255
+
256
+ (['## Routes'] + route_lines.sort.first(20)).join("\n")
244
257
  end
245
258
 
246
259
  def controller_dependencies(unit)
@@ -250,7 +263,7 @@ module Woods
250
263
  models = deps.select { |d| d['type'] == 'model' }.map { |d| d['target'] }
251
264
  return nil if models.empty?
252
265
 
253
- "## Dependencies\n**Models:** #{models.join(', ')}"
266
+ "## Dependencies\n**Models:** #{models.sort.join(', ')}"
254
267
  end
255
268
 
256
269
  def controller_dependents(unit)
@@ -258,7 +271,7 @@ module Woods
258
271
  views = deps.select { |d| d['type'] == 'view_template' }
259
272
  return nil if views.empty?
260
273
 
261
- "## Views\n#{views.map { |v| "- `#{v['identifier']}`" }.first(10).join("\n")}"
274
+ "## Views\n#{views.map { |v| "- `#{v['identifier']}`" }.sort.first(10).join("\n")}"
262
275
  end
263
276
 
264
277
  # ── GraphQL formatting ───────────────────────────────────────────
@@ -271,7 +284,7 @@ module Woods
271
284
 
272
285
  deps = unit['dependencies'] || []
273
286
  models = deps.select { |d| d['type'] == 'model' }.map { |d| d['target'] }
274
- sections << "**Models:** #{models.join(', ')}" if models.any?
287
+ sections << "**Models:** #{models.sort.join(', ')}" if models.any?
275
288
 
276
289
  dependents = unit['dependents'] || []
277
290
  sections << "**Referenced by:** #{dependents.size} units" if dependents.any?
@@ -292,14 +305,15 @@ module Woods
292
305
  deps = unit['dependencies'] || []
293
306
  if deps.any?
294
307
  by_type = deps.group_by { |d| d['type'] }
295
- dep_parts = by_type.map { |type, items| "#{type}: #{items.map { |d| d['target'] }.join(', ')}" }
308
+ dep_parts = by_type.sort_by { |type, _| type.to_s }
309
+ .map { |type, items| "#{type}: #{items.map { |d| d['target'] }.sort.join(', ')}" }
296
310
  sections << "## Dependencies\n#{dep_parts.join("\n")}"
297
311
  end
298
312
 
299
313
  dependents = unit['dependents'] || []
300
314
  if dependents.any?
301
315
  grouped = dependents.group_by { |d| d['type'] }
302
- summary = grouped.map { |type, items| "#{items.size} #{type}s" }
316
+ summary = grouped.sort_by { |type, _| type.to_s }.map { |type, items| "#{items.size} #{type}s" }
303
317
  sections << "## Dependents (#{dependents.size})\n#{summary.join(', ')}"
304
318
  end
305
319
 
@@ -317,9 +331,9 @@ module Woods
317
331
  end
318
332
 
319
333
  def format_callbacks(callbacks)
320
- callbacks.first(5).map do |cb|
321
- "#{cb['type']}: #{cb['filter']}"
322
- end.join(', ')
334
+ # Sort before truncating so both the selection and order are stable
335
+ # regardless of input order (the body is hashed for change detection).
336
+ callbacks.map { |cb| "#{cb['type']}: #{cb['filter']}" }.sort.first(5).join(', ')
323
337
  end
324
338
  end
325
339
  end
@@ -1,9 +1,13 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require 'set'
4
+ require 'digest'
5
+ require 'uri'
3
6
  require 'woods'
4
7
  require_relative 'client'
5
8
  require_relative 'rate_limiter'
6
9
  require_relative 'document_builder'
10
+ require_relative 'sync_manifest'
7
11
 
8
12
  module Woods
9
13
  module Unblocked
@@ -11,16 +15,27 @@ module Woods
11
15
  #
12
16
  # Reads extraction output from disk via IndexReader, converts units to
13
17
  # condensed Markdown documents, and pushes via the Unblocked Documents API.
14
- # All syncs are idempotent documents are upserted by URI.
18
+ # Syncs are incremental: a {SyncManifest} records the content hash and
19
+ # remote document_id of everything last pushed, so each run only PUTs
20
+ # new/changed documents, skips unchanged ones, and deletes documents whose
21
+ # source unit has disappeared. Documents are upserted by URI, so a missing
22
+ # manifest (first run / CI cache miss) degrades to a correct full sync.
15
23
  #
16
24
  # @example
17
25
  # exporter = Exporter.new(index_dir: "tmp/woods")
18
26
  # stats = exporter.sync_all
19
- # # => { synced: 940, skipped: 5060, errors: [] }
27
+ # # => { synced: 12, skipped: 928, deleted: 1, errors: [] }
20
28
  #
21
29
  class Exporter
22
30
  MAX_ERRORS = 100
23
31
 
32
+ # Mass-deletion guard: refuse to purge when more than this fraction of a
33
+ # manifest of at least PURGE_GUARD_MIN_DOCS entries would be deleted —
34
+ # the signature of a sync run against a partial index. Override with
35
+ # force_purge.
36
+ PURGE_GUARD_FRACTION = 0.30
37
+ PURGE_GUARD_MIN_DOCS = 10
38
+
24
39
  # Unit types to sync, in priority order.
25
40
  # All units are synced for these types.
26
41
  FULL_SYNC_TYPES = %w[
@@ -39,9 +54,13 @@ module Woods
39
54
  # @param config [Configuration] Woods configuration (default: global config)
40
55
  # @param client [Client, nil] Unblocked API client (auto-created from config if nil)
41
56
  # @param reader [Object, nil] IndexReader instance (auto-created if nil)
57
+ # @param manifest [SyncManifest, nil] Sync manifest (auto-created under index_dir if nil)
58
+ # @param force_full [Boolean] Re-push every unit, ignoring the unchanged check
59
+ # @param force_purge [Boolean] Bypass the mass-deletion guard
42
60
  # @param output [IO] Progress output stream (default: $stdout)
43
61
  # @raise [ConfigurationError] if required config is missing
44
- def initialize(index_dir:, config: Woods.configuration, client: nil, reader: nil, output: $stdout)
62
+ def initialize(index_dir:, config: Woods.configuration, client: nil, reader: nil,
63
+ manifest: nil, force_full: false, force_purge: false, output: $stdout)
45
64
  @collection_id = config.unblocked_collection_id
46
65
  raise ConfigurationError, 'unblocked_collection_id is required' unless @collection_id
47
66
 
@@ -57,18 +76,35 @@ module Woods
57
76
  @client = client || Client.new(api_token: api_token, rate_limiter: limiter)
58
77
  @reader = reader || build_reader(index_dir)
59
78
  @builder = DocumentBuilder.new(repo_url: repo_url)
79
+ @manifest = manifest || build_manifest(index_dir)
80
+ @force_full = force_full
81
+ @force_purge = force_purge
60
82
  @output = output
83
+ # Initialized here as well as in sync_all so the public sync_type /
84
+ # sync_type_partial methods work standalone (track_uri needs them).
85
+ @current_uris = Set.new
86
+ @budget_exhausted = false
87
+ # base URI => identifier that keeps the bare URI (only populated for
88
+ # URIs shared by >1 unit). Rebuilt per sync_all run.
89
+ @uri_primary = {}
61
90
  end
62
91
 
63
92
  # Sync all configured unit types to the Unblocked collection.
64
93
  #
65
- # @return [Hash] { synced: Integer, skipped: Integer, errors: Array<String> }
94
+ # @return [Hash] { synced:, skipped:, deleted:, errors: }
66
95
  def sync_all
96
+ @current_uris = Set.new
97
+ @budget_exhausted = false
98
+ build_uri_index
99
+ reconcile_from_remote if @manifest.empty?
100
+
67
101
  synced = 0
68
102
  skipped = 0
69
103
  errors = []
70
104
 
71
105
  FULL_SYNC_TYPES.each do |type|
106
+ break if @budget_exhausted
107
+
72
108
  result = sync_type(type)
73
109
  synced += result[:synced]
74
110
  skipped += result[:skipped]
@@ -76,19 +112,24 @@ module Woods
76
112
  end
77
113
 
78
114
  PARTIAL_SYNC_TYPES.each do |type, max_count|
115
+ break if @budget_exhausted
116
+
79
117
  result = sync_type_partial(type, max_count)
80
118
  synced += result[:synced]
81
119
  skipped += result[:skipped]
82
120
  errors.concat(result[:errors])
83
121
  end
84
122
 
85
- { synced: synced, skipped: skipped, errors: cap_errors(errors) }
123
+ deleted = @budget_exhausted ? 0 : purge_stale(errors)
124
+ { synced: synced, skipped: skipped, deleted: deleted, errors: cap_errors(errors) }
125
+ ensure
126
+ save_manifest
86
127
  end
87
128
 
88
129
  # Sync all units of a given type.
89
130
  #
90
131
  # @param type [String] Unit type (e.g. "model", "controller")
91
- # @return [Hash] { synced: Integer, skipped: Integer, errors: Array<String> }
132
+ # @return [Hash] { synced:, skipped:, errors: }
92
133
  def sync_type(type)
93
134
  units = @reader.list_units(type: type)
94
135
  log " #{type}: #{units.size} units"
@@ -100,7 +141,7 @@ module Woods
100
141
  #
101
142
  # @param type [String] Unit type
102
143
  # @param max_count [Integer] Maximum units to sync
103
- # @return [Hash] { synced: Integer, skipped: Integer, errors: Array<String> }
144
+ # @return [Hash] { synced:, skipped:, errors: }
104
145
  def sync_type_partial(type, max_count)
105
146
  units = @reader.list_units(type: type)
106
147
  return empty_stats if units.empty?
@@ -114,8 +155,14 @@ module Woods
114
155
  { entry: entry, data: data, dep_count: dep_count }
115
156
  end
116
157
 
158
+ # Every unit of this type still exists — track its URI so partial units
159
+ # that fall *out* of the top-N are never mistaken for deletions.
160
+ units_with_data.each { |u| track_uri(u[:data]) }
161
+
117
162
  top_units = units_with_data.sort_by { |u| -u[:dep_count] }.first(max_count)
118
- skipped_count = [units.size - max_count, 0].max
163
+ # Count against what was actually synced — units.size includes entries
164
+ # whose unit data was missing (dropped by the filter_map above).
165
+ skipped_count = units.size - top_units.size
119
166
 
120
167
  log " #{type}: #{top_units.size}/#{units.size} units (top by dependents)"
121
168
 
@@ -138,13 +185,19 @@ module Woods
138
185
  next
139
186
  end
140
187
 
141
- push_document(unit_data)
142
- synced += 1
188
+ track_uri(unit_data)
189
+ if push_document(unit_data) == :skipped
190
+ skipped += 1
191
+ else
192
+ synced += 1
193
+ end
143
194
  rescue Woods::Error => e
144
195
  errors << "#{entry['identifier']}: #{e.message}"
145
- break if e.message.include?('daily budget exhausted')
196
+ break if note_budget_exhaustion(e)
146
197
  rescue StandardError => e
147
- errors << "#{entry['identifier']}: #{e.message}"
198
+ # Include the class — "undefined method for nil" without it is
199
+ # unactionable in CI logs.
200
+ errors << "#{entry['identifier']}: #{e.class}: #{e.message}"
148
201
  end
149
202
 
150
203
  { synced: synced, skipped: skipped, errors: errors }
@@ -156,26 +209,222 @@ module Woods
156
209
  errors = []
157
210
 
158
211
  entries_with_data.each do |entry, unit_data|
159
- push_document(unit_data)
160
- synced += 1
212
+ track_uri(unit_data)
213
+ if push_document(unit_data) == :skipped
214
+ skipped += 1
215
+ else
216
+ synced += 1
217
+ end
161
218
  rescue Woods::Error => e
162
219
  errors << "#{entry['identifier']}: #{e.message}"
163
- break if e.message.include?('daily budget exhausted')
220
+ break if note_budget_exhaustion(e)
164
221
  rescue StandardError => e
165
- errors << "#{entry['identifier']}: #{e.message}"
222
+ # Include the class — "undefined method for nil" without it is
223
+ # unactionable in CI logs.
224
+ errors << "#{entry['identifier']}: #{e.class}: #{e.message}"
166
225
  end
167
226
 
168
227
  { synced: synced, skipped: skipped, errors: errors }
169
228
  end
170
229
 
230
+ # Build the document, skip it if the manifest says it is unchanged,
231
+ # otherwise upsert it and record the new hash + remote document_id.
232
+ #
233
+ # @return [Symbol] :synced or :skipped
171
234
  def push_document(unit_data)
235
+ # No file_path → the URI falls back to the bare repo URL, which every
236
+ # such unit would share: they'd overwrite each other remotely and
237
+ # ping-pong the manifest hash forever. Skip them.
238
+ return :skipped unless unit_data['file_path']
239
+
240
+ # When several units share one file they share one base URI; only one
241
+ # keeps it, the rest get a `?unit=` suffix so each is a distinct remote
242
+ # document (and a distinct manifest key).
243
+ uri = effective_uri(unit_data)
244
+
172
245
  doc = @builder.build(unit_data)
173
- @client.put_document(
246
+ # An empty body means the credential scrub failed closed (the builders
247
+ # always emit at least a header). Upserting it would overwrite a good
248
+ # remote document with nothing — error out and leave the remote as-is.
249
+ if doc[:body].nil? || doc[:body].empty?
250
+ raise Woods::ExtractionError, 'document body empty (credential scrub failure?) — push skipped'
251
+ end
252
+
253
+ hash = fingerprint(doc)
254
+ return :skipped if !@force_full && @manifest.unchanged?(uri, hash)
255
+
256
+ response = @client.put_document(
174
257
  collection_id: @collection_id,
175
258
  title: doc[:title],
176
259
  body: doc[:body],
177
- uri: doc[:uri]
260
+ uri: uri
178
261
  )
262
+ document_id = (response['id'] if response.is_a?(Hash)) || @manifest.document_id_for(uri)
263
+ @manifest.record(uri: uri, hash: hash, document_id: document_id)
264
+ :synced
265
+ end
266
+
267
+ # Delete remote documents whose source unit no longer exists. Failures
268
+ # are appended to +errors+ — a delete that fails silently every run is
269
+ # how a collection rots while "deleted: 0" looks normal.
270
+ #
271
+ # @param errors [Array<String>] sink for delete failures
272
+ # @return [Integer] number of documents deleted
273
+ def purge_stale(errors)
274
+ stale = @manifest.stale_uris(@current_uris)
275
+ return 0 if stale.empty?
276
+ return 0 if guard_blocks_purge?(stale)
277
+
278
+ resolve_missing_document_ids(stale)
279
+
280
+ deleted = 0
281
+ stale.each do |uri|
282
+ document_id = @manifest.document_id_for(uri)
283
+ next unless document_id
284
+
285
+ @client.delete_document(document_id: document_id)
286
+ @manifest.forget(uri)
287
+ deleted += 1
288
+ rescue ApiError => e
289
+ if e.status == 404
290
+ # Already gone remotely — goal state reached, drop the entry
291
+ # rather than retrying every run.
292
+ @manifest.forget(uri)
293
+ else
294
+ errors << "delete #{uri}: #{e.message}"
295
+ end
296
+ rescue Woods::Error => e
297
+ break if note_budget_exhaustion(e)
298
+
299
+ errors << "delete #{uri}: #{e.message}"
300
+ rescue StandardError => e
301
+ # Entry stays in the manifest so a later run retries the delete —
302
+ # but surface the failure so systematic breakage is visible.
303
+ errors << "delete #{uri}: #{e.class}: #{e.message}"
304
+ end
305
+ deleted
306
+ end
307
+
308
+ # A manifest entry can carry a nil document_id (e.g. the PUT response
309
+ # body was empty). Those entries would be permanently undeletable, so
310
+ # before purging, make one bounded all_documents sweep to resolve ids.
311
+ # Best-effort: unresolved entries are simply skipped by the purge loop.
312
+ def resolve_missing_document_ids(stale)
313
+ missing = stale.select { |uri| @manifest.document_id_for(uri).nil? }
314
+ return if missing.empty?
315
+
316
+ ids_by_uri = @client.all_documents(collection_id: @collection_id)
317
+ .to_h { |doc| [doc['uri'], doc['id']] }
318
+ missing.each do |uri|
319
+ id = ids_by_uri[uri]
320
+ @manifest.record(uri: uri, hash: nil, document_id: id) if id
321
+ end
322
+ rescue StandardError => e
323
+ log " id resolution skipped (#{e.message})"
324
+ end
325
+
326
+ # True when purging +stale+ would delete too large a fraction of the
327
+ # manifest — the signature of running against a partial index. The floor
328
+ # (PURGE_GUARD_MIN_DOCS) keeps small collections deletable.
329
+ def guard_blocks_purge?(stale)
330
+ return false if @force_purge
331
+
332
+ size = @manifest.size
333
+ return false if size < PURGE_GUARD_MIN_DOCS
334
+
335
+ fraction = stale.size.to_f / size
336
+ return false unless fraction > PURGE_GUARD_FRACTION
337
+
338
+ log " WARNING: refusing to delete #{stale.size} of #{size} documents " \
339
+ "(#{(fraction * 100).round}% > #{(PURGE_GUARD_FRACTION * 100).to_i}% — likely a partial index). " \
340
+ 'Set UNBLOCKED_FORCE_PURGE=1 to override.'
341
+ true
342
+ end
343
+
344
+ # Seed the manifest from the remote collection when we have no local
345
+ # state (first run / CI cache miss). The list endpoint returns no body,
346
+ # so hashes are nil (everything re-pushes), but recovering document_ids
347
+ # lets this run still purge orphaned documents.
348
+ #
349
+ # Auth failures re-raise: a 401/403 here dooms every subsequent call,
350
+ # and "proceeding with full sync" would burn the whole daily budget on
351
+ # guaranteed failures.
352
+ def reconcile_from_remote
353
+ @client.all_documents(collection_id: @collection_id).each do |doc|
354
+ uri = doc['uri']
355
+ next unless uri
356
+
357
+ @manifest.record(uri: uri, hash: nil, document_id: doc['id'])
358
+ end
359
+ rescue ApiError => e
360
+ raise if [401, 403].include?(e.status)
361
+
362
+ log " reconcile skipped (#{e.message}) — proceeding with full sync"
363
+ rescue StandardError => e
364
+ log " reconcile skipped (#{e.message}) — proceeding with full sync"
365
+ end
366
+
367
+ def track_uri(unit_data)
368
+ # Units without a file_path are never pushed (see push_document), so
369
+ # their fallback repo-root URI must not be marked current either — a
370
+ # stale repo-root document from before this guard should purge.
371
+ return unless unit_data['file_path']
372
+
373
+ # Must match the URI push_document actually uses, or a colliding unit's
374
+ # disambiguated document would look stale and be purged.
375
+ @current_uris << effective_uri(unit_data)
376
+ end
377
+
378
+ # The URI a unit's document is stored under. Normally the file's blob URL;
379
+ # when several units share that file, all but the lexically-first
380
+ # identifier get a `?unit=` suffix so each keeps a distinct document
381
+ # rather than overwriting the others (see #build_uri_index).
382
+ def effective_uri(unit_data)
383
+ base = @builder.uri_for(unit_data)
384
+ primary = @uri_primary[base]
385
+ return base if primary.nil? || primary == unit_data['identifier']
386
+
387
+ "#{base}?unit=#{URI.encode_www_form_component(unit_data['identifier'])}"
388
+ end
389
+
390
+ # One cheap pass over the type indexes (entries already carry file_path,
391
+ # and read_index is cached) to find files that define more than one synced
392
+ # unit. For each such base URI, the lexically-smallest identifier — the
393
+ # outer/top-level class — keeps the bare URI; siblings are suffixed. Solo
394
+ # files (the overwhelming majority) are absent from the map and unchanged,
395
+ # so this introduces no churn for them.
396
+ def build_uri_index
397
+ groups = Hash.new { |h, k| h[k] = [] }
398
+ synced_types.each do |type|
399
+ @reader.list_units(type: type).each do |entry|
400
+ next unless entry['file_path']
401
+
402
+ groups[@builder.uri_for(entry)] << entry['identifier']
403
+ end
404
+ end
405
+
406
+ @uri_primary = groups.each_with_object({}) do |(uri, identifiers), primary|
407
+ unique = identifiers.uniq
408
+ primary[uri] = unique.min if unique.size > 1
409
+ end
410
+ end
411
+
412
+ def synced_types
413
+ FULL_SYNC_TYPES + PARTIAL_SYNC_TYPES.map(&:first)
414
+ end
415
+
416
+ def fingerprint(doc)
417
+ Digest::SHA256.hexdigest("#{doc[:title]}\n#{doc[:body]}")
418
+ end
419
+
420
+ # Records whether an error was a budget-exhaustion stop. Returns true when
421
+ # it was, so callers can break out of their loop. Class check first; the
422
+ # message match remains as a fallback for injected clients that raise
423
+ # plain Woods::Error.
424
+ def note_budget_exhaustion(error)
425
+ return false unless error.is_a?(BudgetExhaustedError) || error.message.include?('daily budget exhausted')
426
+
427
+ @budget_exhausted = true
179
428
  end
180
429
 
181
430
  def build_reader(index_dir)
@@ -183,6 +432,23 @@ module Woods
183
432
  Woods::MCP::IndexReader.new(index_dir)
184
433
  end
185
434
 
435
+ # Persist the manifest, downgrading failures to a warning: losing the
436
+ # manifest only costs a full re-check next run, which must not turn an
437
+ # otherwise-successful sync into a crash (this runs from an ensure, where
438
+ # a raise would also mask any in-flight exception).
439
+ def save_manifest
440
+ @manifest.save
441
+ rescue StandardError => e
442
+ log " WARNING: sync manifest not persisted (#{e.message}) — next run will re-push all documents"
443
+ end
444
+
445
+ def build_manifest(index_dir)
446
+ SyncManifest.new(
447
+ path: File.join(index_dir, 'unblocked_sync_manifest.json'),
448
+ collection_id: @collection_id
449
+ )
450
+ end
451
+
186
452
  def empty_stats
187
453
  { synced: 0, skipped: 0, errors: [] }
188
454
  end
@@ -1,7 +1,15 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require 'woods'
4
+
3
5
  module Woods
4
6
  module Unblocked
7
+ # Raised when the daily API call budget is exhausted. Subclasses
8
+ # Woods::Error so existing +rescue Woods::Error+ sites keep working;
9
+ # callers that need to branch on exhaustion rescue this class instead of
10
+ # matching the message string.
11
+ class BudgetExhaustedError < Woods::Error; end
12
+
5
13
  # Daily budget-based rate limiter for the Unblocked API (1000 calls/day).
6
14
  #
7
15
  # Unlike Notion's per-second throttling, Unblocked limits by daily call count.
@@ -35,13 +43,13 @@ module Woods
35
43
  #
36
44
  # @yield The API call to execute
37
45
  # @return [Object] The block's return value
38
- # @raise [Woods::Error] if daily budget is exhausted
46
+ # @raise [BudgetExhaustedError] if daily budget is exhausted
39
47
  def track
40
48
  raise ArgumentError, 'block required' unless block_given?
41
49
 
42
50
  @mutex.synchronize do
43
51
  if @calls_today >= @daily_budget
44
- raise Woods::Error,
52
+ raise BudgetExhaustedError,
45
53
  "Unblocked API daily budget exhausted (#{@daily_budget} calls). " \
46
54
  'Budget resets at midnight PST. Use UNBLOCKED_DAILY_BUDGET to adjust.'
47
55
  end
@@ -0,0 +1,135 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'json'
4
+ require 'fileutils'
5
+
6
+ module Woods
7
+ module Unblocked
8
+ # Tracks what was last pushed to an Unblocked collection so a sync can
9
+ # skip unchanged documents, re-push changed ones, and delete orphans.
10
+ #
11
+ # The manifest is the local source of truth for change detection: each
12
+ # entry records the content hash of the document we last pushed for a URI
13
+ # plus the remote +document_id+ (needed for deletes). Persisted as JSON
14
+ # alongside the extraction output and restored across CI runs via the CI
15
+ # provider's cache. A missing or corrupt file degrades to "everything is
16
+ # new" — a correct (if expensive) full sync that rebuilds the manifest.
17
+ #
18
+ # Modeled on the embedding indexer's checkpoint (load JSON → compare
19
+ # per-key hash → save JSON).
20
+ #
21
+ # @example
22
+ # manifest = SyncManifest.new(path: "tmp/woods/unblocked_sync_manifest.json",
23
+ # collection_id: "col-uuid")
24
+ # manifest.unchanged?(uri, hash) # => false on first run
25
+ # manifest.record(uri:, hash:, document_id:)
26
+ # manifest.save
27
+ #
28
+ class SyncManifest
29
+ VERSION = 1
30
+
31
+ # @param path [String] JSON file path for the manifest
32
+ # @param collection_id [String] Target collection UUID — a stored manifest
33
+ # for a *different* collection is discarded (cache-key reuse guard).
34
+ def initialize(path:, collection_id:)
35
+ @path = path
36
+ @collection_id = collection_id
37
+ @documents = load
38
+ end
39
+
40
+ # @return [Boolean] true when no documents are recorded
41
+ def empty?
42
+ @documents.empty?
43
+ end
44
+
45
+ # @param uri [String] Document URI
46
+ # @param hash [String] Content hash of the document we would push now
47
+ # @return [Boolean] true when the recorded hash matches (safe to skip)
48
+ def unchanged?(uri, hash)
49
+ entry = @documents[uri]
50
+ !entry.nil? && entry['hash'] == hash
51
+ end
52
+
53
+ # Record (or update) what we pushed for a URI.
54
+ #
55
+ # @param uri [String] Document URI
56
+ # @param hash [String, nil] Content hash pushed (nil forces a future re-push)
57
+ # @param document_id [String, nil] Remote document UUID (for later deletes)
58
+ def record(uri:, hash:, document_id:)
59
+ @documents[uri] = { 'hash' => hash, 'document_id' => document_id }
60
+ end
61
+
62
+ # @param uri [String] Document URI
63
+ # @return [String, nil] Stored remote document_id, if known
64
+ def document_id_for(uri)
65
+ @documents.dig(uri, 'document_id')
66
+ end
67
+
68
+ # URIs we have a record of that are absent from the current run's set.
69
+ #
70
+ # @param current_uris [Array<String>, Set] URIs that still exist this run
71
+ # @return [Array<String>] recorded URIs no longer present (deletion candidates)
72
+ def stale_uris(current_uris)
73
+ present = current_uris.to_a
74
+ @documents.keys - present
75
+ end
76
+
77
+ # @return [Integer] number of recorded documents
78
+ def size
79
+ @documents.size
80
+ end
81
+
82
+ # Drop a URI from the manifest (after a successful remote delete).
83
+ #
84
+ # @param uri [String] Document URI
85
+ def forget(uri)
86
+ @documents.delete(uri)
87
+ end
88
+
89
+ # Persist the manifest atomically (temp file + rename) so an interrupted
90
+ # write never leaves a torn file in the CI cache.
91
+ def save
92
+ FileUtils.mkdir_p(File.dirname(@path))
93
+ payload = JSON.generate(
94
+ 'version' => VERSION,
95
+ 'collection_id' => @collection_id,
96
+ 'documents' => @documents
97
+ )
98
+ tmp = "#{@path}.tmp"
99
+ File.write(tmp, payload)
100
+ File.rename(tmp, @path)
101
+ end
102
+
103
+ private
104
+
105
+ # Load the persisted documents, discarding data from a different
106
+ # collection, a different schema version, or an unparseable file.
107
+ # Every discard warns to stderr — the consequence (a full re-push) is
108
+ # expensive enough that operators need to know why it happened.
109
+ #
110
+ # @return [Hash{String=>Hash}] uri => { 'hash' =>, 'document_id' => }
111
+ def load
112
+ return {} unless File.exist?(@path)
113
+
114
+ parsed = JSON.parse(File.read(@path))
115
+ return discard('not a JSON object') unless parsed.is_a?(Hash)
116
+ return discard("schema version #{parsed['version'].inspect}, expected #{VERSION}") unless
117
+ parsed['version'] == VERSION
118
+ return discard("written for collection #{parsed['collection_id'].inspect}, expected #{@collection_id}") unless
119
+ parsed['collection_id'] == @collection_id
120
+
121
+ documents = parsed['documents']
122
+ documents.is_a?(Hash) ? documents : {}
123
+ rescue JSON::ParserError
124
+ discard('unparseable JSON')
125
+ end
126
+
127
+ # @param reason [String] Why the persisted manifest is unusable
128
+ # @return [Hash] empty documents hash (degrades to a full re-push)
129
+ def discard(reason)
130
+ warn "WARNING: discarding sync manifest at #{@path} (#{reason}) — next sync re-pushes all documents"
131
+ {}
132
+ end
133
+ end
134
+ end
135
+ end
data/lib/woods/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Woods
4
- VERSION = '1.3.0'
4
+ VERSION = '1.4.1'
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: woods
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.3.0
4
+ version: 1.4.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Leah Armstrong
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2026-05-13 00:00:00.000000000 Z
11
+ date: 2026-06-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: mcp
@@ -305,6 +305,7 @@ files:
305
305
  - lib/woods/unblocked/document_builder.rb
306
306
  - lib/woods/unblocked/exporter.rb
307
307
  - lib/woods/unblocked/rate_limiter.rb
308
+ - lib/woods/unblocked/sync_manifest.rb
308
309
  - lib/woods/util/host_guard.rb
309
310
  - lib/woods/version.rb
310
311
  homepage: https://github.com/lost-in-the/woods