woods 1.3.0 → 1.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +60 -0
- data/README.md +4 -0
- data/lib/generators/woods/templates/woods.rb.tt +1 -1
- data/lib/tasks/woods.rake +22 -1
- data/lib/woods/extractors/controller_extractor.rb +4 -4
- data/lib/woods/unblocked/client.rb +87 -6
- data/lib/woods/unblocked/document_builder.rb +40 -26
- data/lib/woods/unblocked/exporter.rb +233 -17
- data/lib/woods/unblocked/rate_limiter.rb +10 -2
- data/lib/woods/unblocked/sync_manifest.rb +135 -0
- data/lib/woods/version.rb +1 -1
- metadata +3 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: f8991ef048ac039aafe41403a11c320333107ece6ad89da376765107820d27b9
|
|
4
|
+
data.tar.gz: 3fdc6630de8946e1d2ba074ca0b115588934ff965712f1d18d6d3abeb8c5c50b
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: fed6f9a4a26f68dcd5304bc5ab6ba8a99ead3e5ca872a4c466c8ceb9faa23345e16a8405c14397bccb5fb0290970503b6b5ccb62ffe4d041d583593a64bd619a
|
|
7
|
+
data.tar.gz: 10451e14fce82dd983d43b60875fac444dbb3eb08092c85f2085c08de79ebeeca7be80c3278de8145b88c841dae3f2cd909035899460dffd243cc85711523975
|
data/CHANGELOG.md
CHANGED
|
@@ -7,6 +7,66 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [1.4.0] - 2026-06-10
|
|
11
|
+
|
|
12
|
+
### Added — Incremental Unblocked sync (PR #128)
|
|
13
|
+
|
|
14
|
+
- **`woods:unblocked_sync` is now incremental.** A new `Woods::Unblocked::SyncManifest`
|
|
15
|
+
(JSON at `<output_dir>/unblocked_sync_manifest.json`) records the content hash and
|
|
16
|
+
remote document id of everything last pushed. Each run skips unchanged documents,
|
|
17
|
+
pushes only new/changed ones, and deletes documents whose source unit disappeared.
|
|
18
|
+
A missing manifest (first run / CI cache miss) degrades to a correct full re-push
|
|
19
|
+
that rebuilds it; steady state on an unchanged codebase costs ~0 API calls (was
|
|
20
|
+
~800–1200 per run). Persist the manifest across CI runs via your provider's cache —
|
|
21
|
+
see `docs/UNBLOCKED_INTEGRATION.md`.
|
|
22
|
+
- **Deletion safety.** Orphan purging is skipped when the daily API budget exhausts
|
|
23
|
+
mid-run, and a mass-deletion guard refuses to delete more than 30% of a ≥10-entry
|
|
24
|
+
manifest in one run (`UNBLOCKED_FORCE_PURGE=1` overrides) — protection against
|
|
25
|
+
syncing a partial index. `UNBLOCKED_FORCE_FULL_SYNC=1` re-pushes everything (use
|
|
26
|
+
after a document-format change). Both flags parse `1`/`true`/`yes` (case-insensitive).
|
|
27
|
+
- **`Client#list_documents` / `#all_documents`** — paginated document listing with
|
|
28
|
+
client-side collection filtering, used to reconcile remote document ids when the
|
|
29
|
+
manifest is missing. Cursors are URL-encoded and a page without a cursor id stops
|
|
30
|
+
pagination rather than looping against the rate budget.
|
|
31
|
+
- **`Woods::Unblocked::ApiError`** (subclass of `Woods::Error`) carries the required
|
|
32
|
+
HTTP `status` of failed API calls; a 404 on delete is treated as already-gone.
|
|
33
|
+
**`Woods::Unblocked::BudgetExhaustedError`** (also a `Woods::Error` subclass) is
|
|
34
|
+
raised by the rate limiter, so budget detection no longer depends on message text.
|
|
35
|
+
- **CI-visible failures.** `woods:unblocked_sync` exits non-zero when the sync
|
|
36
|
+
recorded errors (delete failures are now surfaced in the error list too). The one
|
|
37
|
+
tolerated shape is budget exhaustion with partial progress — the expected
|
|
38
|
+
cold-start outcome, which converges on the next run. Reconcile aborts loudly on
|
|
39
|
+
auth failures (401/403) instead of burning the budget on doomed calls.
|
|
40
|
+
- **Deterministic document bodies.** `DocumentBuilder` sorts every rendered collection
|
|
41
|
+
(associations, dependents, routes, enums, scopes, concerns, callbacks) so an
|
|
42
|
+
unchanged unit always produces byte-identical output — the precondition for
|
|
43
|
+
hash-based change detection.
|
|
44
|
+
- **`Client#create_collection` defaults `iconUrl`** to the repo-hosted Woods mark.
|
|
45
|
+
The live API rejects collection creation without an `iconUrl` despite the API docs
|
|
46
|
+
marking it optional (documented quirk).
|
|
47
|
+
- **Branding.** Tree Rings logo set under `assets/` (marks, wordmark lockups, PNG
|
|
48
|
+
exports); README wordmark.
|
|
49
|
+
|
|
50
|
+
### Fixed
|
|
51
|
+
|
|
52
|
+
- `Client#list_collections` no longer raises `TypeError` on the live API's bare-array
|
|
53
|
+
response.
|
|
54
|
+
- API error messages now surface RFC7807 `title`/`detail` fields (previously
|
|
55
|
+
"Unknown error").
|
|
56
|
+
- `require 'woods/unblocked/client'` works standalone (previously needed `woods`
|
|
57
|
+
loaded first).
|
|
58
|
+
- Units without a `file_path` are skipped instead of synced. Previously every
|
|
59
|
+
such unit fell back to the bare repo URL as its document URI — and since URIs
|
|
60
|
+
are the upsert key, they silently overwrote each other in the collection
|
|
61
|
+
(and would have ping-ponged the new manifest hash every run).
|
|
62
|
+
|
|
63
|
+
### Build
|
|
64
|
+
|
|
65
|
+
- The suite now installs and runs on Ruby 4.0: the optional `tokenizers` gem (whose
|
|
66
|
+
native extension cannot build against the Ruby 4.0 ABI) is gated behind
|
|
67
|
+
`install_if (Ruby < 4.0)`, and `benchmark` (no longer a default gem in 4.0) is
|
|
68
|
+
declared explicitly. Lockfile unchanged.
|
|
69
|
+
|
|
10
70
|
## [1.3.0] - 2026-05-13
|
|
11
71
|
|
|
12
72
|
### Upgrade Notes
|
data/README.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
# Woods configuration
|
|
4
|
-
# Full reference: https://github.com/
|
|
4
|
+
# Full reference: https://github.com/lost-in-the/woods/blob/main/docs/CONFIGURATION_REFERENCE.md
|
|
5
5
|
#
|
|
6
6
|
# Quick-start presets (uncomment one instead of the full block below):
|
|
7
7
|
# Woods.configure_with_preset(:local) # in-memory + Ollama, no external services
|
data/lib/tasks/woods.rake
CHANGED
|
@@ -604,25 +604,46 @@ namespace :woods do
|
|
|
604
604
|
end
|
|
605
605
|
|
|
606
606
|
output_dir = ENV.fetch('WOODS_OUTPUT', config.output_dir)
|
|
607
|
+
# Truthy set, so FLAG=false / FLAG=0 disables rather than silently enabling.
|
|
608
|
+
env_flag = ->(name) { %w[1 true yes].include?(ENV.fetch(name, '').strip.downcase) }
|
|
609
|
+
force_full = env_flag.call('UNBLOCKED_FORCE_FULL_SYNC')
|
|
610
|
+
force_purge = env_flag.call('UNBLOCKED_FORCE_PURGE')
|
|
607
611
|
|
|
608
612
|
puts 'Syncing extraction data to Unblocked...'
|
|
609
613
|
puts " Output dir: #{output_dir}"
|
|
610
614
|
puts " Collection: #{config.unblocked_collection_id}"
|
|
611
615
|
puts " Repo URL: #{config.unblocked_repo_url}"
|
|
616
|
+
puts ' Mode: full re-sync (UNBLOCKED_FORCE_FULL_SYNC set)' if force_full
|
|
612
617
|
puts
|
|
613
618
|
|
|
614
|
-
exporter = Woods::Unblocked::Exporter.new(
|
|
619
|
+
exporter = Woods::Unblocked::Exporter.new(
|
|
620
|
+
index_dir: output_dir,
|
|
621
|
+
force_full: force_full,
|
|
622
|
+
force_purge: force_purge
|
|
623
|
+
)
|
|
615
624
|
stats = exporter.sync_all
|
|
616
625
|
|
|
617
626
|
puts
|
|
618
627
|
puts 'Sync complete!'
|
|
619
628
|
puts " Documents synced: #{stats[:synced]}"
|
|
620
629
|
puts " Documents skipped: #{stats[:skipped]}"
|
|
630
|
+
puts " Documents deleted: #{stats[:deleted]}"
|
|
621
631
|
|
|
622
632
|
if stats[:errors].any?
|
|
623
633
|
puts " Errors: #{stats[:errors].size}"
|
|
624
634
|
stats[:errors].first(5).each { |e| puts " - #{e}" }
|
|
625
635
|
puts " ... and #{stats[:errors].size - 5} more" if stats[:errors].size > 5
|
|
636
|
+
|
|
637
|
+
# Fail the task so CI notices — a printed-but-green run is invisible in
|
|
638
|
+
# post-merge pipelines (a dead token would otherwise stay green forever).
|
|
639
|
+
# Exception: budget exhaustion *with* partial progress is the expected
|
|
640
|
+
# cold-start shape; it converges on the next run.
|
|
641
|
+
budget_only = stats[:errors].all? { |e| e.include?('daily budget exhausted') }
|
|
642
|
+
unless budget_only && stats[:synced].positive?
|
|
643
|
+
puts
|
|
644
|
+
puts 'Sync completed with errors — failing so CI surfaces it.'
|
|
645
|
+
exit 1
|
|
646
|
+
end
|
|
626
647
|
end
|
|
627
648
|
end
|
|
628
649
|
|
|
@@ -226,10 +226,10 @@ module Woods
|
|
|
226
226
|
|
|
227
227
|
# Parent chain for understanding inherited behavior
|
|
228
228
|
ancestors: controller.ancestors
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
|
|
229
|
+
.take_while { |a| a != ActionController::Base && a != ActionController::API }
|
|
230
|
+
.grep(Class)
|
|
231
|
+
.map(&:name)
|
|
232
|
+
.compact,
|
|
233
233
|
|
|
234
234
|
# Concerns included
|
|
235
235
|
included_concerns: extract_included_concerns(controller),
|
|
@@ -3,10 +3,28 @@
|
|
|
3
3
|
require 'json'
|
|
4
4
|
require 'net/http'
|
|
5
5
|
require 'uri'
|
|
6
|
+
require 'woods'
|
|
6
7
|
require_relative 'rate_limiter'
|
|
7
8
|
|
|
8
9
|
module Woods
|
|
9
10
|
module Unblocked
|
|
11
|
+
# API error carrying the HTTP status code, so callers can branch on
|
|
12
|
+
# status (e.g. treat a 404 on delete as "already gone") instead of
|
|
13
|
+
# matching message strings. Subclasses Woods::Error, so existing
|
|
14
|
+
# +rescue Woods::Error+ sites keep working unchanged.
|
|
15
|
+
class ApiError < Woods::Error
|
|
16
|
+
# @return [Integer] HTTP status code of the failed response
|
|
17
|
+
attr_reader :status
|
|
18
|
+
|
|
19
|
+
# @param message [String] Error message
|
|
20
|
+
# @param status [Integer] HTTP status code — required, because callers
|
|
21
|
+
# branch on it (a nil status would silently miss every status check)
|
|
22
|
+
def initialize(message, status:)
|
|
23
|
+
super(message)
|
|
24
|
+
@status = Integer(status)
|
|
25
|
+
end
|
|
26
|
+
end
|
|
27
|
+
|
|
10
28
|
# REST client for the Unblocked API v1.
|
|
11
29
|
#
|
|
12
30
|
# Handles document and collection CRUD with rate limiting, retries,
|
|
@@ -25,6 +43,12 @@ module Woods
|
|
|
25
43
|
BASE_URL = 'https://getunblocked.com/api/v1'
|
|
26
44
|
MAX_RETRIES = 3
|
|
27
45
|
DEFAULT_TIMEOUT = 30
|
|
46
|
+
# Max page size the list endpoint accepts (per API docs).
|
|
47
|
+
PAGE_SIZE = 200
|
|
48
|
+
# Repo-hosted Woods mark, used as the collection icon when none is given.
|
|
49
|
+
# The live API rejects collection creation without an iconUrl (despite
|
|
50
|
+
# the API docs marking it optional), so a working default matters.
|
|
51
|
+
DEFAULT_ICON_URL = 'https://raw.githubusercontent.com/lost-in-the/woods/main/assets/woods-mark-black.svg'
|
|
28
52
|
|
|
29
53
|
# @param api_token [String] Unblocked API token (Personal or Team)
|
|
30
54
|
# @param rate_limiter [RateLimiter] Rate limiter instance
|
|
@@ -60,12 +84,17 @@ module Woods
|
|
|
60
84
|
#
|
|
61
85
|
# @param name [String] Collection name (1-32 chars)
|
|
62
86
|
# @param description [String] Collection description (1-4096 chars)
|
|
63
|
-
# @param icon_url [String, nil]
|
|
87
|
+
# @param icon_url [String, nil] Icon URL. The live API rejects creation
|
|
88
|
+
# with a bare 400 when omitted (despite the API docs marking it
|
|
89
|
+
# optional), so nil falls back to DEFAULT_ICON_URL — the repo-hosted
|
|
90
|
+
# Woods mark.
|
|
64
91
|
# @return [Hash] { "id" => "collection-uuid", "name" => "...", ... }
|
|
65
92
|
def create_collection(name:, description:, icon_url: nil)
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
93
|
+
request(:post, 'collections', {
|
|
94
|
+
name: name,
|
|
95
|
+
description: description,
|
|
96
|
+
iconUrl: icon_url || DEFAULT_ICON_URL
|
|
97
|
+
})
|
|
69
98
|
end
|
|
70
99
|
|
|
71
100
|
# List all collections.
|
|
@@ -73,6 +102,10 @@ module Woods
|
|
|
73
102
|
# @return [Array<Hash>] Collection objects
|
|
74
103
|
def list_collections
|
|
75
104
|
result = request(:get, 'collections')
|
|
105
|
+
# The live API returns a bare JSON array; the envelope fallbacks are
|
|
106
|
+
# defensive (calling ['items'] on an Array raises TypeError).
|
|
107
|
+
return result if result.is_a?(Array)
|
|
108
|
+
|
|
76
109
|
result['items'] || result['data'] || [result].flatten.compact
|
|
77
110
|
end
|
|
78
111
|
|
|
@@ -84,6 +117,52 @@ module Woods
|
|
|
84
117
|
request(:delete, "documents/#{document_id}")
|
|
85
118
|
end
|
|
86
119
|
|
|
120
|
+
# List a single page of documents.
|
|
121
|
+
#
|
|
122
|
+
# The endpoint returns a bare JSON array of document metadata (no body):
|
|
123
|
+
# `id, collectionId, title, uri, createdAt, updatedAt`. Pagination is
|
|
124
|
+
# cursor-based via `after`/`before` (opaque cursors); there is no
|
|
125
|
+
# server-side collection filter.
|
|
126
|
+
#
|
|
127
|
+
# @param limit [Integer] Page size (1-200)
|
|
128
|
+
# @param after [String, nil] Opaque forward cursor (typically the last id)
|
|
129
|
+
# @return [Array<Hash>] One page of document metadata
|
|
130
|
+
def list_documents(limit: PAGE_SIZE, after: nil)
|
|
131
|
+
query = "limit=#{limit}"
|
|
132
|
+
query += "&after=#{URI.encode_www_form_component(after)}" if after
|
|
133
|
+
result = request(:get, "documents?#{query}")
|
|
134
|
+
return result if result.is_a?(Array)
|
|
135
|
+
|
|
136
|
+
result['items'] || result['data'] || []
|
|
137
|
+
end
|
|
138
|
+
|
|
139
|
+
# List every document in a collection, paging until exhausted.
|
|
140
|
+
#
|
|
141
|
+
# Filters client-side on `collectionId` since the API has no collection
|
|
142
|
+
# filter. ~5 calls for ~1000 documents; each goes through the rate limiter.
|
|
143
|
+
#
|
|
144
|
+
# @param collection_id [String] Collection UUID to filter to
|
|
145
|
+
# @return [Array<Hash>] All matching document metadata
|
|
146
|
+
def all_documents(collection_id:)
|
|
147
|
+
docs = []
|
|
148
|
+
after = nil
|
|
149
|
+
|
|
150
|
+
loop do
|
|
151
|
+
page = list_documents(limit: PAGE_SIZE, after: after)
|
|
152
|
+
break if page.empty?
|
|
153
|
+
|
|
154
|
+
docs.concat(page)
|
|
155
|
+
break if page.size < PAGE_SIZE
|
|
156
|
+
|
|
157
|
+
after = page.last['id']
|
|
158
|
+
# A full page with no cursor id would refetch page 1 forever —
|
|
159
|
+
# stop with what we have rather than loop against the budget.
|
|
160
|
+
break if after.nil?
|
|
161
|
+
end
|
|
162
|
+
|
|
163
|
+
docs.select { |doc| doc['collectionId'] == collection_id }
|
|
164
|
+
end
|
|
165
|
+
|
|
87
166
|
private
|
|
88
167
|
|
|
89
168
|
def request(method, path, body = nil)
|
|
@@ -155,8 +234,10 @@ module Woods
|
|
|
155
234
|
rescue JSON::ParserError, TypeError
|
|
156
235
|
{ 'message' => response.body&.slice(0, 200) || 'Unknown error' }
|
|
157
236
|
end
|
|
158
|
-
|
|
159
|
-
|
|
237
|
+
# The Unblocked API returns RFC7807-style bodies ({ status, title, detail });
|
|
238
|
+
# older/other paths use message/error. Check all so failures stay legible.
|
|
239
|
+
message = parsed['message'] || parsed['error'] || parsed['detail'] || parsed['title'] || 'Unknown error'
|
|
240
|
+
raise ApiError.new("Unblocked API error #{response.code}: #{message}", status: response.code.to_i)
|
|
160
241
|
end
|
|
161
242
|
end
|
|
162
243
|
end
|
|
@@ -27,23 +27,30 @@ module Woods
|
|
|
27
27
|
def build(unit_data)
|
|
28
28
|
type = unit_data['type']
|
|
29
29
|
identifier = unit_data['identifier']
|
|
30
|
-
file_path = unit_data['file_path']
|
|
31
30
|
|
|
32
31
|
{
|
|
33
32
|
title: "#{identifier} (#{type})",
|
|
34
33
|
body: build_body(unit_data),
|
|
35
|
-
uri:
|
|
34
|
+
uri: uri_for(unit_data)
|
|
36
35
|
}
|
|
37
36
|
end
|
|
38
37
|
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
38
|
+
# The citation URI for a unit (GitHub blob URL, or the repo root when the
|
|
39
|
+
# unit has no file_path). Public so callers can compute a unit's URI
|
|
40
|
+
# cheaply — e.g. to build the set of currently-existing URIs — without
|
|
41
|
+
# building the full document body.
|
|
42
|
+
#
|
|
43
|
+
# @param unit_data [Hash] Parsed unit JSON (needs 'file_path')
|
|
44
|
+
# @return [String] Citation URI
|
|
45
|
+
def uri_for(unit_data)
|
|
46
|
+
file_path = unit_data['file_path']
|
|
42
47
|
return @repo_url unless file_path
|
|
43
48
|
|
|
44
49
|
"#{@repo_url}/blob/main/#{file_path}"
|
|
45
50
|
end
|
|
46
51
|
|
|
52
|
+
private
|
|
53
|
+
|
|
47
54
|
def build_body(unit_data)
|
|
48
55
|
type = unit_data['type']
|
|
49
56
|
body = case type
|
|
@@ -125,7 +132,9 @@ module Woods
|
|
|
125
132
|
dep = a.dig('options', 'dependent')
|
|
126
133
|
dep ? "#{name} (#{dep})" : name
|
|
127
134
|
end
|
|
128
|
-
|
|
135
|
+
# Sorted so the body is a function of association content, not order
|
|
136
|
+
# (the exporter hashes this body to detect changes).
|
|
137
|
+
lines << "**#{type}:** #{targets.sort.join(', ')}"
|
|
129
138
|
end
|
|
130
139
|
|
|
131
140
|
lines.join("\n")
|
|
@@ -136,7 +145,7 @@ module Woods
|
|
|
136
145
|
return nil if deps.empty?
|
|
137
146
|
|
|
138
147
|
grouped = deps.group_by { |d| d['type'] }
|
|
139
|
-
summary_parts = grouped.map { |type, items| "#{items.size} #{type}s" }
|
|
148
|
+
summary_parts = grouped.sort_by { |type, _| type.to_s }.map { |type, items| "#{items.size} #{type}s" }
|
|
140
149
|
|
|
141
150
|
lines = ["## Dependents (#{deps.size} units)"]
|
|
142
151
|
lines << summary_parts.join(', ')
|
|
@@ -160,9 +169,9 @@ module Woods
|
|
|
160
169
|
return nil if controllers.empty? && graphql.empty?
|
|
161
170
|
|
|
162
171
|
lines = ['## Entry Points']
|
|
163
|
-
lines << "**Controllers:** #{controllers.map { |c| c['identifier'] }.join(', ')}" if controllers.any?
|
|
164
|
-
lines << "**GraphQL:** #{graphql.map { |g| g['identifier'] }.join(', ')}" if graphql.any?
|
|
165
|
-
lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.join(', ')}" if jobs.any?
|
|
172
|
+
lines << "**Controllers:** #{controllers.map { |c| c['identifier'] }.sort.join(', ')}" if controllers.any?
|
|
173
|
+
lines << "**GraphQL:** #{graphql.map { |g| g['identifier'] }.sort.join(', ')}" if graphql.any?
|
|
174
|
+
lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.sort.join(', ')}" if jobs.any?
|
|
166
175
|
|
|
167
176
|
lines.join("\n")
|
|
168
177
|
end
|
|
@@ -172,15 +181,16 @@ module Woods
|
|
|
172
181
|
|
|
173
182
|
enums = meta['enums']
|
|
174
183
|
if enums.is_a?(Hash) && enums.any?
|
|
175
|
-
enum_strs = enums.
|
|
184
|
+
enum_strs = enums.sort_by { |name, _| name.to_s }
|
|
185
|
+
.map { |name, values| "#{name} (#{format_enum_values(values)})" }
|
|
176
186
|
parts << "**Enums:** #{enum_strs.join('; ')}"
|
|
177
187
|
end
|
|
178
188
|
|
|
179
189
|
scopes = meta['scopes']
|
|
180
|
-
parts << "**Scopes:** #{scopes.map { |s| s['name'] }.join(', ')}" if scopes.is_a?(Array) && scopes.any?
|
|
190
|
+
parts << "**Scopes:** #{scopes.map { |s| s['name'] }.sort.join(', ')}" if scopes.is_a?(Array) && scopes.any?
|
|
181
191
|
|
|
182
192
|
concerns = meta['inlined_concerns']
|
|
183
|
-
parts << "**Concerns:** #{concerns.join(', ')}" if concerns.is_a?(Array) && concerns.any?
|
|
193
|
+
parts << "**Concerns:** #{concerns.sort.join(', ')}" if concerns.is_a?(Array) && concerns.any?
|
|
184
194
|
|
|
185
195
|
callbacks = meta['callbacks']
|
|
186
196
|
if callbacks.is_a?(Array) && callbacks.any?
|
|
@@ -200,8 +210,8 @@ module Woods
|
|
|
200
210
|
return nil if jobs.empty? && mailers.empty?
|
|
201
211
|
|
|
202
212
|
lines = ['## Side Effects']
|
|
203
|
-
lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.join(', ')}" if jobs.any?
|
|
204
|
-
lines << "**Mailers:** #{mailers.map { |m| m['identifier'] }.join(', ')}" if mailers.any?
|
|
213
|
+
lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.sort.join(', ')}" if jobs.any?
|
|
214
|
+
lines << "**Mailers:** #{mailers.map { |m| m['identifier'] }.sort.join(', ')}" if mailers.any?
|
|
205
215
|
|
|
206
216
|
lines.join("\n")
|
|
207
217
|
end
|
|
@@ -229,18 +239,21 @@ module Woods
|
|
|
229
239
|
routes = meta['routes']
|
|
230
240
|
return nil unless routes.is_a?(Hash) && routes.any?
|
|
231
241
|
|
|
232
|
-
|
|
242
|
+
route_lines = []
|
|
233
243
|
routes.each do |action, route_list|
|
|
234
244
|
next unless route_list.is_a?(Array)
|
|
235
245
|
|
|
236
246
|
route_list.each do |route|
|
|
237
247
|
next unless route.is_a?(Hash)
|
|
238
248
|
|
|
239
|
-
|
|
249
|
+
route_lines << "- `#{route['verb']} #{route['path']}` (#{action})"
|
|
240
250
|
end
|
|
241
251
|
end
|
|
242
252
|
|
|
243
|
-
|
|
253
|
+
# Sort before truncating so the kept subset is stable across runs.
|
|
254
|
+
return nil if route_lines.empty?
|
|
255
|
+
|
|
256
|
+
(['## Routes'] + route_lines.sort.first(20)).join("\n")
|
|
244
257
|
end
|
|
245
258
|
|
|
246
259
|
def controller_dependencies(unit)
|
|
@@ -250,7 +263,7 @@ module Woods
|
|
|
250
263
|
models = deps.select { |d| d['type'] == 'model' }.map { |d| d['target'] }
|
|
251
264
|
return nil if models.empty?
|
|
252
265
|
|
|
253
|
-
"## Dependencies\n**Models:** #{models.join(', ')}"
|
|
266
|
+
"## Dependencies\n**Models:** #{models.sort.join(', ')}"
|
|
254
267
|
end
|
|
255
268
|
|
|
256
269
|
def controller_dependents(unit)
|
|
@@ -258,7 +271,7 @@ module Woods
|
|
|
258
271
|
views = deps.select { |d| d['type'] == 'view_template' }
|
|
259
272
|
return nil if views.empty?
|
|
260
273
|
|
|
261
|
-
"## Views\n#{views.map { |v| "- `#{v['identifier']}`" }.first(10).join("\n")}"
|
|
274
|
+
"## Views\n#{views.map { |v| "- `#{v['identifier']}`" }.sort.first(10).join("\n")}"
|
|
262
275
|
end
|
|
263
276
|
|
|
264
277
|
# ── GraphQL formatting ───────────────────────────────────────────
|
|
@@ -271,7 +284,7 @@ module Woods
|
|
|
271
284
|
|
|
272
285
|
deps = unit['dependencies'] || []
|
|
273
286
|
models = deps.select { |d| d['type'] == 'model' }.map { |d| d['target'] }
|
|
274
|
-
sections << "**Models:** #{models.join(', ')}" if models.any?
|
|
287
|
+
sections << "**Models:** #{models.sort.join(', ')}" if models.any?
|
|
275
288
|
|
|
276
289
|
dependents = unit['dependents'] || []
|
|
277
290
|
sections << "**Referenced by:** #{dependents.size} units" if dependents.any?
|
|
@@ -292,14 +305,15 @@ module Woods
|
|
|
292
305
|
deps = unit['dependencies'] || []
|
|
293
306
|
if deps.any?
|
|
294
307
|
by_type = deps.group_by { |d| d['type'] }
|
|
295
|
-
dep_parts = by_type.
|
|
308
|
+
dep_parts = by_type.sort_by { |type, _| type.to_s }
|
|
309
|
+
.map { |type, items| "#{type}: #{items.map { |d| d['target'] }.sort.join(', ')}" }
|
|
296
310
|
sections << "## Dependencies\n#{dep_parts.join("\n")}"
|
|
297
311
|
end
|
|
298
312
|
|
|
299
313
|
dependents = unit['dependents'] || []
|
|
300
314
|
if dependents.any?
|
|
301
315
|
grouped = dependents.group_by { |d| d['type'] }
|
|
302
|
-
summary = grouped.map { |type, items| "#{items.size} #{type}s" }
|
|
316
|
+
summary = grouped.sort_by { |type, _| type.to_s }.map { |type, items| "#{items.size} #{type}s" }
|
|
303
317
|
sections << "## Dependents (#{dependents.size})\n#{summary.join(', ')}"
|
|
304
318
|
end
|
|
305
319
|
|
|
@@ -317,9 +331,9 @@ module Woods
|
|
|
317
331
|
end
|
|
318
332
|
|
|
319
333
|
def format_callbacks(callbacks)
|
|
320
|
-
|
|
321
|
-
|
|
322
|
-
|
|
334
|
+
# Sort before truncating so both the selection and order are stable
|
|
335
|
+
# regardless of input order (the body is hashed for change detection).
|
|
336
|
+
callbacks.map { |cb| "#{cb['type']}: #{cb['filter']}" }.sort.first(5).join(', ')
|
|
323
337
|
end
|
|
324
338
|
end
|
|
325
339
|
end
|
|
@@ -1,9 +1,12 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
|
+
require 'set'
|
|
4
|
+
require 'digest'
|
|
3
5
|
require 'woods'
|
|
4
6
|
require_relative 'client'
|
|
5
7
|
require_relative 'rate_limiter'
|
|
6
8
|
require_relative 'document_builder'
|
|
9
|
+
require_relative 'sync_manifest'
|
|
7
10
|
|
|
8
11
|
module Woods
|
|
9
12
|
module Unblocked
|
|
@@ -11,16 +14,27 @@ module Woods
|
|
|
11
14
|
#
|
|
12
15
|
# Reads extraction output from disk via IndexReader, converts units to
|
|
13
16
|
# condensed Markdown documents, and pushes via the Unblocked Documents API.
|
|
14
|
-
#
|
|
17
|
+
# Syncs are incremental: a {SyncManifest} records the content hash and
|
|
18
|
+
# remote document_id of everything last pushed, so each run only PUTs
|
|
19
|
+
# new/changed documents, skips unchanged ones, and deletes documents whose
|
|
20
|
+
# source unit has disappeared. Documents are upserted by URI, so a missing
|
|
21
|
+
# manifest (first run / CI cache miss) degrades to a correct full sync.
|
|
15
22
|
#
|
|
16
23
|
# @example
|
|
17
24
|
# exporter = Exporter.new(index_dir: "tmp/woods")
|
|
18
25
|
# stats = exporter.sync_all
|
|
19
|
-
# # => { synced:
|
|
26
|
+
# # => { synced: 12, skipped: 928, deleted: 1, errors: [] }
|
|
20
27
|
#
|
|
21
28
|
class Exporter
|
|
22
29
|
MAX_ERRORS = 100
|
|
23
30
|
|
|
31
|
+
# Mass-deletion guard: refuse to purge when more than this fraction of a
|
|
32
|
+
# manifest of at least PURGE_GUARD_MIN_DOCS entries would be deleted —
|
|
33
|
+
# the signature of a sync run against a partial index. Override with
|
|
34
|
+
# force_purge.
|
|
35
|
+
PURGE_GUARD_FRACTION = 0.30
|
|
36
|
+
PURGE_GUARD_MIN_DOCS = 10
|
|
37
|
+
|
|
24
38
|
# Unit types to sync, in priority order.
|
|
25
39
|
# All units are synced for these types.
|
|
26
40
|
FULL_SYNC_TYPES = %w[
|
|
@@ -39,9 +53,13 @@ module Woods
|
|
|
39
53
|
# @param config [Configuration] Woods configuration (default: global config)
|
|
40
54
|
# @param client [Client, nil] Unblocked API client (auto-created from config if nil)
|
|
41
55
|
# @param reader [Object, nil] IndexReader instance (auto-created if nil)
|
|
56
|
+
# @param manifest [SyncManifest, nil] Sync manifest (auto-created under index_dir if nil)
|
|
57
|
+
# @param force_full [Boolean] Re-push every unit, ignoring the unchanged check
|
|
58
|
+
# @param force_purge [Boolean] Bypass the mass-deletion guard
|
|
42
59
|
# @param output [IO] Progress output stream (default: $stdout)
|
|
43
60
|
# @raise [ConfigurationError] if required config is missing
|
|
44
|
-
def initialize(index_dir:, config: Woods.configuration, client: nil, reader: nil,
|
|
61
|
+
def initialize(index_dir:, config: Woods.configuration, client: nil, reader: nil,
|
|
62
|
+
manifest: nil, force_full: false, force_purge: false, output: $stdout)
|
|
45
63
|
@collection_id = config.unblocked_collection_id
|
|
46
64
|
raise ConfigurationError, 'unblocked_collection_id is required' unless @collection_id
|
|
47
65
|
|
|
@@ -57,18 +75,31 @@ module Woods
|
|
|
57
75
|
@client = client || Client.new(api_token: api_token, rate_limiter: limiter)
|
|
58
76
|
@reader = reader || build_reader(index_dir)
|
|
59
77
|
@builder = DocumentBuilder.new(repo_url: repo_url)
|
|
78
|
+
@manifest = manifest || build_manifest(index_dir)
|
|
79
|
+
@force_full = force_full
|
|
80
|
+
@force_purge = force_purge
|
|
60
81
|
@output = output
|
|
82
|
+
# Initialized here as well as in sync_all so the public sync_type /
|
|
83
|
+
# sync_type_partial methods work standalone (track_uri needs them).
|
|
84
|
+
@current_uris = Set.new
|
|
85
|
+
@budget_exhausted = false
|
|
61
86
|
end
|
|
62
87
|
|
|
63
88
|
# Sync all configured unit types to the Unblocked collection.
|
|
64
89
|
#
|
|
65
|
-
# @return [Hash] { synced
|
|
90
|
+
# @return [Hash] { synced:, skipped:, deleted:, errors: }
|
|
66
91
|
def sync_all
|
|
92
|
+
@current_uris = Set.new
|
|
93
|
+
@budget_exhausted = false
|
|
94
|
+
reconcile_from_remote if @manifest.empty?
|
|
95
|
+
|
|
67
96
|
synced = 0
|
|
68
97
|
skipped = 0
|
|
69
98
|
errors = []
|
|
70
99
|
|
|
71
100
|
FULL_SYNC_TYPES.each do |type|
|
|
101
|
+
break if @budget_exhausted
|
|
102
|
+
|
|
72
103
|
result = sync_type(type)
|
|
73
104
|
synced += result[:synced]
|
|
74
105
|
skipped += result[:skipped]
|
|
@@ -76,19 +107,24 @@ module Woods
|
|
|
76
107
|
end
|
|
77
108
|
|
|
78
109
|
PARTIAL_SYNC_TYPES.each do |type, max_count|
|
|
110
|
+
break if @budget_exhausted
|
|
111
|
+
|
|
79
112
|
result = sync_type_partial(type, max_count)
|
|
80
113
|
synced += result[:synced]
|
|
81
114
|
skipped += result[:skipped]
|
|
82
115
|
errors.concat(result[:errors])
|
|
83
116
|
end
|
|
84
117
|
|
|
85
|
-
|
|
118
|
+
deleted = @budget_exhausted ? 0 : purge_stale(errors)
|
|
119
|
+
{ synced: synced, skipped: skipped, deleted: deleted, errors: cap_errors(errors) }
|
|
120
|
+
ensure
|
|
121
|
+
save_manifest
|
|
86
122
|
end
|
|
87
123
|
|
|
88
124
|
# Sync all units of a given type.
|
|
89
125
|
#
|
|
90
126
|
# @param type [String] Unit type (e.g. "model", "controller")
|
|
91
|
-
# @return [Hash] { synced
|
|
127
|
+
# @return [Hash] { synced:, skipped:, errors: }
|
|
92
128
|
def sync_type(type)
|
|
93
129
|
units = @reader.list_units(type: type)
|
|
94
130
|
log " #{type}: #{units.size} units"
|
|
@@ -100,7 +136,7 @@ module Woods
|
|
|
100
136
|
#
|
|
101
137
|
# @param type [String] Unit type
|
|
102
138
|
# @param max_count [Integer] Maximum units to sync
|
|
103
|
-
# @return [Hash] { synced
|
|
139
|
+
# @return [Hash] { synced:, skipped:, errors: }
|
|
104
140
|
def sync_type_partial(type, max_count)
|
|
105
141
|
units = @reader.list_units(type: type)
|
|
106
142
|
return empty_stats if units.empty?
|
|
@@ -114,8 +150,14 @@ module Woods
|
|
|
114
150
|
{ entry: entry, data: data, dep_count: dep_count }
|
|
115
151
|
end
|
|
116
152
|
|
|
153
|
+
# Every unit of this type still exists — track its URI so partial units
|
|
154
|
+
# that fall *out* of the top-N are never mistaken for deletions.
|
|
155
|
+
units_with_data.each { |u| track_uri(u[:data]) }
|
|
156
|
+
|
|
117
157
|
top_units = units_with_data.sort_by { |u| -u[:dep_count] }.first(max_count)
|
|
118
|
-
|
|
158
|
+
# Count against what was actually synced — units.size includes entries
|
|
159
|
+
# whose unit data was missing (dropped by the filter_map above).
|
|
160
|
+
skipped_count = units.size - top_units.size
|
|
119
161
|
|
|
120
162
|
log " #{type}: #{top_units.size}/#{units.size} units (top by dependents)"
|
|
121
163
|
|
|
@@ -138,13 +180,19 @@ module Woods
|
|
|
138
180
|
next
|
|
139
181
|
end
|
|
140
182
|
|
|
141
|
-
|
|
142
|
-
|
|
183
|
+
track_uri(unit_data)
|
|
184
|
+
if push_document(unit_data) == :skipped
|
|
185
|
+
skipped += 1
|
|
186
|
+
else
|
|
187
|
+
synced += 1
|
|
188
|
+
end
|
|
143
189
|
rescue Woods::Error => e
|
|
144
190
|
errors << "#{entry['identifier']}: #{e.message}"
|
|
145
|
-
break if e
|
|
191
|
+
break if note_budget_exhaustion(e)
|
|
146
192
|
rescue StandardError => e
|
|
147
|
-
|
|
193
|
+
# Include the class — "undefined method for nil" without it is
|
|
194
|
+
# unactionable in CI logs.
|
|
195
|
+
errors << "#{entry['identifier']}: #{e.class}: #{e.message}"
|
|
148
196
|
end
|
|
149
197
|
|
|
150
198
|
{ synced: synced, skipped: skipped, errors: errors }
|
|
@@ -156,26 +204,177 @@ module Woods
|
|
|
156
204
|
errors = []
|
|
157
205
|
|
|
158
206
|
entries_with_data.each do |entry, unit_data|
|
|
159
|
-
|
|
160
|
-
|
|
207
|
+
track_uri(unit_data)
|
|
208
|
+
if push_document(unit_data) == :skipped
|
|
209
|
+
skipped += 1
|
|
210
|
+
else
|
|
211
|
+
synced += 1
|
|
212
|
+
end
|
|
161
213
|
rescue Woods::Error => e
|
|
162
214
|
errors << "#{entry['identifier']}: #{e.message}"
|
|
163
|
-
break if e
|
|
215
|
+
break if note_budget_exhaustion(e)
|
|
164
216
|
rescue StandardError => e
|
|
165
|
-
|
|
217
|
+
# Include the class — "undefined method for nil" without it is
|
|
218
|
+
# unactionable in CI logs.
|
|
219
|
+
errors << "#{entry['identifier']}: #{e.class}: #{e.message}"
|
|
166
220
|
end
|
|
167
221
|
|
|
168
222
|
{ synced: synced, skipped: skipped, errors: errors }
|
|
169
223
|
end
|
|
170
224
|
|
|
225
|
+
# Build the document, skip it if the manifest says it is unchanged,
|
|
226
|
+
# otherwise upsert it and record the new hash + remote document_id.
|
|
227
|
+
#
|
|
228
|
+
# @return [Symbol] :synced or :skipped
|
|
171
229
|
def push_document(unit_data)
|
|
230
|
+
# No file_path → the URI falls back to the bare repo URL, which every
|
|
231
|
+
# such unit would share: they'd overwrite each other remotely and
|
|
232
|
+
# ping-pong the manifest hash forever. Skip them.
|
|
233
|
+
return :skipped unless unit_data['file_path']
|
|
234
|
+
|
|
172
235
|
doc = @builder.build(unit_data)
|
|
173
|
-
|
|
236
|
+
# An empty body means the credential scrub failed closed (the builders
|
|
237
|
+
# always emit at least a header). Upserting it would overwrite a good
|
|
238
|
+
# remote document with nothing — error out and leave the remote as-is.
|
|
239
|
+
if doc[:body].nil? || doc[:body].empty?
|
|
240
|
+
raise Woods::ExtractionError, 'document body empty (credential scrub failure?) — push skipped'
|
|
241
|
+
end
|
|
242
|
+
|
|
243
|
+
hash = fingerprint(doc)
|
|
244
|
+
return :skipped if !@force_full && @manifest.unchanged?(doc[:uri], hash)
|
|
245
|
+
|
|
246
|
+
response = @client.put_document(
|
|
174
247
|
collection_id: @collection_id,
|
|
175
248
|
title: doc[:title],
|
|
176
249
|
body: doc[:body],
|
|
177
250
|
uri: doc[:uri]
|
|
178
251
|
)
|
|
252
|
+
document_id = (response['id'] if response.is_a?(Hash)) || @manifest.document_id_for(doc[:uri])
|
|
253
|
+
@manifest.record(uri: doc[:uri], hash: hash, document_id: document_id)
|
|
254
|
+
:synced
|
|
255
|
+
end
|
|
256
|
+
|
|
257
|
+
# Delete remote documents whose source unit no longer exists. Failures
|
|
258
|
+
# are appended to +errors+ — a delete that fails silently every run is
|
|
259
|
+
# how a collection rots while "deleted: 0" looks normal.
|
|
260
|
+
#
|
|
261
|
+
# @param errors [Array<String>] sink for delete failures
|
|
262
|
+
# @return [Integer] number of documents deleted
|
|
263
|
+
def purge_stale(errors)
|
|
264
|
+
stale = @manifest.stale_uris(@current_uris)
|
|
265
|
+
return 0 if stale.empty?
|
|
266
|
+
return 0 if guard_blocks_purge?(stale)
|
|
267
|
+
|
|
268
|
+
resolve_missing_document_ids(stale)
|
|
269
|
+
|
|
270
|
+
deleted = 0
|
|
271
|
+
stale.each do |uri|
|
|
272
|
+
document_id = @manifest.document_id_for(uri)
|
|
273
|
+
next unless document_id
|
|
274
|
+
|
|
275
|
+
@client.delete_document(document_id: document_id)
|
|
276
|
+
@manifest.forget(uri)
|
|
277
|
+
deleted += 1
|
|
278
|
+
rescue ApiError => e
|
|
279
|
+
if e.status == 404
|
|
280
|
+
# Already gone remotely — goal state reached, drop the entry
|
|
281
|
+
# rather than retrying every run.
|
|
282
|
+
@manifest.forget(uri)
|
|
283
|
+
else
|
|
284
|
+
errors << "delete #{uri}: #{e.message}"
|
|
285
|
+
end
|
|
286
|
+
rescue Woods::Error => e
|
|
287
|
+
break if note_budget_exhaustion(e)
|
|
288
|
+
|
|
289
|
+
errors << "delete #{uri}: #{e.message}"
|
|
290
|
+
rescue StandardError => e
|
|
291
|
+
# Entry stays in the manifest so a later run retries the delete —
|
|
292
|
+
# but surface the failure so systematic breakage is visible.
|
|
293
|
+
errors << "delete #{uri}: #{e.class}: #{e.message}"
|
|
294
|
+
end
|
|
295
|
+
deleted
|
|
296
|
+
end
|
|
297
|
+
|
|
298
|
+
# A manifest entry can carry a nil document_id (e.g. the PUT response
|
|
299
|
+
# body was empty). Those entries would be permanently undeletable, so
|
|
300
|
+
# before purging, make one bounded all_documents sweep to resolve ids.
|
|
301
|
+
# Best-effort: unresolved entries are simply skipped by the purge loop.
|
|
302
|
+
def resolve_missing_document_ids(stale)
|
|
303
|
+
missing = stale.select { |uri| @manifest.document_id_for(uri).nil? }
|
|
304
|
+
return if missing.empty?
|
|
305
|
+
|
|
306
|
+
ids_by_uri = @client.all_documents(collection_id: @collection_id)
|
|
307
|
+
.to_h { |doc| [doc['uri'], doc['id']] }
|
|
308
|
+
missing.each do |uri|
|
|
309
|
+
id = ids_by_uri[uri]
|
|
310
|
+
@manifest.record(uri: uri, hash: nil, document_id: id) if id
|
|
311
|
+
end
|
|
312
|
+
rescue StandardError => e
|
|
313
|
+
log " id resolution skipped (#{e.message})"
|
|
314
|
+
end
|
|
315
|
+
|
|
316
|
+
# True when purging +stale+ would delete too large a fraction of the
|
|
317
|
+
# manifest — the signature of running against a partial index. The floor
|
|
318
|
+
# (PURGE_GUARD_MIN_DOCS) keeps small collections deletable.
|
|
319
|
+
def guard_blocks_purge?(stale)
|
|
320
|
+
return false if @force_purge
|
|
321
|
+
|
|
322
|
+
size = @manifest.size
|
|
323
|
+
return false if size < PURGE_GUARD_MIN_DOCS
|
|
324
|
+
|
|
325
|
+
fraction = stale.size.to_f / size
|
|
326
|
+
return false unless fraction > PURGE_GUARD_FRACTION
|
|
327
|
+
|
|
328
|
+
log " WARNING: refusing to delete #{stale.size} of #{size} documents " \
|
|
329
|
+
"(#{(fraction * 100).round}% > #{(PURGE_GUARD_FRACTION * 100).to_i}% — likely a partial index). " \
|
|
330
|
+
'Set UNBLOCKED_FORCE_PURGE=1 to override.'
|
|
331
|
+
true
|
|
332
|
+
end
|
|
333
|
+
|
|
334
|
+
# Seed the manifest from the remote collection when we have no local
|
|
335
|
+
# state (first run / CI cache miss). The list endpoint returns no body,
|
|
336
|
+
# so hashes are nil (everything re-pushes), but recovering document_ids
|
|
337
|
+
# lets this run still purge orphaned documents.
|
|
338
|
+
#
|
|
339
|
+
# Auth failures re-raise: a 401/403 here dooms every subsequent call,
|
|
340
|
+
# and "proceeding with full sync" would burn the whole daily budget on
|
|
341
|
+
# guaranteed failures.
|
|
342
|
+
def reconcile_from_remote
|
|
343
|
+
@client.all_documents(collection_id: @collection_id).each do |doc|
|
|
344
|
+
uri = doc['uri']
|
|
345
|
+
next unless uri
|
|
346
|
+
|
|
347
|
+
@manifest.record(uri: uri, hash: nil, document_id: doc['id'])
|
|
348
|
+
end
|
|
349
|
+
rescue ApiError => e
|
|
350
|
+
raise if [401, 403].include?(e.status)
|
|
351
|
+
|
|
352
|
+
log " reconcile skipped (#{e.message}) — proceeding with full sync"
|
|
353
|
+
rescue StandardError => e
|
|
354
|
+
log " reconcile skipped (#{e.message}) — proceeding with full sync"
|
|
355
|
+
end
|
|
356
|
+
|
|
357
|
+
def track_uri(unit_data)
|
|
358
|
+
# Units without a file_path are never pushed (see push_document), so
|
|
359
|
+
# their fallback repo-root URI must not be marked current either — a
|
|
360
|
+
# stale repo-root document from before this guard should purge.
|
|
361
|
+
return unless unit_data['file_path']
|
|
362
|
+
|
|
363
|
+
@current_uris << @builder.uri_for(unit_data)
|
|
364
|
+
end
|
|
365
|
+
|
|
366
|
+
def fingerprint(doc)
|
|
367
|
+
Digest::SHA256.hexdigest("#{doc[:title]}\n#{doc[:body]}")
|
|
368
|
+
end
|
|
369
|
+
|
|
370
|
+
# Records whether an error was a budget-exhaustion stop. Returns true when
|
|
371
|
+
# it was, so callers can break out of their loop. Class check first; the
|
|
372
|
+
# message match remains as a fallback for injected clients that raise
|
|
373
|
+
# plain Woods::Error.
|
|
374
|
+
def note_budget_exhaustion(error)
|
|
375
|
+
return false unless error.is_a?(BudgetExhaustedError) || error.message.include?('daily budget exhausted')
|
|
376
|
+
|
|
377
|
+
@budget_exhausted = true
|
|
179
378
|
end
|
|
180
379
|
|
|
181
380
|
def build_reader(index_dir)
|
|
@@ -183,6 +382,23 @@ module Woods
|
|
|
183
382
|
Woods::MCP::IndexReader.new(index_dir)
|
|
184
383
|
end
|
|
185
384
|
|
|
385
|
+
# Persist the manifest, downgrading failures to a warning: losing the
|
|
386
|
+
# manifest only costs a full re-check next run, which must not turn an
|
|
387
|
+
# otherwise-successful sync into a crash (this runs from an ensure, where
|
|
388
|
+
# a raise would also mask any in-flight exception).
|
|
389
|
+
def save_manifest
|
|
390
|
+
@manifest.save
|
|
391
|
+
rescue StandardError => e
|
|
392
|
+
log " WARNING: sync manifest not persisted (#{e.message}) — next run will re-push all documents"
|
|
393
|
+
end
|
|
394
|
+
|
|
395
|
+
def build_manifest(index_dir)
|
|
396
|
+
SyncManifest.new(
|
|
397
|
+
path: File.join(index_dir, 'unblocked_sync_manifest.json'),
|
|
398
|
+
collection_id: @collection_id
|
|
399
|
+
)
|
|
400
|
+
end
|
|
401
|
+
|
|
186
402
|
def empty_stats
|
|
187
403
|
{ synced: 0, skipped: 0, errors: [] }
|
|
188
404
|
end
|
|
@@ -1,7 +1,15 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
|
+
require 'woods'
|
|
4
|
+
|
|
3
5
|
module Woods
|
|
4
6
|
module Unblocked
|
|
7
|
+
# Raised when the daily API call budget is exhausted. Subclasses
|
|
8
|
+
# Woods::Error so existing +rescue Woods::Error+ sites keep working;
|
|
9
|
+
# callers that need to branch on exhaustion rescue this class instead of
|
|
10
|
+
# matching the message string.
|
|
11
|
+
class BudgetExhaustedError < Woods::Error; end
|
|
12
|
+
|
|
5
13
|
# Daily budget-based rate limiter for the Unblocked API (1000 calls/day).
|
|
6
14
|
#
|
|
7
15
|
# Unlike Notion's per-second throttling, Unblocked limits by daily call count.
|
|
@@ -35,13 +43,13 @@ module Woods
|
|
|
35
43
|
#
|
|
36
44
|
# @yield The API call to execute
|
|
37
45
|
# @return [Object] The block's return value
|
|
38
|
-
# @raise [
|
|
46
|
+
# @raise [BudgetExhaustedError] if daily budget is exhausted
|
|
39
47
|
def track
|
|
40
48
|
raise ArgumentError, 'block required' unless block_given?
|
|
41
49
|
|
|
42
50
|
@mutex.synchronize do
|
|
43
51
|
if @calls_today >= @daily_budget
|
|
44
|
-
raise
|
|
52
|
+
raise BudgetExhaustedError,
|
|
45
53
|
"Unblocked API daily budget exhausted (#{@daily_budget} calls). " \
|
|
46
54
|
'Budget resets at midnight PST. Use UNBLOCKED_DAILY_BUDGET to adjust.'
|
|
47
55
|
end
|
|
@@ -0,0 +1,135 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'json'
|
|
4
|
+
require 'fileutils'
|
|
5
|
+
|
|
6
|
+
module Woods
|
|
7
|
+
module Unblocked
|
|
8
|
+
# Tracks what was last pushed to an Unblocked collection so a sync can
|
|
9
|
+
# skip unchanged documents, re-push changed ones, and delete orphans.
|
|
10
|
+
#
|
|
11
|
+
# The manifest is the local source of truth for change detection: each
|
|
12
|
+
# entry records the content hash of the document we last pushed for a URI
|
|
13
|
+
# plus the remote +document_id+ (needed for deletes). Persisted as JSON
|
|
14
|
+
# alongside the extraction output and restored across CI runs via the CI
|
|
15
|
+
# provider's cache. A missing or corrupt file degrades to "everything is
|
|
16
|
+
# new" — a correct (if expensive) full sync that rebuilds the manifest.
|
|
17
|
+
#
|
|
18
|
+
# Modeled on the embedding indexer's checkpoint (load JSON → compare
|
|
19
|
+
# per-key hash → save JSON).
|
|
20
|
+
#
|
|
21
|
+
# @example
|
|
22
|
+
# manifest = SyncManifest.new(path: "tmp/woods/unblocked_sync_manifest.json",
|
|
23
|
+
# collection_id: "col-uuid")
|
|
24
|
+
# manifest.unchanged?(uri, hash) # => false on first run
|
|
25
|
+
# manifest.record(uri:, hash:, document_id:)
|
|
26
|
+
# manifest.save
|
|
27
|
+
#
|
|
28
|
+
class SyncManifest
|
|
29
|
+
VERSION = 1
|
|
30
|
+
|
|
31
|
+
# @param path [String] JSON file path for the manifest
|
|
32
|
+
# @param collection_id [String] Target collection UUID — a stored manifest
|
|
33
|
+
# for a *different* collection is discarded (cache-key reuse guard).
|
|
34
|
+
def initialize(path:, collection_id:)
|
|
35
|
+
@path = path
|
|
36
|
+
@collection_id = collection_id
|
|
37
|
+
@documents = load
|
|
38
|
+
end
|
|
39
|
+
|
|
40
|
+
# @return [Boolean] true when no documents are recorded
|
|
41
|
+
def empty?
|
|
42
|
+
@documents.empty?
|
|
43
|
+
end
|
|
44
|
+
|
|
45
|
+
# @param uri [String] Document URI
|
|
46
|
+
# @param hash [String] Content hash of the document we would push now
|
|
47
|
+
# @return [Boolean] true when the recorded hash matches (safe to skip)
|
|
48
|
+
def unchanged?(uri, hash)
|
|
49
|
+
entry = @documents[uri]
|
|
50
|
+
!entry.nil? && entry['hash'] == hash
|
|
51
|
+
end
|
|
52
|
+
|
|
53
|
+
# Record (or update) what we pushed for a URI.
|
|
54
|
+
#
|
|
55
|
+
# @param uri [String] Document URI
|
|
56
|
+
# @param hash [String, nil] Content hash pushed (nil forces a future re-push)
|
|
57
|
+
# @param document_id [String, nil] Remote document UUID (for later deletes)
|
|
58
|
+
def record(uri:, hash:, document_id:)
|
|
59
|
+
@documents[uri] = { 'hash' => hash, 'document_id' => document_id }
|
|
60
|
+
end
|
|
61
|
+
|
|
62
|
+
# @param uri [String] Document URI
|
|
63
|
+
# @return [String, nil] Stored remote document_id, if known
|
|
64
|
+
def document_id_for(uri)
|
|
65
|
+
@documents.dig(uri, 'document_id')
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
# URIs we have a record of that are absent from the current run's set.
|
|
69
|
+
#
|
|
70
|
+
# @param current_uris [Array<String>, Set] URIs that still exist this run
|
|
71
|
+
# @return [Array<String>] recorded URIs no longer present (deletion candidates)
|
|
72
|
+
def stale_uris(current_uris)
|
|
73
|
+
present = current_uris.to_a
|
|
74
|
+
@documents.keys - present
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
# @return [Integer] number of recorded documents
|
|
78
|
+
def size
|
|
79
|
+
@documents.size
|
|
80
|
+
end
|
|
81
|
+
|
|
82
|
+
# Drop a URI from the manifest (after a successful remote delete).
|
|
83
|
+
#
|
|
84
|
+
# @param uri [String] Document URI
|
|
85
|
+
def forget(uri)
|
|
86
|
+
@documents.delete(uri)
|
|
87
|
+
end
|
|
88
|
+
|
|
89
|
+
# Persist the manifest atomically (temp file + rename) so an interrupted
|
|
90
|
+
# write never leaves a torn file in the CI cache.
|
|
91
|
+
def save
|
|
92
|
+
FileUtils.mkdir_p(File.dirname(@path))
|
|
93
|
+
payload = JSON.generate(
|
|
94
|
+
'version' => VERSION,
|
|
95
|
+
'collection_id' => @collection_id,
|
|
96
|
+
'documents' => @documents
|
|
97
|
+
)
|
|
98
|
+
tmp = "#{@path}.tmp"
|
|
99
|
+
File.write(tmp, payload)
|
|
100
|
+
File.rename(tmp, @path)
|
|
101
|
+
end
|
|
102
|
+
|
|
103
|
+
private
|
|
104
|
+
|
|
105
|
+
# Load the persisted documents, discarding data from a different
|
|
106
|
+
# collection, a different schema version, or an unparseable file.
|
|
107
|
+
# Every discard warns to stderr — the consequence (a full re-push) is
|
|
108
|
+
# expensive enough that operators need to know why it happened.
|
|
109
|
+
#
|
|
110
|
+
# @return [Hash{String=>Hash}] uri => { 'hash' =>, 'document_id' => }
|
|
111
|
+
def load
|
|
112
|
+
return {} unless File.exist?(@path)
|
|
113
|
+
|
|
114
|
+
parsed = JSON.parse(File.read(@path))
|
|
115
|
+
return discard('not a JSON object') unless parsed.is_a?(Hash)
|
|
116
|
+
return discard("schema version #{parsed['version'].inspect}, expected #{VERSION}") unless
|
|
117
|
+
parsed['version'] == VERSION
|
|
118
|
+
return discard("written for collection #{parsed['collection_id'].inspect}, expected #{@collection_id}") unless
|
|
119
|
+
parsed['collection_id'] == @collection_id
|
|
120
|
+
|
|
121
|
+
documents = parsed['documents']
|
|
122
|
+
documents.is_a?(Hash) ? documents : {}
|
|
123
|
+
rescue JSON::ParserError
|
|
124
|
+
discard('unparseable JSON')
|
|
125
|
+
end
|
|
126
|
+
|
|
127
|
+
# @param reason [String] Why the persisted manifest is unusable
|
|
128
|
+
# @return [Hash] empty documents hash (degrades to a full re-push)
|
|
129
|
+
def discard(reason)
|
|
130
|
+
warn "WARNING: discarding sync manifest at #{@path} (#{reason}) — next sync re-pushes all documents"
|
|
131
|
+
{}
|
|
132
|
+
end
|
|
133
|
+
end
|
|
134
|
+
end
|
|
135
|
+
end
|
data/lib/woods/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: woods
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.
|
|
4
|
+
version: 1.4.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Leah Armstrong
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-
|
|
11
|
+
date: 2026-06-10 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: mcp
|
|
@@ -305,6 +305,7 @@ files:
|
|
|
305
305
|
- lib/woods/unblocked/document_builder.rb
|
|
306
306
|
- lib/woods/unblocked/exporter.rb
|
|
307
307
|
- lib/woods/unblocked/rate_limiter.rb
|
|
308
|
+
- lib/woods/unblocked/sync_manifest.rb
|
|
308
309
|
- lib/woods/util/host_guard.rb
|
|
309
310
|
- lib/woods/version.rb
|
|
310
311
|
homepage: https://github.com/lost-in-the/woods
|