RubyGems - woods - Versions diffs - 1.3.0 → 1.4.1 - Mend

woods 1.3.0 → 1.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +75 -0
data/README.md +4 -0
data/lib/generators/woods/templates/woods.rb.tt +1 -1
data/lib/tasks/woods.rake +22 -1
data/lib/woods/extractors/controller_extractor.rb +4 -4
data/lib/woods/unblocked/client.rb +87 -6
data/lib/woods/unblocked/document_builder.rb +40 -26
data/lib/woods/unblocked/exporter.rb +284 -18
data/lib/woods/unblocked/rate_limiter.rb +10 -2
data/lib/woods/unblocked/sync_manifest.rb +135 -0
data/lib/woods/version.rb +1 -1
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 69b5f822b28adb68fa962350e44b8721811c62bfa25b0350ab8a46b8b121f3d4
-  data.tar.gz: 6e1e42f994fd57f5de592f662d451e73b967fd4c84744b1638d94e833a044852
+  metadata.gz: b2da9b3b863eb794ca880de7a8b327c7edd22f5ea0b027bd705191af85ea755a
+  data.tar.gz: 31c23f340816f84d3c1acc8e2cf09daa9bb7009179d1bd51cca51f7833c376e5
 SHA512:
-  metadata.gz: a83cb96217695d13bba825a2fb25dcf257bb1de94565c2005fb559771d0670b628db70a19651c7626d02f2baf030a98e116bd7c971f22823ce4300642dc1bd73
-  data.tar.gz: 8d420c40672e99f2395a410b4f83cf8df761c7f06b42e8558199b7c9d26db9b1ffd9aa866520f3d245a2fcd3bdd2cea9f007c01cfd3ed970d509fe6aebd80eab
+  metadata.gz: 69f9bc1e0e83a7894ab0618b1044608f7eb3b869c7a881b04820d033a1a4c66bae7ce56be4c7bd858915679e941354af3f2907c2faa2decac1de8d0a4511913c
+  data.tar.gz: 17360ebaf41923cb074d0b829b8940e24fbd3b7724243a0738fe73ee5a4fbcaf43108194d025a955073088dcf2ea0cd19380304191c41d9b325f02dcee43badd

data/CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,81 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [1.4.1] - 2026-06-10
+### Fixed
+- **Unblocked sync: multiple units sharing one file no longer collide on a single
+  URI** (#130). A document's URI derives from `file_path`, so a file defining
+  several extracted units (nested/namespaced classes, STI subclasses, multiple
+  classes in one `.rb`) mapped every unit to the same URI — the remote document
+  was overwritten per unit (only the last survived) and, under the content-hash
+  manifest, those units re-pushed on every run. The exporter now detects files
+  shared by more than one synced unit and disambiguates: the lexically-first
+  identifier keeps the bare blob URL, siblings get a `?unit=<identifier>` suffix.
+  Solo files (the overwhelming majority) are untouched. Sibling of the
+  no-`file_path` guard shipped in 1.4.0.
+## [1.4.0] - 2026-06-10
+### Added — Incremental Unblocked sync (PR #128)
+- **`woods:unblocked_sync` is now incremental.** A new `Woods::Unblocked::SyncManifest`
+  (JSON at `<output_dir>/unblocked_sync_manifest.json`) records the content hash and
+  remote document id of everything last pushed. Each run skips unchanged documents,
+  pushes only new/changed ones, and deletes documents whose source unit disappeared.
+  A missing manifest (first run / CI cache miss) degrades to a correct full re-push
+  that rebuilds it; steady state on an unchanged codebase costs ~0 API calls (was
+  ~800–1200 per run). Persist the manifest across CI runs via your provider's cache —
+  see `docs/UNBLOCKED_INTEGRATION.md`.
+- **Deletion safety.** Orphan purging is skipped when the daily API budget exhausts
+  mid-run, and a mass-deletion guard refuses to delete more than 30% of a ≥10-entry
+  manifest in one run (`UNBLOCKED_FORCE_PURGE=1` overrides) — protection against
+  syncing a partial index. `UNBLOCKED_FORCE_FULL_SYNC=1` re-pushes everything (use
+  after a document-format change). Both flags parse `1`/`true`/`yes` (case-insensitive).
+- **`Client#list_documents` / `#all_documents`** — paginated document listing with
+  client-side collection filtering, used to reconcile remote document ids when the
+  manifest is missing. Cursors are URL-encoded and a page without a cursor id stops
+  pagination rather than looping against the rate budget.
+- **`Woods::Unblocked::ApiError`** (subclass of `Woods::Error`) carries the required
+  HTTP `status` of failed API calls; a 404 on delete is treated as already-gone.
+  **`Woods::Unblocked::BudgetExhaustedError`** (also a `Woods::Error` subclass) is
+  raised by the rate limiter, so budget detection no longer depends on message text.
+- **CI-visible failures.** `woods:unblocked_sync` exits non-zero when the sync
+  recorded errors (delete failures are now surfaced in the error list too). The one
+  tolerated shape is budget exhaustion with partial progress — the expected
+  cold-start outcome, which converges on the next run. Reconcile aborts loudly on
+  auth failures (401/403) instead of burning the budget on doomed calls.
+- **Deterministic document bodies.** `DocumentBuilder` sorts every rendered collection
+  (associations, dependents, routes, enums, scopes, concerns, callbacks) so an
+  unchanged unit always produces byte-identical output — the precondition for
+  hash-based change detection.
+- **`Client#create_collection` defaults `iconUrl`** to the repo-hosted Woods mark.
+  The live API rejects collection creation without an `iconUrl` despite the API docs
+  marking it optional (documented quirk).
+- **Branding.** Tree Rings logo set under `assets/` (marks, wordmark lockups, PNG
+  exports); README wordmark.
+### Fixed
+- `Client#list_collections` no longer raises `TypeError` on the live API's bare-array
+  response.
+- API error messages now surface RFC7807 `title`/`detail` fields (previously
+  "Unknown error").
+- `require 'woods/unblocked/client'` works standalone (previously needed `woods`
+  loaded first).
+- Units without a `file_path` are skipped instead of synced. Previously every
+  such unit fell back to the bare repo URL as its document URI — and since URIs
+  are the upsert key, they silently overwrote each other in the collection
+  (and would have ping-ponged the new manifest hash every run).
+### Build
+- The suite now installs and runs on Ruby 4.0: the optional `tokenizers` gem (whose
+  native extension cannot build against the Ruby 4.0 ABI) is gated behind
+  `install_if (Ruby < 4.0)`, and `benchmark` (no longer a default gem in 4.0) is
+  declared explicitly. Lockfile unchanged.
 ## [1.3.0] - 2026-05-13
 ### Upgrade Notes

data/README.md CHANGED Viewed

@@ -1,3 +1,7 @@
+<p align="center">
+  <img src="assets/woods-wordmark-white-with-bg.png" width="400" alt="woods">
+</p>
 # Woods
 **Your AI coding assistant is guessing about your Rails app. Woods gives it the real answers.**

data/lib/generators/woods/templates/woods.rb.tt CHANGED Viewed

@@ -1,7 +1,7 @@
 # frozen_string_literal: true
 # Woods configuration
-# Full reference: https://github.com/bigcartel/woods/blob/main/docs/CONFIGURATION_REFERENCE.md
+# Full reference: https://github.com/lost-in-the/woods/blob/main/docs/CONFIGURATION_REFERENCE.md
 #
 # Quick-start presets (uncomment one instead of the full block below):
 #   Woods.configure_with_preset(:local)       # in-memory + Ollama, no external services

data/lib/tasks/woods.rake CHANGED Viewed

@@ -604,25 +604,46 @@ namespace :woods do
     end
     output_dir = ENV.fetch('WOODS_OUTPUT', config.output_dir)
+    # Truthy set, so FLAG=false / FLAG=0 disables rather than silently enabling.
+    env_flag = ->(name) { %w[1 true yes].include?(ENV.fetch(name, '').strip.downcase) }
+    force_full = env_flag.call('UNBLOCKED_FORCE_FULL_SYNC')
+    force_purge = env_flag.call('UNBLOCKED_FORCE_PURGE')
     puts 'Syncing extraction data to Unblocked...'
     puts "  Output dir:     #{output_dir}"
     puts "  Collection:     #{config.unblocked_collection_id}"
     puts "  Repo URL:       #{config.unblocked_repo_url}"
+    puts '  Mode:           full re-sync (UNBLOCKED_FORCE_FULL_SYNC set)' if force_full
     puts
-    exporter = Woods::Unblocked::Exporter.new(index_dir: output_dir)
+    exporter = Woods::Unblocked::Exporter.new(
+      index_dir: output_dir,
+      force_full: force_full,
+      force_purge: force_purge
+    )
     stats = exporter.sync_all
     puts
     puts 'Sync complete!'
     puts "  Documents synced:   #{stats[:synced]}"
     puts "  Documents skipped:  #{stats[:skipped]}"
+    puts "  Documents deleted:  #{stats[:deleted]}"
     if stats[:errors].any?
       puts "  Errors:             #{stats[:errors].size}"
       stats[:errors].first(5).each { |e| puts "    - #{e}" }
       puts "    ... and #{stats[:errors].size - 5} more" if stats[:errors].size > 5
+      # Fail the task so CI notices — a printed-but-green run is invisible in
+      # post-merge pipelines (a dead token would otherwise stay green forever).
+      # Exception: budget exhaustion *with* partial progress is the expected
+      # cold-start shape; it converges on the next run.
+      budget_only = stats[:errors].all? { |e| e.include?('daily budget exhausted') }
+      unless budget_only && stats[:synced].positive?
+        puts
+        puts 'Sync completed with errors — failing so CI surfaces it.'
+        exit 1
+      end
     end
   end

data/lib/woods/extractors/controller_extractor.rb CHANGED Viewed

@@ -226,10 +226,10 @@ module Woods
           # Parent chain for understanding inherited behavior
           ancestors: controller.ancestors
-                     .take_while { |a| a != ActionController::Base && a != ActionController::API }
-                     .grep(Class)
-                     .map(&:name)
-                     .compact,
+                               .take_while { |a| a != ActionController::Base && a != ActionController::API }
+                               .grep(Class)
+                               .map(&:name)
+                               .compact,
           # Concerns included
           included_concerns: extract_included_concerns(controller),

data/lib/woods/unblocked/client.rb CHANGED Viewed

@@ -3,10 +3,28 @@
 require 'json'
 require 'net/http'
 require 'uri'
+require 'woods'
 require_relative 'rate_limiter'
 module Woods
   module Unblocked
+    # API error carrying the HTTP status code, so callers can branch on
+    # status (e.g. treat a 404 on delete as "already gone") instead of
+    # matching message strings. Subclasses Woods::Error, so existing
+    # +rescue Woods::Error+ sites keep working unchanged.
+    class ApiError < Woods::Error
+      # @return [Integer] HTTP status code of the failed response
+      attr_reader :status
+      # @param message [String] Error message
+      # @param status [Integer] HTTP status code — required, because callers
+      #   branch on it (a nil status would silently miss every status check)
+      def initialize(message, status:)
+        super(message)
+        @status = Integer(status)
+      end
+    end
     # REST client for the Unblocked API v1.
     #
     # Handles document and collection CRUD with rate limiting, retries,
@@ -25,6 +43,12 @@ module Woods
       BASE_URL = 'https://getunblocked.com/api/v1'
       MAX_RETRIES = 3
       DEFAULT_TIMEOUT = 30
+      # Max page size the list endpoint accepts (per API docs).
+      PAGE_SIZE = 200
+      # Repo-hosted Woods mark, used as the collection icon when none is given.
+      # The live API rejects collection creation without an iconUrl (despite
+      # the API docs marking it optional), so a working default matters.
+      DEFAULT_ICON_URL = 'https://raw.githubusercontent.com/lost-in-the/woods/main/assets/woods-mark-black.svg'
       # @param api_token [String] Unblocked API token (Personal or Team)
       # @param rate_limiter [RateLimiter] Rate limiter instance
@@ -60,12 +84,17 @@ module Woods
       #
       # @param name [String] Collection name (1-32 chars)
       # @param description [String] Collection description (1-4096 chars)
-      # @param icon_url [String, nil] Optional icon URL
+      # @param icon_url [String, nil] Icon URL. The live API rejects creation
+      #   with a bare 400 when omitted (despite the API docs marking it
+      #   optional), so nil falls back to DEFAULT_ICON_URL — the repo-hosted
+      #   Woods mark.
       # @return [Hash] { "id" => "collection-uuid", "name" => "...", ... }
       def create_collection(name:, description:, icon_url: nil)
-        body = { name: name, description: description }
-        body[:iconUrl] = icon_url if icon_url
-        request(:post, 'collections', body)
+        request(:post, 'collections', {
+                  name: name,
+                  description: description,
+                  iconUrl: icon_url || DEFAULT_ICON_URL
+                })
       end
       # List all collections.
@@ -73,6 +102,10 @@ module Woods
       # @return [Array<Hash>] Collection objects
       def list_collections
         result = request(:get, 'collections')
+        # The live API returns a bare JSON array; the envelope fallbacks are
+        # defensive (calling ['items'] on an Array raises TypeError).
+        return result if result.is_a?(Array)
         result['items'] || result['data'] || [result].flatten.compact
       end
@@ -84,6 +117,52 @@ module Woods
         request(:delete, "documents/#{document_id}")
       end
+      # List a single page of documents.
+      #
+      # The endpoint returns a bare JSON array of document metadata (no body):
+      # `id, collectionId, title, uri, createdAt, updatedAt`. Pagination is
+      # cursor-based via `after`/`before` (opaque cursors); there is no
+      # server-side collection filter.
+      #
+      # @param limit [Integer] Page size (1-200)
+      # @param after [String, nil] Opaque forward cursor (typically the last id)
+      # @return [Array<Hash>] One page of document metadata
+      def list_documents(limit: PAGE_SIZE, after: nil)
+        query = "limit=#{limit}"
+        query += "&after=#{URI.encode_www_form_component(after)}" if after
+        result = request(:get, "documents?#{query}")
+        return result if result.is_a?(Array)
+        result['items'] || result['data'] || []
+      end
+      # List every document in a collection, paging until exhausted.
+      #
+      # Filters client-side on `collectionId` since the API has no collection
+      # filter. ~5 calls for ~1000 documents; each goes through the rate limiter.
+      #
+      # @param collection_id [String] Collection UUID to filter to
+      # @return [Array<Hash>] All matching document metadata
+      def all_documents(collection_id:)
+        docs = []
+        after = nil
+        loop do
+          page = list_documents(limit: PAGE_SIZE, after: after)
+          break if page.empty?
+          docs.concat(page)
+          break if page.size < PAGE_SIZE
+          after = page.last['id']
+          # A full page with no cursor id would refetch page 1 forever —
+          # stop with what we have rather than loop against the budget.
+          break if after.nil?
+        end
+        docs.select { |doc| doc['collectionId'] == collection_id }
+      end
       private
       def request(method, path, body = nil)
@@ -155,8 +234,10 @@ module Woods
         rescue JSON::ParserError, TypeError
           { 'message' => response.body&.slice(0, 200) || 'Unknown error' }
         end
-        message = parsed['message'] || parsed['error'] || 'Unknown error'
-        raise Woods::Error, "Unblocked API error #{response.code}: #{message}"
+        # The Unblocked API returns RFC7807-style bodies ({ status, title, detail });
+        # older/other paths use message/error. Check all so failures stay legible.
+        message = parsed['message'] || parsed['error'] || parsed['detail'] || parsed['title'] || 'Unknown error'
+        raise ApiError.new("Unblocked API error #{response.code}: #{message}", status: response.code.to_i)
       end
     end
   end

data/lib/woods/unblocked/document_builder.rb CHANGED Viewed

@@ -27,23 +27,30 @@ module Woods
       def build(unit_data)
         type = unit_data['type']
         identifier = unit_data['identifier']
-        file_path = unit_data['file_path']
         {
           title: "#{identifier} (#{type})",
           body: build_body(unit_data),
-          uri: build_uri(file_path)
+          uri: uri_for(unit_data)
         }
       end
-      private
-      def build_uri(file_path)
+      # The citation URI for a unit (GitHub blob URL, or the repo root when the
+      # unit has no file_path). Public so callers can compute a unit's URI
+      # cheaply — e.g. to build the set of currently-existing URIs — without
+      # building the full document body.
+      #
+      # @param unit_data [Hash] Parsed unit JSON (needs 'file_path')
+      # @return [String] Citation URI
+      def uri_for(unit_data)
+        file_path = unit_data['file_path']
         return @repo_url unless file_path
         "#{@repo_url}/blob/main/#{file_path}"
       end
+      private
       def build_body(unit_data)
         type = unit_data['type']
         body = case type
@@ -125,7 +132,9 @@ module Woods
             dep = a.dig('options', 'dependent')
             dep ? "#{name} (#{dep})" : name
           end
-          lines << "**#{type}:** #{targets.join(', ')}"
+          # Sorted so the body is a function of association content, not order
+          # (the exporter hashes this body to detect changes).
+          lines << "**#{type}:** #{targets.sort.join(', ')}"
         end
         lines.join("\n")
@@ -136,7 +145,7 @@ module Woods
         return nil if deps.empty?
         grouped = deps.group_by { |d| d['type'] }
-        summary_parts = grouped.map { |type, items| "#{items.size} #{type}s" }
+        summary_parts = grouped.sort_by { |type, _| type.to_s }.map { |type, items| "#{items.size} #{type}s" }
         lines = ["## Dependents (#{deps.size} units)"]
         lines << summary_parts.join(', ')
@@ -160,9 +169,9 @@ module Woods
         return nil if controllers.empty? && graphql.empty?
         lines = ['## Entry Points']
-        lines << "**Controllers:** #{controllers.map { |c| c['identifier'] }.join(', ')}" if controllers.any?
-        lines << "**GraphQL:** #{graphql.map { |g| g['identifier'] }.join(', ')}" if graphql.any?
-        lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.join(', ')}" if jobs.any?
+        lines << "**Controllers:** #{controllers.map { |c| c['identifier'] }.sort.join(', ')}" if controllers.any?
+        lines << "**GraphQL:** #{graphql.map { |g| g['identifier'] }.sort.join(', ')}" if graphql.any?
+        lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.sort.join(', ')}" if jobs.any?
         lines.join("\n")
       end
@@ -172,15 +181,16 @@ module Woods
         enums = meta['enums']
         if enums.is_a?(Hash) && enums.any?
-          enum_strs = enums.map { |name, values| "#{name} (#{format_enum_values(values)})" }
+          enum_strs = enums.sort_by { |name, _| name.to_s }
+                           .map { |name, values| "#{name} (#{format_enum_values(values)})" }
           parts << "**Enums:** #{enum_strs.join('; ')}"
         end
         scopes = meta['scopes']
-        parts << "**Scopes:** #{scopes.map { |s| s['name'] }.join(', ')}" if scopes.is_a?(Array) && scopes.any?
+        parts << "**Scopes:** #{scopes.map { |s| s['name'] }.sort.join(', ')}" if scopes.is_a?(Array) && scopes.any?
         concerns = meta['inlined_concerns']
-        parts << "**Concerns:** #{concerns.join(', ')}" if concerns.is_a?(Array) && concerns.any?
+        parts << "**Concerns:** #{concerns.sort.join(', ')}" if concerns.is_a?(Array) && concerns.any?
         callbacks = meta['callbacks']
         if callbacks.is_a?(Array) && callbacks.any?
@@ -200,8 +210,8 @@ module Woods
         return nil if jobs.empty? && mailers.empty?
         lines = ['## Side Effects']
-        lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.join(', ')}" if jobs.any?
-        lines << "**Mailers:** #{mailers.map { |m| m['identifier'] }.join(', ')}" if mailers.any?
+        lines << "**Jobs:** #{jobs.map { |j| j['identifier'] }.sort.join(', ')}" if jobs.any?
+        lines << "**Mailers:** #{mailers.map { |m| m['identifier'] }.sort.join(', ')}" if mailers.any?
         lines.join("\n")
       end
@@ -229,18 +239,21 @@ module Woods
         routes = meta['routes']
         return nil unless routes.is_a?(Hash) && routes.any?
-        lines = ['## Routes']
+        route_lines = []
         routes.each do |action, route_list|
           next unless route_list.is_a?(Array)
           route_list.each do |route|
             next unless route.is_a?(Hash)
-            lines << "- `#{route['verb']} #{route['path']}` (#{action})"
+            route_lines << "- `#{route['verb']} #{route['path']}` (#{action})"
           end
         end
-        lines.size > 1 ? lines.first(20).join("\n") : nil
+        # Sort before truncating so the kept subset is stable across runs.
+        return nil if route_lines.empty?
+        (['## Routes'] + route_lines.sort.first(20)).join("\n")
       end
       def controller_dependencies(unit)
@@ -250,7 +263,7 @@ module Woods
         models = deps.select { |d| d['type'] == 'model' }.map { |d| d['target'] }
         return nil if models.empty?
-        "## Dependencies\n**Models:** #{models.join(', ')}"
+        "## Dependencies\n**Models:** #{models.sort.join(', ')}"
       end
       def controller_dependents(unit)
@@ -258,7 +271,7 @@ module Woods
         views = deps.select { |d| d['type'] == 'view_template' }
         return nil if views.empty?
-        "## Views\n#{views.map { |v| "- `#{v['identifier']}`" }.first(10).join("\n")}"
+        "## Views\n#{views.map { |v| "- `#{v['identifier']}`" }.sort.first(10).join("\n")}"
       end
       # ── GraphQL formatting ───────────────────────────────────────────
@@ -271,7 +284,7 @@ module Woods
         deps = unit['dependencies'] || []
         models = deps.select { |d| d['type'] == 'model' }.map { |d| d['target'] }
-        sections << "**Models:** #{models.join(', ')}" if models.any?
+        sections << "**Models:** #{models.sort.join(', ')}" if models.any?
         dependents = unit['dependents'] || []
         sections << "**Referenced by:** #{dependents.size} units" if dependents.any?
@@ -292,14 +305,15 @@ module Woods
         deps = unit['dependencies'] || []
         if deps.any?
           by_type = deps.group_by { |d| d['type'] }
-          dep_parts = by_type.map { |type, items| "#{type}: #{items.map { |d| d['target'] }.join(', ')}" }
+          dep_parts = by_type.sort_by { |type, _| type.to_s }
+                             .map { |type, items| "#{type}: #{items.map { |d| d['target'] }.sort.join(', ')}" }
           sections << "## Dependencies\n#{dep_parts.join("\n")}"
         end
         dependents = unit['dependents'] || []
         if dependents.any?
           grouped = dependents.group_by { |d| d['type'] }
-          summary = grouped.map { |type, items| "#{items.size} #{type}s" }
+          summary = grouped.sort_by { |type, _| type.to_s }.map { |type, items| "#{items.size} #{type}s" }
           sections << "## Dependents (#{dependents.size})\n#{summary.join(', ')}"
         end
@@ -317,9 +331,9 @@ module Woods
       end
       def format_callbacks(callbacks)
-        callbacks.first(5).map do |cb|
-          "#{cb['type']}: #{cb['filter']}"
-        end.join(', ')
+        # Sort before truncating so both the selection and order are stable
+        # regardless of input order (the body is hashed for change detection).
+        callbacks.map { |cb| "#{cb['type']}: #{cb['filter']}" }.sort.first(5).join(', ')
       end
     end
   end

data/lib/woods/unblocked/exporter.rb CHANGED Viewed

@@ -1,9 +1,13 @@
 # frozen_string_literal: true
+require 'set'
+require 'digest'
+require 'uri'
 require 'woods'
 require_relative 'client'
 require_relative 'rate_limiter'
 require_relative 'document_builder'
+require_relative 'sync_manifest'
 module Woods
   module Unblocked
@@ -11,16 +15,27 @@ module Woods
     #
     # Reads extraction output from disk via IndexReader, converts units to
     # condensed Markdown documents, and pushes via the Unblocked Documents API.
-    # All syncs are idempotent — documents are upserted by URI.
+    # Syncs are incremental: a {SyncManifest} records the content hash and
+    # remote document_id of everything last pushed, so each run only PUTs
+    # new/changed documents, skips unchanged ones, and deletes documents whose
+    # source unit has disappeared. Documents are upserted by URI, so a missing
+    # manifest (first run / CI cache miss) degrades to a correct full sync.
     #
     # @example
     #   exporter = Exporter.new(index_dir: "tmp/woods")
     #   stats = exporter.sync_all
-    #   # => { synced: 940, skipped: 5060, errors: [] }
+    #   # => { synced: 12, skipped: 928, deleted: 1, errors: [] }
     #
     class Exporter
       MAX_ERRORS = 100
+      # Mass-deletion guard: refuse to purge when more than this fraction of a
+      # manifest of at least PURGE_GUARD_MIN_DOCS entries would be deleted —
+      # the signature of a sync run against a partial index. Override with
+      # force_purge.
+      PURGE_GUARD_FRACTION = 0.30
+      PURGE_GUARD_MIN_DOCS = 10
       # Unit types to sync, in priority order.
       # All units are synced for these types.
       FULL_SYNC_TYPES = %w[
@@ -39,9 +54,13 @@ module Woods
       # @param config [Configuration] Woods configuration (default: global config)
       # @param client [Client, nil] Unblocked API client (auto-created from config if nil)
       # @param reader [Object, nil] IndexReader instance (auto-created if nil)
+      # @param manifest [SyncManifest, nil] Sync manifest (auto-created under index_dir if nil)
+      # @param force_full [Boolean] Re-push every unit, ignoring the unchanged check
+      # @param force_purge [Boolean] Bypass the mass-deletion guard
       # @param output [IO] Progress output stream (default: $stdout)
       # @raise [ConfigurationError] if required config is missing
-      def initialize(index_dir:, config: Woods.configuration, client: nil, reader: nil, output: $stdout)
+      def initialize(index_dir:, config: Woods.configuration, client: nil, reader: nil,
+                     manifest: nil, force_full: false, force_purge: false, output: $stdout)
         @collection_id = config.unblocked_collection_id
         raise ConfigurationError, 'unblocked_collection_id is required' unless @collection_id
@@ -57,18 +76,35 @@ module Woods
         @client = client || Client.new(api_token: api_token, rate_limiter: limiter)
         @reader = reader || build_reader(index_dir)
         @builder = DocumentBuilder.new(repo_url: repo_url)
+        @manifest = manifest || build_manifest(index_dir)
+        @force_full = force_full
+        @force_purge = force_purge
         @output = output
+        # Initialized here as well as in sync_all so the public sync_type /
+        # sync_type_partial methods work standalone (track_uri needs them).
+        @current_uris = Set.new
+        @budget_exhausted = false
+        # base URI => identifier that keeps the bare URI (only populated for
+        # URIs shared by >1 unit). Rebuilt per sync_all run.
+        @uri_primary = {}
       end
       # Sync all configured unit types to the Unblocked collection.
       #
-      # @return [Hash] { synced: Integer, skipped: Integer, errors: Array<String> }
+      # @return [Hash] { synced:, skipped:, deleted:, errors: }
       def sync_all
+        @current_uris = Set.new
+        @budget_exhausted = false
+        build_uri_index
+        reconcile_from_remote if @manifest.empty?
         synced = 0
         skipped = 0
         errors = []
         FULL_SYNC_TYPES.each do |type|
+          break if @budget_exhausted
           result = sync_type(type)
           synced += result[:synced]
           skipped += result[:skipped]
@@ -76,19 +112,24 @@ module Woods
         end
         PARTIAL_SYNC_TYPES.each do |type, max_count|
+          break if @budget_exhausted
           result = sync_type_partial(type, max_count)
           synced += result[:synced]
           skipped += result[:skipped]
           errors.concat(result[:errors])
         end
-        { synced: synced, skipped: skipped, errors: cap_errors(errors) }
+        deleted = @budget_exhausted ? 0 : purge_stale(errors)
+        { synced: synced, skipped: skipped, deleted: deleted, errors: cap_errors(errors) }
+      ensure
+        save_manifest
       end
       # Sync all units of a given type.
       #
       # @param type [String] Unit type (e.g. "model", "controller")
-      # @return [Hash] { synced: Integer, skipped: Integer, errors: Array<String> }
+      # @return [Hash] { synced:, skipped:, errors: }
       def sync_type(type)
         units = @reader.list_units(type: type)
         log "  #{type}: #{units.size} units"
@@ -100,7 +141,7 @@ module Woods
       #
       # @param type [String] Unit type
       # @param max_count [Integer] Maximum units to sync
-      # @return [Hash] { synced: Integer, skipped: Integer, errors: Array<String> }
+      # @return [Hash] { synced:, skipped:, errors: }
       def sync_type_partial(type, max_count)
         units = @reader.list_units(type: type)
         return empty_stats if units.empty?
@@ -114,8 +155,14 @@ module Woods
           { entry: entry, data: data, dep_count: dep_count }
         end
+        # Every unit of this type still exists — track its URI so partial units
+        # that fall *out* of the top-N are never mistaken for deletions.
+        units_with_data.each { |u| track_uri(u[:data]) }
         top_units = units_with_data.sort_by { |u| -u[:dep_count] }.first(max_count)
-        skipped_count = [units.size - max_count, 0].max
+        # Count against what was actually synced — units.size includes entries
+        # whose unit data was missing (dropped by the filter_map above).
+        skipped_count = units.size - top_units.size
         log "  #{type}: #{top_units.size}/#{units.size} units (top by dependents)"
@@ -138,13 +185,19 @@ module Woods
             next
           end
-          push_document(unit_data)
-          synced += 1
+          track_uri(unit_data)
+          if push_document(unit_data) == :skipped
+            skipped += 1
+          else
+            synced += 1
+          end
         rescue Woods::Error => e
           errors << "#{entry['identifier']}: #{e.message}"
-          break if e.message.include?('daily budget exhausted')
+          break if note_budget_exhaustion(e)
         rescue StandardError => e
-          errors << "#{entry['identifier']}: #{e.message}"
+          # Include the class — "undefined method for nil" without it is
+          # unactionable in CI logs.
+          errors << "#{entry['identifier']}: #{e.class}: #{e.message}"
         end
         { synced: synced, skipped: skipped, errors: errors }
@@ -156,26 +209,222 @@ module Woods
         errors = []
         entries_with_data.each do |entry, unit_data|
-          push_document(unit_data)
-          synced += 1
+          track_uri(unit_data)
+          if push_document(unit_data) == :skipped
+            skipped += 1
+          else
+            synced += 1
+          end
         rescue Woods::Error => e
           errors << "#{entry['identifier']}: #{e.message}"
-          break if e.message.include?('daily budget exhausted')
+          break if note_budget_exhaustion(e)
         rescue StandardError => e
-          errors << "#{entry['identifier']}: #{e.message}"
+          # Include the class — "undefined method for nil" without it is
+          # unactionable in CI logs.
+          errors << "#{entry['identifier']}: #{e.class}: #{e.message}"
         end
         { synced: synced, skipped: skipped, errors: errors }
       end
+      # Build the document, skip it if the manifest says it is unchanged,
+      # otherwise upsert it and record the new hash + remote document_id.
+      #
+      # @return [Symbol] :synced or :skipped
       def push_document(unit_data)
+        # No file_path → the URI falls back to the bare repo URL, which every
+        # such unit would share: they'd overwrite each other remotely and
+        # ping-pong the manifest hash forever. Skip them.
+        return :skipped unless unit_data['file_path']
+        # When several units share one file they share one base URI; only one
+        # keeps it, the rest get a `?unit=` suffix so each is a distinct remote
+        # document (and a distinct manifest key).
+        uri = effective_uri(unit_data)
         doc = @builder.build(unit_data)
-        @client.put_document(
+        # An empty body means the credential scrub failed closed (the builders
+        # always emit at least a header). Upserting it would overwrite a good
+        # remote document with nothing — error out and leave the remote as-is.
+        if doc[:body].nil? || doc[:body].empty?
+          raise Woods::ExtractionError, 'document body empty (credential scrub failure?) — push skipped'
+        end
+        hash = fingerprint(doc)
+        return :skipped if !@force_full && @manifest.unchanged?(uri, hash)
+        response = @client.put_document(
           collection_id: @collection_id,
           title: doc[:title],
           body: doc[:body],
-          uri: doc[:uri]
+          uri: uri
         )
+        document_id = (response['id'] if response.is_a?(Hash)) || @manifest.document_id_for(uri)
+        @manifest.record(uri: uri, hash: hash, document_id: document_id)
+        :synced
+      end
+      # Delete remote documents whose source unit no longer exists. Failures
+      # are appended to +errors+ — a delete that fails silently every run is
+      # how a collection rots while "deleted: 0" looks normal.
+      #
+      # @param errors [Array<String>] sink for delete failures
+      # @return [Integer] number of documents deleted
+      def purge_stale(errors)
+        stale = @manifest.stale_uris(@current_uris)
+        return 0 if stale.empty?
+        return 0 if guard_blocks_purge?(stale)
+        resolve_missing_document_ids(stale)
+        deleted = 0
+        stale.each do |uri|
+          document_id = @manifest.document_id_for(uri)
+          next unless document_id
+          @client.delete_document(document_id: document_id)
+          @manifest.forget(uri)
+          deleted += 1
+        rescue ApiError => e
+          if e.status == 404
+            # Already gone remotely — goal state reached, drop the entry
+            # rather than retrying every run.
+            @manifest.forget(uri)
+          else
+            errors << "delete #{uri}: #{e.message}"
+          end
+        rescue Woods::Error => e
+          break if note_budget_exhaustion(e)
+          errors << "delete #{uri}: #{e.message}"
+        rescue StandardError => e
+          # Entry stays in the manifest so a later run retries the delete —
+          # but surface the failure so systematic breakage is visible.
+          errors << "delete #{uri}: #{e.class}: #{e.message}"
+        end
+        deleted
+      end
+      # A manifest entry can carry a nil document_id (e.g. the PUT response
+      # body was empty). Those entries would be permanently undeletable, so
+      # before purging, make one bounded all_documents sweep to resolve ids.
+      # Best-effort: unresolved entries are simply skipped by the purge loop.
+      def resolve_missing_document_ids(stale)
+        missing = stale.select { |uri| @manifest.document_id_for(uri).nil? }
+        return if missing.empty?
+        ids_by_uri = @client.all_documents(collection_id: @collection_id)
+                            .to_h { |doc| [doc['uri'], doc['id']] }
+        missing.each do |uri|
+          id = ids_by_uri[uri]
+          @manifest.record(uri: uri, hash: nil, document_id: id) if id
+        end
+      rescue StandardError => e
+        log "  id resolution skipped (#{e.message})"
+      end
+      # True when purging +stale+ would delete too large a fraction of the
+      # manifest — the signature of running against a partial index. The floor
+      # (PURGE_GUARD_MIN_DOCS) keeps small collections deletable.
+      def guard_blocks_purge?(stale)
+        return false if @force_purge
+        size = @manifest.size
+        return false if size < PURGE_GUARD_MIN_DOCS
+        fraction = stale.size.to_f / size
+        return false unless fraction > PURGE_GUARD_FRACTION
+        log "  WARNING: refusing to delete #{stale.size} of #{size} documents " \
+            "(#{(fraction * 100).round}% > #{(PURGE_GUARD_FRACTION * 100).to_i}% — likely a partial index). " \
+            'Set UNBLOCKED_FORCE_PURGE=1 to override.'
+        true
+      end
+      # Seed the manifest from the remote collection when we have no local
+      # state (first run / CI cache miss). The list endpoint returns no body,
+      # so hashes are nil (everything re-pushes), but recovering document_ids
+      # lets this run still purge orphaned documents.
+      #
+      # Auth failures re-raise: a 401/403 here dooms every subsequent call,
+      # and "proceeding with full sync" would burn the whole daily budget on
+      # guaranteed failures.
+      def reconcile_from_remote
+        @client.all_documents(collection_id: @collection_id).each do |doc|
+          uri = doc['uri']
+          next unless uri
+          @manifest.record(uri: uri, hash: nil, document_id: doc['id'])
+        end
+      rescue ApiError => e
+        raise if [401, 403].include?(e.status)
+        log "  reconcile skipped (#{e.message}) — proceeding with full sync"
+      rescue StandardError => e
+        log "  reconcile skipped (#{e.message}) — proceeding with full sync"
+      end
+      def track_uri(unit_data)
+        # Units without a file_path are never pushed (see push_document), so
+        # their fallback repo-root URI must not be marked current either — a
+        # stale repo-root document from before this guard should purge.
+        return unless unit_data['file_path']
+        # Must match the URI push_document actually uses, or a colliding unit's
+        # disambiguated document would look stale and be purged.
+        @current_uris << effective_uri(unit_data)
+      end
+      # The URI a unit's document is stored under. Normally the file's blob URL;
+      # when several units share that file, all but the lexically-first
+      # identifier get a `?unit=` suffix so each keeps a distinct document
+      # rather than overwriting the others (see #build_uri_index).
+      def effective_uri(unit_data)
+        base = @builder.uri_for(unit_data)
+        primary = @uri_primary[base]
+        return base if primary.nil? || primary == unit_data['identifier']
+        "#{base}?unit=#{URI.encode_www_form_component(unit_data['identifier'])}"
+      end
+      # One cheap pass over the type indexes (entries already carry file_path,
+      # and read_index is cached) to find files that define more than one synced
+      # unit. For each such base URI, the lexically-smallest identifier — the
+      # outer/top-level class — keeps the bare URI; siblings are suffixed. Solo
+      # files (the overwhelming majority) are absent from the map and unchanged,
+      # so this introduces no churn for them.
+      def build_uri_index
+        groups = Hash.new { |h, k| h[k] = [] }
+        synced_types.each do |type|
+          @reader.list_units(type: type).each do |entry|
+            next unless entry['file_path']
+            groups[@builder.uri_for(entry)] << entry['identifier']
+          end
+        end
+        @uri_primary = groups.each_with_object({}) do |(uri, identifiers), primary|
+          unique = identifiers.uniq
+          primary[uri] = unique.min if unique.size > 1
+        end
+      end
+      def synced_types
+        FULL_SYNC_TYPES + PARTIAL_SYNC_TYPES.map(&:first)
+      end
+      def fingerprint(doc)
+        Digest::SHA256.hexdigest("#{doc[:title]}\n#{doc[:body]}")
+      end
+      # Records whether an error was a budget-exhaustion stop. Returns true when
+      # it was, so callers can break out of their loop. Class check first; the
+      # message match remains as a fallback for injected clients that raise
+      # plain Woods::Error.
+      def note_budget_exhaustion(error)
+        return false unless error.is_a?(BudgetExhaustedError) || error.message.include?('daily budget exhausted')
+        @budget_exhausted = true
       end
       def build_reader(index_dir)
@@ -183,6 +432,23 @@ module Woods
         Woods::MCP::IndexReader.new(index_dir)
       end
+      # Persist the manifest, downgrading failures to a warning: losing the
+      # manifest only costs a full re-check next run, which must not turn an
+      # otherwise-successful sync into a crash (this runs from an ensure, where
+      # a raise would also mask any in-flight exception).
+      def save_manifest
+        @manifest.save
+      rescue StandardError => e
+        log "  WARNING: sync manifest not persisted (#{e.message}) — next run will re-push all documents"
+      end
+      def build_manifest(index_dir)
+        SyncManifest.new(
+          path: File.join(index_dir, 'unblocked_sync_manifest.json'),
+          collection_id: @collection_id
+        )
+      end
       def empty_stats
         { synced: 0, skipped: 0, errors: [] }
       end

data/lib/woods/unblocked/rate_limiter.rb CHANGED Viewed

@@ -1,7 +1,15 @@
 # frozen_string_literal: true
+require 'woods'
 module Woods
   module Unblocked
+    # Raised when the daily API call budget is exhausted. Subclasses
+    # Woods::Error so existing +rescue Woods::Error+ sites keep working;
+    # callers that need to branch on exhaustion rescue this class instead of
+    # matching the message string.
+    class BudgetExhaustedError < Woods::Error; end
     # Daily budget-based rate limiter for the Unblocked API (1000 calls/day).
     #
     # Unlike Notion's per-second throttling, Unblocked limits by daily call count.
@@ -35,13 +43,13 @@ module Woods
       #
       # @yield The API call to execute
       # @return [Object] The block's return value
-      # @raise [Woods::Error] if daily budget is exhausted
+      # @raise [BudgetExhaustedError] if daily budget is exhausted
       def track
         raise ArgumentError, 'block required' unless block_given?
         @mutex.synchronize do
           if @calls_today >= @daily_budget
-            raise Woods::Error,
+            raise BudgetExhaustedError,
                   "Unblocked API daily budget exhausted (#{@daily_budget} calls). " \
                   'Budget resets at midnight PST. Use UNBLOCKED_DAILY_BUDGET to adjust.'
           end

data/lib/woods/unblocked/sync_manifest.rb ADDED Viewed

@@ -0,0 +1,135 @@
+# frozen_string_literal: true
+require 'json'
+require 'fileutils'
+module Woods
+  module Unblocked
+    # Tracks what was last pushed to an Unblocked collection so a sync can
+    # skip unchanged documents, re-push changed ones, and delete orphans.
+    #
+    # The manifest is the local source of truth for change detection: each
+    # entry records the content hash of the document we last pushed for a URI
+    # plus the remote +document_id+ (needed for deletes). Persisted as JSON
+    # alongside the extraction output and restored across CI runs via the CI
+    # provider's cache. A missing or corrupt file degrades to "everything is
+    # new" — a correct (if expensive) full sync that rebuilds the manifest.
+    #
+    # Modeled on the embedding indexer's checkpoint (load JSON → compare
+    # per-key hash → save JSON).
+    #
+    # @example
+    #   manifest = SyncManifest.new(path: "tmp/woods/unblocked_sync_manifest.json",
+    #                               collection_id: "col-uuid")
+    #   manifest.unchanged?(uri, hash)  # => false on first run
+    #   manifest.record(uri:, hash:, document_id:)
+    #   manifest.save
+    #
+    class SyncManifest
+      VERSION = 1
+      # @param path [String] JSON file path for the manifest
+      # @param collection_id [String] Target collection UUID — a stored manifest
+      #   for a *different* collection is discarded (cache-key reuse guard).
+      def initialize(path:, collection_id:)
+        @path = path
+        @collection_id = collection_id
+        @documents = load
+      end
+      # @return [Boolean] true when no documents are recorded
+      def empty?
+        @documents.empty?
+      end
+      # @param uri [String] Document URI
+      # @param hash [String] Content hash of the document we would push now
+      # @return [Boolean] true when the recorded hash matches (safe to skip)
+      def unchanged?(uri, hash)
+        entry = @documents[uri]
+        !entry.nil? && entry['hash'] == hash
+      end
+      # Record (or update) what we pushed for a URI.
+      #
+      # @param uri [String] Document URI
+      # @param hash [String, nil] Content hash pushed (nil forces a future re-push)
+      # @param document_id [String, nil] Remote document UUID (for later deletes)
+      def record(uri:, hash:, document_id:)
+        @documents[uri] = { 'hash' => hash, 'document_id' => document_id }
+      end
+      # @param uri [String] Document URI
+      # @return [String, nil] Stored remote document_id, if known
+      def document_id_for(uri)
+        @documents.dig(uri, 'document_id')
+      end
+      # URIs we have a record of that are absent from the current run's set.
+      #
+      # @param current_uris [Array<String>, Set] URIs that still exist this run
+      # @return [Array<String>] recorded URIs no longer present (deletion candidates)
+      def stale_uris(current_uris)
+        present = current_uris.to_a
+        @documents.keys - present
+      end
+      # @return [Integer] number of recorded documents
+      def size
+        @documents.size
+      end
+      # Drop a URI from the manifest (after a successful remote delete).
+      #
+      # @param uri [String] Document URI
+      def forget(uri)
+        @documents.delete(uri)
+      end
+      # Persist the manifest atomically (temp file + rename) so an interrupted
+      # write never leaves a torn file in the CI cache.
+      def save
+        FileUtils.mkdir_p(File.dirname(@path))
+        payload = JSON.generate(
+          'version' => VERSION,
+          'collection_id' => @collection_id,
+          'documents' => @documents
+        )
+        tmp = "#{@path}.tmp"
+        File.write(tmp, payload)
+        File.rename(tmp, @path)
+      end
+      private
+      # Load the persisted documents, discarding data from a different
+      # collection, a different schema version, or an unparseable file.
+      # Every discard warns to stderr — the consequence (a full re-push) is
+      # expensive enough that operators need to know why it happened.
+      #
+      # @return [Hash{String=>Hash}] uri => { 'hash' =>, 'document_id' => }
+      def load
+        return {} unless File.exist?(@path)
+        parsed = JSON.parse(File.read(@path))
+        return discard('not a JSON object') unless parsed.is_a?(Hash)
+        return discard("schema version #{parsed['version'].inspect}, expected #{VERSION}") unless
+          parsed['version'] == VERSION
+        return discard("written for collection #{parsed['collection_id'].inspect}, expected #{@collection_id}") unless
+          parsed['collection_id'] == @collection_id
+        documents = parsed['documents']
+        documents.is_a?(Hash) ? documents : {}
+      rescue JSON::ParserError
+        discard('unparseable JSON')
+      end
+      # @param reason [String] Why the persisted manifest is unusable
+      # @return [Hash] empty documents hash (degrades to a full re-push)
+      def discard(reason)
+        warn "WARNING: discarding sync manifest at #{@path} (#{reason}) — next sync re-pushes all documents"
+        {}
+      end
+    end
+  end
+end

data/lib/woods/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Woods
-  VERSION = '1.3.0'
+  VERSION = '1.4.1'
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: woods
 version: !ruby/object:Gem::Version
-  version: 1.3.0
+  version: 1.4.1
 platform: ruby
 authors:
 - Leah Armstrong
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2026-05-13 00:00:00.000000000 Z
+date: 2026-06-11 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: mcp
@@ -305,6 +305,7 @@ files:
 - lib/woods/unblocked/document_builder.rb
 - lib/woods/unblocked/exporter.rb
 - lib/woods/unblocked/rate_limiter.rb
+- lib/woods/unblocked/sync_manifest.rb
 - lib/woods/util/host_guard.rb
 - lib/woods/version.rb
 homepage: https://github.com/lost-in-the/woods