@pinkpixel/sugarstitch 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. package/CHANGELOG.md +59 -0
  2. package/LICENSE +21 -0
  3. package/OVERVIEW.md +306 -0
  4. package/README.md +462 -0
  5. package/assets/banner_dark.png +0 -0
  6. package/assets/banner_light.png +0 -0
  7. package/assets/logo.png +0 -0
  8. package/assets/screenshot_cli.png +0 -0
  9. package/assets/screenshot_completed.png +0 -0
  10. package/assets/screenshot_homepage.png +0 -0
  11. package/assets/screenshot_scraping.png +0 -0
  12. package/dist/index.js +216 -0
  13. package/dist/scraper.js +719 -0
  14. package/dist/server.js +1272 -0
  15. package/package.json +26 -0
  16. package/public/favicon.png +0 -0
  17. package/scripts/add-shebang.js +11 -0
  18. package/src/index.ts +217 -0
  19. package/src/scraper.ts +903 -0
  20. package/src/server.ts +1319 -0
  21. package/tsconfig.json +12 -0
  22. package/website/astro.config.mjs +5 -0
  23. package/website/package-lock.json +6358 -0
  24. package/website/package.json +18 -0
  25. package/website/public/banner_dark.png +0 -0
  26. package/website/public/banner_light.png +0 -0
  27. package/website/public/favicon.png +0 -0
  28. package/website/public/screenshot_cli.png +0 -0
  29. package/website/public/screenshot_completed.png +0 -0
  30. package/website/public/screenshot_homepage.png +0 -0
  31. package/website/public/screenshot_scraping.png +0 -0
  32. package/website/src/layouts/DocsLayout.astro +142 -0
  33. package/website/src/pages/docs/install.astro +96 -0
  34. package/website/src/pages/docs/use-the-app.astro +131 -0
  35. package/website/src/pages/index.astro +94 -0
  36. package/website/src/styles/site.css +611 -0
  37. package/website/tsconfig.json +3 -0
  38. package/website/wrangler.toml +6 -0
package/CHANGELOG.md ADDED
@@ -0,0 +1,59 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on Keep a Changelog, and this project currently tracks releases in a simple manual format.
6
+
7
+ ## [1.0.0] - 2026-03-25
8
+
9
+ ### Added
10
+
11
+ - Initial public CLI for scraping fiber arts pattern pages into local JSON, image, and PDF files
12
+ - Branded README header with the SugarStitch logo
13
+ - Plain-text pattern artifacts saved under `texts/<pattern_title>/pattern.txt` for each scraped page
14
+ - README improvements covering installation, usage, selector discovery, troubleshooting, and output structure
15
+ - MIT license
16
+ - A simple local web UI with a mode dropdown, URL inputs, output field, summary view, and run log
17
+ - A shared scraper module so the CLI and UI use the same scraping engine
18
+ - Built-in selector presets for generic pages, WordPress posts, and WooCommerce product pages
19
+ - Advanced per-run selector overrides for title, description, materials, instructions, and images
20
+ - Saved site profiles loaded from a JSON config file
21
+ - Preview mode for testing selectors before a full scrape
22
+ - Discovery crawl mode for following links from a listing page and scraping discovered child pages
23
+ - Output directory support for both the CLI and local UI
24
+ - In-page loading feedback with a spinner/progress overlay in the local UI
25
+ - `OVERVIEW.md` with a technical and development-oriented guide to the project
26
+ - Crawl language filtering to keep discovered URLs focused on one language
27
+ - Crawl pagination support for paginated listing pages and load-more style archives that expose regular page URLs
28
+ - A light/dark mode toggle in the local UI with branded banner swapping
29
+ - Local static serving for UI branding assets and favicon support
30
+ - A colored ASCII SugarStitch banner for the CLI
31
+
32
+ ### Changed
33
+
34
+ - Broadened user-facing project wording from sewing-specific language to fiber arts language so the docs and CLI better reflect crochet, knitting, sewing, and related pattern sites
35
+ - Corrected the CLI version string to `1.0.0`
36
+ - Wired the package up as a real executable CLI with a `bin` entry and a post-build shebang step
37
+ - Improved CLI option placeholders to use clearer argument names
38
+ - Refactored the scraping logic out of the CLI entrypoint and into reusable shared code
39
+ - Added preset-aware selector matching so the CLI and UI can switch site strategies without code edits
40
+ - Added field-level selector overrides so users can adjust only the parts a preset misses
41
+ - Added reusable profile resolution so presets and overrides can be saved and reused across runs
42
+ - Added preview flows in both the CLI and the local UI
43
+ - Added early validation so `--url` and `--file` cannot be used together
44
+ - Added URL normalization and validation for single URL and batch file input
45
+ - Added URL deduplication for batch mode
46
+ - Added image and PDF URL deduplication before download
47
+ - Added request timeouts for page fetches and file downloads
48
+ - Improved filename sanitization with a stable fallback for empty titles
49
+ - Changed duplicate handling so previously scraped `sourceUrl` entries are skipped before network work begins
50
+ - Made output loading safer by surfacing invalid JSON instead of silently overwriting it
51
+ - Improved result messaging so PDF-only or partial-content matches are clearer in the UI
52
+ - Expanded the README to document crawl mode, output directories, profiles, preview flow, and UI behavior
53
+ - Refreshed the docs so README, OVERVIEW, and CHANGELOG stay aligned with the current UI, output model, and branding
54
+ - Expanded crawl controls in both the CLI and UI to include language and pagination tuning
55
+
56
+ ### Fixed
57
+
58
+ - Prevented accidental re-scraping of URLs already present in the output JSON
59
+ - Reduced duplicate asset downloads caused by repeated image or PDF links on a page
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Pink Pixel
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/OVERVIEW.md ADDED
@@ -0,0 +1,306 @@
1
+ # Overview
2
+
3
+ ## Purpose
4
+
5
+ SugarStitch is a local-first Node/TypeScript scraper for fiber arts pattern websites. It is meant to support a few related workflows without forcing users into one style of use:
6
+
7
+ - scrape one known pattern page
8
+ - scrape many known pattern pages from a list
9
+ - start from a listing or archive page, discover child links, and scrape the discovered pattern pages
10
+ - preview extraction before saving files
11
+
12
+ The project supports both a command-line interface and a simple browser UI so the same scraping engine can be used by both technical and non-technical users.
13
+
14
+ ## Current Feature Set
15
+
16
+ At a high level, SugarStitch now supports:
17
+
18
+ - CLI scraping
19
+ - CLI startup banner
20
+ - local browser UI
21
+ - light/dark mode toggle in the UI
22
+ - selector presets
23
+ - one-off selector overrides
24
+ - saved site profiles
25
+ - preview mode
26
+ - output directory selection
27
+ - discovery crawl mode
28
+ - crawl language filtering
29
+ - crawl pagination support for listing pages with regular paginated URLs
30
+ - duplicate detection by `sourceUrl`
31
+ - plain-text pattern artifact generation
32
+ - PDF and image downloading
33
+
34
+ ## Core Architecture
35
+
36
+ The project is split into three main layers:
37
+
38
+ 1. Shared scraping engine
39
+ File: [`src/scraper.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/scraper.ts)
40
+
41
+ 2. CLI wrapper
42
+ File: [`src/index.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/index.ts)
43
+
44
+ 3. Local browser UI
45
+ File: [`src/server.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/server.ts)
46
+
47
+ The shared scraper owns the real behavior. The CLI and UI should stay as thin adapters around that logic.
48
+
49
+ ## File Map
50
+
51
+ - [`src/scraper.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/scraper.ts)
52
+ Shared types, selector presets, saved profile loading, preview logic, discovery crawl logic, pagination expansion, language filtering, page scraping, file downloads, and JSON append behavior.
53
+
54
+ - [`src/index.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/index.ts)
55
+ CLI argument parsing, URL source handling, output path resolution, crawl option collection, and handoff into the shared scraper.
56
+
57
+ - [`src/server.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/server.ts)
58
+ Small Node HTTP server that renders the HTML UI, serves local branding assets, handles preview/scrape form posts, manages loading-state UX, and returns result pages.
59
+
60
+ - [`sugarstitch.profiles.json`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/sugarstitch.profiles.json)
61
+ Starter saved profile config file. The UI and CLI know how to load profiles from it by default.
62
+
63
+ - [`README.md`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/README.md)
64
+ User-facing usage guide.
65
+
66
+ - [`CHANGELOG.md`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/CHANGELOG.md)
67
+ Release-oriented change history.
68
+
69
+ - [`scripts/add-shebang.js`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/scripts/add-shebang.js)
70
+ Adds a Node shebang to the built CLI entrypoint after TypeScript compilation.
71
+
72
+ - [`package.json`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/package.json)
73
+ Package metadata, scripts, executable `bin` entry, and dependency definitions.
74
+
75
+ ## Shared Scraper Responsibilities
76
+
77
+ The shared scraper currently handles:
78
+
79
+ - selector preset definitions
80
+ - selector override sanitization and merging
81
+ - saved site profile loading
82
+ - run strategy resolution
83
+ - URL normalization and deduplication
84
+ - preview extraction
85
+ - bounded discovery crawl
86
+ - crawl language filtering
87
+ - pagination seed expansion for listing pages
88
+ - image and PDF download handling
89
+ - plain-text pattern artifact generation
90
+ - output JSON loading and append behavior
91
+ - duplicate `sourceUrl` prevention
92
+
93
+ If a feature changes what SugarStitch actually scrapes or how it discovers pages, it should usually begin in [`src/scraper.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/scraper.ts).
94
+
95
+ ## Selector System
96
+
97
+ The selector system has three layers:
98
+
99
+ 1. Built-in preset
100
+ Examples: `generic`, `wordpress`, `woocommerce`
101
+
102
+ 2. Optional saved profile
103
+ Loaded from a JSON config file and typically used for site-specific tuning
104
+
105
+ 3. Optional one-off overrides
106
+ Per-run selector overrides that replace only the fields provided
107
+
108
+ Resolution order:
109
+
110
+ - choose preset
111
+ - optionally load saved profile
112
+ - merge profile overrides
113
+ - merge one-off overrides last
114
+
115
+ That means:
116
+
117
+ - one-off overrides win over profile values
118
+ - profile values win over the base preset
119
+
120
+ ## Preview Flow
121
+
122
+ Preview mode exists to answer:
123
+
124
+ “Before I download anything, what does SugarStitch think this page contains?”
125
+
126
+ The preview flow:
127
+
128
+ 1. resolve preset/profile/override strategy
129
+ 2. fetch the page HTML
130
+ 3. run selector extraction
131
+ 4. return title, description, materials, instructions, image URLs, and PDF URLs
132
+ 5. also derive a fuller page-text block for plain-text artifact output
133
+ 6. do not write JSON
134
+ 7. do not download assets
135
+
136
+ This is the safest way to validate selectors on a new site.
137
+
138
+ ## Discovery Crawl Flow
139
+
140
+ Discovery crawl mode is for archive or listing pages where the real patterns live deeper in the site.
141
+
142
+ The crawl flow:
143
+
144
+ 1. start from one or more seed URLs
145
+ 2. optionally expand paginated listing pages
146
+ 3. fetch page HTML
147
+ 4. collect link targets from `a[href]`
148
+ 5. resolve them to absolute URLs
149
+ 6. optionally restrict to the same domain
150
+ 7. optionally restrict by language
151
+ 8. optionally filter by URL or link text pattern
152
+ 9. stop at the configured depth and max-URL limits
153
+ 10. pass the discovered URLs into the normal scrape flow
154
+
155
+ Important:
156
+
157
+ - crawl mode is intentionally bounded
158
+ - it is not meant to be a full general-purpose spider
159
+ - it is designed to help discover likely pattern pages from listing pages
160
+
161
+ ## Crawl Language Filtering
162
+
163
+ Some sites expose multiple language sections from the same listing page. For example:
164
+
165
+ - English archive page
166
+ - French archive page
167
+ - Portuguese archive page
168
+
169
+ The crawler can now prefer one language when discovered URLs clearly indicate a language, either by query string or pathname conventions. This helps avoid mixing multiple language archives into a single run.
170
+
171
+ This is especially useful for sites like Tilda where a top-level page links to multiple language-specific pattern sections.
172
+
173
+ ## Crawl Pagination Support
174
+
175
+ Some sites use a `Load More` interaction in the UI, but also expose those later batches as normal paginated URLs.
176
+
177
+ SugarStitch now supports pagination-aware crawl seeding:
178
+
179
+ - inspect the seed page for listing pagination hints
180
+ - detect max page counts where possible
181
+ - add `/page/2/`, `/page/3/`, and similar listing pages up to a configured cap
182
+ - continue crawl discovery from those expanded listing pages
183
+
184
+ This works well when the site exposes traditional archive pages even if the visual UI presents them as a load-more interaction.
185
+
186
+ ## Output Model
187
+
188
+ Each successful page becomes a `PatternData` object with:
189
+
190
+ - `title`
191
+ - `description`
192
+ - `materials`
193
+ - `instructions`
194
+ - `sourceUrl`
195
+ - `localImages`
196
+ - `localPdfs`
197
+ - `localTextFile`
198
+
199
+ Output behavior:
200
+
201
+ - JSON is appended to an output file
202
+ - duplicate `sourceUrl` entries are skipped before re-scraping
203
+ - downloaded images go into `images/<sanitized-title>/`
204
+ - downloaded PDFs go into `pdfs/<sanitized-title>/`
205
+ - extracted page text is written to `texts/<sanitized-title>/pattern.txt`
206
+ - all of that lives under the selected output directory
207
+
208
+ ## UI Notes
209
+
210
+ The UI is intentionally simple and server-rendered.
211
+
212
+ Key design choices:
213
+
214
+ - no frontend framework
215
+ - HTML assembled on the server
216
+ - form-submit workflow instead of a client-side app
217
+ - loading overlay and spinner so long requests feel active
218
+ - persisted light/dark theme toggle with light mode as the default
219
+ - result pages returned after preview or scrape completion
220
+
221
+ Why it is structured this way:
222
+
223
+ - easy to maintain
224
+ - easy to inspect
225
+ - minimal dependencies
226
+ - keeps most logic in the shared scraper instead of duplicated client code
227
+
228
+ ## CLI Notes
229
+
230
+ The CLI acts as a thin adapter over the shared scraper. It currently handles:
231
+
232
+ - mutual exclusion of `--url` and `--file`
233
+ - URL list loading from files
234
+ - output path and output directory resolution
235
+ - crawl option parsing
236
+ - startup banner rendering
237
+ - preview vs full scrape routing
238
+
239
+ If behavior is purely about command syntax or argument ergonomics, it belongs in [`src/index.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/index.ts).
240
+
241
+ ## Result and Success Semantics
242
+
243
+ A “successful” scrape does not mean every field matched.
244
+
245
+ Examples of valid successful runs:
246
+
247
+ - title plus PDF download only
248
+ - title plus images only
249
+ - title, description, and article content with no PDF
250
+
251
+ This matters because some fiber arts sites store the actual pattern inside a PDF, while others use a blog-post layout with visible HTML content.
252
+
253
+ ## Known Constraints
254
+
255
+ - selector quality is site-dependent
256
+ - some sites render important content with JavaScript after load
257
+ - browser security means the UI uses an output-directory path field rather than a native folder picker
258
+ - the crawler is bounded and heuristic-based, not exhaustive
259
+ - HTTPS certificate issues on some environments can affect preview or scrape runs against certain sites
260
+ - pagination support currently assumes the site exposes discoverable regular listing pages rather than a fully hidden API-only load-more flow
261
+
262
+ ## Good Next Improvements
263
+
264
+ If development continues, strong candidates include:
265
+
266
+ - richer site-specific profile library
267
+ - selector test diagnostics showing which selector matched which field
268
+ - live streaming logs in the UI instead of request/response-only updates
269
+ - optional desktop wrapper if native folder selection becomes important
270
+ - export options beyond JSON
271
+ - more explicit crawl diagnostics for why a URL was followed or skipped
272
+ - support for sites whose load-more behavior only exists through AJAX or browser interaction
273
+
274
+ ## Development Workflow
275
+
276
+ ### Build
277
+
278
+ ```bash
279
+ npm run build
280
+ ```
281
+
282
+ ### Run CLI
283
+
284
+ ```bash
285
+ npm run scrape -- --url "https://example.com/pattern"
286
+ ```
287
+
288
+ ### Run UI
289
+
290
+ ```bash
291
+ npm run ui
292
+ ```
293
+
294
+ ## Common Places To Edit
295
+
296
+ If scraping quality is the issue:
297
+ - edit [`src/scraper.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/scraper.ts)
298
+
299
+ If CLI behavior is the issue:
300
+ - edit [`src/index.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/index.ts)
301
+
302
+ If UI workflow or local-browser UX is the issue:
303
+ - edit [`src/server.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/server.ts)
304
+
305
+ If site-specific defaults are the issue:
306
+ - edit [`sugarstitch.profiles.json`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/sugarstitch.profiles.json)