npm - @pinkpixel/sugarstitch - Versions diffs - 1.0.0 - Mend

@pinkpixel/sugarstitch 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

package/CHANGELOG.md +59 -0
package/LICENSE +21 -0
package/OVERVIEW.md +306 -0
package/README.md +462 -0
package/assets/banner_dark.png +0 -0
package/assets/banner_light.png +0 -0
package/assets/logo.png +0 -0
package/assets/screenshot_cli.png +0 -0
package/assets/screenshot_completed.png +0 -0
package/assets/screenshot_homepage.png +0 -0
package/assets/screenshot_scraping.png +0 -0
package/dist/index.js +216 -0
package/dist/scraper.js +719 -0
package/dist/server.js +1272 -0
package/package.json +26 -0
package/public/favicon.png +0 -0
package/scripts/add-shebang.js +11 -0
package/src/index.ts +217 -0
package/src/scraper.ts +903 -0
package/src/server.ts +1319 -0
package/tsconfig.json +12 -0
package/website/astro.config.mjs +5 -0
package/website/package-lock.json +6358 -0
package/website/package.json +18 -0
package/website/public/banner_dark.png +0 -0
package/website/public/banner_light.png +0 -0
package/website/public/favicon.png +0 -0
package/website/public/screenshot_cli.png +0 -0
package/website/public/screenshot_completed.png +0 -0
package/website/public/screenshot_homepage.png +0 -0
package/website/public/screenshot_scraping.png +0 -0
package/website/src/layouts/DocsLayout.astro +142 -0
package/website/src/pages/docs/install.astro +96 -0
package/website/src/pages/docs/use-the-app.astro +131 -0
package/website/src/pages/index.astro +94 -0
package/website/src/styles/site.css +611 -0
package/website/tsconfig.json +3 -0
package/website/wrangler.toml +6 -0

package/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,59 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on Keep a Changelog, and this project currently tracks releases in a simple manual format.
+## [1.0.0] - 2026-03-25
+### Added
+- Initial public CLI for scraping fiber arts pattern pages into local JSON, image, and PDF files
+- Branded README header with the SugarStitch logo
+- Plain-text pattern artifacts saved under `texts/<pattern_title>/pattern.txt` for each scraped page
+- README improvements covering installation, usage, selector discovery, troubleshooting, and output structure
+- MIT license
+- A simple local web UI with a mode dropdown, URL inputs, output field, summary view, and run log
+- A shared scraper module so the CLI and UI use the same scraping engine
+- Built-in selector presets for generic pages, WordPress posts, and WooCommerce product pages
+- Advanced per-run selector overrides for title, description, materials, instructions, and images
+- Saved site profiles loaded from a JSON config file
+- Preview mode for testing selectors before a full scrape
+- Discovery crawl mode for following links from a listing page and scraping discovered child pages
+- Output directory support for both the CLI and local UI
+- In-page loading feedback with a spinner/progress overlay in the local UI
+- `OVERVIEW.md` with a technical and development-oriented guide to the project
+- Crawl language filtering to keep discovered URLs focused on one language
+- Crawl pagination support for paginated listing pages and load-more style archives that expose regular page URLs
+- A light/dark mode toggle in the local UI with branded banner swapping
+- Local static serving for UI branding assets and favicon support
+- A colored ASCII SugarStitch banner for the CLI
+### Changed
+- Broadened user-facing project wording from sewing-specific language to fiber arts language so the docs and CLI better reflect crochet, knitting, sewing, and related pattern sites
+- Corrected the CLI version string to `1.0.0`
+- Wired the package up as a real executable CLI with a `bin` entry and a post-build shebang step
+- Improved CLI option placeholders to use clearer argument names
+- Refactored the scraping logic out of the CLI entrypoint and into reusable shared code
+- Added preset-aware selector matching so the CLI and UI can switch site strategies without code edits
+- Added field-level selector overrides so users can adjust only the parts a preset misses
+- Added reusable profile resolution so presets and overrides can be saved and reused across runs
+- Added preview flows in both the CLI and the local UI
+- Added early validation so `--url` and `--file` cannot be used together
+- Added URL normalization and validation for single URL and batch file input
+- Added URL deduplication for batch mode
+- Added image and PDF URL deduplication before download
+- Added request timeouts for page fetches and file downloads
+- Improved filename sanitization with a stable fallback for empty titles
+- Changed duplicate handling so previously scraped `sourceUrl` entries are skipped before network work begins
+- Made output loading safer by surfacing invalid JSON instead of silently overwriting it
+- Improved result messaging so PDF-only or partial-content matches are clearer in the UI
+- Expanded the README to document crawl mode, output directories, profiles, preview flow, and UI behavior
+- Refreshed the docs so README, OVERVIEW, and CHANGELOG stay aligned with the current UI, output model, and branding
+- Expanded crawl controls in both the CLI and UI to include language and pagination tuning
+### Fixed
+- Prevented accidental re-scraping of URLs already present in the output JSON
+- Reduced duplicate asset downloads caused by repeated image or PDF links on a page

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Pink Pixel
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/OVERVIEW.md ADDED Viewed

@@ -0,0 +1,306 @@
+# Overview
+## Purpose
+SugarStitch is a local-first Node/TypeScript scraper for fiber arts pattern websites. It is meant to support a few related workflows without forcing users into one style of use:
+- scrape one known pattern page
+- scrape many known pattern pages from a list
+- start from a listing or archive page, discover child links, and scrape the discovered pattern pages
+- preview extraction before saving files
+The project supports both a command-line interface and a simple browser UI so the same scraping engine can be used by both technical and non-technical users.
+## Current Feature Set
+At a high level, SugarStitch now supports:
+- CLI scraping
+- CLI startup banner
+- local browser UI
+- light/dark mode toggle in the UI
+- selector presets
+- one-off selector overrides
+- saved site profiles
+- preview mode
+- output directory selection
+- discovery crawl mode
+- crawl language filtering
+- crawl pagination support for listing pages with regular paginated URLs
+- duplicate detection by `sourceUrl`
+- plain-text pattern artifact generation
+- PDF and image downloading
+## Core Architecture
+The project is split into three main layers:
+1. Shared scraping engine
+   File: [`src/scraper.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/scraper.ts)
+2. CLI wrapper
+   File: [`src/index.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/index.ts)
+3. Local browser UI
+   File: [`src/server.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/server.ts)
+The shared scraper owns the real behavior. The CLI and UI should stay as thin adapters around that logic.
+## File Map
+- [`src/scraper.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/scraper.ts)
+  Shared types, selector presets, saved profile loading, preview logic, discovery crawl logic, pagination expansion, language filtering, page scraping, file downloads, and JSON append behavior.
+- [`src/index.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/index.ts)
+  CLI argument parsing, URL source handling, output path resolution, crawl option collection, and handoff into the shared scraper.
+- [`src/server.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/server.ts)
+  Small Node HTTP server that renders the HTML UI, serves local branding assets, handles preview/scrape form posts, manages loading-state UX, and returns result pages.
+- [`sugarstitch.profiles.json`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/sugarstitch.profiles.json)
+  Starter saved profile config file. The UI and CLI know how to load profiles from it by default.
+- [`README.md`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/README.md)
+  User-facing usage guide.
+- [`CHANGELOG.md`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/CHANGELOG.md)
+  Release-oriented change history.
+- [`scripts/add-shebang.js`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/scripts/add-shebang.js)
+  Adds a Node shebang to the built CLI entrypoint after TypeScript compilation.
+- [`package.json`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/package.json)
+  Package metadata, scripts, executable `bin` entry, and dependency definitions.
+## Shared Scraper Responsibilities
+The shared scraper currently handles:
+- selector preset definitions
+- selector override sanitization and merging
+- saved site profile loading
+- run strategy resolution
+- URL normalization and deduplication
+- preview extraction
+- bounded discovery crawl
+- crawl language filtering
+- pagination seed expansion for listing pages
+- image and PDF download handling
+- plain-text pattern artifact generation
+- output JSON loading and append behavior
+- duplicate `sourceUrl` prevention
+If a feature changes what SugarStitch actually scrapes or how it discovers pages, it should usually begin in [`src/scraper.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/scraper.ts).
+## Selector System
+The selector system has three layers:
+1. Built-in preset
+   Examples: `generic`, `wordpress`, `woocommerce`
+2. Optional saved profile
+   Loaded from a JSON config file and typically used for site-specific tuning
+3. Optional one-off overrides
+   Per-run selector overrides that replace only the fields provided
+Resolution order:
+- choose preset
+- optionally load saved profile
+- merge profile overrides
+- merge one-off overrides last
+That means:
+- one-off overrides win over profile values
+- profile values win over the base preset
+## Preview Flow
+Preview mode exists to answer:
+“Before I download anything, what does SugarStitch think this page contains?”
+The preview flow:
+1. resolve preset/profile/override strategy
+2. fetch the page HTML
+3. run selector extraction
+4. return title, description, materials, instructions, image URLs, and PDF URLs
+5. also derive a fuller page-text block for plain-text artifact output
+6. do not write JSON
+7. do not download assets
+This is the safest way to validate selectors on a new site.
+## Discovery Crawl Flow
+Discovery crawl mode is for archive or listing pages where the real patterns live deeper in the site.
+The crawl flow:
+1. start from one or more seed URLs
+2. optionally expand paginated listing pages
+3. fetch page HTML
+4. collect link targets from `a[href]`
+5. resolve them to absolute URLs
+6. optionally restrict to the same domain
+7. optionally restrict by language
+8. optionally filter by URL or link text pattern
+9. stop at the configured depth and max-URL limits
+10. pass the discovered URLs into the normal scrape flow
+Important:
+- crawl mode is intentionally bounded
+- it is not meant to be a full general-purpose spider
+- it is designed to help discover likely pattern pages from listing pages
+## Crawl Language Filtering
+Some sites expose multiple language sections from the same listing page. For example:
+- English archive page
+- French archive page
+- Portuguese archive page
+The crawler can now prefer one language when discovered URLs clearly indicate a language, either by query string or pathname conventions. This helps avoid mixing multiple language archives into a single run.
+This is especially useful for sites like Tilda where a top-level page links to multiple language-specific pattern sections.
+## Crawl Pagination Support
+Some sites use a `Load More` interaction in the UI, but also expose those later batches as normal paginated URLs.
+SugarStitch now supports pagination-aware crawl seeding:
+- inspect the seed page for listing pagination hints
+- detect max page counts where possible
+- add `/page/2/`, `/page/3/`, and similar listing pages up to a configured cap
+- continue crawl discovery from those expanded listing pages
+This works well when the site exposes traditional archive pages even if the visual UI presents them as a load-more interaction.
+## Output Model
+Each successful page becomes a `PatternData` object with:
+- `title`
+- `description`
+- `materials`
+- `instructions`
+- `sourceUrl`
+- `localImages`
+- `localPdfs`
+- `localTextFile`
+Output behavior:
+- JSON is appended to an output file
+- duplicate `sourceUrl` entries are skipped before re-scraping
+- downloaded images go into `images/<sanitized-title>/`
+- downloaded PDFs go into `pdfs/<sanitized-title>/`
+- extracted page text is written to `texts/<sanitized-title>/pattern.txt`
+- all of that lives under the selected output directory
+## UI Notes
+The UI is intentionally simple and server-rendered.
+Key design choices:
+- no frontend framework
+- HTML assembled on the server
+- form-submit workflow instead of a client-side app
+- loading overlay and spinner so long requests feel active
+- persisted light/dark theme toggle with light mode as the default
+- result pages returned after preview or scrape completion
+Why it is structured this way:
+- easy to maintain
+- easy to inspect
+- minimal dependencies
+- keeps most logic in the shared scraper instead of duplicated client code
+## CLI Notes
+The CLI acts as a thin adapter over the shared scraper. It currently handles:
+- mutual exclusion of `--url` and `--file`
+- URL list loading from files
+- output path and output directory resolution
+- crawl option parsing
+- startup banner rendering
+- preview vs full scrape routing
+If behavior is purely about command syntax or argument ergonomics, it belongs in [`src/index.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/index.ts).
+## Result and Success Semantics
+A “successful” scrape does not mean every field matched.
+Examples of valid successful runs:
+- title plus PDF download only
+- title plus images only
+- title, description, and article content with no PDF
+This matters because some fiber arts sites store the actual pattern inside a PDF, while others use a blog-post layout with visible HTML content.
+## Known Constraints
+- selector quality is site-dependent
+- some sites render important content with JavaScript after load
+- browser security means the UI uses an output-directory path field rather than a native folder picker
+- the crawler is bounded and heuristic-based, not exhaustive
+- HTTPS certificate issues on some environments can affect preview or scrape runs against certain sites
+- pagination support currently assumes the site exposes discoverable regular listing pages rather than a fully hidden API-only load-more flow
+## Good Next Improvements
+If development continues, strong candidates include:
+- richer site-specific profile library
+- selector test diagnostics showing which selector matched which field
+- live streaming logs in the UI instead of request/response-only updates
+- optional desktop wrapper if native folder selection becomes important
+- export options beyond JSON
+- more explicit crawl diagnostics for why a URL was followed or skipped
+- support for sites whose load-more behavior only exists through AJAX or browser interaction
+## Development Workflow
+### Build
+```bash
+npm run build
+```
+### Run CLI
+```bash
+npm run scrape -- --url "https://example.com/pattern"
+```
+### Run UI
+```bash
+npm run ui
+```
+## Common Places To Edit
+If scraping quality is the issue:
+- edit [`src/scraper.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/scraper.ts)
+If CLI behavior is the issue:
+- edit [`src/index.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/index.ts)
+If UI workflow or local-browser UX is the issue:
+- edit [`src/server.ts`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/src/server.ts)
+If site-specific defaults are the issue:
+- edit [`sugarstitch.profiles.json`](/home/sizzlebop/PINKPIXEL/PROJECTS/CURRENT/sugarstitch/sugarstitch.profiles.json)