recipe-scrapers-js 1.0.0-rc.0 → 1.0.0-rc.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md ADDED
@@ -0,0 +1,37 @@
1
+ <!-- markdownlint-disable MD024 -->
2
+ # Changelog
3
+
4
+ All notable changes to this project will be documented in this file.
5
+
6
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
7
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
8
+
9
+ ## [1.0.0-rc.1] - 2025-12-20
10
+
11
+ ### Added
12
+
13
+ - chore: tsdown configuration file
14
+
15
+ ### Fixed
16
+
17
+ - fix: main/module/type entriess in package.json; add exports field
18
+
19
+ ## [1.0.0-rc.0] - 2025-12-20
20
+
21
+ ### Added
22
+
23
+ - Optional ingredient parsing via [parse-ingredient](https://github.com/jakeboone02/parse-ingredient)
24
+ - `parse()` and `safeParse()` methods for Zod schema validated recipe extraction
25
+
26
+ ### Changed
27
+
28
+ - **BREAKING**: Renamed `toObject()` method to `toRecipeObject()` for clarity
29
+ - **BREAKING**: Ingredients and instructions now require grouped structures (each group has `name` and `items`) instead of flat arrays
30
+
31
+ ---
32
+
33
+ ## Pre-Release History
34
+
35
+ Prior to version 1.0.0-rc.0, this project was in alpha development. No formal changelog was maintained during the alpha phase.
36
+
37
+ [1.0.0-rc.0]: https://github.com/nerdstep/recipe-scrapers-js/releases/tag/v1.0.0-rc.0
@@ -0,0 +1,578 @@
1
+ # Recipe Scrapers JS - Architecture
2
+
3
+ ## Overview
4
+
5
+ Recipe Scrapers JS is a TypeScript library for extracting structured recipe data from cooking websites. It uses a plugin-based architecture that combines generic extraction methods with site-specific customizations to produce consistent `RecipeObject` outputs.
6
+
7
+ ### Key Features
8
+
9
+ - **Multi-source extraction**: JSON-LD (Schema.org), Microdata, OpenGraph
10
+ - **Plugin architecture**: Extensible extraction and post-processing pipeline
11
+ - **Site-specific scrapers**: Custom logic for popular recipe sites
12
+ - **Runtime validation**: Zod-based schema validation for data quality
13
+ - **Type-safe**: Full TypeScript support with strict typing
14
+ - **Test coverage**: Comprehensive test suite with real HTML fixtures
15
+
16
+ ## Core Concepts
17
+
18
+ ### RecipeObject
19
+
20
+ The output format representing a complete recipe. See `src/types/recipe.interface.ts` for the full interface definition.
21
+
22
+ ### Extraction & Validation Pipeline
23
+
24
+ ```txt
25
+ HTML Input
26
+
27
+ RecipeExtractor / AbstractScraper
28
+
29
+ ┌─────────────────────────┐
30
+ │ Extractor Plugins │
31
+ │ (Priority Order) │
32
+ ├─────────────────────────┤
33
+ │ 1. Schema.org (90) │ ← JSON-LD extraction
34
+ │ 2. Microdata (80) │ ← HTML microdata
35
+ │ 3. OpenGraph (50) │ ← OG meta tags
36
+ └─────────────────────────┘
37
+
38
+ Partial RecipeObject
39
+
40
+ Site-Specific Scraper (optional)
41
+
42
+ ┌─────────────────────────┐
43
+ │ PostProcessor Plugins │
44
+ ├─────────────────────────┤
45
+ │ • HTML Stripper (100) │ ← Remove HTML tags
46
+ │ • Ingredient Parser* │ ← Parse ingredients (optional)
47
+ └─────────────────────────┘
48
+
49
+ ┌─────────────────────────┐
50
+ │ Validation (Optional) │
51
+ ├─────────────────────────┤
52
+ │ • Zod Schema │ ← Runtime validation
53
+ │ • Type checking │
54
+ │ • Auto-fixes │
55
+ └─────────────────────────┘
56
+
57
+ Final RecipeObject
58
+ ```
59
+
60
+ ## Plugin System
61
+
62
+ ### Plugin Types
63
+
64
+ #### 1. ExtractorPlugin
65
+
66
+ Extracts data from HTML to populate `RecipeObject` fields.
67
+
68
+ ```typescript
69
+ abstract class AbstractExtractorPlugin {
70
+ abstract readonly priority: number // Higher = runs first
71
+
72
+ // Override to extract specific fields
73
+ name?($: CheerioAPI, prevValue?: string): string | undefined
74
+ ingredients?($: CheerioAPI, prevValue?: Ingredients): Ingredients | undefined
75
+ // ... other field extractors
76
+ }
77
+ ```
78
+
79
+ **Priority System**:
80
+
81
+ - Higher priority plugins run first
82
+ - Later plugins can override or enhance earlier results
83
+ - Use `prevValue` to access previous extraction results
84
+
85
+ **Built-in Extractors**:
86
+
87
+ - **Schema.org** (priority 90): Primary extractor for JSON-LD data
88
+ - **OpenGraph** (priority 50): Fallback for basic metadata
89
+
90
+ #### 2. PostProcessorPlugin
91
+
92
+ Processes extracted data after all extraction is complete.
93
+
94
+ ```typescript
95
+ abstract class AbstractPostprocessorPlugin {
96
+ abstract readonly priority: number
97
+
98
+ abstract process(recipe: RecipeObject): RecipeObject
99
+ }
100
+ ```
101
+
102
+ **Built-in Processors**:
103
+
104
+ - **HtmlStripperPlugin** (priority 100): Removes HTML tags from text fields
105
+ - **IngredientParserPlugin** (priority 50): Parses ingredient strings into structured data (optional, enabled via `parseIngredients` option)
106
+
107
+ ### Plugin Registration
108
+
109
+ Plugins are registered in `PluginManager`:
110
+
111
+ ```typescript
112
+ const pluginManager = new PluginManager()
113
+
114
+ // Add extractors
115
+ pluginManager.registerExtractorPlugin(new SchemaOrgExtractorPlugin())
116
+ pluginManager.registerExtractorPlugin(new OpenGraphExtractorPlugin())
117
+
118
+ // Add post-processors
119
+ pluginManager.registerPostprocessorPlugin(new HtmlStripperPlugin())
120
+ ```
121
+
122
+ ## Site-Specific Scrapers
123
+
124
+ ### AbstractScraper
125
+
126
+ Base class for all site-specific scrapers:
127
+
128
+ ```typescript
129
+ abstract class AbstractScraper {
130
+ protected $: CheerioAPI // Cheerio instance for HTML parsing
131
+
132
+ static host(): string // Domain this scraper handles
133
+
134
+ extractors: {
135
+ [K in keyof RecipeFields]?: (prevValue?: RecipeFields[K]) => RecipeFields[K]
136
+ }
137
+
138
+ // Extraction methods
139
+ toRecipeObject(): RecipeObject // Extract without validation
140
+ parse(): RecipeObjectValidated // Extract with validation (throws on error)
141
+ safeParse(): SafeParseReturnType // Extract with validation (returns result)
142
+
143
+ // Override for custom validation
144
+ protected getSchema(): ZodSchema
145
+ }
146
+ ```
147
+
148
+ ### Validation Methods
149
+
150
+ **`toRecipeObject()`** - Extract without validation
151
+
152
+ ```typescript
153
+ const recipe = scraper.toRecipeObject()
154
+ // Returns RecipeObject - no runtime checks
155
+ ```
156
+
157
+ **`parse()`** - Extract with validation (throws ZodError on failure)
158
+
159
+ ```typescript
160
+ try {
161
+ const recipe = scraper.parse()
162
+ // Returns RecipeObjectValidated - guaranteed valid
163
+ } catch (error) {
164
+ if (error instanceof ZodError) {
165
+ console.error(error.format())
166
+ }
167
+ }
168
+ ```
169
+
170
+ **`safeParse()`** - Extract with validation (returns result object)
171
+
172
+ ```typescript
173
+ const result = scraper.safeParse()
174
+ if (result.success) {
175
+ console.log(result.data) // RecipeObjectValidated
176
+ } else {
177
+ console.error(result.error.issues) // Validation errors
178
+ }
179
+ ```
180
+
181
+ ### How Scrapers Work
182
+
183
+ 1. **Domain matching**: `host()` method identifies which scraper to use
184
+ 2. **Plugin extraction**: Generic plugins extract base data
185
+ 3. **Custom extraction**: Scraper's `extractors` override/enhance fields
186
+ 4. **Post-processing**: PostProcessor plugins clean up final output
187
+
188
+ ### Example Scraper with Custom Validation
189
+
190
+ ```typescript
191
+ class NYTimes extends AbstractScraper {
192
+ static host() {
193
+ return 'cooking.nytimes.com'
194
+ }
195
+
196
+ // Override schema for site-specific validation
197
+ protected getSchema() {
198
+ return RecipeObjectSchema.extend({
199
+ title: z.string()
200
+ .refine((v) => v.length >= 10, 'Titles must be descriptive')
201
+ })
202
+ }
203
+ }
204
+ ```
205
+
206
+ ### When to Create a Scraper
207
+
208
+ Create a site-specific scraper when:
209
+
210
+ - ✅ Generic plugins produce incorrect/incomplete data
211
+ - ✅ Site has unique HTML structure requiring custom parsing
212
+ - ✅ Data needs restructuring (e.g., ingredient grouping)
213
+ - ✅ Site lacks proper Schema.org markup
214
+
215
+ Don't create a scraper when:
216
+
217
+ - ❌ Generic plugins already extract everything correctly
218
+ - ❌ Only minor text cleanup needed (use post-processors instead)
219
+
220
+ ## RecipeExtractor
221
+
222
+ Main entry point for recipe extraction:
223
+
224
+ ```typescript
225
+ class RecipeExtractor {
226
+ constructor(
227
+ html: string,
228
+ url: string,
229
+ pluginManager?: PluginManager,
230
+ )
231
+
232
+ extract(): RecipeObject
233
+ }
234
+ ```
235
+
236
+ ### Extraction Process
237
+
238
+ 1. **Parse HTML**: Load HTML with Cheerio
239
+ 2. **Find scraper**: Match URL domain to registered scrapers
240
+ 3. **Run extractors**: Execute plugins and scraper extractors
241
+ 4. **Post-process**: Apply post-processor plugins
242
+ 5. **Return result**: Final `RecipeObject`
243
+
244
+ ### Usage
245
+
246
+ ```typescript
247
+ import { RecipeExtractor } from 'recipe-scrapers-js'
248
+
249
+ const html = await fetch('https://cooking.nytimes.com/recipes/...')
250
+ const extractor = new RecipeExtractor(
251
+ await html.text(),
252
+ 'https://cooking.nytimes.com/recipes/...'
253
+ )
254
+
255
+ const recipe = extractor.extract()
256
+ console.log(recipe.name)
257
+ console.log(recipe.ingredients)
258
+ ```
259
+
260
+ ## Validation System
261
+
262
+ The library uses [Zod](https://zod.dev/) for runtime validation.
263
+
264
+ ### Schema Organization
265
+
266
+ ```txt
267
+ src/schemas/
268
+ ├── recipe.schema.ts # Main RecipeObject schema
269
+ └── common.schema.ts # Reusable helpers (zString, zHttpUrl, etc.)
270
+ ```
271
+
272
+ ### Validation Features
273
+
274
+ - **Auto-fixing**: Calculates missing totalTime from cook + prep times
275
+ - **Cross-field validation**: Ensures time consistency and rating relationships
276
+ - **Custom error messages**: Clear validation feedback
277
+ - **Transform pipeline**: Trims whitespace, normalizes data
278
+ - **Extensible**: Scrapers can override schemas for site-specific rules
279
+
280
+ ### Validation Guarantees
281
+
282
+ Validated recipes (`RecipeObjectValidated`) are guaranteed to have:
283
+
284
+ - ✅ Valid URLs for host, canonicalUrl, image
285
+ - ✅ Non-empty title, yields, description, author
286
+ - ✅ At least one ingredient group with items
287
+ - ✅ At least one instruction group with steps
288
+ - ✅ Positive time values (when present)
289
+ - ✅ Valid ratings (0-5) and non-negative rating counts
290
+ - ✅ Consistent totalTime ≥ cookTime + prepTime
291
+
292
+ ## Project Structure
293
+
294
+ ```txt
295
+ recipe-scrapers-js/
296
+ ├── src/
297
+ │ ├── index.ts # Public API exports
298
+ │ ├── recipe-extractor.ts # Main extraction orchestrator
299
+ │ ├── plugin-manager.ts # Plugin registration & execution
300
+ │ ├── abstract-scraper.ts # Base scraper class
301
+ │ ├── abstract-plugin.ts # Base plugin classes
302
+ │ │
303
+ │ ├── schemas/ # Zod validation schemas
304
+ │ │ └── recipe.schema.ts # RecipeObject schema
305
+ │ │
306
+ │ ├── plugins/ # Generic extraction plugins
307
+ │ │ ├── schema-org.extractor/
308
+ │ │ ├── opengraph.extractor.ts
309
+ │ │ └── html-stripper.processor.ts
310
+ │ │
311
+ │ ├── scrapers/ # Site-specific scrapers
312
+ │ │ ├── _index.ts # Scraper registry
313
+ │ │ ├── nytimes.ts
314
+ │ │ ├── allrecipes.ts
315
+ │ │ └── ...
316
+ │ │
317
+ │ ├── utils/ # Utility functions
318
+ │ │ ├── ingredients.ts # Ingredient utilities
319
+ │ │ ├── instructions.ts # Instruction utilities
320
+ │ │ ├── parsing.ts # Text parsing helpers
321
+ │ │ └── ...
322
+ │ │
323
+ │ ├── types/ # TypeScript types
324
+ │ │ ├── recipe.interface.ts # RecipeObject definition
325
+ │ │ └── scraper.interface.ts
326
+ │ │
327
+ │ └── exceptions/ # Custom exceptions
328
+
329
+ ├── test-data/ # Test fixtures by domain
330
+ │ ├── allrecipes.com/
331
+ │ │ ├── allrecipes.testhtml # HTML fixture
332
+ │ │ └── allrecipes.json # Expected output
333
+ │ └── ...
334
+
335
+ └── docs/
336
+ ├── architecture.md # This file
337
+ └── ingredients-architecture.md # Ingredients deep-dive
338
+ ```
339
+
340
+ ## Testing Strategy
341
+
342
+ ### Test Structure
343
+
344
+ Each site has test fixtures in `test-data/[domain]/`:
345
+
346
+ - `*.testhtml` - Real HTML from the site (anonymized if needed)
347
+ - `*.json` - Expected `RecipeObject` output
348
+
349
+ ### Test Types
350
+
351
+ #### 1. Unit Tests
352
+
353
+ Located in `__tests__/` directories alongside source files:
354
+
355
+ - Plugin tests: Verify extraction logic
356
+ - Utility tests: Test helper functions
357
+ - Type predicates: Validate type guards
358
+
359
+ ```typescript
360
+ // Example: Plugin test
361
+ describe('SchemaOrgExtractorPlugin', () => {
362
+ it('should extract ingredients from recipeIngredient', () => {
363
+ const html = '<script type="application/ld+json">...</script>'
364
+ const $ = cheerio.load(html)
365
+ const plugin = new SchemaOrgExtractorPlugin()
366
+
367
+ const result = plugin.ingredients($)
368
+ expect(result).toEqual([...])
369
+ })
370
+ })
371
+ ```
372
+
373
+ #### 2. Integration Tests
374
+
375
+ Tests scrapers with real HTML fixtures:
376
+
377
+ ```typescript
378
+ describe('NYTimes scraper', () => {
379
+ it('should extract recipe from nytimes.testhtml', async () => {
380
+ const html = await Bun.file('test-data/cooking.nytimes.com/nytimes.testhtml').text()
381
+ const expected = await Bun.file('test-data/cooking.nytimes.com/nytimes.json').json()
382
+
383
+ const extractor = new RecipeExtractor(html, 'https://cooking.nytimes.com/...')
384
+ const result = extractor.extract()
385
+
386
+ expect(result).toEqual(expected)
387
+ })
388
+ })
389
+ ```
390
+
391
+ ### Writing Tests
392
+
393
+ When adding a new scraper:
394
+
395
+ 1. **Fetch real HTML**: Get actual page HTML from the target site
396
+ 2. **Create fixture**: Save as `test-data/[domain]/[testname].testhtml`
397
+ 3. **Run extraction**: Use the library to extract data
398
+ 4. **Verify manually**: Check that extraction is correct
399
+ 5. **Save expected**: Store result as `[testname].json`
400
+ 6. **Write test**: Create integration test comparing fixture to expected
401
+
402
+ ### Test Best Practices
403
+
404
+ - ✅ Use real HTML from actual sites
405
+ - ✅ Test edge cases (missing fields, unusual structure)
406
+ - ✅ Use `toEqual` for full object comparison
407
+ - ✅ Keep tests focused and independent
408
+ - ❌ Don't test internal implementation details
409
+ - ❌ Don't use mock HTML that doesn't match real sites
410
+
411
+ ### Running Tests
412
+
413
+ ```bash
414
+ # Run all tests
415
+ bun test
416
+
417
+ # Run specific test file
418
+ bun test src/plugins/__tests__/schema-org.test.ts
419
+
420
+ # Run with coverage
421
+ bun test --coverage
422
+
423
+ # Run scrapers tests only
424
+ bun test src/scrapers/__tests__/
425
+ ```
426
+
427
+ ## Type System
428
+
429
+ ### Strict Type Safety
430
+
431
+ The project enforces strict TypeScript rules:
432
+
433
+ ```typescript
434
+ // ❌ NEVER use non-null assertions
435
+ const value = obj.field! // Don't do this
436
+
437
+ // ✅ Use optional chaining and nullish coalescing
438
+ const value = obj.field ?? 'default'
439
+
440
+ // ❌ NEVER use 'any'
441
+ function process(data: any) { ... } // Don't do this
442
+
443
+ // ✅ Use proper types or 'unknown'
444
+ function process(data: RecipeObject) { ... }
445
+ function process(data: unknown) {
446
+ if (isRecipeObject(data)) {
447
+ // Type guard narrows to RecipeObject
448
+ }
449
+ }
450
+ ```
451
+
452
+ ### Type Guards vs Zod Validation
453
+
454
+ **Type Guards** - For compile-time type narrowing:
455
+
456
+ ```typescript
457
+ export function isIngredientItem(value: unknown): value is IngredientItem {
458
+ return isPlainObject(value) && 'value' in value && isString(value.value)
459
+ }
460
+ ```
461
+
462
+ **Zod Schemas** - For runtime validation with error details:
463
+
464
+ ```typescript
465
+ const result = RecipeObjectSchema.safeParse(data)
466
+ if (result.success) {
467
+ // data is validated RecipeObjectValidated
468
+ } else {
469
+ // result.error contains detailed validation errors
470
+ }
471
+ ```
472
+
473
+ ## Code Style
474
+
475
+ ### Modern ECMAScript
476
+
477
+ - Use ESM (`import`/`export`) not CommonJS
478
+ - Use `const`/`let` instead of `var`
479
+ - Prefer template literals for strings
480
+ - Use destructuring for objects/arrays
481
+ - Use async/await for promises
482
+
483
+ ### Bun-First
484
+
485
+ - Prefer Bun APIs over Node.js when available
486
+ - Use Bun's built-in test framework
487
+ - Leverage Bun's fast module resolution
488
+
489
+ ### Documentation
490
+
491
+ - Add JSDoc comments for public APIs
492
+ - Include examples in documentation
493
+ - Document complex algorithms inline
494
+ - Keep comments up-to-date with code
495
+
496
+ ## Adding New Features
497
+
498
+ ### Adding a New Scraper
499
+
500
+ 1. **Research**: Analyze target site's HTML structure
501
+ 2. **Check JSON-LD**: Verify if Schema.org data exists and quality
502
+ 3. **Create scraper**: Extend `AbstractScraper` in `src/scrapers/`
503
+ 4. **Override extractors**: Add custom extraction methods
504
+ 5. **Custom validation** (optional): Override `getSchema()` for site-specific rules
505
+ 6. **Register**: Add to `src/scrapers/_index.ts`
506
+ 7. **Add tests**: Create fixtures in `test-data/[domain]/`
507
+ 8. **Test validation**: Ensure `parse()` succeeds on test data
508
+ 9. **Verify**: Run tests and ensure extraction is accurate
509
+
510
+ ### Adding a New Plugin
511
+
512
+ 1. **Identify need**: What extraction method is missing?
513
+ 2. **Choose type**: ExtractorPlugin or PostProcessorPlugin?
514
+ 3. **Implement**: Extend appropriate abstract class
515
+ 4. **Register**: Add to `PluginManager` initialization
516
+ 5. **Test**: Add unit tests for plugin logic
517
+ 6. **Document**: Update relevant architecture docs
518
+
519
+ ### Adding Schema Validation Rules
520
+
521
+ 1. **Identify need**: What validation is missing?
522
+ 2. **Update schema**: Modify `src/schemas/recipe.schema.ts`
523
+ 3. **Add refinements**: Use `.refine()` for cross-field validation
524
+ 4. **Add transforms**: Use `.transform()` for auto-fixes
525
+ 5. **Test**: Add unit tests for new validation rules
526
+
527
+ ## Performance Considerations
528
+
529
+ ### Cheerio Usage
530
+
531
+ - Cheerio is synchronous and fast
532
+ - Cache selectors when reusing: `const $heading = this.$('h1')`
533
+ - Use efficient selectors (IDs and classes over complex queries)
534
+ - Limit DOM traversal when possible
535
+
536
+ ### Memory
537
+
538
+ - HTML fixtures can be large - stream if needed
539
+ - Don't store entire DOM in memory unnecessarily
540
+ - Clear references to Cheerio instances after extraction
541
+
542
+ ### Optimization Opportunities
543
+
544
+ - **Selector optimization**: Profile slow selectors
545
+ - **Lazy loading**: Only load scrapers for matched domains
546
+
547
+ ## Future Enhancements
548
+
549
+ ### Planned Features
550
+
551
+ 1. **More scrapers**: Expand site coverage
552
+ 2. **Better fuzzy matching**: Improve ingredient text matching
553
+ 3. **Ingredient parsing**: Break down quantity/unit/ingredient
554
+ 4. **Video extraction**: Support recipe videos
555
+ 5. **CLI tool**: Scaffold new scrapers
556
+
557
+ ### Architecture Improvements
558
+
559
+ 1. **Scraper generator**: CLI tool to scaffold new scrapers
560
+ 2. **Schema versioning**: Track validation rule changes
561
+ 3. **Streaming API**: Process large HTML documents efficiently
562
+ 4. **Performance profiling**: Identify validation bottlenecks
563
+
564
+ ## Contributing
565
+
566
+ See the main README.md for contribution guidelines. Key points:
567
+
568
+ - Follow TypeScript strict mode rules
569
+ - Add tests for all new features
570
+ - Update documentation
571
+ - Use existing patterns and conventions
572
+ - Run `bun test` before submitting PRs
573
+
574
+ ## Related Documentation
575
+
576
+ - **[Ingredients Architecture](./ingredients-architecture.md)**: Deep dive into ingredient extraction system
577
+ - **API Reference**: (Future: Generated from TSDoc comments)
578
+ - **Recipe Schema**: (Future: Detailed RecipeObject field documentation)
@@ -0,0 +1,363 @@
1
+ # Ingredients Architecture
2
+
3
+ > **Note**: This document focuses specifically on the ingredients extraction system. For overall project architecture, see [Architecture](./architecture.md).
4
+
5
+ ## Overview
6
+
7
+ The ingredients extraction system is designed to extract and structure recipe ingredients from various cooking websites. The system uses a multi-layered approach combining generic extraction plugins with site-specific scrapers to produce a consistent data structure.
8
+
9
+ ## Data Structure
10
+
11
+ ### Core Types
12
+
13
+ ```typescript
14
+ type ParsedIngredient = {
15
+ quantity: number | null // Primary quantity (e.g., 2)
16
+ quantity2: number | null // Secondary quantity for ranges (e.g., 1-2 cups)
17
+ unitOfMeasureID: string | null // Normalized unit key (e.g., "cup")
18
+ unitOfMeasure: string | null // Unit as written (e.g., "cups")
19
+ description: string // Ingredient name (e.g., "flour")
20
+ isGroupHeader: boolean // True if this is a section header
21
+ }
22
+
23
+ type IngredientItem = {
24
+ value: string // The ingredient text, e.g., "1 1/2 cups flour"
25
+ parsed?: ParsedIngredient // Structured data (optional, via parseIngredients option)
26
+ }
27
+
28
+ type IngredientGroup = {
29
+ name: string | null // Group name, e.g., "For the dough" or null for default
30
+ items: IngredientItem[] // Array of ingredients in this group
31
+ }
32
+
33
+ type Ingredients = IngredientGroup[] // Array of groups
34
+ ```
35
+
36
+ ### Design Decisions
37
+
38
+ **Why no IDs?**
39
+
40
+ - Initially had `id` fields on both items and groups
41
+ - Removed for simplicity - IDs added complexity without providing value
42
+ - Groups and items are identified by their position in arrays
43
+
44
+ **Why groups?**
45
+
46
+ - Many recipes organize ingredients into sections (e.g., "For the crust", "For the filling")
47
+ - Default group name is `null` for ungrouped ingredients
48
+ - Preserves recipe structure and improves readability
49
+
50
+ ## Extraction Flow
51
+
52
+ ### 1. Plugin-Based Extraction
53
+
54
+ The `RecipeExtractor` runs multiple extraction plugins in priority order:
55
+
56
+ ```txt
57
+ 1. Schema.org Plugin (priority: 90)
58
+ ├─> Extracts from <script type="application/ld+json">
59
+ └─> Returns: Ingredients with clean, normalized text
60
+
61
+ 2. OpenGraph Plugin (priority: 50)
62
+ └─> Fallback for basic metadata
63
+
64
+ 3. PostProcessor Plugins (in priority order)
65
+ ├─> HtmlStripperPlugin (100): Removes HTML tags from text values
66
+ └─> IngredientParserPlugin (50): Parses ingredients into structured data*
67
+
68
+ *Only active when `parseIngredients` option is enabled
69
+ ```
70
+
71
+ **Key Insight**: Schema.org JSON-LD provides **clean, well-formatted text**:
72
+
73
+ - Proper spacing: `"1 1/2 cups flour"`
74
+ - Normalized fractions: `"1/2"` instead of `"½"`
75
+ - No HTML artifacts or concatenation issues
76
+
77
+ ### 2. Site-Specific Scrapers
78
+
79
+ Scrapers extend `AbstractScraper` and can override any field extractor:
80
+
81
+ ```typescript
82
+ class NYTimes extends AbstractScraper {
83
+ extractors = {
84
+ ingredients: this.ingredients.bind(this),
85
+ }
86
+
87
+ protected ingredients(prevValue: Ingredients | undefined): Ingredients {
88
+ // Custom logic here
89
+ }
90
+ }
91
+ ```
92
+
93
+ **The `prevValue` Parameter**:
94
+
95
+ - Contains the result from plugin extraction (usually Schema.org)
96
+ - Provides clean text that scrapers can restructure
97
+ - Optional - scrapers can extract from scratch if needed
98
+
99
+ ## The Flatten→Regroup Pattern
100
+
101
+ ### Why This Pattern Exists
102
+
103
+ Many recipe websites have **structure in HTML** but **clean text in JSON-LD**:
104
+
105
+ **HTML Structure** (NYTimes example):
106
+
107
+ ```html
108
+ <h3>For the dough</h3>
109
+ <li>1½cups flour</li> <!-- ❌ No space, unicode fraction -->
110
+ <li>½teaspoon salt</li> <!-- ❌ Concatenated -->
111
+
112
+ <h3>For the filling</h3>
113
+ <li>2cups sugar</li> <!-- ❌ No space -->
114
+ ```
115
+
116
+ **JSON-LD Text** (from Schema.org):
117
+
118
+ ```json
119
+ [
120
+ "1 1/2 cups flour", // ✅ Clean, spaced, normalized
121
+ "1/2 teaspoon salt", // ✅ Perfect formatting
122
+ "2 cups sugar" // ✅ No issues
123
+ ]
124
+ ```
125
+
126
+ **The Problem**:
127
+
128
+ - JSON-LD has **perfect text** but **loses grouping** (all ingredients in flat array)
129
+ - HTML has **accurate grouping** but **poor text quality** (no spaces, unicode chars, etc.)
130
+
131
+ **The Solution**: Use BOTH!
132
+
133
+ 1. Extract clean text from JSON-LD (via Schema.org plugin)
134
+ 2. Flatten it to strings for matching
135
+ 3. Parse HTML structure for grouping
136
+ 4. Match HTML elements to JSON-LD text
137
+ 5. Rebuild grouped structure with clean text
138
+
139
+ ### Implementation
140
+
141
+ ```typescript
142
+ protected ingredients(prevValue: Ingredients | undefined): Ingredients {
143
+ const headingSelector = 'h3.ingredient-group'
144
+ const ingredientSelector = 'li.ingredient'
145
+
146
+ if (prevValue && prevValue.length > 0) {
147
+ // Step 1: Flatten prevValue (from Schema.org) to get clean text
148
+ const values = flattenIngredients(prevValue)
149
+ // values = ["1 1/2 cups flour", "1/2 teaspoon salt", "2 cups sugar"]
150
+
151
+ // Step 2: Parse HTML and match text to rebuild groups
152
+ return groupIngredients(
153
+ this.$, // Cheerio instance
154
+ values, // Clean text from JSON-LD
155
+ headingSelector, // Where to find group names
156
+ ingredientSelector, // Where to find ingredient items
157
+ )
158
+ }
159
+
160
+ throw new NoIngredientsFoundException()
161
+ }
162
+ ```
163
+
164
+ ### Utility Functions
165
+
166
+ #### `flattenIngredients(ingredients: Ingredients): string[]`
167
+
168
+ Converts grouped structure to flat array of strings:
169
+
170
+ ```typescript
171
+ // Input:
172
+ [
173
+ { name: "For the dough", items: [{ value: "1 cup flour" }] },
174
+ { name: "", items: [{ value: "Salt to taste" }] }
175
+ ]
176
+
177
+ // Output:
178
+ ["1 cup flour", "Salt to taste"]
179
+ ```
180
+
181
+ **Why flatten?** To get an ordered list of clean text values for matching against HTML elements.
182
+
183
+ #### `groupIngredients($, values, headingSelector, ingredientSelector): Ingredients`
184
+
185
+ Rebuilds grouped structure by:
186
+
187
+ 1. Parsing HTML to find headings and items
188
+ 2. Matching HTML text (normalized) to `values` array (normalized)
189
+ 3. Creating groups based on HTML structure
190
+ 4. Using matched text from `values` (preserving clean formatting)
191
+
192
+ **Normalization**: Both HTML text and values are normalized before matching (trim, lowercase, collapse whitespace) to ensure reliable matches despite formatting differences.
193
+
194
+ #### `stringsToIngredients(values: string[]): Ingredients`
195
+
196
+ Converts flat string array to default group structure:
197
+
198
+ ```typescript
199
+ // Input:
200
+ ["1 cup flour", "Salt to taste"]
201
+
202
+ // Output:
203
+ [
204
+ {
205
+ name: null, // Default group
206
+ items: [
207
+ { value: "1 cup flour" },
208
+ { value: "Salt to taste" }
209
+ ]
210
+ }
211
+ ]
212
+ ```
213
+
214
+ Used by Schema.org plugin when converting `recipeIngredient` array.
215
+
216
+ ## When to Use Each Approach
217
+
218
+ ### Use Flatten→Regroup When
219
+
220
+ - ✅ JSON-LD provides clean text but lacks grouping
221
+ - ✅ HTML has visible grouping structure (headings, sections)
222
+ - ✅ Text quality in HTML is poor (spacing, unicode, concatenation issues)
223
+ - **Example**: NYTimes, BBC Good Food, Simply Recipes
224
+
225
+ ### Parse HTML Directly When
226
+
227
+ - ✅ JSON-LD is missing or unreliable
228
+ - ✅ HTML text quality is good
229
+ - ✅ Complex custom structure that doesn't fit standard pattern
230
+ - **Example**: Sites without proper Schema.org markup
231
+
232
+ ### Use Default (No Override) When
233
+
234
+ - ✅ Schema.org JSON-LD provides both good text AND grouping
235
+ - ✅ No special processing needed
236
+ - **Example**: Sites with perfect Schema.org implementation
237
+
238
+ ## Common Pitfalls
239
+
240
+ ### ❌ Don't Parse HTML Text Directly If You Have JSON-LD
241
+
242
+ ```typescript
243
+ // BAD: HTML text has formatting issues
244
+ protected ingredients(): Ingredients {
245
+ const items = this.$('li.ingredient').map((_, el) =>
246
+ this.$(el).text() // "1½cups" - no space!
247
+ ).get()
248
+ // ...
249
+ }
250
+ ```
251
+
252
+ ```typescript
253
+ // GOOD: Use JSON-LD text with HTML structure
254
+ protected ingredients(prevValue: Ingredients | undefined): Ingredients {
255
+ const values = flattenIngredients(prevValue) // Clean text from JSON-LD
256
+ return groupIngredients(this.$, values, headingSelector, itemSelector)
257
+ }
258
+ ```
259
+
260
+ ### ❌ Don't Modify Text After Flattening
261
+
262
+ ```typescript
263
+ // BAD: Modifying clean text
264
+ const values = flattenIngredients(prevValue)
265
+ .map(v => v.toUpperCase()) // Don't do this!
266
+ ```
267
+
268
+ The text from JSON-LD is already clean and normalized. Preserve it.
269
+
270
+ ### ❌ Don't Flatten If You're Not Regrouping
271
+
272
+ ```typescript
273
+ // BAD: Unnecessary work
274
+ const values = flattenIngredients(prevValue)
275
+ return stringsToIngredients(values) // Just return prevValue!
276
+ ```
277
+
278
+ If Verify test passes with both HTML and JSON-LD extraction
279
+
280
+ ## Ingredient Parsing
281
+
282
+ ### Parsing Overview
283
+
284
+ The library supports optional structured parsing of ingredient strings using the [parse-ingredient](https://github.com/jakeboone02/parse-ingredient) library. When enabled via the `parseIngredients` option, each ingredient item includes a `parsed` field with extracted data.
285
+
286
+ ### Enabling Parsing
287
+
288
+ ```typescript
289
+ // Enable with defaults
290
+ const scraper = new MyScraper(html, url, { parseIngredients: true })
291
+
292
+ // Enable with options
293
+ const scraper = new MyScraper(html, url, {
294
+ parseIngredients: {
295
+ normalizeUOM: true, // "tbsp" → "tablespoon"
296
+ ignoreUOMs: ['small'], // Treat as description, not unit
297
+ }
298
+ })
299
+ ```
300
+
301
+ ### Parsing Pipeline
302
+
303
+ The `IngredientParserPlugin` runs as a PostProcessor (priority 50), after HTML stripping:
304
+
305
+ ```txt
306
+ 1. HtmlStripperPlugin (100)
307
+ └─> "2 cups <b>flour</b>" → "2 cups flour"
308
+
309
+ 2. IngredientParserPlugin (50)
310
+ └─> "2 cups flour" → { value: "2 cups flour", parsed: {...} }
311
+ ```
312
+
313
+ ### Parsed Data Structure
314
+
315
+ ```typescript
316
+ {
317
+ value: "1-2 tablespoons olive oil",
318
+ parsed: {
319
+ quantity: 1, // Primary quantity
320
+ quantity2: 2, // Secondary (range) quantity
321
+ unitOfMeasure: "tablespoons", // As written
322
+ unitOfMeasureID: "tablespoon", // Normalized key
323
+ description: "olive oil", // Ingredient name
324
+ isGroupHeader: false // True for "For the sauce:" etc.
325
+ }
326
+ }
327
+ ```
328
+
329
+ ### Use Cases
330
+
331
+ - **Scaling recipes**: Multiply quantities by a factor
332
+ - **Shopping lists**: Aggregate same ingredients across recipes
333
+ - **Nutritional lookup**: Search by normalized ingredient name
334
+ - **Unit conversion**: Convert between measurement systems
335
+
336
+ ## Future Considerations
337
+
338
+ ### Potential Improvements
339
+
340
+ 1. **Text Normalization Pipeline**: Configurable normalization steps (fraction conversion, unit standardization, etc.)
341
+
342
+ 2. **Fuzzy Matching**: Handle cases where HTML text diverges significantly from JSON-LD (currently uses exact normalized matching)
343
+
344
+ 3. **Fallback Strategies**: Graceful degradation when JSON-LD is partial or HTML structure is ambiguous
345
+
346
+ 4. **Schema.org Validation**: Detect and handle malformed JSON-LD more robustly
347
+
348
+ ### Non-Goals
349
+
350
+ - **Unit Conversion**: Converting between measurement systems (use parsed data with external tools)
351
+ - **Substitutions**: Handling ingredient alternatives or substitutions
352
+ - **Nutritional Analysis**: Calculating nutrition facts from ingredients
353
+
354
+ ## Summary
355
+
356
+ The ingredients system balances **text quality** (from JSON-LD) with **structural accuracy** (from HTML):
357
+
358
+ 1. **Schema.org Plugin** extracts clean text → `Ingredients` with default grouping
359
+ 2. **Scrapers** flatten text → parse HTML structure → rebuild groups with clean text
360
+ 3. **IngredientParserPlugin** (optional) → adds structured `parsed` data
361
+ 4. **Result**: Accurate grouping with high-quality, normalized ingredient text and optional structured data
362
+
363
+ This architecture leverages the strengths of both JSON-LD (clean data) and HTML (visual structure) to produce the best possible output.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "recipe-scrapers-js",
3
- "version": "1.0.0-rc.0",
3
+ "version": "1.0.0-rc.1",
4
4
  "license": "MIT",
5
5
  "description": "A recipe scrapers library",
6
6
  "author": {
@@ -12,13 +12,22 @@
12
12
  "url": "git+https://github.com/nerdstep/recipe-scrapers-js.git"
13
13
  },
14
14
  "type": "module",
15
- "module": "dist/index.js",
16
- "main": "dist/index.js",
17
- "types": "dist/index.d.ts",
15
+ "module": "dist/index.mjs",
16
+ "main": "dist/index.mjs",
17
+ "types": "dist/index.d.mts",
18
+ "exports": {
19
+ "./package.json": "./package.json",
20
+ ".": {
21
+ "types": "./dist/index.d.mts",
22
+ "import": "./dist/index.mjs"
23
+ }
24
+ },
18
25
  "files": [
19
26
  "dist",
20
- "README.md",
21
- "LICENSE"
27
+ "docs",
28
+ "CHANGELOG.md",
29
+ "LICENSE",
30
+ "README.md"
22
31
  ],
23
32
  "keywords": [
24
33
  "recipe",