recipe-scrapers-js 1.0.0-rc.0 → 1.0.0-rc.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +37 -0
- package/docs/architecture.md +578 -0
- package/docs/ingredients-architecture.md +363 -0
- package/package.json +15 -6
package/CHANGELOG.md
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
<!-- markdownlint-disable MD024 -->
|
|
2
|
+
# Changelog
|
|
3
|
+
|
|
4
|
+
All notable changes to this project will be documented in this file.
|
|
5
|
+
|
|
6
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
7
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
8
|
+
|
|
9
|
+
## [1.0.0-rc.1] - 2025-12-20
|
|
10
|
+
|
|
11
|
+
### Added
|
|
12
|
+
|
|
13
|
+
- chore: tsdown configuration file
|
|
14
|
+
|
|
15
|
+
### Fixed
|
|
16
|
+
|
|
17
|
+
- fix: main/module/type entriess in package.json; add exports field
|
|
18
|
+
|
|
19
|
+
## [1.0.0-rc.0] - 2025-12-20
|
|
20
|
+
|
|
21
|
+
### Added
|
|
22
|
+
|
|
23
|
+
- Optional ingredient parsing via [parse-ingredient](https://github.com/jakeboone02/parse-ingredient)
|
|
24
|
+
- `parse()` and `safeParse()` methods for Zod schema validated recipe extraction
|
|
25
|
+
|
|
26
|
+
### Changed
|
|
27
|
+
|
|
28
|
+
- **BREAKING**: Renamed `toObject()` method to `toRecipeObject()` for clarity
|
|
29
|
+
- **BREAKING**: Ingredients and instructions now require grouped structures (each group has `name` and `items`) instead of flat arrays
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## Pre-Release History
|
|
34
|
+
|
|
35
|
+
Prior to version 1.0.0-rc.0, this project was in alpha development. No formal changelog was maintained during the alpha phase.
|
|
36
|
+
|
|
37
|
+
[1.0.0-rc.0]: https://github.com/nerdstep/recipe-scrapers-js/releases/tag/v1.0.0-rc.0
|
|
@@ -0,0 +1,578 @@
|
|
|
1
|
+
# Recipe Scrapers JS - Architecture
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
Recipe Scrapers JS is a TypeScript library for extracting structured recipe data from cooking websites. It uses a plugin-based architecture that combines generic extraction methods with site-specific customizations to produce consistent `RecipeObject` outputs.
|
|
6
|
+
|
|
7
|
+
### Key Features
|
|
8
|
+
|
|
9
|
+
- **Multi-source extraction**: JSON-LD (Schema.org), Microdata, OpenGraph
|
|
10
|
+
- **Plugin architecture**: Extensible extraction and post-processing pipeline
|
|
11
|
+
- **Site-specific scrapers**: Custom logic for popular recipe sites
|
|
12
|
+
- **Runtime validation**: Zod-based schema validation for data quality
|
|
13
|
+
- **Type-safe**: Full TypeScript support with strict typing
|
|
14
|
+
- **Test coverage**: Comprehensive test suite with real HTML fixtures
|
|
15
|
+
|
|
16
|
+
## Core Concepts
|
|
17
|
+
|
|
18
|
+
### RecipeObject
|
|
19
|
+
|
|
20
|
+
The output format representing a complete recipe. See `src/types/recipe.interface.ts` for the full interface definition.
|
|
21
|
+
|
|
22
|
+
### Extraction & Validation Pipeline
|
|
23
|
+
|
|
24
|
+
```txt
|
|
25
|
+
HTML Input
|
|
26
|
+
↓
|
|
27
|
+
RecipeExtractor / AbstractScraper
|
|
28
|
+
↓
|
|
29
|
+
┌─────────────────────────┐
|
|
30
|
+
│ Extractor Plugins │
|
|
31
|
+
│ (Priority Order) │
|
|
32
|
+
├─────────────────────────┤
|
|
33
|
+
│ 1. Schema.org (90) │ ← JSON-LD extraction
|
|
34
|
+
│ 2. Microdata (80) │ ← HTML microdata
|
|
35
|
+
│ 3. OpenGraph (50) │ ← OG meta tags
|
|
36
|
+
└─────────────────────────┘
|
|
37
|
+
↓
|
|
38
|
+
Partial RecipeObject
|
|
39
|
+
↓
|
|
40
|
+
Site-Specific Scraper (optional)
|
|
41
|
+
↓
|
|
42
|
+
┌─────────────────────────┐
|
|
43
|
+
│ PostProcessor Plugins │
|
|
44
|
+
├─────────────────────────┤
|
|
45
|
+
│ • HTML Stripper (100) │ ← Remove HTML tags
|
|
46
|
+
│ • Ingredient Parser* │ ← Parse ingredients (optional)
|
|
47
|
+
└─────────────────────────┘
|
|
48
|
+
↓
|
|
49
|
+
┌─────────────────────────┐
|
|
50
|
+
│ Validation (Optional) │
|
|
51
|
+
├─────────────────────────┤
|
|
52
|
+
│ • Zod Schema │ ← Runtime validation
|
|
53
|
+
│ • Type checking │
|
|
54
|
+
│ • Auto-fixes │
|
|
55
|
+
└─────────────────────────┘
|
|
56
|
+
↓
|
|
57
|
+
Final RecipeObject
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
## Plugin System
|
|
61
|
+
|
|
62
|
+
### Plugin Types
|
|
63
|
+
|
|
64
|
+
#### 1. ExtractorPlugin
|
|
65
|
+
|
|
66
|
+
Extracts data from HTML to populate `RecipeObject` fields.
|
|
67
|
+
|
|
68
|
+
```typescript
|
|
69
|
+
abstract class AbstractExtractorPlugin {
|
|
70
|
+
abstract readonly priority: number // Higher = runs first
|
|
71
|
+
|
|
72
|
+
// Override to extract specific fields
|
|
73
|
+
name?($: CheerioAPI, prevValue?: string): string | undefined
|
|
74
|
+
ingredients?($: CheerioAPI, prevValue?: Ingredients): Ingredients | undefined
|
|
75
|
+
// ... other field extractors
|
|
76
|
+
}
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
**Priority System**:
|
|
80
|
+
|
|
81
|
+
- Higher priority plugins run first
|
|
82
|
+
- Later plugins can override or enhance earlier results
|
|
83
|
+
- Use `prevValue` to access previous extraction results
|
|
84
|
+
|
|
85
|
+
**Built-in Extractors**:
|
|
86
|
+
|
|
87
|
+
- **Schema.org** (priority 90): Primary extractor for JSON-LD data
|
|
88
|
+
- **OpenGraph** (priority 50): Fallback for basic metadata
|
|
89
|
+
|
|
90
|
+
#### 2. PostProcessorPlugin
|
|
91
|
+
|
|
92
|
+
Processes extracted data after all extraction is complete.
|
|
93
|
+
|
|
94
|
+
```typescript
|
|
95
|
+
abstract class AbstractPostprocessorPlugin {
|
|
96
|
+
abstract readonly priority: number
|
|
97
|
+
|
|
98
|
+
abstract process(recipe: RecipeObject): RecipeObject
|
|
99
|
+
}
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
**Built-in Processors**:
|
|
103
|
+
|
|
104
|
+
- **HtmlStripperPlugin** (priority 100): Removes HTML tags from text fields
|
|
105
|
+
- **IngredientParserPlugin** (priority 50): Parses ingredient strings into structured data (optional, enabled via `parseIngredients` option)
|
|
106
|
+
|
|
107
|
+
### Plugin Registration
|
|
108
|
+
|
|
109
|
+
Plugins are registered in `PluginManager`:
|
|
110
|
+
|
|
111
|
+
```typescript
|
|
112
|
+
const pluginManager = new PluginManager()
|
|
113
|
+
|
|
114
|
+
// Add extractors
|
|
115
|
+
pluginManager.registerExtractorPlugin(new SchemaOrgExtractorPlugin())
|
|
116
|
+
pluginManager.registerExtractorPlugin(new OpenGraphExtractorPlugin())
|
|
117
|
+
|
|
118
|
+
// Add post-processors
|
|
119
|
+
pluginManager.registerPostprocessorPlugin(new HtmlStripperPlugin())
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
## Site-Specific Scrapers
|
|
123
|
+
|
|
124
|
+
### AbstractScraper
|
|
125
|
+
|
|
126
|
+
Base class for all site-specific scrapers:
|
|
127
|
+
|
|
128
|
+
```typescript
|
|
129
|
+
abstract class AbstractScraper {
|
|
130
|
+
protected $: CheerioAPI // Cheerio instance for HTML parsing
|
|
131
|
+
|
|
132
|
+
static host(): string // Domain this scraper handles
|
|
133
|
+
|
|
134
|
+
extractors: {
|
|
135
|
+
[K in keyof RecipeFields]?: (prevValue?: RecipeFields[K]) => RecipeFields[K]
|
|
136
|
+
}
|
|
137
|
+
|
|
138
|
+
// Extraction methods
|
|
139
|
+
toRecipeObject(): RecipeObject // Extract without validation
|
|
140
|
+
parse(): RecipeObjectValidated // Extract with validation (throws on error)
|
|
141
|
+
safeParse(): SafeParseReturnType // Extract with validation (returns result)
|
|
142
|
+
|
|
143
|
+
// Override for custom validation
|
|
144
|
+
protected getSchema(): ZodSchema
|
|
145
|
+
}
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### Validation Methods
|
|
149
|
+
|
|
150
|
+
**`toRecipeObject()`** - Extract without validation
|
|
151
|
+
|
|
152
|
+
```typescript
|
|
153
|
+
const recipe = scraper.toRecipeObject()
|
|
154
|
+
// Returns RecipeObject - no runtime checks
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
**`parse()`** - Extract with validation (throws ZodError on failure)
|
|
158
|
+
|
|
159
|
+
```typescript
|
|
160
|
+
try {
|
|
161
|
+
const recipe = scraper.parse()
|
|
162
|
+
// Returns RecipeObjectValidated - guaranteed valid
|
|
163
|
+
} catch (error) {
|
|
164
|
+
if (error instanceof ZodError) {
|
|
165
|
+
console.error(error.format())
|
|
166
|
+
}
|
|
167
|
+
}
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
**`safeParse()`** - Extract with validation (returns result object)
|
|
171
|
+
|
|
172
|
+
```typescript
|
|
173
|
+
const result = scraper.safeParse()
|
|
174
|
+
if (result.success) {
|
|
175
|
+
console.log(result.data) // RecipeObjectValidated
|
|
176
|
+
} else {
|
|
177
|
+
console.error(result.error.issues) // Validation errors
|
|
178
|
+
}
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
### How Scrapers Work
|
|
182
|
+
|
|
183
|
+
1. **Domain matching**: `host()` method identifies which scraper to use
|
|
184
|
+
2. **Plugin extraction**: Generic plugins extract base data
|
|
185
|
+
3. **Custom extraction**: Scraper's `extractors` override/enhance fields
|
|
186
|
+
4. **Post-processing**: PostProcessor plugins clean up final output
|
|
187
|
+
|
|
188
|
+
### Example Scraper with Custom Validation
|
|
189
|
+
|
|
190
|
+
```typescript
|
|
191
|
+
class NYTimes extends AbstractScraper {
|
|
192
|
+
static host() {
|
|
193
|
+
return 'cooking.nytimes.com'
|
|
194
|
+
}
|
|
195
|
+
|
|
196
|
+
// Override schema for site-specific validation
|
|
197
|
+
protected getSchema() {
|
|
198
|
+
return RecipeObjectSchema.extend({
|
|
199
|
+
title: z.string()
|
|
200
|
+
.refine((v) => v.length >= 10, 'Titles must be descriptive')
|
|
201
|
+
})
|
|
202
|
+
}
|
|
203
|
+
}
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
### When to Create a Scraper
|
|
207
|
+
|
|
208
|
+
Create a site-specific scraper when:
|
|
209
|
+
|
|
210
|
+
- ✅ Generic plugins produce incorrect/incomplete data
|
|
211
|
+
- ✅ Site has unique HTML structure requiring custom parsing
|
|
212
|
+
- ✅ Data needs restructuring (e.g., ingredient grouping)
|
|
213
|
+
- ✅ Site lacks proper Schema.org markup
|
|
214
|
+
|
|
215
|
+
Don't create a scraper when:
|
|
216
|
+
|
|
217
|
+
- ❌ Generic plugins already extract everything correctly
|
|
218
|
+
- ❌ Only minor text cleanup needed (use post-processors instead)
|
|
219
|
+
|
|
220
|
+
## RecipeExtractor
|
|
221
|
+
|
|
222
|
+
Main entry point for recipe extraction:
|
|
223
|
+
|
|
224
|
+
```typescript
|
|
225
|
+
class RecipeExtractor {
|
|
226
|
+
constructor(
|
|
227
|
+
html: string,
|
|
228
|
+
url: string,
|
|
229
|
+
pluginManager?: PluginManager,
|
|
230
|
+
)
|
|
231
|
+
|
|
232
|
+
extract(): RecipeObject
|
|
233
|
+
}
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
### Extraction Process
|
|
237
|
+
|
|
238
|
+
1. **Parse HTML**: Load HTML with Cheerio
|
|
239
|
+
2. **Find scraper**: Match URL domain to registered scrapers
|
|
240
|
+
3. **Run extractors**: Execute plugins and scraper extractors
|
|
241
|
+
4. **Post-process**: Apply post-processor plugins
|
|
242
|
+
5. **Return result**: Final `RecipeObject`
|
|
243
|
+
|
|
244
|
+
### Usage
|
|
245
|
+
|
|
246
|
+
```typescript
|
|
247
|
+
import { RecipeExtractor } from 'recipe-scrapers-js'
|
|
248
|
+
|
|
249
|
+
const html = await fetch('https://cooking.nytimes.com/recipes/...')
|
|
250
|
+
const extractor = new RecipeExtractor(
|
|
251
|
+
await html.text(),
|
|
252
|
+
'https://cooking.nytimes.com/recipes/...'
|
|
253
|
+
)
|
|
254
|
+
|
|
255
|
+
const recipe = extractor.extract()
|
|
256
|
+
console.log(recipe.name)
|
|
257
|
+
console.log(recipe.ingredients)
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
## Validation System
|
|
261
|
+
|
|
262
|
+
The library uses [Zod](https://zod.dev/) for runtime validation.
|
|
263
|
+
|
|
264
|
+
### Schema Organization
|
|
265
|
+
|
|
266
|
+
```txt
|
|
267
|
+
src/schemas/
|
|
268
|
+
├── recipe.schema.ts # Main RecipeObject schema
|
|
269
|
+
└── common.schema.ts # Reusable helpers (zString, zHttpUrl, etc.)
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
### Validation Features
|
|
273
|
+
|
|
274
|
+
- **Auto-fixing**: Calculates missing totalTime from cook + prep times
|
|
275
|
+
- **Cross-field validation**: Ensures time consistency and rating relationships
|
|
276
|
+
- **Custom error messages**: Clear validation feedback
|
|
277
|
+
- **Transform pipeline**: Trims whitespace, normalizes data
|
|
278
|
+
- **Extensible**: Scrapers can override schemas for site-specific rules
|
|
279
|
+
|
|
280
|
+
### Validation Guarantees
|
|
281
|
+
|
|
282
|
+
Validated recipes (`RecipeObjectValidated`) are guaranteed to have:
|
|
283
|
+
|
|
284
|
+
- ✅ Valid URLs for host, canonicalUrl, image
|
|
285
|
+
- ✅ Non-empty title, yields, description, author
|
|
286
|
+
- ✅ At least one ingredient group with items
|
|
287
|
+
- ✅ At least one instruction group with steps
|
|
288
|
+
- ✅ Positive time values (when present)
|
|
289
|
+
- ✅ Valid ratings (0-5) and non-negative rating counts
|
|
290
|
+
- ✅ Consistent totalTime ≥ cookTime + prepTime
|
|
291
|
+
|
|
292
|
+
## Project Structure
|
|
293
|
+
|
|
294
|
+
```txt
|
|
295
|
+
recipe-scrapers-js/
|
|
296
|
+
├── src/
|
|
297
|
+
│ ├── index.ts # Public API exports
|
|
298
|
+
│ ├── recipe-extractor.ts # Main extraction orchestrator
|
|
299
|
+
│ ├── plugin-manager.ts # Plugin registration & execution
|
|
300
|
+
│ ├── abstract-scraper.ts # Base scraper class
|
|
301
|
+
│ ├── abstract-plugin.ts # Base plugin classes
|
|
302
|
+
│ │
|
|
303
|
+
│ ├── schemas/ # Zod validation schemas
|
|
304
|
+
│ │ └── recipe.schema.ts # RecipeObject schema
|
|
305
|
+
│ │
|
|
306
|
+
│ ├── plugins/ # Generic extraction plugins
|
|
307
|
+
│ │ ├── schema-org.extractor/
|
|
308
|
+
│ │ ├── opengraph.extractor.ts
|
|
309
|
+
│ │ └── html-stripper.processor.ts
|
|
310
|
+
│ │
|
|
311
|
+
│ ├── scrapers/ # Site-specific scrapers
|
|
312
|
+
│ │ ├── _index.ts # Scraper registry
|
|
313
|
+
│ │ ├── nytimes.ts
|
|
314
|
+
│ │ ├── allrecipes.ts
|
|
315
|
+
│ │ └── ...
|
|
316
|
+
│ │
|
|
317
|
+
│ ├── utils/ # Utility functions
|
|
318
|
+
│ │ ├── ingredients.ts # Ingredient utilities
|
|
319
|
+
│ │ ├── instructions.ts # Instruction utilities
|
|
320
|
+
│ │ ├── parsing.ts # Text parsing helpers
|
|
321
|
+
│ │ └── ...
|
|
322
|
+
│ │
|
|
323
|
+
│ ├── types/ # TypeScript types
|
|
324
|
+
│ │ ├── recipe.interface.ts # RecipeObject definition
|
|
325
|
+
│ │ └── scraper.interface.ts
|
|
326
|
+
│ │
|
|
327
|
+
│ └── exceptions/ # Custom exceptions
|
|
328
|
+
│
|
|
329
|
+
├── test-data/ # Test fixtures by domain
|
|
330
|
+
│ ├── allrecipes.com/
|
|
331
|
+
│ │ ├── allrecipes.testhtml # HTML fixture
|
|
332
|
+
│ │ └── allrecipes.json # Expected output
|
|
333
|
+
│ └── ...
|
|
334
|
+
│
|
|
335
|
+
└── docs/
|
|
336
|
+
├── architecture.md # This file
|
|
337
|
+
└── ingredients-architecture.md # Ingredients deep-dive
|
|
338
|
+
```
|
|
339
|
+
|
|
340
|
+
## Testing Strategy
|
|
341
|
+
|
|
342
|
+
### Test Structure
|
|
343
|
+
|
|
344
|
+
Each site has test fixtures in `test-data/[domain]/`:
|
|
345
|
+
|
|
346
|
+
- `*.testhtml` - Real HTML from the site (anonymized if needed)
|
|
347
|
+
- `*.json` - Expected `RecipeObject` output
|
|
348
|
+
|
|
349
|
+
### Test Types
|
|
350
|
+
|
|
351
|
+
#### 1. Unit Tests
|
|
352
|
+
|
|
353
|
+
Located in `__tests__/` directories alongside source files:
|
|
354
|
+
|
|
355
|
+
- Plugin tests: Verify extraction logic
|
|
356
|
+
- Utility tests: Test helper functions
|
|
357
|
+
- Type predicates: Validate type guards
|
|
358
|
+
|
|
359
|
+
```typescript
|
|
360
|
+
// Example: Plugin test
|
|
361
|
+
describe('SchemaOrgExtractorPlugin', () => {
|
|
362
|
+
it('should extract ingredients from recipeIngredient', () => {
|
|
363
|
+
const html = '<script type="application/ld+json">...</script>'
|
|
364
|
+
const $ = cheerio.load(html)
|
|
365
|
+
const plugin = new SchemaOrgExtractorPlugin()
|
|
366
|
+
|
|
367
|
+
const result = plugin.ingredients($)
|
|
368
|
+
expect(result).toEqual([...])
|
|
369
|
+
})
|
|
370
|
+
})
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
#### 2. Integration Tests
|
|
374
|
+
|
|
375
|
+
Tests scrapers with real HTML fixtures:
|
|
376
|
+
|
|
377
|
+
```typescript
|
|
378
|
+
describe('NYTimes scraper', () => {
|
|
379
|
+
it('should extract recipe from nytimes.testhtml', async () => {
|
|
380
|
+
const html = await Bun.file('test-data/cooking.nytimes.com/nytimes.testhtml').text()
|
|
381
|
+
const expected = await Bun.file('test-data/cooking.nytimes.com/nytimes.json').json()
|
|
382
|
+
|
|
383
|
+
const extractor = new RecipeExtractor(html, 'https://cooking.nytimes.com/...')
|
|
384
|
+
const result = extractor.extract()
|
|
385
|
+
|
|
386
|
+
expect(result).toEqual(expected)
|
|
387
|
+
})
|
|
388
|
+
})
|
|
389
|
+
```
|
|
390
|
+
|
|
391
|
+
### Writing Tests
|
|
392
|
+
|
|
393
|
+
When adding a new scraper:
|
|
394
|
+
|
|
395
|
+
1. **Fetch real HTML**: Get actual page HTML from the target site
|
|
396
|
+
2. **Create fixture**: Save as `test-data/[domain]/[testname].testhtml`
|
|
397
|
+
3. **Run extraction**: Use the library to extract data
|
|
398
|
+
4. **Verify manually**: Check that extraction is correct
|
|
399
|
+
5. **Save expected**: Store result as `[testname].json`
|
|
400
|
+
6. **Write test**: Create integration test comparing fixture to expected
|
|
401
|
+
|
|
402
|
+
### Test Best Practices
|
|
403
|
+
|
|
404
|
+
- ✅ Use real HTML from actual sites
|
|
405
|
+
- ✅ Test edge cases (missing fields, unusual structure)
|
|
406
|
+
- ✅ Use `toEqual` for full object comparison
|
|
407
|
+
- ✅ Keep tests focused and independent
|
|
408
|
+
- ❌ Don't test internal implementation details
|
|
409
|
+
- ❌ Don't use mock HTML that doesn't match real sites
|
|
410
|
+
|
|
411
|
+
### Running Tests
|
|
412
|
+
|
|
413
|
+
```bash
|
|
414
|
+
# Run all tests
|
|
415
|
+
bun test
|
|
416
|
+
|
|
417
|
+
# Run specific test file
|
|
418
|
+
bun test src/plugins/__tests__/schema-org.test.ts
|
|
419
|
+
|
|
420
|
+
# Run with coverage
|
|
421
|
+
bun test --coverage
|
|
422
|
+
|
|
423
|
+
# Run scrapers tests only
|
|
424
|
+
bun test src/scrapers/__tests__/
|
|
425
|
+
```
|
|
426
|
+
|
|
427
|
+
## Type System
|
|
428
|
+
|
|
429
|
+
### Strict Type Safety
|
|
430
|
+
|
|
431
|
+
The project enforces strict TypeScript rules:
|
|
432
|
+
|
|
433
|
+
```typescript
|
|
434
|
+
// ❌ NEVER use non-null assertions
|
|
435
|
+
const value = obj.field! // Don't do this
|
|
436
|
+
|
|
437
|
+
// ✅ Use optional chaining and nullish coalescing
|
|
438
|
+
const value = obj.field ?? 'default'
|
|
439
|
+
|
|
440
|
+
// ❌ NEVER use 'any'
|
|
441
|
+
function process(data: any) { ... } // Don't do this
|
|
442
|
+
|
|
443
|
+
// ✅ Use proper types or 'unknown'
|
|
444
|
+
function process(data: RecipeObject) { ... }
|
|
445
|
+
function process(data: unknown) {
|
|
446
|
+
if (isRecipeObject(data)) {
|
|
447
|
+
// Type guard narrows to RecipeObject
|
|
448
|
+
}
|
|
449
|
+
}
|
|
450
|
+
```
|
|
451
|
+
|
|
452
|
+
### Type Guards vs Zod Validation
|
|
453
|
+
|
|
454
|
+
**Type Guards** - For compile-time type narrowing:
|
|
455
|
+
|
|
456
|
+
```typescript
|
|
457
|
+
export function isIngredientItem(value: unknown): value is IngredientItem {
|
|
458
|
+
return isPlainObject(value) && 'value' in value && isString(value.value)
|
|
459
|
+
}
|
|
460
|
+
```
|
|
461
|
+
|
|
462
|
+
**Zod Schemas** - For runtime validation with error details:
|
|
463
|
+
|
|
464
|
+
```typescript
|
|
465
|
+
const result = RecipeObjectSchema.safeParse(data)
|
|
466
|
+
if (result.success) {
|
|
467
|
+
// data is validated RecipeObjectValidated
|
|
468
|
+
} else {
|
|
469
|
+
// result.error contains detailed validation errors
|
|
470
|
+
}
|
|
471
|
+
```
|
|
472
|
+
|
|
473
|
+
## Code Style
|
|
474
|
+
|
|
475
|
+
### Modern ECMAScript
|
|
476
|
+
|
|
477
|
+
- Use ESM (`import`/`export`) not CommonJS
|
|
478
|
+
- Use `const`/`let` instead of `var`
|
|
479
|
+
- Prefer template literals for strings
|
|
480
|
+
- Use destructuring for objects/arrays
|
|
481
|
+
- Use async/await for promises
|
|
482
|
+
|
|
483
|
+
### Bun-First
|
|
484
|
+
|
|
485
|
+
- Prefer Bun APIs over Node.js when available
|
|
486
|
+
- Use Bun's built-in test framework
|
|
487
|
+
- Leverage Bun's fast module resolution
|
|
488
|
+
|
|
489
|
+
### Documentation
|
|
490
|
+
|
|
491
|
+
- Add JSDoc comments for public APIs
|
|
492
|
+
- Include examples in documentation
|
|
493
|
+
- Document complex algorithms inline
|
|
494
|
+
- Keep comments up-to-date with code
|
|
495
|
+
|
|
496
|
+
## Adding New Features
|
|
497
|
+
|
|
498
|
+
### Adding a New Scraper
|
|
499
|
+
|
|
500
|
+
1. **Research**: Analyze target site's HTML structure
|
|
501
|
+
2. **Check JSON-LD**: Verify if Schema.org data exists and quality
|
|
502
|
+
3. **Create scraper**: Extend `AbstractScraper` in `src/scrapers/`
|
|
503
|
+
4. **Override extractors**: Add custom extraction methods
|
|
504
|
+
5. **Custom validation** (optional): Override `getSchema()` for site-specific rules
|
|
505
|
+
6. **Register**: Add to `src/scrapers/_index.ts`
|
|
506
|
+
7. **Add tests**: Create fixtures in `test-data/[domain]/`
|
|
507
|
+
8. **Test validation**: Ensure `parse()` succeeds on test data
|
|
508
|
+
9. **Verify**: Run tests and ensure extraction is accurate
|
|
509
|
+
|
|
510
|
+
### Adding a New Plugin
|
|
511
|
+
|
|
512
|
+
1. **Identify need**: What extraction method is missing?
|
|
513
|
+
2. **Choose type**: ExtractorPlugin or PostProcessorPlugin?
|
|
514
|
+
3. **Implement**: Extend appropriate abstract class
|
|
515
|
+
4. **Register**: Add to `PluginManager` initialization
|
|
516
|
+
5. **Test**: Add unit tests for plugin logic
|
|
517
|
+
6. **Document**: Update relevant architecture docs
|
|
518
|
+
|
|
519
|
+
### Adding Schema Validation Rules
|
|
520
|
+
|
|
521
|
+
1. **Identify need**: What validation is missing?
|
|
522
|
+
2. **Update schema**: Modify `src/schemas/recipe.schema.ts`
|
|
523
|
+
3. **Add refinements**: Use `.refine()` for cross-field validation
|
|
524
|
+
4. **Add transforms**: Use `.transform()` for auto-fixes
|
|
525
|
+
5. **Test**: Add unit tests for new validation rules
|
|
526
|
+
|
|
527
|
+
## Performance Considerations
|
|
528
|
+
|
|
529
|
+
### Cheerio Usage
|
|
530
|
+
|
|
531
|
+
- Cheerio is synchronous and fast
|
|
532
|
+
- Cache selectors when reusing: `const $heading = this.$('h1')`
|
|
533
|
+
- Use efficient selectors (IDs and classes over complex queries)
|
|
534
|
+
- Limit DOM traversal when possible
|
|
535
|
+
|
|
536
|
+
### Memory
|
|
537
|
+
|
|
538
|
+
- HTML fixtures can be large - stream if needed
|
|
539
|
+
- Don't store entire DOM in memory unnecessarily
|
|
540
|
+
- Clear references to Cheerio instances after extraction
|
|
541
|
+
|
|
542
|
+
### Optimization Opportunities
|
|
543
|
+
|
|
544
|
+
- **Selector optimization**: Profile slow selectors
|
|
545
|
+
- **Lazy loading**: Only load scrapers for matched domains
|
|
546
|
+
|
|
547
|
+
## Future Enhancements
|
|
548
|
+
|
|
549
|
+
### Planned Features
|
|
550
|
+
|
|
551
|
+
1. **More scrapers**: Expand site coverage
|
|
552
|
+
2. **Better fuzzy matching**: Improve ingredient text matching
|
|
553
|
+
3. **Ingredient parsing**: Break down quantity/unit/ingredient
|
|
554
|
+
4. **Video extraction**: Support recipe videos
|
|
555
|
+
5. **CLI tool**: Scaffold new scrapers
|
|
556
|
+
|
|
557
|
+
### Architecture Improvements
|
|
558
|
+
|
|
559
|
+
1. **Scraper generator**: CLI tool to scaffold new scrapers
|
|
560
|
+
2. **Schema versioning**: Track validation rule changes
|
|
561
|
+
3. **Streaming API**: Process large HTML documents efficiently
|
|
562
|
+
4. **Performance profiling**: Identify validation bottlenecks
|
|
563
|
+
|
|
564
|
+
## Contributing
|
|
565
|
+
|
|
566
|
+
See the main README.md for contribution guidelines. Key points:
|
|
567
|
+
|
|
568
|
+
- Follow TypeScript strict mode rules
|
|
569
|
+
- Add tests for all new features
|
|
570
|
+
- Update documentation
|
|
571
|
+
- Use existing patterns and conventions
|
|
572
|
+
- Run `bun test` before submitting PRs
|
|
573
|
+
|
|
574
|
+
## Related Documentation
|
|
575
|
+
|
|
576
|
+
- **[Ingredients Architecture](./ingredients-architecture.md)**: Deep dive into ingredient extraction system
|
|
577
|
+
- **API Reference**: (Future: Generated from TSDoc comments)
|
|
578
|
+
- **Recipe Schema**: (Future: Detailed RecipeObject field documentation)
|
|
@@ -0,0 +1,363 @@
|
|
|
1
|
+
# Ingredients Architecture
|
|
2
|
+
|
|
3
|
+
> **Note**: This document focuses specifically on the ingredients extraction system. For overall project architecture, see [Architecture](./architecture.md).
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
The ingredients extraction system is designed to extract and structure recipe ingredients from various cooking websites. The system uses a multi-layered approach combining generic extraction plugins with site-specific scrapers to produce a consistent data structure.
|
|
8
|
+
|
|
9
|
+
## Data Structure
|
|
10
|
+
|
|
11
|
+
### Core Types
|
|
12
|
+
|
|
13
|
+
```typescript
|
|
14
|
+
type ParsedIngredient = {
|
|
15
|
+
quantity: number | null // Primary quantity (e.g., 2)
|
|
16
|
+
quantity2: number | null // Secondary quantity for ranges (e.g., 1-2 cups)
|
|
17
|
+
unitOfMeasureID: string | null // Normalized unit key (e.g., "cup")
|
|
18
|
+
unitOfMeasure: string | null // Unit as written (e.g., "cups")
|
|
19
|
+
description: string // Ingredient name (e.g., "flour")
|
|
20
|
+
isGroupHeader: boolean // True if this is a section header
|
|
21
|
+
}
|
|
22
|
+
|
|
23
|
+
type IngredientItem = {
|
|
24
|
+
value: string // The ingredient text, e.g., "1 1/2 cups flour"
|
|
25
|
+
parsed?: ParsedIngredient // Structured data (optional, via parseIngredients option)
|
|
26
|
+
}
|
|
27
|
+
|
|
28
|
+
type IngredientGroup = {
|
|
29
|
+
name: string | null // Group name, e.g., "For the dough" or null for default
|
|
30
|
+
items: IngredientItem[] // Array of ingredients in this group
|
|
31
|
+
}
|
|
32
|
+
|
|
33
|
+
type Ingredients = IngredientGroup[] // Array of groups
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
### Design Decisions
|
|
37
|
+
|
|
38
|
+
**Why no IDs?**
|
|
39
|
+
|
|
40
|
+
- Initially had `id` fields on both items and groups
|
|
41
|
+
- Removed for simplicity - IDs added complexity without providing value
|
|
42
|
+
- Groups and items are identified by their position in arrays
|
|
43
|
+
|
|
44
|
+
**Why groups?**
|
|
45
|
+
|
|
46
|
+
- Many recipes organize ingredients into sections (e.g., "For the crust", "For the filling")
|
|
47
|
+
- Default group name is `null` for ungrouped ingredients
|
|
48
|
+
- Preserves recipe structure and improves readability
|
|
49
|
+
|
|
50
|
+
## Extraction Flow
|
|
51
|
+
|
|
52
|
+
### 1. Plugin-Based Extraction
|
|
53
|
+
|
|
54
|
+
The `RecipeExtractor` runs multiple extraction plugins in priority order:
|
|
55
|
+
|
|
56
|
+
```txt
|
|
57
|
+
1. Schema.org Plugin (priority: 90)
|
|
58
|
+
├─> Extracts from <script type="application/ld+json">
|
|
59
|
+
└─> Returns: Ingredients with clean, normalized text
|
|
60
|
+
|
|
61
|
+
2. OpenGraph Plugin (priority: 50)
|
|
62
|
+
└─> Fallback for basic metadata
|
|
63
|
+
|
|
64
|
+
3. PostProcessor Plugins (in priority order)
|
|
65
|
+
├─> HtmlStripperPlugin (100): Removes HTML tags from text values
|
|
66
|
+
└─> IngredientParserPlugin (50): Parses ingredients into structured data*
|
|
67
|
+
|
|
68
|
+
*Only active when `parseIngredients` option is enabled
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
**Key Insight**: Schema.org JSON-LD provides **clean, well-formatted text**:
|
|
72
|
+
|
|
73
|
+
- Proper spacing: `"1 1/2 cups flour"`
|
|
74
|
+
- Normalized fractions: `"1/2"` instead of `"½"`
|
|
75
|
+
- No HTML artifacts or concatenation issues
|
|
76
|
+
|
|
77
|
+
### 2. Site-Specific Scrapers
|
|
78
|
+
|
|
79
|
+
Scrapers extend `AbstractScraper` and can override any field extractor:
|
|
80
|
+
|
|
81
|
+
```typescript
|
|
82
|
+
class NYTimes extends AbstractScraper {
|
|
83
|
+
extractors = {
|
|
84
|
+
ingredients: this.ingredients.bind(this),
|
|
85
|
+
}
|
|
86
|
+
|
|
87
|
+
protected ingredients(prevValue: Ingredients | undefined): Ingredients {
|
|
88
|
+
// Custom logic here
|
|
89
|
+
}
|
|
90
|
+
}
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
**The `prevValue` Parameter**:
|
|
94
|
+
|
|
95
|
+
- Contains the result from plugin extraction (usually Schema.org)
|
|
96
|
+
- Provides clean text that scrapers can restructure
|
|
97
|
+
- Optional - scrapers can extract from scratch if needed
|
|
98
|
+
|
|
99
|
+
## The Flatten→Regroup Pattern
|
|
100
|
+
|
|
101
|
+
### Why This Pattern Exists
|
|
102
|
+
|
|
103
|
+
Many recipe websites have **structure in HTML** but **clean text in JSON-LD**:
|
|
104
|
+
|
|
105
|
+
**HTML Structure** (NYTimes example):
|
|
106
|
+
|
|
107
|
+
```html
|
|
108
|
+
<h3>For the dough</h3>
|
|
109
|
+
<li>1½cups flour</li> <!-- ❌ No space, unicode fraction -->
|
|
110
|
+
<li>½teaspoon salt</li> <!-- ❌ Concatenated -->
|
|
111
|
+
|
|
112
|
+
<h3>For the filling</h3>
|
|
113
|
+
<li>2cups sugar</li> <!-- ❌ No space -->
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
**JSON-LD Text** (from Schema.org):
|
|
117
|
+
|
|
118
|
+
```json
|
|
119
|
+
[
|
|
120
|
+
"1 1/2 cups flour", // ✅ Clean, spaced, normalized
|
|
121
|
+
"1/2 teaspoon salt", // ✅ Perfect formatting
|
|
122
|
+
"2 cups sugar" // ✅ No issues
|
|
123
|
+
]
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
**The Problem**:
|
|
127
|
+
|
|
128
|
+
- JSON-LD has **perfect text** but **loses grouping** (all ingredients in flat array)
|
|
129
|
+
- HTML has **accurate grouping** but **poor text quality** (no spaces, unicode chars, etc.)
|
|
130
|
+
|
|
131
|
+
**The Solution**: Use BOTH!
|
|
132
|
+
|
|
133
|
+
1. Extract clean text from JSON-LD (via Schema.org plugin)
|
|
134
|
+
2. Flatten it to strings for matching
|
|
135
|
+
3. Parse HTML structure for grouping
|
|
136
|
+
4. Match HTML elements to JSON-LD text
|
|
137
|
+
5. Rebuild grouped structure with clean text
|
|
138
|
+
|
|
139
|
+
### Implementation
|
|
140
|
+
|
|
141
|
+
```typescript
|
|
142
|
+
protected ingredients(prevValue: Ingredients | undefined): Ingredients {
|
|
143
|
+
const headingSelector = 'h3.ingredient-group'
|
|
144
|
+
const ingredientSelector = 'li.ingredient'
|
|
145
|
+
|
|
146
|
+
if (prevValue && prevValue.length > 0) {
|
|
147
|
+
// Step 1: Flatten prevValue (from Schema.org) to get clean text
|
|
148
|
+
const values = flattenIngredients(prevValue)
|
|
149
|
+
// values = ["1 1/2 cups flour", "1/2 teaspoon salt", "2 cups sugar"]
|
|
150
|
+
|
|
151
|
+
// Step 2: Parse HTML and match text to rebuild groups
|
|
152
|
+
return groupIngredients(
|
|
153
|
+
this.$, // Cheerio instance
|
|
154
|
+
values, // Clean text from JSON-LD
|
|
155
|
+
headingSelector, // Where to find group names
|
|
156
|
+
ingredientSelector, // Where to find ingredient items
|
|
157
|
+
)
|
|
158
|
+
}
|
|
159
|
+
|
|
160
|
+
throw new NoIngredientsFoundException()
|
|
161
|
+
}
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
### Utility Functions
|
|
165
|
+
|
|
166
|
+
#### `flattenIngredients(ingredients: Ingredients): string[]`
|
|
167
|
+
|
|
168
|
+
Converts grouped structure to flat array of strings:
|
|
169
|
+
|
|
170
|
+
```typescript
|
|
171
|
+
// Input:
|
|
172
|
+
[
|
|
173
|
+
{ name: "For the dough", items: [{ value: "1 cup flour" }] },
|
|
174
|
+
{ name: "", items: [{ value: "Salt to taste" }] }
|
|
175
|
+
]
|
|
176
|
+
|
|
177
|
+
// Output:
|
|
178
|
+
["1 cup flour", "Salt to taste"]
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
**Why flatten?** To get an ordered list of clean text values for matching against HTML elements.
|
|
182
|
+
|
|
183
|
+
#### `groupIngredients($, values, headingSelector, ingredientSelector): Ingredients`
|
|
184
|
+
|
|
185
|
+
Rebuilds grouped structure by:
|
|
186
|
+
|
|
187
|
+
1. Parsing HTML to find headings and items
|
|
188
|
+
2. Matching HTML text (normalized) to `values` array (normalized)
|
|
189
|
+
3. Creating groups based on HTML structure
|
|
190
|
+
4. Using matched text from `values` (preserving clean formatting)
|
|
191
|
+
|
|
192
|
+
**Normalization**: Both HTML text and values are normalized before matching (trim, lowercase, collapse whitespace) to ensure reliable matches despite formatting differences.
|
|
193
|
+
|
|
194
|
+
#### `stringsToIngredients(values: string[]): Ingredients`
|
|
195
|
+
|
|
196
|
+
Converts flat string array to default group structure:
|
|
197
|
+
|
|
198
|
+
```typescript
|
|
199
|
+
// Input:
|
|
200
|
+
["1 cup flour", "Salt to taste"]
|
|
201
|
+
|
|
202
|
+
// Output:
|
|
203
|
+
[
|
|
204
|
+
{
|
|
205
|
+
name: null, // Default group
|
|
206
|
+
items: [
|
|
207
|
+
{ value: "1 cup flour" },
|
|
208
|
+
{ value: "Salt to taste" }
|
|
209
|
+
]
|
|
210
|
+
}
|
|
211
|
+
]
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
Used by Schema.org plugin when converting `recipeIngredient` array.
|
|
215
|
+
|
|
216
|
+
## When to Use Each Approach
|
|
217
|
+
|
|
218
|
+
### Use Flatten→Regroup When
|
|
219
|
+
|
|
220
|
+
- ✅ JSON-LD provides clean text but lacks grouping
|
|
221
|
+
- ✅ HTML has visible grouping structure (headings, sections)
|
|
222
|
+
- ✅ Text quality in HTML is poor (spacing, unicode, concatenation issues)
|
|
223
|
+
- **Example**: NYTimes, BBC Good Food, Simply Recipes
|
|
224
|
+
|
|
225
|
+
### Parse HTML Directly When
|
|
226
|
+
|
|
227
|
+
- ✅ JSON-LD is missing or unreliable
|
|
228
|
+
- ✅ HTML text quality is good
|
|
229
|
+
- ✅ Complex custom structure that doesn't fit standard pattern
|
|
230
|
+
- **Example**: Sites without proper Schema.org markup
|
|
231
|
+
|
|
232
|
+
### Use Default (No Override) When
|
|
233
|
+
|
|
234
|
+
- ✅ Schema.org JSON-LD provides both good text AND grouping
|
|
235
|
+
- ✅ No special processing needed
|
|
236
|
+
- **Example**: Sites with perfect Schema.org implementation
|
|
237
|
+
|
|
238
|
+
## Common Pitfalls
|
|
239
|
+
|
|
240
|
+
### ❌ Don't Parse HTML Text Directly If You Have JSON-LD
|
|
241
|
+
|
|
242
|
+
```typescript
|
|
243
|
+
// BAD: HTML text has formatting issues
|
|
244
|
+
protected ingredients(): Ingredients {
|
|
245
|
+
const items = this.$('li.ingredient').map((_, el) =>
|
|
246
|
+
this.$(el).text() // "1½cups" - no space!
|
|
247
|
+
).get()
|
|
248
|
+
// ...
|
|
249
|
+
}
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
```typescript
|
|
253
|
+
// GOOD: Use JSON-LD text with HTML structure
|
|
254
|
+
protected ingredients(prevValue: Ingredients | undefined): Ingredients {
|
|
255
|
+
const values = flattenIngredients(prevValue) // Clean text from JSON-LD
|
|
256
|
+
return groupIngredients(this.$, values, headingSelector, itemSelector)
|
|
257
|
+
}
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
### ❌ Don't Modify Text After Flattening
|
|
261
|
+
|
|
262
|
+
```typescript
|
|
263
|
+
// BAD: Modifying clean text
|
|
264
|
+
const values = flattenIngredients(prevValue)
|
|
265
|
+
.map(v => v.toUpperCase()) // Don't do this!
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
The text from JSON-LD is already clean and normalized. Preserve it.
|
|
269
|
+
|
|
270
|
+
### ❌ Don't Flatten If You're Not Regrouping
|
|
271
|
+
|
|
272
|
+
```typescript
|
|
273
|
+
// BAD: Unnecessary work
|
|
274
|
+
const values = flattenIngredients(prevValue)
|
|
275
|
+
return stringsToIngredients(values) // Just return prevValue!
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
If Verify test passes with both HTML and JSON-LD extraction
|
|
279
|
+
|
|
280
|
+
## Ingredient Parsing
|
|
281
|
+
|
|
282
|
+
### Parsing Overview
|
|
283
|
+
|
|
284
|
+
The library supports optional structured parsing of ingredient strings using the [parse-ingredient](https://github.com/jakeboone02/parse-ingredient) library. When enabled via the `parseIngredients` option, each ingredient item includes a `parsed` field with extracted data.
|
|
285
|
+
|
|
286
|
+
### Enabling Parsing
|
|
287
|
+
|
|
288
|
+
```typescript
|
|
289
|
+
// Enable with defaults
|
|
290
|
+
const scraper = new MyScraper(html, url, { parseIngredients: true })
|
|
291
|
+
|
|
292
|
+
// Enable with options
|
|
293
|
+
const scraper = new MyScraper(html, url, {
|
|
294
|
+
parseIngredients: {
|
|
295
|
+
normalizeUOM: true, // "tbsp" → "tablespoon"
|
|
296
|
+
ignoreUOMs: ['small'], // Treat as description, not unit
|
|
297
|
+
}
|
|
298
|
+
})
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
### Parsing Pipeline
|
|
302
|
+
|
|
303
|
+
The `IngredientParserPlugin` runs as a PostProcessor (priority 50), after HTML stripping:
|
|
304
|
+
|
|
305
|
+
```txt
|
|
306
|
+
1. HtmlStripperPlugin (100)
|
|
307
|
+
└─> "2 cups <b>flour</b>" → "2 cups flour"
|
|
308
|
+
|
|
309
|
+
2. IngredientParserPlugin (50)
|
|
310
|
+
└─> "2 cups flour" → { value: "2 cups flour", parsed: {...} }
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
### Parsed Data Structure
|
|
314
|
+
|
|
315
|
+
```typescript
|
|
316
|
+
{
|
|
317
|
+
value: "1-2 tablespoons olive oil",
|
|
318
|
+
parsed: {
|
|
319
|
+
quantity: 1, // Primary quantity
|
|
320
|
+
quantity2: 2, // Secondary (range) quantity
|
|
321
|
+
unitOfMeasure: "tablespoons", // As written
|
|
322
|
+
unitOfMeasureID: "tablespoon", // Normalized key
|
|
323
|
+
description: "olive oil", // Ingredient name
|
|
324
|
+
isGroupHeader: false // True for "For the sauce:" etc.
|
|
325
|
+
}
|
|
326
|
+
}
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
### Use Cases
|
|
330
|
+
|
|
331
|
+
- **Scaling recipes**: Multiply quantities by a factor
|
|
332
|
+
- **Shopping lists**: Aggregate same ingredients across recipes
|
|
333
|
+
- **Nutritional lookup**: Search by normalized ingredient name
|
|
334
|
+
- **Unit conversion**: Convert between measurement systems
|
|
335
|
+
|
|
336
|
+
## Future Considerations
|
|
337
|
+
|
|
338
|
+
### Potential Improvements
|
|
339
|
+
|
|
340
|
+
1. **Text Normalization Pipeline**: Configurable normalization steps (fraction conversion, unit standardization, etc.)
|
|
341
|
+
|
|
342
|
+
2. **Fuzzy Matching**: Handle cases where HTML text diverges significantly from JSON-LD (currently uses exact normalized matching)
|
|
343
|
+
|
|
344
|
+
3. **Fallback Strategies**: Graceful degradation when JSON-LD is partial or HTML structure is ambiguous
|
|
345
|
+
|
|
346
|
+
4. **Schema.org Validation**: Detect and handle malformed JSON-LD more robustly
|
|
347
|
+
|
|
348
|
+
### Non-Goals
|
|
349
|
+
|
|
350
|
+
- **Unit Conversion**: Converting between measurement systems (use parsed data with external tools)
|
|
351
|
+
- **Substitutions**: Handling ingredient alternatives or substitutions
|
|
352
|
+
- **Nutritional Analysis**: Calculating nutrition facts from ingredients
|
|
353
|
+
|
|
354
|
+
## Summary
|
|
355
|
+
|
|
356
|
+
The ingredients system balances **text quality** (from JSON-LD) with **structural accuracy** (from HTML):
|
|
357
|
+
|
|
358
|
+
1. **Schema.org Plugin** extracts clean text → `Ingredients` with default grouping
|
|
359
|
+
2. **Scrapers** flatten text → parse HTML structure → rebuild groups with clean text
|
|
360
|
+
3. **IngredientParserPlugin** (optional) → adds structured `parsed` data
|
|
361
|
+
4. **Result**: Accurate grouping with high-quality, normalized ingredient text and optional structured data
|
|
362
|
+
|
|
363
|
+
This architecture leverages the strengths of both JSON-LD (clean data) and HTML (visual structure) to produce the best possible output.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "recipe-scrapers-js",
|
|
3
|
-
"version": "1.0.0-rc.
|
|
3
|
+
"version": "1.0.0-rc.1",
|
|
4
4
|
"license": "MIT",
|
|
5
5
|
"description": "A recipe scrapers library",
|
|
6
6
|
"author": {
|
|
@@ -12,13 +12,22 @@
|
|
|
12
12
|
"url": "git+https://github.com/nerdstep/recipe-scrapers-js.git"
|
|
13
13
|
},
|
|
14
14
|
"type": "module",
|
|
15
|
-
"module": "dist/index.
|
|
16
|
-
"main": "dist/index.
|
|
17
|
-
"types": "dist/index.d.
|
|
15
|
+
"module": "dist/index.mjs",
|
|
16
|
+
"main": "dist/index.mjs",
|
|
17
|
+
"types": "dist/index.d.mts",
|
|
18
|
+
"exports": {
|
|
19
|
+
"./package.json": "./package.json",
|
|
20
|
+
".": {
|
|
21
|
+
"types": "./dist/index.d.mts",
|
|
22
|
+
"import": "./dist/index.mjs"
|
|
23
|
+
}
|
|
24
|
+
},
|
|
18
25
|
"files": [
|
|
19
26
|
"dist",
|
|
20
|
-
"
|
|
21
|
-
"
|
|
27
|
+
"docs",
|
|
28
|
+
"CHANGELOG.md",
|
|
29
|
+
"LICENSE",
|
|
30
|
+
"README.md"
|
|
22
31
|
],
|
|
23
32
|
"keywords": [
|
|
24
33
|
"recipe",
|