defuddle 0.3.8 → 0.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (64) hide show
  1. package/README.md +61 -48
  2. package/dist/constants.js +744 -0
  3. package/dist/constants.js.map +1 -0
  4. package/dist/defuddle.d.ts +3 -2
  5. package/dist/defuddle.js +1676 -0
  6. package/dist/defuddle.js.map +1 -0
  7. package/dist/elements/code.js +287 -0
  8. package/dist/elements/code.js.map +1 -0
  9. package/dist/elements/headings.js +95 -0
  10. package/dist/elements/headings.js.map +1 -0
  11. package/dist/elements/math.base.js +192 -0
  12. package/dist/elements/math.base.js.map +1 -0
  13. package/dist/elements/math.full.d.ts +1 -1
  14. package/dist/elements/math.full.js +121 -0
  15. package/dist/elements/math.full.js.map +1 -0
  16. package/dist/extractor-registry.d.ts +15 -0
  17. package/dist/extractor-registry.js +101 -0
  18. package/dist/extractor-registry.js.map +1 -0
  19. package/dist/extractors/_base.d.ts +9 -0
  20. package/dist/extractors/_base.js +12 -0
  21. package/dist/extractors/_base.js.map +1 -0
  22. package/dist/extractors/_conversation.d.ts +9 -0
  23. package/dist/extractors/_conversation.js +77 -0
  24. package/dist/extractors/_conversation.js.map +1 -0
  25. package/dist/extractors/chatgpt.d.ts +13 -0
  26. package/dist/extractors/chatgpt.js +142 -0
  27. package/dist/extractors/chatgpt.js.map +1 -0
  28. package/dist/extractors/claude.d.ts +10 -0
  29. package/dist/extractors/claude.js +87 -0
  30. package/dist/extractors/claude.js.map +1 -0
  31. package/dist/extractors/hackernews.d.ts +21 -0
  32. package/dist/extractors/hackernews.js +206 -0
  33. package/dist/extractors/hackernews.js.map +1 -0
  34. package/dist/extractors/reddit.d.ts +16 -0
  35. package/dist/extractors/reddit.js +143 -0
  36. package/dist/extractors/reddit.js.map +1 -0
  37. package/dist/extractors/twitter.d.ts +16 -0
  38. package/dist/extractors/twitter.js +199 -0
  39. package/dist/extractors/twitter.js.map +1 -0
  40. package/dist/extractors/youtube.d.ts +12 -0
  41. package/dist/extractors/youtube.js +53 -0
  42. package/dist/extractors/youtube.js.map +1 -0
  43. package/dist/index.full.js +19181 -1
  44. package/dist/index.full.js.map +1 -0
  45. package/dist/index.js +6 -1
  46. package/dist/index.js.map +1 -0
  47. package/dist/markdown.d.ts +1 -0
  48. package/dist/markdown.js +545 -0
  49. package/dist/markdown.js.map +1 -0
  50. package/dist/metadata.js +268 -0
  51. package/dist/metadata.js.map +1 -0
  52. package/dist/node.d.ts +12 -0
  53. package/dist/node.js +50 -0
  54. package/dist/node.js.map +1 -0
  55. package/dist/scoring.d.ts +8 -0
  56. package/dist/scoring.js +95 -0
  57. package/dist/scoring.js.map +1 -0
  58. package/dist/types/extractors.d.ts +41 -0
  59. package/dist/types/extractors.js +3 -0
  60. package/dist/types/extractors.js.map +1 -0
  61. package/dist/types.d.ts +14 -0
  62. package/dist/types.js +3 -0
  63. package/dist/types.js.map +1 -0
  64. package/package.json +19 -5
package/README.md CHANGED
@@ -24,72 +24,63 @@ Defuddle can be used as a replacement for [Mozilla Readability](https://github.c
24
24
  npm install defuddle
25
25
  ```
26
26
 
27
- ## Usage
28
-
29
- ```typescript
30
- import Defuddle from 'defuddle';
27
+ For Node.js usage, you'll also need to install JSDOM:
31
28
 
32
- const article = new Defuddle(document).parse();
33
-
34
- // Use the extracted content and metadata
35
- console.log(article.content); // HTML string of the main content
36
- console.log(article.title); // Title of the article
29
+ ```bash
30
+ npm install jsdom
37
31
  ```
38
32
 
39
- ### Bundles
40
-
41
- Defuddle comes in two bundles:
42
-
43
- **Core bundle** (~50kB), no dependencies
44
- ```js
45
- import Defuddle from 'defuddle';
46
- ```
47
- **Full bundle** (~432kB), includes advanced math conversion capabilities
48
- ```js
49
- import Defuddle from 'defuddle/full';
50
- ```
33
+ ## Usage
51
34
 
52
- The core bundle is recommended for most use cases. It still handles math content, but doesn't include fallbacks for converting between MathML and LaTeX formats. The full bundle adds the ability to create reliable `<math>` elements using `mathml-to-latex` and `temml` libraries.
35
+ ### Browser
53
36
 
54
- ### Debug mode
37
+ ```javascript
38
+ import { Defuddle } from 'defuddle';
55
39
 
56
- You can enable debug mode by passing an options object when creating a new Defuddle instance:
40
+ // Parse the current document
41
+ const defuddle = new Defuddle(document);
42
+ const result = defuddle.parse();
57
43
 
58
- ```typescript
59
- const article = new Defuddle(document, { debug: true }).parse();
44
+ // Access the content and metadata
45
+ console.log(result.content);
46
+ console.log(result.title);
47
+ console.log(result.author);
60
48
  ```
61
49
 
62
- - More verbose console logging about the parsing process
63
- - Preserves HTML class and id attributes that are normally stripped
64
- - Retains all data-* attributes
65
- - Skips div flattening to preserve document structure
50
+ ### Node.js
66
51
 
67
- ### Server-side usage
52
+ ```javascript
53
+ import { JSDOM } from 'jsdom';
54
+ import { Defuddle } from 'defuddle/node';
68
55
 
69
- When using Defuddle in a Node.js environment, you can use JSDOM to create a DOM document:
56
+ // Parse HTML from a string
57
+ const html = '<html><body><article>...</article></body></html>';
58
+ const result = await Defuddle(html);
70
59
 
71
- ```typescript
72
- import Defuddle from 'defuddle';
73
- import { JSDOM } from 'jsdom';
60
+ // Parse HTML from a URL
61
+ const dom = await JSDOM.fromURL('https://example.com/article');
62
+ const result = await Defuddle(dom);
74
63
 
75
- const html = '...'; // Your HTML string
76
- const dom = new JSDOM(html, {
77
- url: "https://www.example.com/page-url" // Optional: helps resolve relative URLs
64
+ // With options
65
+ const result = await Defuddle(dom, {
66
+ debug: true, // Enable debug mode for verbose logging
67
+ markdown: true, // Convert content to markdown
68
+ url: 'https://example.com/article' // Original URL of the page
78
69
  });
79
70
 
80
- const article = new Defuddle(dom.window.document).parse();
81
- console.log(article.content);
71
+ // Access the content and metadata
72
+ console.log(result.content);
73
+ console.log(result.title);
74
+ console.log(result.author);
82
75
  ```
83
76
 
84
- Providing `url` in the JSDOM constructor helps convert relative URLs (images, links, etc.) to absolute URLs.
85
-
86
77
  ## Response
87
78
 
88
- The `parse()` method returns an object with the following properties:
79
+ Defuddle returns an object with the following properties:
89
80
 
90
81
  | Property | Type | Description |
91
82
  |----------|------|-------------|
92
- | `content` | string | HTML string of the extracted main content |
83
+ | `content` | string | Cleaned up string of the extracted content |
93
84
  | `title` | string | Title of the article |
94
85
  | `description` | string | Description or summary of the article |
95
86
  | `domain` | string | Domain name of the website |
@@ -102,6 +93,32 @@ The `parse()` method returns an object with the following properties:
102
93
  | `schemaOrgData` | object | Raw schema.org data extracted from the page |
103
94
  | `wordCount` | number | Total number of words in the extracted content |
104
95
 
96
+ ## Bundles
97
+
98
+ Defuddle is available in three different bundles:
99
+
100
+ 1. Core bundle (`defuddle`): The main bundle for browser usage. No dependencies.
101
+ 2. Full bundle (`defuddle/full`): Includes additional features for math equation parsing.
102
+ 3. Node.js bundle (`defuddle/node`): Optimized for Node.js environments using JSDOM. Includes full capabilities for math and Markdown conversion.
103
+
104
+ The core bundle is recommended for most use cases. It still handles math content, but doesn't include fallbacks for converting between MathML and LaTeX formats. The full bundle adds the ability to create reliable `<math>` elements using `mathml-to-latex` and `temml` libraries.
105
+
106
+ ## Options
107
+
108
+ ### Debug mode
109
+
110
+ You can enable debug mode by passing an options object when creating a new Defuddle instance:
111
+
112
+ ```typescript
113
+ const article = new Defuddle(document, { debug: true }).parse();
114
+ ```
115
+
116
+ - More verbose console logging about the parsing process
117
+ - Preserves HTML class and id attributes that are normally stripped
118
+ - Retains all data-* attributes
119
+ - Skips div flattening to preserve document structure
120
+
121
+
105
122
  ## HTML standardization
106
123
 
107
124
  Defuddle attempts to standardize HTML elements to provide a consistent input for subsequent manipulation such as conversion to Markdown.
@@ -167,7 +184,3 @@ npm install
167
184
  # Clean and build
168
185
  npm run build
169
186
  ```
170
-
171
- This will generate:
172
- - `dist/index.js` - UMD build for both Node.js and browsers
173
- - `dist/index.d.ts` - TypeScript declaration file