contextractor 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +133 -0
  2. package/package.json +3 -2
package/README.md ADDED
@@ -0,0 +1,133 @@
1
+ # Contextractor
2
+
3
+ Extract clean, readable content from any website using [Trafilatura](https://trafilatura.readthedocs.io/).
4
+
5
+ Available as: [npm CLI](#install) | [Docker](#docker) | [Apify actor](https://apify.com/shortc/contextractor) | [Web app](https://contextractor.com)
6
+
7
+ ## Install
8
+
9
+ ```bash
10
+ npm install -g contextractor
11
+ ```
12
+
13
+ Requires Node.js 18+. Playwright Chromium is installed automatically.
14
+
15
+ ## Usage
16
+
17
+ ```bash
18
+ contextractor https://example.com
19
+ ```
20
+
21
+ Works with zero config. Pass URLs directly, or use a config file for complex setups:
22
+
23
+ ```bash
24
+ contextractor https://example.com --precision --format json -o ./results
25
+ contextractor --config config.yaml --max-pages 10
26
+ ```
27
+
28
+ ### CLI Options
29
+
30
+ ```
31
+ contextractor [OPTIONS] [URLS...]
32
+
33
+ Options:
34
+ --config, -c Path to YAML or JSON config file
35
+ --output-dir, -o Output directory
36
+ --format, -f Output format (txt, markdown, json, xml, xmltei)
37
+ --max-pages Max pages to crawl (0 = unlimited)
38
+ --crawl-depth Max link depth from start URLs (0 = start only)
39
+ --headless/--no-headless Browser headless mode (default: headless)
40
+ --precision High precision mode (less noise)
41
+ --recall High recall mode (more content)
42
+ --fast Fast extraction mode (less thorough)
43
+ --no-links Exclude links from output
44
+ --no-comments Exclude comments from output
45
+ --include-tables/--no-tables Include tables (default: include)
46
+ --include-images Include image descriptions
47
+ --include-formatting/--no-formatting Preserve formatting (default: preserve)
48
+ --deduplicate Deduplicate extracted content
49
+ --target-language Filter by language (e.g. "en")
50
+ --with-metadata/--no-metadata Extract metadata (default: with)
51
+ --prune-xpath XPath patterns to remove from content
52
+ --verbose, -v Enable verbose logging
53
+ ```
54
+
55
+ CLI flags override config file settings. Merge order: `defaults → config file → CLI args`
56
+
57
+ ### Config File (optional)
58
+
59
+ ```yaml
60
+ urls:
61
+ - https://example.com
62
+ - https://docs.example.com
63
+ outputFormat: markdown
64
+ outputDir: ./output
65
+ crawlDepth: 1
66
+
67
+ extraction:
68
+ favorPrecision: true
69
+ includeLinks: true
70
+ includeTables: true
71
+ deduplicate: true
72
+ ```
73
+
74
+ | Field | Type | Default | Description |
75
+ |-------|------|---------|-------------|
76
+ | `urls` | array | `[]` | URLs to extract content from |
77
+ | `maxPages` | int | 0 | Max pages to crawl (0 = unlimited) |
78
+ | `outputFormat` | string | `"markdown"` | `txt`, `markdown`, `json`, `xml`, `xmltei` |
79
+ | `outputDir` | string | `"./output"` | Directory for extracted content |
80
+ | `crawlDepth` | int | 0 | How deep to follow links (0 = start URLs only) |
81
+ | `headless` | bool | true | Browser headless mode |
82
+ | `extraction` | object | `{}` | Trafilatura extraction options (see below) |
83
+
84
+ ### Extraction Options
85
+
86
+ All options go under the `extraction` key in config files, or use the equivalent CLI flags:
87
+
88
+ | Field | Type | Default | Description |
89
+ |-------|------|---------|-------------|
90
+ | `favorPrecision` | bool | false | High precision, less noise |
91
+ | `favorRecall` | bool | false | High recall, more content |
92
+ | `includeComments` | bool | true | Include comments |
93
+ | `includeTables` | bool | true | Include tables |
94
+ | `includeImages` | bool | false | Include images |
95
+ | `includeFormatting` | bool | true | Preserve formatting |
96
+ | `includeLinks` | bool | true | Include links |
97
+ | `deduplicate` | bool | false | Deduplicate content |
98
+ | `withMetadata` | bool | true | Extract metadata (title, author, date) |
99
+ | `targetLanguage` | string | null | Filter by language (e.g. `"en"`) |
100
+ | `fast` | bool | false | Fast mode (less thorough) |
101
+
102
+ ## Docker
103
+
104
+ ```bash
105
+ docker run ghcr.io/contextractor/contextractor https://example.com
106
+ ```
107
+
108
+ Save output to your local machine:
109
+
110
+ ```bash
111
+ docker run -v ./output:/output ghcr.io/contextractor/contextractor https://example.com -o /output
112
+ ```
113
+
114
+ Use a config file:
115
+
116
+ ```bash
117
+ docker run -v ./config.yaml:/config.yaml ghcr.io/contextractor/contextractor --config /config.yaml
118
+ ```
119
+
120
+ All CLI flags work the same inside Docker.
121
+
122
+ ## Output
123
+
124
+ One file per crawled page, named from the URL slug (e.g. `example-com-page.md`). Metadata (title, author, date) is included in the output header when available.
125
+
126
+ ## Platforms
127
+
128
+ - npm: macOS arm64, Linux (x64, arm64), Windows x64
129
+ - Docker: linux/amd64, linux/arm64
130
+
131
+ ## License
132
+
133
+ MIT
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "contextractor",
3
- "version": "0.2.0",
3
+ "version": "0.2.1",
4
4
  "description": "Extract web content from URLs with configurable extraction options",
5
5
  "license": "MIT",
6
6
  "repository": {
@@ -29,7 +29,8 @@
29
29
  "files": [
30
30
  "cli.js",
31
31
  "index.js",
32
- "postinstall.js"
32
+ "postinstall.js",
33
+ "README.md"
33
34
  ],
34
35
  "scripts": {
35
36
  "postinstall": "node postinstall.js"