contextractor 0.2.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +133 -0
- package/package.json +3 -2
package/README.md
ADDED
|
@@ -0,0 +1,133 @@
|
|
|
1
|
+
# Contextractor
|
|
2
|
+
|
|
3
|
+
Extract clean, readable content from any website using [Trafilatura](https://trafilatura.readthedocs.io/).
|
|
4
|
+
|
|
5
|
+
Available as: [npm CLI](#install) | [Docker](#docker) | [Apify actor](https://apify.com/shortc/contextractor) | [Web app](https://contextractor.com)
|
|
6
|
+
|
|
7
|
+
## Install
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
npm install -g contextractor
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
Requires Node.js 18+. Playwright Chromium is installed automatically.
|
|
14
|
+
|
|
15
|
+
## Usage
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
contextractor https://example.com
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
Works with zero config. Pass URLs directly, or use a config file for complex setups:
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
contextractor https://example.com --precision --format json -o ./results
|
|
25
|
+
contextractor --config config.yaml --max-pages 10
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
### CLI Options
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
contextractor [OPTIONS] [URLS...]
|
|
32
|
+
|
|
33
|
+
Options:
|
|
34
|
+
--config, -c Path to YAML or JSON config file
|
|
35
|
+
--output-dir, -o Output directory
|
|
36
|
+
--format, -f Output format (txt, markdown, json, xml, xmltei)
|
|
37
|
+
--max-pages Max pages to crawl (0 = unlimited)
|
|
38
|
+
--crawl-depth Max link depth from start URLs (0 = start only)
|
|
39
|
+
--headless/--no-headless Browser headless mode (default: headless)
|
|
40
|
+
--precision High precision mode (less noise)
|
|
41
|
+
--recall High recall mode (more content)
|
|
42
|
+
--fast Fast extraction mode (less thorough)
|
|
43
|
+
--no-links Exclude links from output
|
|
44
|
+
--no-comments Exclude comments from output
|
|
45
|
+
--include-tables/--no-tables Include tables (default: include)
|
|
46
|
+
--include-images Include image descriptions
|
|
47
|
+
--include-formatting/--no-formatting Preserve formatting (default: preserve)
|
|
48
|
+
--deduplicate Deduplicate extracted content
|
|
49
|
+
--target-language Filter by language (e.g. "en")
|
|
50
|
+
--with-metadata/--no-metadata Extract metadata (default: with)
|
|
51
|
+
--prune-xpath XPath patterns to remove from content
|
|
52
|
+
--verbose, -v Enable verbose logging
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
CLI flags override config file settings. Merge order: `defaults → config file → CLI args`
|
|
56
|
+
|
|
57
|
+
### Config File (optional)
|
|
58
|
+
|
|
59
|
+
```yaml
|
|
60
|
+
urls:
|
|
61
|
+
- https://example.com
|
|
62
|
+
- https://docs.example.com
|
|
63
|
+
outputFormat: markdown
|
|
64
|
+
outputDir: ./output
|
|
65
|
+
crawlDepth: 1
|
|
66
|
+
|
|
67
|
+
extraction:
|
|
68
|
+
favorPrecision: true
|
|
69
|
+
includeLinks: true
|
|
70
|
+
includeTables: true
|
|
71
|
+
deduplicate: true
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
| Field | Type | Default | Description |
|
|
75
|
+
|-------|------|---------|-------------|
|
|
76
|
+
| `urls` | array | `[]` | URLs to extract content from |
|
|
77
|
+
| `maxPages` | int | 0 | Max pages to crawl (0 = unlimited) |
|
|
78
|
+
| `outputFormat` | string | `"markdown"` | `txt`, `markdown`, `json`, `xml`, `xmltei` |
|
|
79
|
+
| `outputDir` | string | `"./output"` | Directory for extracted content |
|
|
80
|
+
| `crawlDepth` | int | 0 | How deep to follow links (0 = start URLs only) |
|
|
81
|
+
| `headless` | bool | true | Browser headless mode |
|
|
82
|
+
| `extraction` | object | `{}` | Trafilatura extraction options (see below) |
|
|
83
|
+
|
|
84
|
+
### Extraction Options
|
|
85
|
+
|
|
86
|
+
All options go under the `extraction` key in config files, or use the equivalent CLI flags:
|
|
87
|
+
|
|
88
|
+
| Field | Type | Default | Description |
|
|
89
|
+
|-------|------|---------|-------------|
|
|
90
|
+
| `favorPrecision` | bool | false | High precision, less noise |
|
|
91
|
+
| `favorRecall` | bool | false | High recall, more content |
|
|
92
|
+
| `includeComments` | bool | true | Include comments |
|
|
93
|
+
| `includeTables` | bool | true | Include tables |
|
|
94
|
+
| `includeImages` | bool | false | Include images |
|
|
95
|
+
| `includeFormatting` | bool | true | Preserve formatting |
|
|
96
|
+
| `includeLinks` | bool | true | Include links |
|
|
97
|
+
| `deduplicate` | bool | false | Deduplicate content |
|
|
98
|
+
| `withMetadata` | bool | true | Extract metadata (title, author, date) |
|
|
99
|
+
| `targetLanguage` | string | null | Filter by language (e.g. `"en"`) |
|
|
100
|
+
| `fast` | bool | false | Fast mode (less thorough) |
|
|
101
|
+
|
|
102
|
+
## Docker
|
|
103
|
+
|
|
104
|
+
```bash
|
|
105
|
+
docker run ghcr.io/contextractor/contextractor https://example.com
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
Save output to your local machine:
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
docker run -v ./output:/output ghcr.io/contextractor/contextractor https://example.com -o /output
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
Use a config file:
|
|
115
|
+
|
|
116
|
+
```bash
|
|
117
|
+
docker run -v ./config.yaml:/config.yaml ghcr.io/contextractor/contextractor --config /config.yaml
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
All CLI flags work the same inside Docker.
|
|
121
|
+
|
|
122
|
+
## Output
|
|
123
|
+
|
|
124
|
+
One file per crawled page, named from the URL slug (e.g. `example-com-page.md`). Metadata (title, author, date) is included in the output header when available.
|
|
125
|
+
|
|
126
|
+
## Platforms
|
|
127
|
+
|
|
128
|
+
- npm: macOS arm64, Linux (x64, arm64), Windows x64
|
|
129
|
+
- Docker: linux/amd64, linux/arm64
|
|
130
|
+
|
|
131
|
+
## License
|
|
132
|
+
|
|
133
|
+
MIT
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "contextractor",
|
|
3
|
-
"version": "0.2.
|
|
3
|
+
"version": "0.2.1",
|
|
4
4
|
"description": "Extract web content from URLs with configurable extraction options",
|
|
5
5
|
"license": "MIT",
|
|
6
6
|
"repository": {
|
|
@@ -29,7 +29,8 @@
|
|
|
29
29
|
"files": [
|
|
30
30
|
"cli.js",
|
|
31
31
|
"index.js",
|
|
32
|
-
"postinstall.js"
|
|
32
|
+
"postinstall.js",
|
|
33
|
+
"README.md"
|
|
33
34
|
],
|
|
34
35
|
"scripts": {
|
|
35
36
|
"postinstall": "node postinstall.js"
|