crawlberg 0.0.1 → 1.0.0.pre.rc.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +160 -1
- data/Steepfile +14 -0
- data/ext/crawlberg_rb/native/Cargo.lock +3419 -0
- data/ext/crawlberg_rb/native/Cargo.toml +26 -0
- data/ext/crawlberg_rb/native/extconf.rb +14 -0
- data/ext/crawlberg_rb/src/lib.rs +7702 -0
- data/lib/crawlberg/native.rb +494 -0
- data/lib/crawlberg/version.rb +10 -0
- data/lib/crawlberg.rb +14 -2
- data/lib/crawlberg_rb.so +0 -0
- data/sig/types.rbs +530 -0
- metadata +62 -13
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: a26c90b8c3a0b70eef7d1f0028f27259832cd9b64228b4b9fd2c11efbdc6880c
|
|
4
|
+
data.tar.gz: 77168b56b966f5def8746365b621b9907318b8ab924326b944d9ef2e70035285
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 25ffc939ed2fc63d181388c379dc04322eb1f169da22ed208f8408a73d514a509c90c7a26bc51623dae8f3eb8da54818fbc801eef711c1af32121a6c751eb807
|
|
7
|
+
data.tar.gz: 2cd010b1460e61ba6f8ef9784c02506cd410da0c17ca754046798a5e7c0635dc34d646b3127b817e14385db0ab336e79330c3f4df2f46eeb213bb4fa43d5571f
|
data/README.md
CHANGED
|
@@ -1,3 +1,162 @@
|
|
|
1
|
+
<p align="center">
|
|
2
|
+
<picture>
|
|
3
|
+
<source media="(prefers-color-scheme: dark)" srcset="https://cdn.jsdelivr.net/gh/xberg-io/assets@v1/banner/readme-banner-dark.svg">
|
|
4
|
+
<img alt="Xberg" width="420" src="https://cdn.jsdelivr.net/gh/xberg-io/assets@v1/banner/readme-banner-light.svg">
|
|
5
|
+
</picture>
|
|
6
|
+
</p>
|
|
7
|
+
|
|
1
8
|
# crawlberg
|
|
2
9
|
|
|
3
|
-
|
|
10
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
|
|
11
|
+
<a href="https://github.com/xberg-io/alef">
|
|
12
|
+
<img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
|
|
13
|
+
</a>
|
|
14
|
+
<!-- Language Bindings -->
|
|
15
|
+
<a href="https://crates.io/crates/crawlberg">
|
|
16
|
+
<img src="https://img.shields.io/crates/v/crawlberg?label=Rust&color=007ec6" alt="Rust">
|
|
17
|
+
</a>
|
|
18
|
+
<a href="https://pypi.org/project/crawlberg/">
|
|
19
|
+
<img src="https://img.shields.io/pypi/v/crawlberg?label=Python&color=007ec6" alt="Python">
|
|
20
|
+
</a>
|
|
21
|
+
<a href="https://www.npmjs.com/package/@xberg-io/crawlberg">
|
|
22
|
+
<img src="https://img.shields.io/npm/v/@xberg-io/crawlberg?label=Node.js&color=007ec6" alt="Node.js">
|
|
23
|
+
</a>
|
|
24
|
+
<a href="https://www.npmjs.com/package/@xberg-io/crawlberg-wasm">
|
|
25
|
+
<img src="https://img.shields.io/npm/v/@xberg-io/crawlberg-wasm?label=WASM&color=007ec6" alt="WASM">
|
|
26
|
+
</a>
|
|
27
|
+
<a href="https://central.sonatype.com/artifact/io.xberg.crawlberg/crawlberg">
|
|
28
|
+
<img src="https://img.shields.io/maven-central/v/io.xberg.crawlberg/crawlberg?label=Java&color=007ec6" alt="Java">
|
|
29
|
+
</a>
|
|
30
|
+
<a href="https://pkg.go.dev/github.com/xberg-io/crawlberg/packages/go">
|
|
31
|
+
<img src="https://img.shields.io/github/v/tag/xberg-io/crawlberg?label=Go&color=007ec6" alt="Go">
|
|
32
|
+
</a>
|
|
33
|
+
<a href="https://www.nuget.org/packages/Crawlberg/">
|
|
34
|
+
<img src="https://img.shields.io/nuget/v/Crawlberg?label=C%23&color=007ec6" alt="C#">
|
|
35
|
+
</a>
|
|
36
|
+
<a href="https://packagist.org/packages/xberg-io/crawlberg">
|
|
37
|
+
<img src="https://img.shields.io/packagist/v/xberg-io/crawlberg?label=PHP&color=007ec6" alt="PHP">
|
|
38
|
+
</a>
|
|
39
|
+
<a href="https://rubygems.org/gems/crawlberg">
|
|
40
|
+
<img src="https://img.shields.io/gem/v/crawlberg?label=Ruby&color=007ec6" alt="Ruby">
|
|
41
|
+
</a>
|
|
42
|
+
<a href="https://hex.pm/packages/crawlberg">
|
|
43
|
+
<img src="https://img.shields.io/hexpm/v/crawlberg?label=Elixir&color=007ec6" alt="Elixir">
|
|
44
|
+
</a>
|
|
45
|
+
<a href="https://pub.dev/packages/crawlberg">
|
|
46
|
+
<img src="https://img.shields.io/pub/v/crawlberg?label=Dart&color=007ec6" alt="Dart">
|
|
47
|
+
</a>
|
|
48
|
+
<a href="https://central.sonatype.com/artifact/io.xberg.crawlberg.android/crawlberg-android">
|
|
49
|
+
<img src="https://img.shields.io/maven-central/v/io.xberg.crawlberg.android/crawlberg-android?label=Kotlin&color=007ec6" alt="Kotlin">
|
|
50
|
+
</a>
|
|
51
|
+
<a href="https://github.com/xberg-io/crawlberg/tree/main/packages/swift">
|
|
52
|
+
<img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
|
|
53
|
+
</a>
|
|
54
|
+
<a href="https://github.com/xberg-io/crawlberg/tree/main/packages/zig">
|
|
55
|
+
<img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
|
|
56
|
+
</a>
|
|
57
|
+
<a href="https://github.com/xberg-io/crawlberg/releases">
|
|
58
|
+
<img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
|
|
59
|
+
</a>
|
|
60
|
+
<a href="https://github.com/xberg-io/crawlberg/pkgs/container/crawlberg">
|
|
61
|
+
<img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
|
|
62
|
+
</a>
|
|
63
|
+
|
|
64
|
+
<!-- Project Info -->
|
|
65
|
+
<a href="https://github.com/xberg-io/crawlberg/blob/main/LICENSE">
|
|
66
|
+
<img src="https://img.shields.io/badge/License-MIT-007ec6" alt="License">
|
|
67
|
+
</a>
|
|
68
|
+
<a href="https://docs.crawlberg.xberg.io">
|
|
69
|
+
<img src="https://img.shields.io/badge/Docs-crawlberg-007ec6" alt="Documentation">
|
|
70
|
+
</a>
|
|
71
|
+
</div>
|
|
72
|
+
|
|
73
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
|
|
74
|
+
<a href="https://discord.gg/xt9WY3GnKR">
|
|
75
|
+
<img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
|
|
76
|
+
</a>
|
|
77
|
+
</div>
|
|
78
|
+
|
|
79
|
+
Ruby bindings for **crawlberg** — a high-performance Rust web crawling engine. Powered by
|
|
80
|
+
Magnus with native Ruby objects, full metadata extraction, and Markdown conversion.
|
|
81
|
+
|
|
82
|
+
## What This Package Provides
|
|
83
|
+
|
|
84
|
+
- **Same crawler as every binding** — one Rust engine behind Python, Node.js, Ruby, Go, Java, .NET, PHP, Elixir, Dart, Kotlin Android, Swift, Zig, WASM, and C FFI.
|
|
85
|
+
- **Structured scrape output** — HTML, Markdown, metadata, links, assets, response headers, and extraction warnings with consistent field names.
|
|
86
|
+
- **Crawl controls** — depth, page limits, concurrency, URL filters, robots/sitemap handling, rate limits, and partial failure reporting.
|
|
87
|
+
- **Rendering path** — optional browser rendering for JavaScript-heavy pages; direct HTTP path for fast static pages.
|
|
88
|
+
- **Ruby package** — Magnus-backed native extension with Ruby objects for crawl results.
|
|
89
|
+
|
|
90
|
+
## Installation
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
gem install crawlberg
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
## Agent plugin
|
|
97
|
+
|
|
98
|
+
The `crawlberg` plugin is available via the `xberg-io/plugins` marketplace.
|
|
99
|
+
|
|
100
|
+
```text
|
|
101
|
+
/plugin marketplace add xberg-io/plugins
|
|
102
|
+
/plugin install crawlberg@xberg
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
Works with Claude Code, Codex, Cursor, Gemini CLI, Factory Droid, GitHub Copilot CLI, and opencode. See [the marketplace README](https://github.com/xberg-io/plugins) for harness-specific install instructions.
|
|
106
|
+
|
|
107
|
+
## Quick Start
|
|
108
|
+
|
|
109
|
+
```ruby title="Ruby"
|
|
110
|
+
require "crawlberg"
|
|
111
|
+
|
|
112
|
+
# Simplest case: scrape a single page with default settings.
|
|
113
|
+
engine = Crawlberg.create_engine
|
|
114
|
+
result = Crawlberg.scrape(engine, "https://example.com/")
|
|
115
|
+
puts "Title: #{result.metadata.title}"
|
|
116
|
+
puts "Status: #{result.status_code}"
|
|
117
|
+
puts "Links found: #{result.links.length}"
|
|
118
|
+
|
|
119
|
+
# Crawl from a seed URL, limited to one hop and a handful of pages.
|
|
120
|
+
config = Crawlberg::CrawlConfig.new(max_depth: 1, max_pages: 5)
|
|
121
|
+
crawl_engine = Crawlberg.create_engine(config)
|
|
122
|
+
crawl_result = Crawlberg.crawl(crawl_engine, "https://en.wikipedia.org/wiki/Web_scraping")
|
|
123
|
+
puts "Pages crawled: #{crawl_result.pages.length}"
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
## API Reference
|
|
127
|
+
|
|
128
|
+
Full API documentation is available at [docs.crawlberg.xberg.io](https://docs.crawlberg.xberg.io).
|
|
129
|
+
|
|
130
|
+
Key functions:
|
|
131
|
+
|
|
132
|
+
- `create_engine(config?)` — Create a crawl engine with optional configuration
|
|
133
|
+
- `scrape(engine, url)` — Scrape a single URL
|
|
134
|
+
- `crawl(engine, url)` — Crawl a website following links
|
|
135
|
+
- `map_urls(engine, url)` — Discover all pages on a site
|
|
136
|
+
- `batch_scrape(engine, urls)` — Scrape multiple URLs concurrently
|
|
137
|
+
- `batch_crawl(engine, urls)` — Crawl multiple seed URLs concurrently
|
|
138
|
+
|
|
139
|
+
## Contributing
|
|
140
|
+
|
|
141
|
+
Contributions are welcome! Please see our [Contributing Guide](https://github.com/xberg-io/crawlberg/blob/main/CONTRIBUTING.md) for details.
|
|
142
|
+
|
|
143
|
+
## Part of Xberg.dev
|
|
144
|
+
|
|
145
|
+
- [Xberg](https://github.com/xberg-io/xberg) — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
|
|
146
|
+
- [Xberg Enterprise](https://github.com/xberg-io/xberg-enterprise) — managed extraction API with SDKs, dashboards, and observability.
|
|
147
|
+
- [crawlberg](https://github.com/xberg-io/crawlberg) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
|
|
148
|
+
- [html-to-markdown](https://github.com/xberg-io/html-to-markdown) — fast, lossless HTML→Markdown engine.
|
|
149
|
+
- [liter-llm](https://github.com/xberg-io/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
|
|
150
|
+
- [tree-sitter-language-pack](https://github.com/xberg-io/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
|
|
151
|
+
- [alef](https://github.com/xberg-io/alef) — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.
|
|
152
|
+
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
|
|
153
|
+
|
|
154
|
+
## License
|
|
155
|
+
|
|
156
|
+
This project is licensed under [MIT License](https://github.com/xberg-io/crawlberg/blob/main/LICENSE).
|
|
157
|
+
|
|
158
|
+
## Links
|
|
159
|
+
|
|
160
|
+
- [Documentation](https://docs.crawlberg.xberg.io)
|
|
161
|
+
- [GitHub Repository](https://github.com/xberg-io/crawlberg)
|
|
162
|
+
- [Issue Tracker](https://github.com/xberg-io/crawlberg/issues)
|
data/Steepfile
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
target :lib do
|
|
4
|
+
signature "sig"
|
|
5
|
+
check "lib"
|
|
6
|
+
# The generated `lib/crawlberg/native.rb` carries inline Sorbet
|
|
7
|
+
# `sig { ... }` blocks on tagged-enum variant Data classes. Sorbet's runtime
|
|
8
|
+
# provides those via `extend T::Sig`, but Steep does not understand the
|
|
9
|
+
# extension (it relies on RBS, not Sorbet sigs) and reports
|
|
10
|
+
# `Type `self` does not have method `sig`` on every block. RBS coverage
|
|
11
|
+
# for the same surface lives in `sig/types.rbs`, so we steer Steep to the
|
|
12
|
+
# RBS file by ignoring the .rb.
|
|
13
|
+
ignore "lib/crawlberg/native.rb"
|
|
14
|
+
end
|