crawlberg 0.0.1 → 1.0.0.pre.rc.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 84d24f5fa9af1d0b4b70254ba190b3cc949c51c6383f4242ec13dbce8fdb6c3b
4
- data.tar.gz: 40281a66a14d43b1ea9da7f3142ef35f775b1a97fd69f138fe25b602b5a58ed1
3
+ metadata.gz: 07722edb64b4714a652ad533eb1594e52ee56e172c5f62e911870a7b7a00cf49
4
+ data.tar.gz: 312938f63afed7f64edb9765493e72242b186ff473e181ec27ffa570c5c3e86d
5
5
  SHA512:
6
- metadata.gz: bb5dd17b7ca11213404ed014056027c1aabbef4bf9ced1af345b5ae8ccebe42e75934c07a5d1dc1665f7ca9ed67fe24b2fd4ae126d9239187ba4b7dc69708e0d
7
- data.tar.gz: 76515a688aa854dc7514bbd4c1e5716f46614af40ebab1567340a4fbae00cdc137eec9941c58ca7473868bc507b4574e4ea358eed7fa80079a7fce70de58f0bf
6
+ metadata.gz: 6db2096db0dea5bae3f9d1b211a64fe29350d58a66d4b18474ecb37ac148fcf0c2c63f27cfeaf1254ad33dee7c8f62ad002f6fb933e9ddec74f069ed633aa076
7
+ data.tar.gz: f3474d90832cab725e874a122a89bace6e2f85df4ffc56bd169a8e57a171f6bc5d9f281657704768570d70f51d25840e29d64548ac591a303e2e55da4a90c804
data/README.md CHANGED
@@ -1,3 +1,162 @@
1
+ <p align="center">
2
+ <picture>
3
+ <source media="(prefers-color-scheme: dark)" srcset="https://cdn.jsdelivr.net/gh/xberg-io/assets@v1/banner/readme-banner-dark.svg">
4
+ <img alt="Xberg" width="420" src="https://cdn.jsdelivr.net/gh/xberg-io/assets@v1/banner/readme-banner-light.svg">
5
+ </picture>
6
+ </p>
7
+
1
8
  # crawlberg
2
9
 
3
- Reserved name. See https://github.com/xberg-io/crawlberg
10
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
11
+ <a href="https://github.com/xberg-io/alef">
12
+ <img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
13
+ </a>
14
+ <!-- Language Bindings -->
15
+ <a href="https://crates.io/crates/crawlberg">
16
+ <img src="https://img.shields.io/crates/v/crawlberg?label=Rust&color=007ec6" alt="Rust">
17
+ </a>
18
+ <a href="https://pypi.org/project/crawlberg/">
19
+ <img src="https://img.shields.io/pypi/v/crawlberg?label=Python&color=007ec6" alt="Python">
20
+ </a>
21
+ <a href="https://www.npmjs.com/package/@xberg-io/crawlberg">
22
+ <img src="https://img.shields.io/npm/v/@xberg-io/crawlberg?label=Node.js&color=007ec6" alt="Node.js">
23
+ </a>
24
+ <a href="https://www.npmjs.com/package/@xberg-io/crawlberg-wasm">
25
+ <img src="https://img.shields.io/npm/v/@xberg-io/crawlberg-wasm?label=WASM&color=007ec6" alt="WASM">
26
+ </a>
27
+ <a href="https://central.sonatype.com/artifact/io.xberg.crawlberg/crawlberg">
28
+ <img src="https://img.shields.io/maven-central/v/io.xberg.crawlberg/crawlberg?label=Java&color=007ec6" alt="Java">
29
+ </a>
30
+ <a href="https://pkg.go.dev/github.com/xberg-io/crawlberg/packages/go">
31
+ <img src="https://img.shields.io/github/v/tag/xberg-io/crawlberg?label=Go&color=007ec6" alt="Go">
32
+ </a>
33
+ <a href="https://www.nuget.org/packages/XbergIo.Crawlberg/">
34
+ <img src="https://img.shields.io/nuget/v/XbergIo.Crawlberg?label=C%23&color=007ec6" alt="C#">
35
+ </a>
36
+ <a href="https://packagist.org/packages/xberg-io/crawlberg">
37
+ <img src="https://img.shields.io/packagist/v/xberg-io/crawlberg?label=PHP&color=007ec6" alt="PHP">
38
+ </a>
39
+ <a href="https://rubygems.org/gems/crawlberg">
40
+ <img src="https://img.shields.io/gem/v/crawlberg?label=Ruby&color=007ec6" alt="Ruby">
41
+ </a>
42
+ <a href="https://hex.pm/packages/crawlberg">
43
+ <img src="https://img.shields.io/hexpm/v/crawlberg?label=Elixir&color=007ec6" alt="Elixir">
44
+ </a>
45
+ <a href="https://pub.dev/packages/crawlberg">
46
+ <img src="https://img.shields.io/pub/v/crawlberg?label=Dart&color=007ec6" alt="Dart">
47
+ </a>
48
+ <a href="https://central.sonatype.com/artifact/io.xberg.crawlberg.android/crawlberg-android">
49
+ <img src="https://img.shields.io/maven-central/v/io.xberg.crawlberg.android/crawlberg-android?label=Kotlin&color=007ec6" alt="Kotlin">
50
+ </a>
51
+ <a href="https://github.com/xberg-io/crawlberg/tree/main/packages/swift">
52
+ <img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
53
+ </a>
54
+ <a href="https://github.com/xberg-io/crawlberg/tree/main/packages/zig">
55
+ <img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
56
+ </a>
57
+ <a href="https://github.com/xberg-io/crawlberg/releases">
58
+ <img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
59
+ </a>
60
+ <a href="https://github.com/xberg-io/crawlberg/pkgs/container/crawlberg">
61
+ <img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
62
+ </a>
63
+
64
+ <!-- Project Info -->
65
+ <a href="https://github.com/xberg-io/crawlberg/blob/main/LICENSE">
66
+ <img src="https://img.shields.io/badge/License-MIT-007ec6" alt="License">
67
+ </a>
68
+ <a href="https://docs.crawlberg.xberg.io">
69
+ <img src="https://img.shields.io/badge/Docs-crawlberg-007ec6" alt="Documentation">
70
+ </a>
71
+ </div>
72
+
73
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
74
+ <a href="https://discord.gg/xt9WY3GnKR">
75
+ <img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
76
+ </a>
77
+ </div>
78
+
79
+ Ruby bindings for **crawlberg** — a high-performance Rust web crawling engine. Powered by
80
+ Magnus with native Ruby objects, full metadata extraction, and Markdown conversion.
81
+
82
+ ## What This Package Provides
83
+
84
+ - **Same crawler as every binding** — one Rust engine behind Python, Node.js, Ruby, Go, Java, .NET, PHP, Elixir, Dart, Kotlin Android, Swift, Zig, WASM, and C FFI.
85
+ - **Structured scrape output** — HTML, Markdown, metadata, links, assets, response headers, and extraction warnings with consistent field names.
86
+ - **Crawl controls** — depth, page limits, concurrency, URL filters, robots/sitemap handling, rate limits, and partial failure reporting.
87
+ - **Rendering path** — optional browser rendering for JavaScript-heavy pages; direct HTTP path for fast static pages.
88
+ - **Ruby package** — Magnus-backed native extension with Ruby objects for crawl results.
89
+
90
+ ## Installation
91
+
92
+ ```bash
93
+ gem install crawlberg
94
+ ```
95
+
96
+ ## Agent plugin
97
+
98
+ The `crawlberg` plugin is available via the `xberg-io/plugins` marketplace.
99
+
100
+ ```text
101
+ /plugin marketplace add xberg-io/plugins
102
+ /plugin install crawlberg@xberg
103
+ ```
104
+
105
+ Works with Claude Code, Codex, Cursor, Gemini CLI, Factory Droid, GitHub Copilot CLI, and opencode. See [the marketplace README](https://github.com/xberg-io/plugins) for harness-specific install instructions.
106
+
107
+ ## Quick Start
108
+
109
+ ```ruby title="Ruby"
110
+ require "crawlberg"
111
+
112
+ # Simplest case: scrape a single page with default settings.
113
+ engine = Crawlberg.create_engine
114
+ result = Crawlberg.scrape(engine, "https://example.com/")
115
+ puts "Title: #{result.metadata.title}"
116
+ puts "Status: #{result.status_code}"
117
+ puts "Links found: #{result.links.length}"
118
+
119
+ # Crawl from a seed URL, limited to one hop and a handful of pages.
120
+ config = Crawlberg::CrawlConfig.new(max_depth: 1, max_pages: 5)
121
+ crawl_engine = Crawlberg.create_engine(config)
122
+ crawl_result = Crawlberg.crawl(crawl_engine, "https://en.wikipedia.org/wiki/Web_scraping")
123
+ puts "Pages crawled: #{crawl_result.pages.length}"
124
+ ```
125
+
126
+ ## API Reference
127
+
128
+ Full API documentation is available at [docs.crawlberg.xberg.io](https://docs.crawlberg.xberg.io).
129
+
130
+ Key functions:
131
+
132
+ - `create_engine(config?)` — Create a crawl engine with optional configuration
133
+ - `scrape(engine, url)` — Scrape a single URL
134
+ - `crawl(engine, url)` — Crawl a website following links
135
+ - `map_urls(engine, url)` — Discover all pages on a site
136
+ - `batch_scrape(engine, urls)` — Scrape multiple URLs concurrently
137
+ - `batch_crawl(engine, urls)` — Crawl multiple seed URLs concurrently
138
+
139
+ ## Contributing
140
+
141
+ Contributions are welcome! Please see our [Contributing Guide](https://github.com/xberg-io/crawlberg/blob/main/CONTRIBUTING.md) for details.
142
+
143
+ ## Part of Xberg.dev
144
+
145
+ - [Xberg](https://github.com/xberg-io/xberg) — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
146
+ - [Xberg Enterprise](https://github.com/xberg-io/xberg-enterprise) — managed extraction API with SDKs, dashboards, and observability.
147
+ - [crawlberg](https://github.com/xberg-io/crawlberg) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
148
+ - [html-to-markdown](https://github.com/xberg-io/html-to-markdown) — fast, lossless HTML→Markdown engine.
149
+ - [liter-llm](https://github.com/xberg-io/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
150
+ - [tree-sitter-language-pack](https://github.com/xberg-io/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
151
+ - [alef](https://github.com/xberg-io/alef) — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.
152
+ - [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
153
+
154
+ ## License
155
+
156
+ This project is licensed under [MIT License](https://github.com/xberg-io/crawlberg/blob/main/LICENSE).
157
+
158
+ ## Links
159
+
160
+ - [Documentation](https://docs.crawlberg.xberg.io)
161
+ - [GitHub Repository](https://github.com/xberg-io/crawlberg)
162
+ - [Issue Tracker](https://github.com/xberg-io/crawlberg/issues)
data/Steepfile ADDED
@@ -0,0 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ target :lib do
4
+ signature "sig"
5
+ check "lib"
6
+ # The generated `lib/crawlberg/native.rb` carries inline Sorbet
7
+ # `sig { ... }` blocks on tagged-enum variant Data classes. Sorbet's runtime
8
+ # provides those via `extend T::Sig`, but Steep does not understand the
9
+ # extension (it relies on RBS, not Sorbet sigs) and reports
10
+ # `Type `self` does not have method `sig`` on every block. RBS coverage
11
+ # for the same surface lives in `sig/types.rbs`, so we steer Steep to the
12
+ # RBS file by ignoring the .rb.
13
+ ignore "lib/crawlberg/native.rb"
14
+ end