kreuzcrawl 0.3.0.pre.rc.19 → 0.3.0.pre.rc.42
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/LICENSE +93 -0
- data/README.md +150 -0
- data/Steepfile +14 -0
- data/ext/kreuzcrawl_rb/native/Cargo.lock +491 -13
- data/ext/kreuzcrawl_rb/native/Cargo.toml +12 -9
- data/ext/kreuzcrawl_rb/native/extconf.rb +11 -0
- data/ext/kreuzcrawl_rb/src/lib.rs +3843 -1114
- data/lib/kreuzcrawl/native.rb +613 -0
- data/{ext/kreuzcrawl_rb/src → lib}/kreuzcrawl/version.rb +2 -3
- data/lib/kreuzcrawl.rb +10 -1
- data/lib/kreuzcrawl_rb.so +0 -0
- data/sig/types.rbs +519 -0
- metadata +39 -13
- data/ext/kreuzcrawl_rb/Cargo.lock +0 -3163
- data/ext/kreuzcrawl_rb/Cargo.toml +0 -15
- data/ext/kreuzcrawl_rb/extconf.rb +0 -10
- data/ext/kreuzcrawl_rb/src/kreuzcrawl.rb +0 -13
- data/lib/kreuzcrawl_rb.bundle +0 -0
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: be6d4e8418112b5bf2e374769ad30d974f4fe9f7faf39cbea4d01adabc647620
|
|
4
|
+
data.tar.gz: d99861fcb4000219d16b0487e5c93f87ca19fe288dc4d00ef028c964dce63178
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 4f3dd04f57b59ddd9a64619dadb0a5c2100b5bf41faff9f020fa2aa11b9263b912daf9542a8cd221d3d1cfe69c2b3bf27576513b239cf92a3d7f19327be3d3ba
|
|
7
|
+
data.tar.gz: a32941dad01241666ec4fab94bc15e048463ac8f8fa4a6db8dc470fc94e565bdc963fa37c5d41527373ef2b8dc1b287823bb369ddde660a764ee51c129d0d6b0
|
data/LICENSE
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
Elastic License 2.0 (ELv2)
|
|
2
|
+
|
|
3
|
+
Copyright 2025-2026 Kreuzberg, Inc.
|
|
4
|
+
|
|
5
|
+
Acceptance
|
|
6
|
+
|
|
7
|
+
By using the software, you agree to all of the terms and conditions below.
|
|
8
|
+
|
|
9
|
+
Copyright License
|
|
10
|
+
|
|
11
|
+
The licensor grants you a non-exclusive, royalty-free, worldwide,
|
|
12
|
+
non-sublicensable, non-transferable license to use, copy, distribute, make
|
|
13
|
+
available, and prepare derivative works of the software, in each case subject to
|
|
14
|
+
the limitations and conditions below.
|
|
15
|
+
|
|
16
|
+
Limitations
|
|
17
|
+
|
|
18
|
+
You may not provide the software to third parties as a hosted or managed
|
|
19
|
+
service, where the service provides users with access to any substantial set of
|
|
20
|
+
the features or functionality of the software.
|
|
21
|
+
|
|
22
|
+
You may not move, change, disable, or circumvent the license key functionality
|
|
23
|
+
in the software, and you may not remove or obscure any functionality in the
|
|
24
|
+
software that is protected by the license key.
|
|
25
|
+
|
|
26
|
+
You may not alter, remove, or obscure any licensing, copyright, or other notices
|
|
27
|
+
of the licensor in the software. Any use of the licensor's trademarks is subject
|
|
28
|
+
to applicable law.
|
|
29
|
+
|
|
30
|
+
Patents
|
|
31
|
+
|
|
32
|
+
The licensor grants you a license, under any patent claims the licensor can
|
|
33
|
+
license, or becomes able to license, to make, have made, use, sell, offer for
|
|
34
|
+
sale, import and have imported the software, in each case subject to the
|
|
35
|
+
limitations and conditions in this license. This license does not cover any
|
|
36
|
+
patent claims that you cause to be infringed by modifications or additions to the
|
|
37
|
+
software. If you or your company make any written claim that the software
|
|
38
|
+
infringes or contributes to infringement of any patent, your patent license for
|
|
39
|
+
the software granted under these terms ends immediately. If your company makes
|
|
40
|
+
such a claim, your patent license ends immediately for work on behalf of your
|
|
41
|
+
company.
|
|
42
|
+
|
|
43
|
+
Notices
|
|
44
|
+
|
|
45
|
+
You must ensure that anyone who gets a copy of any part of the software from you
|
|
46
|
+
also gets a copy of these terms.
|
|
47
|
+
|
|
48
|
+
If you modify the software, you must include in any modified copies of the
|
|
49
|
+
software prominent notices stating that you have modified the software.
|
|
50
|
+
|
|
51
|
+
No Other Rights
|
|
52
|
+
|
|
53
|
+
These terms do not imply any licenses other than those expressly granted in
|
|
54
|
+
these terms.
|
|
55
|
+
|
|
56
|
+
Termination
|
|
57
|
+
|
|
58
|
+
If you use the software in violation of these terms, such use is not licensed,
|
|
59
|
+
and your licenses will automatically terminate. If the licensor provides you with
|
|
60
|
+
a notice of your violation, and you cease all violation of this license no later
|
|
61
|
+
than 30 days after you receive that notice, your licenses will be reinstated
|
|
62
|
+
retroactively. However, if you violate these terms after such reinstatement, any
|
|
63
|
+
additional violation of these terms will cause your licenses to terminate
|
|
64
|
+
automatically and permanently.
|
|
65
|
+
|
|
66
|
+
No Liability
|
|
67
|
+
|
|
68
|
+
As far as the law allows, the software comes as is, without any warranty or
|
|
69
|
+
condition, and the licensor will not be liable to you for any damages arising out
|
|
70
|
+
of these terms or the use or nature of the software, under any kind of legal
|
|
71
|
+
claim.
|
|
72
|
+
|
|
73
|
+
Definitions
|
|
74
|
+
|
|
75
|
+
The licensor is the entity offering these terms, and the software is the
|
|
76
|
+
software the licensor makes available under these terms, including any portion
|
|
77
|
+
of it.
|
|
78
|
+
|
|
79
|
+
you refers to the individual or entity agreeing to these terms.
|
|
80
|
+
|
|
81
|
+
your company is any legal entity, sole proprietorship, or other kind of
|
|
82
|
+
organization that you work for, plus all organizations that have control over,
|
|
83
|
+
are under the control of, or are under common control with that organization.
|
|
84
|
+
control means ownership of substantially all the assets of an entity, or the
|
|
85
|
+
power to direct its management and policies by vote, contract, or otherwise.
|
|
86
|
+
Control can be direct or indirect.
|
|
87
|
+
|
|
88
|
+
your licenses are all the licenses granted to you for the software under these
|
|
89
|
+
terms.
|
|
90
|
+
|
|
91
|
+
use means anything you do with the software requiring one of your licenses.
|
|
92
|
+
|
|
93
|
+
trademark means trademarks, service marks, and similar rights.
|
data/README.md
ADDED
|
@@ -0,0 +1,150 @@
|
|
|
1
|
+
# kreuzcrawl
|
|
2
|
+
|
|
3
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
|
|
4
|
+
<a href="https://github.com/kreuzberg-dev/alef">
|
|
5
|
+
<img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
|
|
6
|
+
</a>
|
|
7
|
+
<!-- Language Bindings -->
|
|
8
|
+
<a href="https://crates.io/crates/kreuzcrawl">
|
|
9
|
+
<img src="https://img.shields.io/crates/v/kreuzcrawl?label=Rust&color=007ec6" alt="Rust">
|
|
10
|
+
</a>
|
|
11
|
+
<a href="https://pypi.org/project/kreuzcrawl/">
|
|
12
|
+
<img src="https://img.shields.io/pypi/v/kreuzcrawl?label=Python&color=007ec6" alt="Python">
|
|
13
|
+
</a>
|
|
14
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/kreuzcrawl">
|
|
15
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/kreuzcrawl?label=Node.js&color=007ec6" alt="Node.js">
|
|
16
|
+
</a>
|
|
17
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/kreuzcrawl-wasm">
|
|
18
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/kreuzcrawl-wasm?label=WASM&color=007ec6" alt="WASM">
|
|
19
|
+
</a>
|
|
20
|
+
<a href="https://central.sonatype.com/artifact/dev.kreuzberg.kreuzcrawl/kreuzcrawl">
|
|
21
|
+
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg.kreuzcrawl/kreuzcrawl?label=Java&color=007ec6" alt="Java">
|
|
22
|
+
</a>
|
|
23
|
+
<a href="https://pkg.go.dev/github.com/kreuzberg-dev/kreuzcrawl/packages/go">
|
|
24
|
+
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzcrawl?label=Go&color=007ec6" alt="Go">
|
|
25
|
+
</a>
|
|
26
|
+
<a href="https://www.nuget.org/packages/Kreuzcrawl/">
|
|
27
|
+
<img src="https://img.shields.io/nuget/v/Kreuzcrawl?label=C%23&color=007ec6" alt="C#">
|
|
28
|
+
</a>
|
|
29
|
+
<a href="https://packagist.org/packages/kreuzberg-dev/kreuzcrawl">
|
|
30
|
+
<img src="https://img.shields.io/packagist/v/kreuzberg-dev/kreuzcrawl?label=PHP&color=007ec6" alt="PHP">
|
|
31
|
+
</a>
|
|
32
|
+
<a href="https://rubygems.org/gems/kreuzcrawl">
|
|
33
|
+
<img src="https://img.shields.io/gem/v/kreuzcrawl?label=Ruby&color=007ec6" alt="Ruby">
|
|
34
|
+
</a>
|
|
35
|
+
<a href="https://hex.pm/packages/kreuzcrawl">
|
|
36
|
+
<img src="https://img.shields.io/hexpm/v/kreuzcrawl?label=Elixir&color=007ec6" alt="Elixir">
|
|
37
|
+
</a>
|
|
38
|
+
<a href="https://pub.dev/packages/kreuzcrawl">
|
|
39
|
+
<img src="https://img.shields.io/pub/v/kreuzcrawl?label=Dart&color=007ec6" alt="Dart">
|
|
40
|
+
</a>
|
|
41
|
+
<a href="https://central.sonatype.com/artifact/dev.kreuzberg.kreuzcrawl.android/kreuzcrawl-android">
|
|
42
|
+
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg.kreuzcrawl.android/kreuzcrawl-android?label=Kotlin&color=007ec6" alt="Kotlin">
|
|
43
|
+
</a>
|
|
44
|
+
<a href="https://github.com/kreuzberg-dev/kreuzcrawl/tree/main/packages/swift">
|
|
45
|
+
<img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
|
|
46
|
+
</a>
|
|
47
|
+
<a href="https://github.com/kreuzberg-dev/kreuzcrawl/tree/main/packages/zig">
|
|
48
|
+
<img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
|
|
49
|
+
</a>
|
|
50
|
+
<a href="https://github.com/kreuzberg-dev/kreuzcrawl/releases">
|
|
51
|
+
<img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
|
|
52
|
+
</a>
|
|
53
|
+
<a href="https://github.com/kreuzberg-dev/kreuzcrawl/pkgs/container/kreuzcrawl">
|
|
54
|
+
<img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
|
|
55
|
+
</a>
|
|
56
|
+
|
|
57
|
+
<!-- Project Info -->
|
|
58
|
+
<a href="https://github.com/kreuzberg-dev/kreuzcrawl/blob/main/LICENSE">
|
|
59
|
+
<img src="https://img.shields.io/badge/License-Elastic--2.0-007ec6" alt="License">
|
|
60
|
+
</a>
|
|
61
|
+
<a href="https://docs.kreuzcrawl.kreuzberg.dev">
|
|
62
|
+
<img src="https://img.shields.io/badge/Docs-kreuzcrawl-007ec6" alt="Documentation">
|
|
63
|
+
</a>
|
|
64
|
+
</div>
|
|
65
|
+
|
|
66
|
+
<div align="center" style="margin: 24px 0 0;">
|
|
67
|
+
<a href="https://kreuzberg.dev">
|
|
68
|
+
<img alt="Kreuzcrawl" src="https://raw.githubusercontent.com/kreuzberg-dev/kreuzcrawl/main/docs/assets/docs_top_banner.svg" />
|
|
69
|
+
</a>
|
|
70
|
+
</div>
|
|
71
|
+
|
|
72
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
|
|
73
|
+
<a href="https://discord.gg/xt9WY3GnKR">
|
|
74
|
+
<img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
|
|
75
|
+
</a>
|
|
76
|
+
</div>
|
|
77
|
+
|
|
78
|
+
Ruby bindings for **kreuzcrawl** — a high-performance Rust web crawling engine. Powered by
|
|
79
|
+
Magnus with native Ruby objects, full metadata extraction, and Markdown conversion.
|
|
80
|
+
|
|
81
|
+
## What This Package Provides
|
|
82
|
+
|
|
83
|
+
- **Same crawler as every binding** — one Rust engine behind Python, Node.js, Ruby, Go, Java, .NET, PHP, Elixir, Dart, Kotlin Android, Swift, Zig, WASM, and C FFI.
|
|
84
|
+
- **Structured scrape output** — HTML, Markdown, metadata, links, assets, response headers, and extraction warnings with consistent field names.
|
|
85
|
+
- **Crawl controls** — depth, page limits, concurrency, URL filters, robots/sitemap handling, rate limits, and partial failure reporting.
|
|
86
|
+
- **Rendering path** — optional browser rendering for JavaScript-heavy pages; direct HTTP path for fast static pages.
|
|
87
|
+
- **Ruby package** — Magnus-backed native extension with Ruby objects for crawl results.
|
|
88
|
+
|
|
89
|
+
## Installation
|
|
90
|
+
|
|
91
|
+
```bash
|
|
92
|
+
gem install kreuzcrawl
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
## Quick Start
|
|
96
|
+
|
|
97
|
+
```ruby title="Ruby"
|
|
98
|
+
require "kreuzcrawl"
|
|
99
|
+
|
|
100
|
+
# Simplest case: scrape a single page with default settings.
|
|
101
|
+
engine = Kreuzcrawl.create_engine
|
|
102
|
+
result = Kreuzcrawl.scrape(engine, "https://example.com/")
|
|
103
|
+
puts "Title: #{result.metadata.title}"
|
|
104
|
+
puts "Status: #{result.status_code}"
|
|
105
|
+
puts "Links found: #{result.links.length}"
|
|
106
|
+
|
|
107
|
+
# Crawl from a seed URL, limited to one hop and a handful of pages.
|
|
108
|
+
config = Kreuzcrawl::CrawlConfig.new(max_depth: 1, max_pages: 5)
|
|
109
|
+
crawl_engine = Kreuzcrawl.create_engine(config)
|
|
110
|
+
crawl_result = Kreuzcrawl.crawl(crawl_engine, "https://en.wikipedia.org/wiki/Web_scraping")
|
|
111
|
+
puts "Pages crawled: #{crawl_result.pages.length}"
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
## API Reference
|
|
115
|
+
|
|
116
|
+
Full API documentation is available at [docs.kreuzcrawl.kreuzberg.dev](https://docs.kreuzcrawl.kreuzberg.dev).
|
|
117
|
+
|
|
118
|
+
Key functions:
|
|
119
|
+
|
|
120
|
+
- `create_engine(config?)` — Create a crawl engine with optional configuration
|
|
121
|
+
- `scrape(engine, url)` — Scrape a single URL
|
|
122
|
+
- `crawl(engine, url)` — Crawl a website following links
|
|
123
|
+
- `map_urls(engine, url)` — Discover all pages on a site
|
|
124
|
+
- `batch_scrape(engine, urls)` — Scrape multiple URLs concurrently
|
|
125
|
+
- `batch_crawl(engine, urls)` — Crawl multiple seed URLs concurrently
|
|
126
|
+
|
|
127
|
+
## Contributing
|
|
128
|
+
|
|
129
|
+
Contributions are welcome! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/kreuzcrawl/blob/main/CONTRIBUTING.md) for details.
|
|
130
|
+
|
|
131
|
+
## Part of Kreuzberg.dev
|
|
132
|
+
|
|
133
|
+
- [Kreuzberg](https://github.com/kreuzberg-dev/kreuzberg) — document intelligence: text, tables, metadata from 90+ formats with optional OCR.
|
|
134
|
+
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
|
|
135
|
+
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
|
|
136
|
+
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
|
|
137
|
+
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
|
|
138
|
+
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
|
|
139
|
+
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
|
|
140
|
+
|
|
141
|
+
## License
|
|
142
|
+
|
|
143
|
+
This project is licensed under [Elastic License 2.0](https://github.com/kreuzberg-dev/kreuzcrawl/blob/main/LICENSE).
|
|
144
|
+
|
|
145
|
+
## Links
|
|
146
|
+
|
|
147
|
+
- [Documentation](https://docs.kreuzcrawl.kreuzberg.dev)
|
|
148
|
+
- [GitHub Repository](https://github.com/kreuzberg-dev/kreuzcrawl)
|
|
149
|
+
- [Issue Tracker](https://github.com/kreuzberg-dev/kreuzcrawl/issues)
|
|
150
|
+
- [Issues](https://github.com/kreuzberg-dev/kreuzcrawl/issues)
|
data/Steepfile
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
target :lib do
|
|
4
|
+
signature "sig"
|
|
5
|
+
check "lib"
|
|
6
|
+
# The generated `lib/kreuzcrawl/native.rb` carries inline Sorbet
|
|
7
|
+
# `sig { ... }` blocks on tagged-enum variant Data classes. Sorbet's runtime
|
|
8
|
+
# provides those via `extend T::Sig`, but Steep does not understand the
|
|
9
|
+
# extension (it relies on RBS, not Sorbet sigs) and reports
|
|
10
|
+
# `Type `self` does not have method `sig`` on every block. RBS coverage
|
|
11
|
+
# for the same surface lives in `sig/types.rbs`, so we steer Steep to the
|
|
12
|
+
# RBS file by ignoring the .rb.
|
|
13
|
+
ignore "lib/kreuzcrawl/native.rb"
|
|
14
|
+
end
|