kreuzcrawl 0.3.0.pre.rc.19 → 0.3.0.pre.rc.42

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d09ae66f4d1e1225eff324784cd63c0d255fb9d4a3889bb3ae94c4e35c58c6de
4
- data.tar.gz: fa17e9c1fc7ec56401a396044476d83a596cd445481ed01a4e871662b12a354e
3
+ metadata.gz: be6d4e8418112b5bf2e374769ad30d974f4fe9f7faf39cbea4d01adabc647620
4
+ data.tar.gz: d99861fcb4000219d16b0487e5c93f87ca19fe288dc4d00ef028c964dce63178
5
5
  SHA512:
6
- metadata.gz: 5260d33d43f4fdd815a69a62e062efa8607a25bfe0d353d349d698a6beddf8a239ce7135bd17eff750cb865a0cfc9619a0856be1250ebbeaf66e99dbc7f52284
7
- data.tar.gz: bc7f16750f3c99ac60aa05566e41c07d5eb9dea8dc2118e73a646a398b764f7e0c3996c703c2739756ac0bda42d8e7833c4524771b1ddf59027aa5068c055898
6
+ metadata.gz: 4f3dd04f57b59ddd9a64619dadb0a5c2100b5bf41faff9f020fa2aa11b9263b912daf9542a8cd221d3d1cfe69c2b3bf27576513b239cf92a3d7f19327be3d3ba
7
+ data.tar.gz: a32941dad01241666ec4fab94bc15e048463ac8f8fa4a6db8dc470fc94e565bdc963fa37c5d41527373ef2b8dc1b287823bb369ddde660a764ee51c129d0d6b0
data/LICENSE ADDED
@@ -0,0 +1,93 @@
1
+ Elastic License 2.0 (ELv2)
2
+
3
+ Copyright 2025-2026 Kreuzberg, Inc.
4
+
5
+ Acceptance
6
+
7
+ By using the software, you agree to all of the terms and conditions below.
8
+
9
+ Copyright License
10
+
11
+ The licensor grants you a non-exclusive, royalty-free, worldwide,
12
+ non-sublicensable, non-transferable license to use, copy, distribute, make
13
+ available, and prepare derivative works of the software, in each case subject to
14
+ the limitations and conditions below.
15
+
16
+ Limitations
17
+
18
+ You may not provide the software to third parties as a hosted or managed
19
+ service, where the service provides users with access to any substantial set of
20
+ the features or functionality of the software.
21
+
22
+ You may not move, change, disable, or circumvent the license key functionality
23
+ in the software, and you may not remove or obscure any functionality in the
24
+ software that is protected by the license key.
25
+
26
+ You may not alter, remove, or obscure any licensing, copyright, or other notices
27
+ of the licensor in the software. Any use of the licensor's trademarks is subject
28
+ to applicable law.
29
+
30
+ Patents
31
+
32
+ The licensor grants you a license, under any patent claims the licensor can
33
+ license, or becomes able to license, to make, have made, use, sell, offer for
34
+ sale, import and have imported the software, in each case subject to the
35
+ limitations and conditions in this license. This license does not cover any
36
+ patent claims that you cause to be infringed by modifications or additions to the
37
+ software. If you or your company make any written claim that the software
38
+ infringes or contributes to infringement of any patent, your patent license for
39
+ the software granted under these terms ends immediately. If your company makes
40
+ such a claim, your patent license ends immediately for work on behalf of your
41
+ company.
42
+
43
+ Notices
44
+
45
+ You must ensure that anyone who gets a copy of any part of the software from you
46
+ also gets a copy of these terms.
47
+
48
+ If you modify the software, you must include in any modified copies of the
49
+ software prominent notices stating that you have modified the software.
50
+
51
+ No Other Rights
52
+
53
+ These terms do not imply any licenses other than those expressly granted in
54
+ these terms.
55
+
56
+ Termination
57
+
58
+ If you use the software in violation of these terms, such use is not licensed,
59
+ and your licenses will automatically terminate. If the licensor provides you with
60
+ a notice of your violation, and you cease all violation of this license no later
61
+ than 30 days after you receive that notice, your licenses will be reinstated
62
+ retroactively. However, if you violate these terms after such reinstatement, any
63
+ additional violation of these terms will cause your licenses to terminate
64
+ automatically and permanently.
65
+
66
+ No Liability
67
+
68
+ As far as the law allows, the software comes as is, without any warranty or
69
+ condition, and the licensor will not be liable to you for any damages arising out
70
+ of these terms or the use or nature of the software, under any kind of legal
71
+ claim.
72
+
73
+ Definitions
74
+
75
+ The licensor is the entity offering these terms, and the software is the
76
+ software the licensor makes available under these terms, including any portion
77
+ of it.
78
+
79
+ you refers to the individual or entity agreeing to these terms.
80
+
81
+ your company is any legal entity, sole proprietorship, or other kind of
82
+ organization that you work for, plus all organizations that have control over,
83
+ are under the control of, or are under common control with that organization.
84
+ control means ownership of substantially all the assets of an entity, or the
85
+ power to direct its management and policies by vote, contract, or otherwise.
86
+ Control can be direct or indirect.
87
+
88
+ your licenses are all the licenses granted to you for the software under these
89
+ terms.
90
+
91
+ use means anything you do with the software requiring one of your licenses.
92
+
93
+ trademark means trademarks, service marks, and similar rights.
data/README.md ADDED
@@ -0,0 +1,150 @@
1
+ # kreuzcrawl
2
+
3
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
4
+ <a href="https://github.com/kreuzberg-dev/alef">
5
+ <img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
6
+ </a>
7
+ <!-- Language Bindings -->
8
+ <a href="https://crates.io/crates/kreuzcrawl">
9
+ <img src="https://img.shields.io/crates/v/kreuzcrawl?label=Rust&color=007ec6" alt="Rust">
10
+ </a>
11
+ <a href="https://pypi.org/project/kreuzcrawl/">
12
+ <img src="https://img.shields.io/pypi/v/kreuzcrawl?label=Python&color=007ec6" alt="Python">
13
+ </a>
14
+ <a href="https://www.npmjs.com/package/@kreuzberg/kreuzcrawl">
15
+ <img src="https://img.shields.io/npm/v/@kreuzberg/kreuzcrawl?label=Node.js&color=007ec6" alt="Node.js">
16
+ </a>
17
+ <a href="https://www.npmjs.com/package/@kreuzberg/kreuzcrawl-wasm">
18
+ <img src="https://img.shields.io/npm/v/@kreuzberg/kreuzcrawl-wasm?label=WASM&color=007ec6" alt="WASM">
19
+ </a>
20
+ <a href="https://central.sonatype.com/artifact/dev.kreuzberg.kreuzcrawl/kreuzcrawl">
21
+ <img src="https://img.shields.io/maven-central/v/dev.kreuzberg.kreuzcrawl/kreuzcrawl?label=Java&color=007ec6" alt="Java">
22
+ </a>
23
+ <a href="https://pkg.go.dev/github.com/kreuzberg-dev/kreuzcrawl/packages/go">
24
+ <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzcrawl?label=Go&color=007ec6" alt="Go">
25
+ </a>
26
+ <a href="https://www.nuget.org/packages/Kreuzcrawl/">
27
+ <img src="https://img.shields.io/nuget/v/Kreuzcrawl?label=C%23&color=007ec6" alt="C#">
28
+ </a>
29
+ <a href="https://packagist.org/packages/kreuzberg-dev/kreuzcrawl">
30
+ <img src="https://img.shields.io/packagist/v/kreuzberg-dev/kreuzcrawl?label=PHP&color=007ec6" alt="PHP">
31
+ </a>
32
+ <a href="https://rubygems.org/gems/kreuzcrawl">
33
+ <img src="https://img.shields.io/gem/v/kreuzcrawl?label=Ruby&color=007ec6" alt="Ruby">
34
+ </a>
35
+ <a href="https://hex.pm/packages/kreuzcrawl">
36
+ <img src="https://img.shields.io/hexpm/v/kreuzcrawl?label=Elixir&color=007ec6" alt="Elixir">
37
+ </a>
38
+ <a href="https://pub.dev/packages/kreuzcrawl">
39
+ <img src="https://img.shields.io/pub/v/kreuzcrawl?label=Dart&color=007ec6" alt="Dart">
40
+ </a>
41
+ <a href="https://central.sonatype.com/artifact/dev.kreuzberg.kreuzcrawl.android/kreuzcrawl-android">
42
+ <img src="https://img.shields.io/maven-central/v/dev.kreuzberg.kreuzcrawl.android/kreuzcrawl-android?label=Kotlin&color=007ec6" alt="Kotlin">
43
+ </a>
44
+ <a href="https://github.com/kreuzberg-dev/kreuzcrawl/tree/main/packages/swift">
45
+ <img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
46
+ </a>
47
+ <a href="https://github.com/kreuzberg-dev/kreuzcrawl/tree/main/packages/zig">
48
+ <img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
49
+ </a>
50
+ <a href="https://github.com/kreuzberg-dev/kreuzcrawl/releases">
51
+ <img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
52
+ </a>
53
+ <a href="https://github.com/kreuzberg-dev/kreuzcrawl/pkgs/container/kreuzcrawl">
54
+ <img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
55
+ </a>
56
+
57
+ <!-- Project Info -->
58
+ <a href="https://github.com/kreuzberg-dev/kreuzcrawl/blob/main/LICENSE">
59
+ <img src="https://img.shields.io/badge/License-Elastic--2.0-007ec6" alt="License">
60
+ </a>
61
+ <a href="https://docs.kreuzcrawl.kreuzberg.dev">
62
+ <img src="https://img.shields.io/badge/Docs-kreuzcrawl-007ec6" alt="Documentation">
63
+ </a>
64
+ </div>
65
+
66
+ <div align="center" style="margin: 24px 0 0;">
67
+ <a href="https://kreuzberg.dev">
68
+ <img alt="Kreuzcrawl" src="https://raw.githubusercontent.com/kreuzberg-dev/kreuzcrawl/main/docs/assets/docs_top_banner.svg" />
69
+ </a>
70
+ </div>
71
+
72
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
73
+ <a href="https://discord.gg/xt9WY3GnKR">
74
+ <img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
75
+ </a>
76
+ </div>
77
+
78
+ Ruby bindings for **kreuzcrawl** — a high-performance Rust web crawling engine. Powered by
79
+ Magnus with native Ruby objects, full metadata extraction, and Markdown conversion.
80
+
81
+ ## What This Package Provides
82
+
83
+ - **Same crawler as every binding** — one Rust engine behind Python, Node.js, Ruby, Go, Java, .NET, PHP, Elixir, Dart, Kotlin Android, Swift, Zig, WASM, and C FFI.
84
+ - **Structured scrape output** — HTML, Markdown, metadata, links, assets, response headers, and extraction warnings with consistent field names.
85
+ - **Crawl controls** — depth, page limits, concurrency, URL filters, robots/sitemap handling, rate limits, and partial failure reporting.
86
+ - **Rendering path** — optional browser rendering for JavaScript-heavy pages; direct HTTP path for fast static pages.
87
+ - **Ruby package** — Magnus-backed native extension with Ruby objects for crawl results.
88
+
89
+ ## Installation
90
+
91
+ ```bash
92
+ gem install kreuzcrawl
93
+ ```
94
+
95
+ ## Quick Start
96
+
97
+ ```ruby title="Ruby"
98
+ require "kreuzcrawl"
99
+
100
+ # Simplest case: scrape a single page with default settings.
101
+ engine = Kreuzcrawl.create_engine
102
+ result = Kreuzcrawl.scrape(engine, "https://example.com/")
103
+ puts "Title: #{result.metadata.title}"
104
+ puts "Status: #{result.status_code}"
105
+ puts "Links found: #{result.links.length}"
106
+
107
+ # Crawl from a seed URL, limited to one hop and a handful of pages.
108
+ config = Kreuzcrawl::CrawlConfig.new(max_depth: 1, max_pages: 5)
109
+ crawl_engine = Kreuzcrawl.create_engine(config)
110
+ crawl_result = Kreuzcrawl.crawl(crawl_engine, "https://en.wikipedia.org/wiki/Web_scraping")
111
+ puts "Pages crawled: #{crawl_result.pages.length}"
112
+ ```
113
+
114
+ ## API Reference
115
+
116
+ Full API documentation is available at [docs.kreuzcrawl.kreuzberg.dev](https://docs.kreuzcrawl.kreuzberg.dev).
117
+
118
+ Key functions:
119
+
120
+ - `create_engine(config?)` — Create a crawl engine with optional configuration
121
+ - `scrape(engine, url)` — Scrape a single URL
122
+ - `crawl(engine, url)` — Crawl a website following links
123
+ - `map_urls(engine, url)` — Discover all pages on a site
124
+ - `batch_scrape(engine, urls)` — Scrape multiple URLs concurrently
125
+ - `batch_crawl(engine, urls)` — Crawl multiple seed URLs concurrently
126
+
127
+ ## Contributing
128
+
129
+ Contributions are welcome! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/kreuzcrawl/blob/main/CONTRIBUTING.md) for details.
130
+
131
+ ## Part of Kreuzberg.dev
132
+
133
+ - [Kreuzberg](https://github.com/kreuzberg-dev/kreuzberg) — document intelligence: text, tables, metadata from 90+ formats with optional OCR.
134
+ - [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
135
+ - [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
136
+ - [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
137
+ - [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
138
+ - [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
139
+ - [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
140
+
141
+ ## License
142
+
143
+ This project is licensed under [Elastic License 2.0](https://github.com/kreuzberg-dev/kreuzcrawl/blob/main/LICENSE).
144
+
145
+ ## Links
146
+
147
+ - [Documentation](https://docs.kreuzcrawl.kreuzberg.dev)
148
+ - [GitHub Repository](https://github.com/kreuzberg-dev/kreuzcrawl)
149
+ - [Issue Tracker](https://github.com/kreuzberg-dev/kreuzcrawl/issues)
150
+ - [Issues](https://github.com/kreuzberg-dev/kreuzcrawl/issues)
data/Steepfile ADDED
@@ -0,0 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ target :lib do
4
+ signature "sig"
5
+ check "lib"
6
+ # The generated `lib/kreuzcrawl/native.rb` carries inline Sorbet
7
+ # `sig { ... }` blocks on tagged-enum variant Data classes. Sorbet's runtime
8
+ # provides those via `extend T::Sig`, but Steep does not understand the
9
+ # extension (it relies on RBS, not Sorbet sigs) and reports
10
+ # `Type `self` does not have method `sig`` on every block. RBS coverage
11
+ # for the same surface lives in `sig/types.rbs`, so we steer Steep to the
12
+ # RBS file by ignoring the .rb.
13
+ ignore "lib/kreuzcrawl/native.rb"
14
+ end