firecrawl 0.0.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d79423650fbec60124bab7710e372f16774987f437e8bb52cb13e72a434bdec6
4
- data.tar.gz: bfb0ec5c302e4e646812855d5c954e2fc6c075e4a03af495a6a09815ce588e44
3
+ metadata.gz: 7e55dc5e433f0632ab0c11feda818bf82ddf77ae8ea2fdaa624c5a9af1dccf4c
4
+ data.tar.gz: 2781378e0a6b62c2e7befb0ebd0298b0b91b2cf249fe33115dc675606aed7bfe
5
5
  SHA512:
6
- metadata.gz: 46c819135a19388beb434c797e16a3a73f52b1d85cc37f317e03dc159de8ed1a56a6a5170f2e2457e2798fbff25d93cfbf11f7be12682fcb6ef13534aa347f62
7
- data.tar.gz: ffe9b239c29138617a1902f8502099b6f0b2019eeed8b266cd8f77e94eee868e8e876ee9f8545cd584d701a8bfffa75f6552508b2949463f2cc187580eeb35fe
6
+ metadata.gz: 6fa88114c36df02f9cd261159132298e9a44b01464334daeb31ea2d8d5b11122321066ddf55207fddc3bef72704b353cf42d7aebaa46ac70298ea1efe19c6885
7
+ data.tar.gz: 622fd277c01854131b4a21742c915359c97fffa54b4dc41aff47d09b06d4d7c6c485361ff6409dc1361cf561f313d42096674a426987b5bbf696dcba1f52cd96
data/README.md ADDED
@@ -0,0 +1,213 @@
1
+ # Firecrawl
2
+
3
+ Firecrawl is a lightweight Ruby gem that provides a semantically straightfoward interface to
4
+ the Firecrawl.dev API, allowing you to easily scrape web content, take screenshots, as well as
5
+ crawl entire web domains.
6
+
7
+ The gem is particularly useful when working with Large Language Models (LLMs) as it can
8
+ provide markdown information for real time information lookup as well as grounding.
9
+
10
+ ```ruby
11
+ require 'firecrawl'
12
+
13
+ Firecrawl.api_key ENV[ 'FIRECRAWL_API_KEY' ]
14
+ response = Firecrawl.scrape( 'https://example.com' )
15
+ if response.success?
16
+ result = response.result
17
+ puts result.metadata[ 'title' ]
18
+ puts '---'
19
+ puts result.markdown
20
+ puts "Screenshot URL: #{ result.screenshot_url }"
21
+ else
22
+ puts response.result.error_description
23
+ end
24
+ ```
25
+
26
+ ## Installation
27
+
28
+ Add this line to your application's Gemfile:
29
+
30
+ ```ruby
31
+ gem 'firecrawl'
32
+ ```
33
+
34
+ Then execute:
35
+
36
+ ```bash
37
+ $ bundle install
38
+ ```
39
+
40
+ Or install it directly:
41
+
42
+ ```bash
43
+ $ gem install firecrawl
44
+ ```
45
+
46
+ ## Usage
47
+
48
+ ### Scraping
49
+
50
+ The simplest way to use Firecrawl is to `scrape`, which will scrape the content of a single page
51
+ at the given url and optionally convert it to markdown as well as create a screenshot. You can
52
+ chose to scrape the entire page or only the main content.
53
+
54
+ ```ruby
55
+ Firecrawl.api_key ENV[ 'FIRECRAWL_API_KEY' ]
56
+ response = Firecrawl.scrape( 'https://example.com', format: :markdown )
57
+
58
+ if response.success?
59
+ result = response.result
60
+ if result.success?
61
+ puts result.metadata[ 'title' ]
62
+ puts result.markdown
63
+ end
64
+ else
65
+ puts response.result.error_description
66
+ end
67
+ ```
68
+
69
+ In this basic example we have globally set the `Firecrawl.api_key` from the environment and then
70
+ used the `Firecrawl.scrape` convenience method to make a request to the Firecrawl API to scrape
71
+ the `https://example.com` page and return markdown ( markdown and the main content of the page
72
+ are returned by default so we could have ommitted the options entirelly ).
73
+
74
+ The `Firecrawl.scrape` method instantiates a `Firecrawl::ScrapeRequest` instance and then calls
75
+ it's `submit` method. The following is the equivalent code which makes explict use of the
76
+ `Firecrawl::ScrapeRequest` class.
77
+
78
+ ```ruby
79
+ request = Firecrawl::ScrapeRequest.new( api_key: ENV[ 'FIRECRAWL_API_KEY' ] )
80
+ response = request.submit( 'https://example.com', format: :markdown )
81
+
82
+ if response.success?
83
+ result = response.result
84
+ if result.success?
85
+ puts result.metadata[ 'title' ]
86
+ puts result.markdown
87
+ end
88
+ else
89
+ puts response.result.error_description
90
+ end
91
+ ```
92
+
93
+ Notice also that in this example we've directly passed the `api_key` to the individual request.
94
+ This is optional. If you set the key globally and omit it in the request constructor the
95
+ `ScrapeRequest` instance will use the globally assigned `api_key`.
96
+
97
+ #### Scrape Options
98
+
99
+ You can customize scraping behavior using options, either by passing an option hash to
100
+ `submit` method, as we have done above, or by building a `ScrapeOptions` instance:
101
+
102
+ ```ruby
103
+ options = Firecrawl::ScrapeOptions.build do
104
+ formats [ :html, :markdown, :screenshot ]
105
+ only_main_content true
106
+ include_tags [ 'article', 'main' ]
107
+ exclude_tags [ 'nav', 'footer' ]
108
+ wait_for 5000 # milliseconds
109
+ end
110
+
111
+ request = Firecrawl::ScrapeRequest.new( api_key: ENV[ 'FIRECRAWL_API_KEY' ] )
112
+ response = request.submit( 'https://example.com', options )
113
+ ```
114
+
115
+ #### Scrape Response
116
+
117
+ The `Firecrawl` gem is based on the `Faraday` gem, which permits you to customize the request
118
+ orchestration, up to and including changing the actual HTTP implementation used to make the
119
+ request. See Connections below for additional details.
120
+
121
+ Any `Firecrawl` request, including the `submit` method as used above, will thus return a
122
+ `Faraday::Response`. This response includes a `success?` method which indicates if the request
123
+ was successful. If the request was successful, the `response.result` method will be an instance
124
+ of `Firecrawl::ScrapeResult` that will encapsulate the scraping result. This instance, in turn,
125
+ has a `success?` method which will return `true` if Firecrawl successfully scraped the page.
126
+
127
+ A successful result will include html, markdown, screenshot, as well as any action and llm
128
+ results and related metadata.
129
+
130
+ If the response is not successful ( if `response.success?` is `false` ) then `response.result`
131
+ will be an instance of Firecrawl::ErrorResult which will provide additional details about the
132
+ nature of the failure.
133
+
134
+ ### Batch Scraping
135
+
136
+ For scraping multiple URLs efficiently:
137
+
138
+ ```ruby
139
+ request = Firecrawl::BatchScrapeRequest.new( api_key: ENV[ 'FIRECRAWL_API_KEY' ] )
140
+
141
+ urls = [ 'https://example.com', 'https://example.org' ]
142
+ options = Firecrawl::ScrapeOptions.build do
143
+ format :markdown
144
+ only_main_content true
145
+ end
146
+
147
+ response = request.submit( urls, options )
148
+ while response.success?
149
+ batch_result = response.result
150
+ batch_result.scrape_results.each do |result|
151
+ puts result.metadata['title']
152
+ puts result.markdown
153
+ puts "\n---\n"
154
+ end
155
+ break unless batch_result.status?( :scraping )
156
+ sleep 0.5
157
+ response = request.retrieve( batch_result )
158
+ end
159
+ ```
160
+
161
+ ### Site Mapping
162
+
163
+ To retrieve a site's structure:
164
+
165
+ ```ruby
166
+ request = Firecrawl::MapRequest.new( api_key: ENV[ 'FIRECRAWL_API_KEY' ] )
167
+
168
+ options = Firecrawl::MapOptions.build do
169
+ limit 100
170
+ ignore_subdomains true
171
+ end
172
+
173
+ response = request.submit( 'https://example.com', options )
174
+ if response.success?
175
+ result = response.result
176
+ result.links.each do |link|
177
+ puts link
178
+ end
179
+ end
180
+ ```
181
+
182
+ ### Site Crawling
183
+
184
+ For comprehensive site crawling:
185
+
186
+ ```ruby
187
+ request = Firecrawl::CrawlRequest.new( api_key: ENV[ 'FIRECRAWL_API_KEY' ] )
188
+
189
+ options = Firecrawl::CrawlOptions.build do
190
+ maximum_depth 2
191
+ limit 10
192
+ scrape_options do
193
+ format :markdown
194
+ only_main_content true
195
+ end
196
+ end
197
+
198
+ response = request.submit( 'https://example.com', options )
199
+ while response.success?
200
+ crawl_result = response.result
201
+ crawl_result.scrape_results.each do | result |
202
+ puts result.metadata[ 'title' ]
203
+ puts result.markdown
204
+ end
205
+ break unless crawl_result.status?( :scraping )
206
+ sleep 0.5
207
+ response = request.retrieve( crawl_result )
208
+ end
209
+ ```
210
+
211
+ ## License
212
+
213
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
data/firecrawl.gemspec CHANGED
@@ -1,7 +1,7 @@
1
1
  Gem::Specification.new do | spec |
2
2
 
3
3
  spec.name = 'firecrawl'
4
- spec.version = '0.0.1'
4
+ spec.version = '0.2.0'
5
5
  spec.authors = [ 'Kristoph Cichocki-Romanov' ]
6
6
  spec.email = [ 'rubygems.org@kristoph.net' ]
7
7
 
@@ -29,9 +29,10 @@ Gem::Specification.new do | spec |
29
29
  spec.require_paths = [ "lib" ]
30
30
 
31
31
  spec.add_runtime_dependency 'faraday', '~> 2.7'
32
- spec.add_runtime_dependency 'dynamicschema', '~> 1.0.0.beta03'
32
+ spec.add_runtime_dependency 'dynamicschema', '~> 1.0.0.beta04'
33
33
 
34
34
  spec.add_development_dependency 'rspec', '~> 3.13'
35
35
  spec.add_development_dependency 'debug', '~> 1.9'
36
+ spec.add_development_dependency 'vcr', '~> 6.3'
36
37
 
37
38
  end
@@ -3,8 +3,8 @@ module Firecrawl
3
3
  ##
4
4
  # The +BatchScrapeRequest+ class encapsulates a batch scrape request to the Firecrawl API.
5
5
  # After creating a new +BatchScrapeRequest+ instance you can begin batch scraping by calling
6
- # the +begin_scraping+ method and then subsequently evaluate the results by calling the
7
- # +continue_scraping' method.
6
+ # the +submit+ method and then subsequently retrieve the results by calling the
7
+ # +retrieve' method.
8
8
  #
9
9
  # === examples
10
10
  #
@@ -18,7 +18,7 @@ module Firecrawl
18
18
  # only_main_content true
19
19
  # end
20
20
  #
21
- # batch_response = request.beging_scraping( urls, options )
21
+ # batch_response = request.submit( urls, options )
22
22
  # while response.success?
23
23
  # batch_result = batch_response.result
24
24
  # if batch_result.success?
@@ -30,17 +30,18 @@ module Firecrawl
30
30
  # end
31
31
  # end
32
32
  # break unless batch_result.status?( :scraping )
33
+ # batch_response = request.retrieve( batch_result )
33
34
  # end
34
35
  #
35
- # unless response.success?
36
- # puts response.result.error_description
36
+ # unless batch_response.success?
37
+ # puts batch_response.result.error_description
37
38
  # end
38
39
  #
39
40
  class BatchScrapeRequest < Request
40
41
 
41
42
  ##
42
- # The +start_scraping+ method makes a Firecrawl '/batch/scrape' POST request which will
43
- # initiate batch scraping of the given urls.
43
+ # The +submit+ method makes a Firecrawl '/batch/scrape/{id}' POST request which will initiate
44
+ # batch scraping of the given urls.
44
45
  #
45
46
  # The response is always an instance of +Faraday::Response+. If +response.success?+ is true,
46
47
  # then +response.result+ will be an instance +BatchScrapeResult+. If the request is not
@@ -50,7 +51,7 @@ module Firecrawl
50
51
  # successful and then +response.result.success?+ to validate that the API processed the
51
52
  # request successfuly.
52
53
  #
53
- def start_scraping( urls, options = nil, &block )
54
+ def submit( urls, options = nil, &block )
54
55
  if options
55
56
  options = options.is_a?( ScrapeOptions ) ? options : ScrapeOptions.build( options.to_h )
56
57
  options = options.to_h
@@ -58,25 +59,25 @@ module Firecrawl
58
59
  options = {}
59
60
  end
60
61
  options[ :urls ] = [ urls ].flatten
61
-
62
62
  response = post( "#{BASE_URI}/batch/scrape", options, &block )
63
63
  result = nil
64
+ attributes = JSON.parse( response.body, symbolize_names: true ) rescue nil
64
65
  if response.success?
65
- attributes = ( JSON.parse( response.body, symbolize_names: true ) rescue nil )
66
66
  attributes ||= { success: false, status: :failed }
67
67
  result = BatchScrapeResult.new( attributes[ :success ], attributes )
68
68
  else
69
- result = ErrorResult.new( response.status, attributes )
69
+ result = ErrorResult.new( response.status, attributes || {} )
70
70
  end
71
71
 
72
72
  ResponseMethods.install( response, result )
73
73
  end
74
74
 
75
75
  ##
76
- # The +retrieve_scraping+ method makes a Firecrawl '/batch/scrape' GET request which will
77
- # retrieve batch scraping results. Note that there is no guarantee that there are any batch
78
- # scraping results at the time of the call and you may need to call this method multiple
79
- # times.
76
+ # The +retrieve+ method makes a Firecrawl '/batch/scrape' GET request which will return the
77
+ # scrape results that were completed since the previous call to this method ( or, if this is
78
+ # the first call to this method, since the batch scrape was started ). Note that there is no
79
+ # guarantee that there are any new batch scrape results at the time you make this call
80
+ # ( scrape_results may be empty ).
80
81
  #
81
82
  # The response is always an instance of +Faraday::Response+. If +response.success?+ is +true+,
82
83
  # then +response.result+ will be an instance +BatchScrapeResult+. If the request is not
@@ -86,17 +87,53 @@ module Firecrawl
86
87
  # successful and then +response.result.success?+ to validate that the API processed the
87
88
  # request successfuly.
88
89
  #
89
- def retrieve_scraping( batch_result, &block )
90
+ def retrieve( batch_result, &block )
90
91
  raise ArgumentError, "The first argument must be an instance of BatchScrapeResult." \
91
92
  unless batch_result.is_a?( BatchScrapeResult )
92
93
  response = get( batch_result.next_url, &block )
93
94
  result = nil
95
+ attributes = JSON.parse( response.body, symbolize_names: true ) rescue nil
96
+ if response.success?
97
+ attributes ||= { success: false, status: :failed }
98
+ result = batch_result.merge( attributes )
99
+ else
100
+ result = ErrorResult.new( response.status, attributes || {} )
101
+ end
102
+
103
+ ResponseMethods.install( response, result )
104
+ end
105
+
106
+ ##
107
+ # The +retrieve_all+ method makes a Firecrawl '/batch/scrape' GET request which will return
108
+ # the scrape results that were completed at the time of this call. Repeated calls to this
109
+ # method will retrieve the scrape results previouslly returned as well as any scrape results
110
+ # that have accumulated since.
111
+ #
112
+ # Note that there is no guarantee that there are any new batch scrape results at the time you
113
+ # make this call ( scrape_results may be empty ).
114
+ #
115
+ # The response is always an instance of +Faraday::Response+. If +response.success?+ is +true+,
116
+ # then +response.result+ will be an instance +BatchScrapeResult+. If the request is not
117
+ # successful then +response.result+ will be an instance of +ErrorResult+.
118
+ #
119
+ # Remember that you should call +response.success?+ to valida that the call to the API was
120
+ # successful and then +response.result.success?+ to validate that the API processed the
121
+ # request successfuly.
122
+ #
123
+ def retrieve_all( batch_result, &block )
124
+ raise ArgumentError, "The first argument must be an instance of BatchScrapeResult." \
125
+ unless batch_result.is_a?( BatchScrapeResult )
126
+ response = get( batch_result.url, &block )
127
+ result = nil
128
+ attributes = JSON.parse( response.body, symbolize_names: true ) rescue nil
94
129
  if response.success?
95
- attributes = ( JSON.parse( response.body, symbolize_names: true ) rescue nil )
96
130
  attributes ||= { success: false, status: :failed }
131
+ # the next url should not be set by this method so that retrieve and retrieve_all do
132
+ # not impact each other
133
+ attributes.delete( :next )
97
134
  result = batch_result.merge( attributes )
98
135
  else
99
- result = ErrorResult.new( response.status, attributes )
136
+ result = ErrorResult.new( response.status, attributes || {} )
100
137
  end
101
138
 
102
139
  ResponseMethods.install( response, result )
@@ -1,21 +1,17 @@
1
1
  module Firecrawl
2
2
  class CrawlOptions
3
3
  include DynamicSchema::Definable
4
- include DynamicSchema::Buildable
5
-
6
- FORMATS = [ :markdown, :links, :html, :raw_html, :screenshot ]
7
-
8
- ACTIONS = [ :wait, :click, :write, :press, :screenshot, :scrape ]
4
+ include Helpers
9
5
 
10
6
  schema do
11
7
  exclude_paths String, as: :excludePaths, array: true
12
8
  include_paths String, as: :includePaths, array: true
13
9
  maximum_depth Integer, as: :maxDepth
14
10
  ignore_sitemap [ TrueClass, FalseClass ], as: :ignoreSitemap
15
- limit Integer
11
+ limit Integer, in: (0..)
16
12
  allow_backward_links [ TrueClass, FalseClass ], as: :allowBackwardLinks
17
13
  allow_external_links [ TrueClass, FalseClass ], as: :allowExternalLinks
18
- webhook String
14
+ webhook_uri URI, as: :webhook
19
15
  scrape_options as: :scrapeOptions, &ScrapeOptions.schema
20
16
  end
21
17
 
@@ -27,13 +23,13 @@ module Firecrawl
27
23
  new( api_options: builder.build!( options, &block ) )
28
24
  end
29
25
 
30
- def initialize( options, api_options: nil )
26
+ def initialize( options = nil, api_options: nil )
31
27
  @options = self.class.builder.build( options || {} )
32
28
  @options = api_options.merge( @options ) if api_options
33
29
 
34
30
  scrape_options = @options[ :scrapeOptions ]
35
31
  if scrape_options
36
- scrape_options[ :formats ]&.map!( &method( :string_camelize ) )
32
+ scrape_options[ :formats ]&.map! { | format | string_camelize( format.to_s ) }
37
33
  end
38
34
  end
39
35
 
@@ -0,0 +1,135 @@
1
+ module Firecrawl
2
+
3
+ ##
4
+ # The +CrawlRequest+ class encapsulates a crawl request to the Firecrawl API. After creating
5
+ # a new +CrawlRequest+ instance you can begin crawling by calling the +submit+ method and
6
+ # then subsequently retrieving the results by calling the +retrieve+ method.
7
+ #
8
+ # You can also optionally cancel the crawling operation by calling +cancel+.
9
+ #
10
+ # === examples
11
+ #
12
+ # require 'firecrawl'
13
+ #
14
+ # request = Firecrawl::CrawlRequest.new( api_key: ENV[ 'FIRECRAWL_API_KEY' )
15
+ #
16
+ # urls = 'https://icann.org'
17
+ # options = Firecrawl::CrawlOptions.build do
18
+ # scrape_options do
19
+ # main_content_only true
20
+ # end
21
+ # end
22
+ #
23
+ # crawl_response = request.submit( urls, options )
24
+ # while crawl_response.success?
25
+ # crawl_result = crawl_response.result
26
+ # if crawl_result.success?
27
+ # crawl_result.scrape_results.each do | result |
28
+ # puts response.metadata[ 'title ]
29
+ # puts '---'
30
+ # puts response.markdown
31
+ # puts "\n\n"
32
+ # end
33
+ # end
34
+ # break unless crawl_result.status?( :scraping )
35
+ # crawl_response = request.retrieve( crawl_result )
36
+ # end
37
+ #
38
+ # unless crawl_response.success?
39
+ # puts crawl_response.result.error_description
40
+ # end
41
+ #
42
+ class CrawlRequest < Request
43
+
44
+ ##
45
+ # The +submit+ method makes a Firecrawl '/crawl' POST request which will initiate crawling
46
+ # of the given url.
47
+ #
48
+ # The response is always an instance of +Faraday::Response+. If +response.success?+ is true,
49
+ # then +response.result+ will be an instance +CrawlResult+. If the request is not successful
50
+ # then +response.result+ will be an instance of +ErrorResult+.
51
+ #
52
+ # Remember that you should call +response.success?+ to validr that the call to the API was
53
+ # successful and then +response.result.success?+ to validate that the API processed the
54
+ # request successfuly.
55
+ #
56
+ def submit( url, options = nil, &block )
57
+ if options
58
+ options = options.is_a?( CrawlOptions ) ? options : CrawlOptions.build( options.to_h )
59
+ options = options.to_h
60
+ else
61
+ options = {}
62
+ end
63
+ options[ :url ] = url
64
+ response = post( "#{BASE_URI}/crawl", options, &block )
65
+ result = nil
66
+ attributes = JSON.parse( response.body, symbolize_names: true ) rescue nil
67
+ if response.success?
68
+ attributes ||= { success: false, status: :failed }
69
+ result = CrawlResult.new( attributes[ :success ], attributes )
70
+ else
71
+ result = ErrorResult.new( response.status, attributes )
72
+ end
73
+
74
+ ResponseMethods.install( response, result )
75
+ end
76
+
77
+ ##
78
+ # The +retrieve+ method makes a Firecrawl '/crawl/{id}' GET request which will return the
79
+ # crawl results that were completed since the previous call to this method( or, if this is
80
+ # the first call to this method, since the crawl was started ). Note that there is no
81
+ # guarantee that there are any new crawl results at the time you make this call
82
+ # ( scrape_results may be empty ).
83
+ #
84
+ # The response is always an instance of +Faraday::Response+. If +response.success?+ is
85
+ # +true+, then +response.result+ will be an instance +CrawlResult+. If the request is not
86
+ # successful then +response.result+ will be an instance of +ErrorResult+.
87
+ #
88
+ # Remember that you should call +response.success?+ to validate that the call to the API was
89
+ # successful and then +response.result.success?+ to validate that the API processed the
90
+ # request successfuly.
91
+ #
92
+ def retrieve( crawl_result, &block )
93
+ raise ArgumentError, "The first argument must be an instance of CrawlResult." \
94
+ unless crawl_result.is_a?( CrawlResult )
95
+ response = get( crawl_result.next_url, &block )
96
+ result = nil
97
+ attributes = JSON.parse( response.body, symbolize_names: true ) rescue nil
98
+ if response.success?
99
+ result = crawl_result.merge( attributes || { success: false, status: :failed } )
100
+ else
101
+ result = ErrorResult.new( response.status, attributes || {} )
102
+ end
103
+
104
+ ResponseMethods.install( response, result )
105
+ end
106
+
107
+ ##
108
+ # The +cancel+ method makes a Firecrawl '/crawl/{id}' DELETE request which will cancel a
109
+ # previouslly submitted crawl.
110
+ #
111
+ # The response is always an instance of +Faraday::Response+. If +response.success?+ is
112
+ # +true+, then +response.result+ will be an instance +CrawlResult+. If the request is not
113
+ # successful then +response.result+ will be an instance of +ErrorResult+.
114
+ #
115
+ # Remember that you should call +response.success?+ to validate that the call to the API was
116
+ # successful and then +response.result.success?+ to validate that the API processed the
117
+ # request successfuly.
118
+ #
119
+ def cancel( crawl_result, &block )
120
+ raise ArgumentError, "The first argument must be an instance of CrawlResult." \
121
+ unless crawl_result.is_a?( CrawlResult )
122
+ response = get( crawl_result.url, &block )
123
+ result = nil
124
+ attributes = JSON.parse( response.body, symbolize_names: true ) rescue nil
125
+ if response.success?
126
+ result = crawl_result.merge( attributes || { success: false, status: :failed } )
127
+ else
128
+ result = ErrorResult.new( response.status, attributes || {} )
129
+ end
130
+
131
+ ResponseMethods.install( response, result )
132
+ end
133
+
134
+ end
135
+ end
@@ -0,0 +1,63 @@
1
+ module Firecrawl
2
+ class CrawlResult
3
+
4
+ def initialize( success, attributes )
5
+ @success = success
6
+ @attributes = attributes || {}
7
+ end
8
+
9
+ def success?
10
+ @success || false
11
+ end
12
+
13
+ def status
14
+ # the initial Firecrawl response does not have a status so we synthesize a 'crawling'
15
+ # status if the operation was otherwise successful
16
+ @attributes[ :status ]&.to_sym || ( @success ? :scraping : :failed )
17
+ end
18
+
19
+ def status?( status )
20
+ self.status == status
21
+ end
22
+
23
+ def id
24
+ @attributes[ :id ]
25
+ end
26
+
27
+ def total
28
+ @attributes[ :total ] || 0
29
+ end
30
+
31
+ def completed
32
+ @attributes[ :completed ] || 0
33
+ end
34
+
35
+ def credits_used
36
+ @attributes[ :creditsUsed ] || 0
37
+ end
38
+
39
+ def expires_at
40
+ Date.parse( @attributes[ :expiresAt ] ) rescue nil
41
+ end
42
+
43
+ def url
44
+ @attributes[ :url ]
45
+ end
46
+
47
+ def next_url
48
+ @attributes[ :next ] || @attributes[ :url ]
49
+ end
50
+
51
+ def scrape_results
52
+ success = @attributes[ :success ]
53
+ # note the &.compact is here because I've noted null entries in the data
54
+ ( @attributes[ :data ]&.compact || [] ).map do | attr |
55
+ ScrapeResult.new( success, attr )
56
+ end
57
+ end
58
+
59
+ def merge( attributes )
60
+ self.class.new( attributes[ :success ], @attributes.merge( attributes ) )
61
+ end
62
+ end
63
+ end
@@ -5,7 +5,7 @@ module Firecrawl
5
5
 
6
6
  def initialize( status_code, attributes = nil )
7
7
  @error_code, @error_description = status_code_to_error( status_code )
8
- @error_description = attributes[ :error ] if @attributes&.respond_to?( :[] )
8
+ @error_description = attributes[ :error ] if attributes&.respond_to?( :[] )
9
9
  end
10
10
 
11
11
  private
@@ -10,7 +10,7 @@ module Firecrawl
10
10
  #
11
11
  # request = Firecrawl::MapRequest.new( api_key: ENV[ 'FIRECRAWL_API_KEY' )
12
12
  #
13
- # response = request.map( 'https://example.com', { limit: 100 } )
13
+ # response = request.submit( 'https://example.com', { limit: 100 } )
14
14
  # if response.success?
15
15
  # result = response.result
16
16
  # if result.success?
@@ -25,14 +25,14 @@ module Firecrawl
25
25
  class MapRequest < Request
26
26
 
27
27
  ##
28
- # The +map+ method makes a Firecrawl '/map' POST request which will scrape the site with
29
- # given url.
28
+ # The +submit+ method makes a Firecrawl '/map' POST request which will scrape the site with
29
+ # given url and return links to all hosted pages related to that url.
30
30
  #
31
31
  # The response is always an instance of +Faraday::Response+. If +response.success?+ is true,
32
32
  # then +response.result+ will be an instance +MapResult+. If the request is not successful
33
33
  # then +response.result+ will be an instance of +ErrorResult+.
34
34
  #
35
- def map( url, options = nil, &block )
35
+ def submit( url, options = nil, &block )
36
36
  if options
37
37
  options = options.is_a?( MapOptions ) ? options : MapOptions.build( options.to_h )
38
38
  options = options.to_h
@@ -0,0 +1,18 @@
1
+ module Firecrawl
2
+ module ModuleMethods
3
+ DEFAULT_CONNECTION = Faraday.new { | builder | builder.adapter Faraday.default_adapter }
4
+
5
+ def connection( connection = nil )
6
+ @connection = connection || @connection || DEFAULT_CONNECTION
7
+ end
8
+
9
+ def api_key( api_key = nil )
10
+ @api_key = api_key || @api_key
11
+ @api_key
12
+ end
13
+
14
+ def scrape( url, options = nil, &block )
15
+ Firecrawl::ScrapeRequest.new.submit( url, options, &block )
16
+ end
17
+ end
18
+ end
@@ -28,8 +28,6 @@ module Firecrawl
28
28
  #
29
29
  class Request
30
30
 
31
- DEFAULT_CONNECTION = Faraday.new { | builder | builder.adapter Faraday.default_adapter }
32
-
33
31
  BASE_URI = 'https://api.firecrawl.dev/v1'
34
32
 
35
33
  ##
@@ -37,7 +35,7 @@ module Firecrawl
37
35
  # and optionally a (Faraday) +connection+.
38
36
  #
39
37
  def initialize( connection: nil, api_key: nil )
40
- @connection = connection || DEFAULT_CONNECTION
38
+ @connection = connection || Firecrawl.connection
41
39
  @api_key = api_key || Firecrawl.api_key
42
40
  raise ArgumentError, "An 'api_key' is required unless configured using 'Firecrawl.api_key'." \
43
41
  unless @api_key
@@ -70,6 +68,18 @@ module Firecrawl
70
68
  end
71
69
  end
72
70
 
71
+ def delete( uri, &block )
72
+ headers = {
73
+ 'Authorization' => "Bearer #{@api_key}",
74
+ 'Content-Type' => 'application/json'
75
+ }
76
+
77
+ @connection.delete( uri ) do | request |
78
+ headers.each { | key, value | request.headers[ key ] = value }
79
+ block.call( request ) if block
80
+ end
81
+ end
82
+
73
83
  end
74
84
 
75
85
  end
@@ -9,7 +9,7 @@ module Firecrawl
9
9
 
10
10
  schema do
11
11
  # note: both format and formats are defined as a semantic convenience
12
- format String, as: :formats, array: true, in: FORMATS
12
+ format String, as: :formats, array: true, in: FORMATS
13
13
  formats String, array: true, in: FORMATS
14
14
  only_main_content [ TrueClass, FalseClass ], as: :onlyMainContent
15
15
  include_tags String, as: :includeTags, array: true
@@ -17,7 +17,7 @@ module Firecrawl
17
17
  wait_for Integer
18
18
  timeout Integer
19
19
  extract do
20
- #schema Hash
20
+ schema Hash
21
21
  system_prompt String, as: :systemPrompt
22
22
  prompt String
23
23
  end
@@ -1,7 +1,7 @@
1
1
  module Firecrawl
2
2
  ##
3
3
  # The +ScrapeRequest+ class encapsulates a '/scrape' POST request to the Firecrawl API. After
4
- # creating a new +ScrapeRequest+ instance you can initiate the request by calling the +scrape+
4
+ # creating a new +ScrapeRequest+ instance you can initiate the request by calling the +submit+
5
5
  # method to perform synchronous scraping.
6
6
  #
7
7
  # === examples
@@ -15,7 +15,7 @@ module Firecrawl
15
15
  # only_main_content true
16
16
  # end
17
17
  #
18
- # response = request.scrape( 'https://example.com', options )
18
+ # response = request.submit( 'https://example.com', options )
19
19
  # if response.success?
20
20
  # result = response.result
21
21
  # puts response.metadata[ 'title ]
@@ -28,13 +28,13 @@ module Firecrawl
28
28
  class ScrapeRequest < Request
29
29
 
30
30
  ##
31
- # The +scrape+ method makes a Firecrawl '/scrape' POST request which will scrape the given url.
31
+ # The +submit+ method makes a Firecrawl '/scrape' POST request which will scrape the given url.
32
32
  #
33
33
  # The response is always an instance of +Faraday::Response+. If +response.success?+ is true,
34
34
  # then +response.result+ will be an instance +ScrapeResult+. If the request is not successful
35
35
  # then +response.result+ will be an instance of +ErrorResult+.
36
36
  #
37
- def scrape( url, options = nil, &block )
37
+ def submit( url, options = nil, &block )
38
38
  if options
39
39
  options = options.is_a?( ScrapeOptions ) ? options : ScrapeOptions.build( options.to_h )
40
40
  options = options.to_h
@@ -16,6 +16,20 @@ module Firecrawl
16
16
  @success || false
17
17
  end
18
18
 
19
+ def metadata
20
+ unless @metadata
21
+ metadata = @attributes[ :metadata ] || {}
22
+ @metadata = metadata.transform_keys do | key |
23
+ key.to_s.gsub( /([a-z])([A-Z])/, '\1_\2' ).downcase
24
+ end
25
+ # remove the camelCase forms injected by Firecrawl
26
+ @metadata.delete_if do | key, _ |
27
+ key.start_with?( 'og_' ) && @metadata.key?( key.sub( 'og_', 'og:' ) )
28
+ end
29
+ end
30
+ @metadata
31
+ end
32
+
19
33
  ##
20
34
  # The +markdown+ method returns scraped content that has been converted to markdown. The
21
35
  # markdown content is present only if the request options +formats+ included +markdown+.
@@ -66,20 +80,6 @@ module Firecrawl
66
80
  @attributes[ :actions ] || {}
67
81
  end
68
82
 
69
- def metadata
70
- unless @metadata
71
- metadata = @attributes[ :metadata ] || {}
72
- @metadata = metadata.transform_keys do | key |
73
- key.to_s.gsub( /([a-z])([A-Z])/, '\1_\2' ).downcase
74
- end
75
- # remove the camelCase forms injected by Firecrawl
76
- @metadata.delete_if do | key, _ |
77
- key.start_with?( 'og_' ) && @metadata.key?( key.sub( 'og_', 'og:' ) )
78
- end
79
- end
80
- @metadata
81
- end
82
-
83
83
  def llm_extraction
84
84
  @attributes[ :llm_extraction ] || {}
85
85
  end
data/lib/firecrawl.rb CHANGED
@@ -18,10 +18,14 @@ require_relative 'firecrawl/batch_scrape_request'
18
18
  require_relative 'firecrawl/map_options'
19
19
  require_relative 'firecrawl/map_result'
20
20
  require_relative 'firecrawl/map_request'
21
+ require_relative 'firecrawl/crawl_options'
22
+ require_relative 'firecrawl/crawl_result'
23
+ require_relative 'firecrawl/crawl_request'
24
+
25
+ require_relative 'firecrawl/module_methods'
21
26
 
22
27
  module Firecrawl
23
- class << self
24
- attr_accessor :api_key
25
- end
28
+ extend ModuleMethods
26
29
  end
27
30
 
31
+
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: firecrawl
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Kristoph Cichocki-Romanov
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-11-04 00:00:00.000000000 Z
11
+ date: 2024-11-29 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: faraday
@@ -30,14 +30,14 @@ dependencies:
30
30
  requirements:
31
31
  - - "~>"
32
32
  - !ruby/object:Gem::Version
33
- version: 1.0.0.beta03
33
+ version: 1.0.0.beta04
34
34
  type: :runtime
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
38
  - - "~>"
39
39
  - !ruby/object:Gem::Version
40
- version: 1.0.0.beta03
40
+ version: 1.0.0.beta04
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: rspec
43
43
  requirement: !ruby/object:Gem::Requirement
@@ -66,6 +66,20 @@ dependencies:
66
66
  - - "~>"
67
67
  - !ruby/object:Gem::Version
68
68
  version: '1.9'
69
+ - !ruby/object:Gem::Dependency
70
+ name: vcr
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: '6.3'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: '6.3'
69
83
  description: |-
70
84
  The Firecrawl gem implements a lightweight interface to the Firecrawl.dev API. Firecrawl can take a URL, scrape the page contents and return the whole page or principal content as html, markdown, or structured data.
71
85
 
@@ -77,16 +91,20 @@ extensions: []
77
91
  extra_rdoc_files: []
78
92
  files:
79
93
  - LICENSE
94
+ - README.md
80
95
  - firecrawl.gemspec
81
96
  - lib/firecrawl.rb
82
97
  - lib/firecrawl/batch_scrape_request.rb
83
98
  - lib/firecrawl/batch_scrape_result.rb
84
99
  - lib/firecrawl/crawl_options.rb
100
+ - lib/firecrawl/crawl_request.rb
101
+ - lib/firecrawl/crawl_result.rb
85
102
  - lib/firecrawl/error_result.rb
86
103
  - lib/firecrawl/helpers.rb
87
104
  - lib/firecrawl/map_options.rb
88
105
  - lib/firecrawl/map_request.rb
89
106
  - lib/firecrawl/map_result.rb
107
+ - lib/firecrawl/module_methods.rb
90
108
  - lib/firecrawl/request.rb
91
109
  - lib/firecrawl/response_methods.rb
92
110
  - lib/firecrawl/scrape_options.rb