npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.4 - Mend

ultimate-pi 0.1.2 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/.agents/skills/scrapling-official/references/spiders/advanced.md DELETED Viewed

@@ -1,344 +0,0 @@
-# Advanced usages
-## Concurrency Control
-The spider system uses three class attributes to control how aggressively it crawls:
-| Attribute                        | Default | Description                                                      |
-|----------------------------------|---------|------------------------------------------------------------------|
-| `concurrent_requests`            | `4`     | Maximum number of requests being processed at the same time      |
-| `concurrent_requests_per_domain` | `0`     | Maximum concurrent requests per domain (0 = no per-domain limit) |
-| `download_delay`                 | `0.0`   | Seconds to wait before each request                              |
-| `robots_txt_obey`               | `False` | Respect robots.txt rules (Disallow, Crawl-delay, Request-rate)   |
-```python
-class PoliteSpider(Spider):
-    name = "polite"
-    start_urls = ["https://example.com"]
-    # Be gentle with the server
-    concurrent_requests = 4
-    concurrent_requests_per_domain = 2
-    download_delay = 1.0  # Wait 1 second between requests
-    async def parse(self, response: Response):
-        yield {"title": response.css("title::text").get("")}
-```
-When `concurrent_requests_per_domain` is set, each domain gets its own concurrency limiter in addition to the global limit. This is useful when crawling multiple domains simultaneously - you can allow high global concurrency while being polite to each individual domain.
-**Tip:** The `download_delay` parameter adds a fixed wait before every request, regardless of the domain. Use it for simple rate limiting.
-### Using uvloop
-The `start()` method accepts a `use_uvloop` parameter to use the faster [uvloop](https://github.com/MagicStack/uvloop)/[winloop](https://github.com/nicktimko/winloop) event loop implementation, if available:
-```python
-result = MySpider().start(use_uvloop=True)
-```
-This can improve throughput for I/O-heavy crawls. You'll need to install `uvloop` (Linux/macOS) or `winloop` (Windows) separately.
-## Pause & Resume
-The spider supports graceful pause-and-resume via checkpointing. To enable it, pass a `crawldir` directory to the spider constructor:
-```python
-spider = MySpider(crawldir="crawl_data/my_spider")
-result = spider.start()
-if result.paused:
-    print("Crawl was paused. Run again to resume.")
-else:
-    print("Crawl completed!")
-```
-### How It Works
-1. **Pausing**: Press `Ctrl+C` during a crawl. The spider waits for all in-flight requests to finish, saves a checkpoint (pending requests + a set of seen request fingerprints), and then exits.
-2. **Force stopping**: Press `Ctrl+C` a second time to stop immediately without waiting for active tasks.
-3. **Resuming**: Run the spider again with the same `crawldir`. It detects the checkpoint, restores the queue and seen set, and continues from where it left off, skipping `start_requests()`.
-4. **Cleanup**: When a crawl completes normally (not paused), the checkpoint files are deleted automatically.
-**Checkpoints are also saved periodically during the crawl (every 5 minutes by default).**
-You can change the interval as follows:
-```python
-# Save checkpoint every 2 minutes
-spider = MySpider(crawldir="crawl_data/my_spider", interval=120.0)
-```
-The writing to the disk is atomic, so it's totally safe.
-**Tip:** Pressing `Ctrl+C` during a crawl always causes the spider to close gracefully, even if the checkpoint system is not enabled. Doing it again without waiting forces the spider to close immediately.
-### Knowing If You're Resuming
-The `on_start()` hook receives a `resuming` flag:
-```python
-async def on_start(self, resuming: bool = False):
-    if resuming:
-        self.logger.info("Resuming from checkpoint!")
-    else:
-        self.logger.info("Starting fresh crawl")
-```
-## Development Mode
-When you're iterating on a spider's `parse()` logic, re-hitting the target servers on every run is slow and noisy. Development mode caches every response to disk on the first run and replays them from disk on subsequent runs, so you can tweak your selectors and re-run the spider as many times as you want without making a single network request.
-Enable it by setting `development_mode = True` on your spider:
-```python
-class MySpider(Spider):
-    name = "my_spider"
-    start_urls = ["https://example.com"]
-    development_mode = True
-    async def parse(self, response: Response):
-        yield {"title": response.css("title::text").get("")}
-```
-The first run fetches normally and stores each response on disk. Every subsequent run serves the same requests from the cache, skipping the network entirely.
-### Cache Location
-By default, responses are cached in `.scrapling_cache/{spider.name}/` relative to the current working directory (where you ran the spider from, **not** where the spider script lives). You can override the location with `development_cache_dir`:
-```python
-class MySpider(Spider):
-    name = "my_spider"
-    start_urls = ["https://example.com"]
-    development_mode = True
-    development_cache_dir = "/tmp/my_spider_cache"
-```
-### How It Works
-1. **Cache key**: Each response is keyed by the request's fingerprint, so any change to fingerprint-affecting attributes (`fp_include_kwargs`, `fp_include_headers`, `fp_keep_fragments`) will produce a fresh fetch.
-2. **Storage format**: One JSON file per response, named `{fingerprint_hex}.json`. The body is base64-encoded so binary content is preserved exactly. Writes are atomic (temp file + rename).
-3. **Replay**: On a cache hit, the engine skips the network entirely, including `download_delay`, rate limiting, and the `is_blocked()` retry path. The cached response goes straight to your callback.
-4. **Stats**: Cached requests still count toward `requests_count`, `response_bytes`, and the per-status counters, so your stat output looks the same as a normal crawl. Two extra counters, `cache_hits` and `cache_misses`, let you see how the cache performed.
-### Clearing the Cache
-There's no automatic expiration. To force a fresh crawl, delete the cache directory or call the manager's `clear()` method directly.
-**Warning:** Development mode is meant for development, not production. Cached responses never expire, and replay bypasses rate limiting and blocked-request retries. Don't ship a spider with `development_mode = True`.
-## Streaming
-For long-running spiders or applications that need real-time access to scraped items, use the `stream()` method instead of `start()`:
-```python
-import anyio
-async def main():
-    spider = MySpider()
-    async for item in spider.stream():
-        print(f"Got item: {item}")
-        # Access real-time stats
-        print(f"Items so far: {spider.stats.items_scraped}")
-        print(f"Requests made: {spider.stats.requests_count}")
-anyio.run(main)
-```
-Key differences from `start()`:
-- `stream()` must be called from an async context
-- Items are yielded one by one as they're scraped, not collected into a list
-- You can access `spider.stats` during iteration for real-time statistics
-**Note:** The full list of all stats that can be accessed by `spider.stats` is explained below [here](#results--statistics).
-You can use it with the checkpoint system too, so it's easy to build UI on top of spiders. UIs that have real-time data and can be paused/resumed.
-```python
-import anyio
-async def main():
-    spider = MySpider(crawldir="crawl_data/my_spider")
-    async for item in spider.stream():
-        print(f"Got item: {item}")
-        # Access real-time stats
-        print(f"Items so far: {spider.stats.items_scraped}")
-        print(f"Requests made: {spider.stats.requests_count}")
-anyio.run(main)
-```
-You can also use `spider.pause()` to shut down the spider in the code above. If you used it without enabling the checkpoint system, it will just close the crawl.
-## Lifecycle Hooks
-The spider provides several hooks you can override to add custom behavior at different stages of the crawl:
-### on_start
-Called before crawling begins. Use it for setup tasks like loading data or initializing resources:
-```python
-async def on_start(self, resuming: bool = False):
-    self.logger.info("Spider starting up")
-    # Load seed URLs from a database, initialize counters, etc.
-```
-### on_close
-Called after crawling finishes (whether completed or paused). Use it for cleanup:
-```python
-async def on_close(self):
-    self.logger.info("Spider shutting down")
-    # Close database connections, flush buffers, etc.
-```
-### on_error
-Called when a request fails with an exception. Use it for error tracking or custom recovery logic:
-```python
-async def on_error(self, request: Request, error: Exception):
-    self.logger.error(f"Failed: {request.url} - {error}")
-    # Log to error tracker, save failed URL for later, etc.
-```
-### on_scraped_item
-Called for every scraped item before it's added to the results. Return the item (modified or not) to keep it, or return `None` to drop it:
-```python
-async def on_scraped_item(self, item: dict) -> dict | None:
-    # Drop items without a title
-    if not item.get("title"):
-        return None
-    # Modify items (e.g., add timestamps)
-    item["scraped_at"] = "2026-01-01"
-    return item
-```
-**Tip:** This hook can also be used to direct items through your own pipelines and drop them from the spider.
-### start_requests
-Override `start_requests()` for custom initial request generation instead of using `start_urls`:
-```python
-async def start_requests(self):
-    # POST request to log in first
-    yield Request(
-        "https://example.com/login",
-        method="POST",
-        data={"user": "admin", "pass": "secret"},
-        callback=self.after_login,
-    )
-async def after_login(self, response: Response):
-    # Now crawl the authenticated pages
-    yield response.follow("/dashboard", callback=self.parse)
-```
-## Results & Statistics
-The `CrawlResult` returned by `start()` contains both the scraped items and detailed statistics:
-```python
-result = MySpider().start()
-# Items
-print(f"Total items: {len(result.items)}")
-result.items.to_json("output.json", indent=True)
-# Did the crawl complete?
-print(f"Completed: {result.completed}")
-print(f"Paused: {result.paused}")
-# Statistics
-stats = result.stats
-print(f"Requests: {stats.requests_count}")
-print(f"Failed: {stats.failed_requests_count}")
-print(f"Blocked: {stats.blocked_requests_count}")
-print(f"Offsite filtered: {stats.offsite_requests_count}")
-print(f"Robots.txt disallowed: {stats.robots_disallowed_count}")
-print(f"Cache hits: {stats.cache_hits}")
-print(f"Cache misses: {stats.cache_misses}")
-print(f"Items scraped: {stats.items_scraped}")
-print(f"Items dropped: {stats.items_dropped}")
-print(f"Response bytes: {stats.response_bytes}")
-print(f"Duration: {stats.elapsed_seconds:.1f}s")
-print(f"Speed: {stats.requests_per_second:.1f} req/s")
-```
-### Detailed Stats
-The `CrawlStats` object tracks granular information:
-```python
-stats = result.stats
-# Status code distribution
-print(stats.response_status_count)
-# {'status_200': 150, 'status_404': 3, 'status_403': 1}
-# Bytes downloaded per domain
-print(stats.domains_response_bytes)
-# {'example.com': 1234567, 'api.example.com': 45678}
-# Requests per session
-print(stats.sessions_requests_count)
-# {'http': 120, 'stealth': 34}
-# Proxies used during the crawl
-print(stats.proxies)
-# ['http://proxy1:8080', 'http://proxy2:8080']
-# Log level counts
-print(stats.log_levels_counter)
-# {'debug': 200, 'info': 50, 'warning': 3, 'error': 1, 'critical': 0}
-# Timing information
-print(stats.start_time)       # Unix timestamp when crawl started
-print(stats.end_time)         # Unix timestamp when crawl finished
-print(stats.download_delay)   # The download delay used (seconds)
-# Concurrency settings used
-print(stats.concurrent_requests)             # Global concurrency limit
-print(stats.concurrent_requests_per_domain)  # Per-domain concurrency limit
-# Custom stats (set by your spider code)
-print(stats.custom_stats)
-# {'login_attempts': 3, 'pages_with_errors': 5}
-# Export everything as a dict
-print(stats.to_dict())
-```
-## Logging
-The spider has a built-in logger accessible via `self.logger`. It's pre-configured with the spider's name and supports several customization options:
-| Attribute             | Default                                                      | Description                                        |
-|-----------------------|--------------------------------------------------------------|----------------------------------------------------|
-| `logging_level`       | `logging.DEBUG`                                              | Minimum log level                                  |
-| `logging_format`      | `"[%(asctime)s]:({spider_name}) %(levelname)s: %(message)s"` | Log message format                                 |
-| `logging_date_format` | `"%Y-%m-%d %H:%M:%S"`                                        | Date format in log messages                        |
-| `log_file`            | `None`                                                       | Path to a log file (in addition to console output) |
-```python
-import logging
-class MySpider(Spider):
-    name = "my_spider"
-    start_urls = ["https://example.com"]
-    logging_level = logging.INFO
-    log_file = "logs/my_spider.log"
-    async def parse(self, response: Response):
-        self.logger.info(f"Processing {response.url}")
-        yield {"title": response.css("title::text").get("")}
-```
-The log file directory is created automatically if it doesn't exist. Both console and file output use the same format.

package/.agents/skills/scrapling-official/references/spiders/architecture.md DELETED Viewed

@@ -1,94 +0,0 @@
-# Spiders architecture
-Scrapling's spider system is an async crawling framework designed for concurrent, multi-session crawls with built-in pause/resume support. It brings together Scrapling's parsing engine and fetchers into a unified crawling API while adding scheduling, concurrency control, and checkpointing.
-## Data Flow
-The diagram below shows how data flows through the spider system when a crawl is running:
-Here's what happens step by step when you run a spider:
-1. The **Spider** produces the first batch of `Request` objects. By default, it creates one request for each URL in `start_urls`, but you can override `start_requests()` for custom logic.
-2. The **Scheduler** receives requests and places them in a priority queue, and creates fingerprints for them. Higher-priority requests are dequeued first.
-3. The **Crawler Engine** asks the **Scheduler** to dequeue the next request, respecting concurrency limits (global and per-domain) and download delays. If `robots_txt_obey` is enabled, the engine checks the domain's robots.txt rules before proceeding -- disallowed requests are dropped silently. Once the **Crawler Engine** receives the request, it passes it to the **Session Manager**, which routes it to the correct session based on the request's `sid` (session ID).
-4. The **session** fetches the page and returns a [Response](../fetching/choosing.md#response-object) object to the **Crawler Engine**. The engine records statistics and checks for blocked responses. If the response is blocked, the engine retries the request up to `max_blocked_retries` times. Of course, the blocking detection and the retry logic for blocked requests can be customized.
-5. The **Crawler Engine** passes the [Response](../fetching/choosing.md#response-object) to the request's callback. The callback either yields a dictionary, which gets treated as a scraped item, or a follow-up request, which gets sent to the scheduler for queuing.
-6. The cycle repeats from step 2 until the scheduler is empty and no tasks are active, or the spider is paused.
-7. If `crawldir` is set while starting the spider, the **Crawler Engine** periodically saves a checkpoint (pending requests + seen URLs set) to disk. On graceful shutdown (Ctrl+C), a final checkpoint is saved. The next time the spider runs with the same `crawldir`, it resumes from where it left off, skipping `start_requests()` and restoring the scheduler state.
-## Components
-### Spider
-The central class you interact with. You subclass `Spider`, define your `start_urls` and `parse()` method, and optionally configure sessions and override lifecycle hooks.
-```python
-from scrapling.spiders import Spider, Response, Request
-class MySpider(Spider):
-    name = "my_spider"
-    start_urls = ["https://example.com"]
-    async def parse(self, response: Response):
-        for link in response.css("a::attr(href)").getall():
-            yield response.follow(link, callback=self.parse_page)
-    async def parse_page(self, response: Response):
-        yield {"title": response.css("h1::text").get("")}
-```
-### Crawler Engine
-The engine orchestrates the entire crawl. It manages the main loop, enforces concurrency limits, dispatches requests through the Session Manager, and processes results from callbacks. You don't interact with it directly - the `Spider.start()` and `Spider.stream()` methods handle it for you.
-### Scheduler
-A priority queue with built-in URL deduplication. Requests are fingerprinted based on their URL, HTTP method, body, and session ID. The scheduler supports `snapshot()` and `restore()` for the checkpoint system, allowing the crawl state to be saved and resumed.
-### Session Manager
-Manages one or more named session instances. Each session is one of:
-- [FetcherSession](../fetching/static.md)
-- [AsyncDynamicSession](../fetching/dynamic.md)
-- [AsyncStealthySession](../fetching/stealthy.md)
-When a request comes in, the Session Manager routes it to the correct session based on the request's `sid` field. Sessions can be started with the spider start (default) or lazily (started on the first use).
-### Checkpoint System
-An optional system that, if enabled, saves the crawler's state (pending requests + seen URL fingerprints) to a pickle file on disk. Writes are atomic (temp file + rename) to prevent corruption. Checkpoints are saved periodically at a configurable interval and on graceful shutdown. Upon successful completion (not paused), checkpoint files are automatically cleaned up.
-### Response Cache
-An optional cache that, when development mode is enabled, stores every fetched response on disk and replays it on subsequent runs. Each response is keyed by request fingerprint and serialized as JSON (with the body base64-encoded so binary content survives). It's meant for iterating on `parse()` logic without re-hitting the target servers, not for production use.
-### Output
-Scraped items are collected in an `ItemList` (a list subclass with `to_json()` and `to_jsonl()` export methods). Crawl statistics are tracked in a `CrawlStats` dataclass which contains a lot of useful info.
-## Comparison with Scrapy
-If you're coming from Scrapy, here's how Scrapling's spider system maps:
-| Concept            | Scrapy                        | Scrapling                                                       |
-|--------------------|-------------------------------|-----------------------------------------------------------------|
-| Spider definition  | `scrapy.Spider` subclass      | `scrapling.spiders.Spider` subclass                             |
-| Initial requests   | `start_requests()`            | `async start_requests()`                                        |
-| Callbacks          | `def parse(self, response)`   | `async def parse(self, response)`                               |
-| Following links    | `response.follow(url)`        | `response.follow(url)`                                          |
-| Item output        | `yield dict` or `yield Item`  | `yield dict`                                                    |
-| Request scheduling | Scheduler + Dupefilter        | Scheduler with built-in deduplication                           |
-| Downloading        | Downloader + Middlewares      | Session Manager with multi-session support                      |
-| Item processing    | Item Pipelines                | `on_scraped_item()` hook                                        |
-| Blocked detection  | Through custom middlewares    | Built-in `is_blocked()` + `retry_blocked_request()` hooks       |
-| Concurrency        | `CONCURRENT_REQUESTS` setting | `concurrent_requests` class attribute                           |
-| Domain filtering   | `allowed_domains`             | `allowed_domains`                                               |
-| Robots.txt         | `ROBOTSTXT_OBEY` setting      | `robots_txt_obey` class attribute                               |
-| Pause/Resume       | `JOBDIR` setting              | `crawldir` constructor argument                                 |
-| Export             | Feed exports                  | `result.items.to_json()` / `to_jsonl()` or custom through hooks |
-| Running            | `scrapy crawl spider_name`    | `MySpider().start()`                                            |
-| Streaming          | N/A                           | `async for item in spider.stream()`                             |
-| Multi-session      | N/A                           | Multiple sessions with different types per spider               |

package/.agents/skills/scrapling-official/references/spiders/getting-started.md DELETED Viewed

@@ -1,164 +0,0 @@
-# Getting started
-## Your First Spider
-A spider is a class that defines how to crawl and extract data from websites. Here's the simplest possible spider:
-```python
-from scrapling.spiders import Spider, Response
-class QuotesSpider(Spider):
-    name = "quotes"
-    start_urls = ["https://quotes.toscrape.com"]
-    async def parse(self, response: Response):
-        for quote in response.css("div.quote"):
-            yield {
-                "text": quote.css("span.text::text").get(""),
-                "author": quote.css("small.author::text").get(""),
-            }
-```
-Every spider needs three things:
-1. **`name`**: A unique identifier for the spider.
-2. **`start_urls`**: A list of URLs to start crawling from.
-3. **`parse()`**: An async generator method that processes each response and yields results.
-The `parse()` method processes each response. You use the same selection methods you'd use with Scrapling's [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object), and `yield` dictionaries to output scraped items.
-## Running the Spider
-To run your spider, create an instance and call `start()`:
-```python
-result = QuotesSpider().start()
-```
-The `start()` method handles all the async machinery internally, so there is no need to worry about event loops. While the spider is running, everything that happens is logged to the terminal, and at the end of the crawl, you get very detailed stats.
-Those stats are in the returned `CrawlResult` object, which gives you everything you need:
-```python
-result = QuotesSpider().start()
-# Access scraped items
-for item in result.items:
-    print(item["text"], "-", item["author"])
-# Check statistics
-print(f"Scraped {result.stats.items_scraped} items")
-print(f"Made {result.stats.requests_count} requests")
-print(f"Took {result.stats.elapsed_seconds:.1f} seconds")
-# Did the crawl finish or was it paused?
-print(f"Completed: {result.completed}")
-```
-## Following Links
-Most crawls need to follow links across multiple pages. Use `response.follow()` to create follow-up requests:
-```python
-from scrapling.spiders import Spider, Response
-class QuotesSpider(Spider):
-    name = "quotes"
-    start_urls = ["https://quotes.toscrape.com"]
-    async def parse(self, response: Response):
-        # Extract items from the current page
-        for quote in response.css("div.quote"):
-            yield {
-                "text": quote.css("span.text::text").get(""),
-                "author": quote.css("small.author::text").get(""),
-            }
-        # Follow the "next page" link
-        next_page = response.css("li.next a::attr(href)").get()
-        if next_page:
-            yield response.follow(next_page, callback=self.parse)
-```
-`response.follow()` handles relative URLs automatically by joining them with the current page's URL. It also sets the current page as the `Referer` header by default.
-You can point follow-up requests at different callback methods for different page types:
-```python
-async def parse(self, response: Response):
-    for link in response.css("a.product-link::attr(href)").getall():
-        yield response.follow(link, callback=self.parse_product)
-async def parse_product(self, response: Response):
-    yield {
-        "name": response.css("h1::text").get(""),
-        "price": response.css(".price::text").get(""),
-    }
-```
-**Note:** All callback methods must be async generators (using `async def` and `yield`).
-## Exporting Data
-The `ItemList` returned in `result.items` has built-in export methods:
-```python
-result = QuotesSpider().start()
-# Export as JSON
-result.items.to_json("quotes.json")
-# Export as JSON with pretty-printing
-result.items.to_json("quotes.json", indent=True)
-# Export as JSON Lines (one JSON object per line)
-result.items.to_jsonl("quotes.jsonl")
-```
-Both methods create parent directories automatically if they don't exist.
-## Filtering Domains
-Use `allowed_domains` to restrict the spider to specific domains. This prevents it from accidentally following links to external websites:
-```python
-class MySpider(Spider):
-    name = "my_spider"
-    start_urls = ["https://example.com"]
-    allowed_domains = {"example.com"}
-    async def parse(self, response: Response):
-        for link in response.css("a::attr(href)").getall():
-            # Links to other domains are silently dropped
-            yield response.follow(link, callback=self.parse)
-```
-Subdomains are matched automatically, so setting `allowed_domains = {"example.com"}` also allows `sub.example.com`, `blog.example.com`, etc.
-When a request is filtered out, it's counted in `stats.offsite_requests_count` so you can see how many were dropped.
-## Robots.txt Compliance
-Set `robots_txt_obey = True` to make the spider respect robots.txt rules before crawling any domain:
-```python
-class PoliteSpider(Spider):
-    name = "polite"
-    start_urls = ["https://example.com"]
-    robots_txt_obey = True
-    async def parse(self, response: Response):
-        for link in response.css("a::attr(href)").getall():
-            yield response.follow(link, callback=self.parse)
-```
-When enabled, the spider will:
-1. **Pre-fetch robots.txt** for all domains in `start_urls` before the crawl begins (concurrently).
-2. **Check every request** against the domain's robots.txt `Disallow` rules. Disallowed requests are silently dropped and counted in `stats.robots_disallowed_count`.
-3. **Respect `Crawl-delay` and `Request-rate` directives** by taking the maximum of the directive and your configured `download_delay`. This means robots.txt delays never reduce your configured delay, only increase it when needed.
-Robots.txt files are fetched using the spider's default session and cached per domain for the entire crawl. Domains discovered mid-crawl (not in `start_urls`) have their robots.txt fetched on the first request to that domain.
-**Note:** `robots_txt_obey` is turned off by default. It does not affect your concurrency settings -- only the delay between requests is adjusted.