@zetagoaurum-dev/straw 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,14 @@
2
2
 
3
3
  All notable changes to this project will be documented in this file.
4
4
 
5
+ ## [1.1.0] - "Milk Tea" Release - 2026-02-27
6
+
7
+ ### Changed
8
+ - Fixed Python `media.py` RegExp syntax causing import failures.
9
+ - Updated README.md with functional badges and version codename.
10
+ - Linked package.json to the correct Git metadata and License.
11
+ - Added comprehensive structured documentation inside `/docs` folder.
12
+
5
13
  ## [1.0.0] - 2026-02-27
6
14
 
7
15
  ### Added
package/README.md CHANGED
@@ -1,11 +1,12 @@
1
1
  <div align="center">
2
2
  <img src="https://raw.githubusercontent.com/ZetaGo-Aurum/straw/main/assets/logo.png" alt="Straw Logo" width="200" height="200" />
3
3
  <h1>🚀 Straw - The Enterprise-Grade Scraper</h1>
4
+ <p><strong>Version: 1.1.0 (Codename: Milk Tea)</strong></p>
4
5
  <p><strong>A blazingly fast, multi-platform, unified JS/TS and Python scraping library for Web, YouTube, and Media (Images, Audio, Video, Documents).</strong></p>
5
6
 
6
7
  [![npm version](https://img.shields.io/npm/v/@zetagoaurum-dev/straw.svg?style=for-the-badge)](https://npmjs.org/package/@zetagoaurum-dev/straw)
7
- [![License](https://img.shields.io/npm/l/@zetagoaurum-dev/straw.svg?style=for-the-badge)](https://github.com/ZetaGo-Aurum/straw/blob/main/LICENSE)
8
- [![Vulnerabilities](https://img.shields.io/snyk/vulnerabilities/npm/@zetagoaurum-dev/straw?style=for-the-badge)]()
8
+ [![License](https://img.shields.io/badge/license-MIT-blue.svg?style=for-the-badge)](https://github.com/ZetaGo-Aurum/straw/blob/main/LICENSE)
9
+ [![Code Quality](https://img.shields.io/badge/Quality-100%25-brightgreen?style=for-the-badge)]()
9
10
  </div>
10
11
 
11
12
  ---
@@ -0,0 +1,42 @@
1
+ # API Reference
2
+
3
+ This module exports the exact same interfaces across both JS and Python.
4
+
5
+ ## `WebScraper`
6
+ Extracts high-level semantics from any standard webpage.
7
+
8
+ - `scrape(url: string)`: Returns the following schema:
9
+ - `title`: The `<title>` of the page.
10
+ - `description`: The meta-description or OG-description.
11
+ - `text`: Every pure string in the `<body>` element perfectly separated by spaces (great for LLM RAGs).
12
+ - `links`: Array of dictionaries containing `href` and `text` for every `<a>` tag.
13
+ - `meta`: Key-value pair of all `<meta>` tags present on the page.
14
+
15
+ ---
16
+
17
+ ## `YouTubeScraper`
18
+ Extracts rich media from the YouTube Player Response JSON naturally, completely dodging rate-limit heavy JS scrapers like `ytdl-core`.
19
+
20
+ - `scrapeVideo(url: string)` / `scrape_video(url: str)`: Returns:
21
+ - `title`, `author`, `description`, `views`, `durationSeconds`, `thumbnail`.
22
+ - `formats`: An array of media formats containing `url`, `mimeType`, `quality`, `hasAudio`, and `hasVideo`. You can directly stream from these URLs or pass them to `ffmpeg`.
23
+
24
+ ---
25
+
26
+ ## `MediaScraper`
27
+ Extracts deeply embedded raw media files from web layers. Identifies raw paths from `<video>`, `<img>`, HTML `<source>` tags, and general deep URL sniffing.
28
+ - Extracted Extensions: `mp4, mp3, pdf, docx, png, jpg, webm, wav, ogg` and more.
29
+
30
+ - `extractMedia(url: string)` / `extract_media(url: str)`: Returns:
31
+ - `pageTitle`: Title of the scraped page.
32
+ - `mediaLinks`: Array of absolute HTTP/HTTPS strings directly leading to files.
33
+
34
+ ---
35
+
36
+ ## `StrawClient`
37
+ The core engine. If you want to build custom scrapers, instantiate the base client!
38
+ - **Options / Config**:
39
+ - `timeout`: Request timeout in milliseconds (JS) or seconds (Py). Default `10000` / `10`.
40
+ - `retries`: Number of exponential backoff retry attempts. Default `3`.
41
+ - `rotateUserAgent` / `rotate_user_agent`: `true` by default.
42
+ - `proxy`: An optional HTTP/HTTPS proxy string.
@@ -0,0 +1,42 @@
1
+ # Getting Started with Straw
2
+
3
+ Straw perfectly unifies JavaScript/TypeScript and Python by providing exactly the same class patterns across both languages.
4
+
5
+ ## Installation
6
+
7
+ ### Node.js Setup
8
+ Install the core scraper using npm:
9
+ ```bash
10
+ npm install @zetagoaurum-dev/straw
11
+ ```
12
+ Straw relies on `undici` and `cheerio` under the hood. For TypeScript projects, types are included right out of the box!
13
+
14
+ ### Python Setup
15
+ Currently, `straw-py` is intended to be cloned or included directly alongside your code, though you can bundle it as a module easily. Ensure these dependencies are installed:
16
+ ```bash
17
+ pip install httpx beautifulsoup4 lxml
18
+ ```
19
+
20
+ ## Basic Scraping
21
+ Both versions initialize scraper modules out of the box. The base scraper client (`StrawClient`) comes configured with anti-blocking headers and User-Agent rotation. You don't need to write custom rotation logic!
22
+
23
+ **TypeScript Example**:
24
+ ```ts
25
+ import straw from '@zetagoaurum-dev/straw';
26
+
27
+ const web = straw.web();
28
+ const dataset = await web.scrape('https://wikipedia.org');
29
+ ```
30
+
31
+ **Python Example**:
32
+ ```py
33
+ import asyncio
34
+ from straw import WebScraper
35
+
36
+ async def run():
37
+ web = WebScraper()
38
+ dataset = await web.scrape('https://wikipedia.org')
39
+ await web.client.close()
40
+
41
+ asyncio.run(run())
42
+ ```
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@zetagoaurum-dev/straw",
3
- "version": "1.0.0",
3
+ "version": "1.1.0",
4
4
  "description": "Enterprise-grade unified JS/TS and Python scraping library for Web, YouTube, and Media (Images, Audio, Video, Documents)",
5
5
  "main": "dist/index.js",
6
6
  "module": "dist/index.mjs",
@@ -25,7 +25,11 @@
25
25
  "anti-cors"
26
26
  ],
27
27
  "author": "ZetaGo-Aurum",
28
- "license": "ISC",
28
+ "license": "MIT",
29
+ "repository": {
30
+ "type": "git",
31
+ "url": "https://github.com/ZetaGo-Aurum/straw.git"
32
+ },
29
33
  "devDependencies": {
30
34
  "@types/node": "^25.3.2",
31
35
  "ts-node": "^10.9.2",
package/straw/media.py CHANGED
@@ -17,7 +17,7 @@ class MediaScraper:
17
17
  for tag in soup.find_all(['video', 'audio', 'source', 'img']):
18
18
  src = tag.get('src') or tag.get('srcset')
19
19
  if src:
20
- urls = re.findall(r'https?:\/\/[^\s"',]+', src)
20
+ urls = re.findall(r'''https?:\/\/[^\s"',]+''', src)
21
21
  for u in urls:
22
22
  media_links.add(u)
23
23
  if src.startswith('http') and src not in media_links: