kabigon 0.14.2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,319 @@
1
+ Metadata-Version: 2.4
2
+ Name: kabigon
3
+ Version: 0.14.2
4
+ Author-email: narumi <toucans-cutouts0f@icloud.com>
5
+ License-File: LICENSE
6
+ Requires-Python: >=3.12
7
+ Requires-Dist: firecrawl-py>=2.4.1
8
+ Requires-Dist: httpx>=0.28.1
9
+ Requires-Dist: loguru>=0.7.3
10
+ Requires-Dist: markdownify>=0.14.1
11
+ Requires-Dist: openai-whisper>=20250625
12
+ Requires-Dist: playwright>=1.52.0
13
+ Requires-Dist: pypdf>=5.3.0
14
+ Requires-Dist: rich>=13.9.4
15
+ Requires-Dist: typer>=0.15.3
16
+ Requires-Dist: youtube-transcript-api>=1.2.2
17
+ Requires-Dist: yt-dlp>=2025.4.30
18
+ Description-Content-Type: text/markdown
19
+
20
+ # kabigon
21
+
22
+ [![PyPI version](https://badge.fury.io/py/kabigon.svg)](https://badge.fury.io/py/kabigon)
23
+ [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
24
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
25
+ [![codecov](https://codecov.io/gh/narumiruna/kabigon/branch/main/graph/badge.svg)](https://codecov.io/gh/narumiruna/kabigon)
26
+
27
+ A URL content loader library that extracts content from various sources (YouTube, Instagram Reels, Twitter/X, Reddit, Truth Social, GitHub files, PDFs, web pages) and converts them to text/markdown format.
28
+
29
+ ## Features
30
+
31
+ ✹ **Multi-Platform Support**: YouTube, Twitter/X, Truth Social, Reddit, Instagram Reels, PTT, GitHub files, PDFs, and generic web pages
32
+
33
+ 🔄 **Async-First Design**: Built with async/await for efficient parallel processing
34
+
35
+ 🎯 **Smart Fallback**: Automatically tries multiple extraction strategies until one succeeds
36
+
37
+ 🚀 **Simple API**: Single-line usage with sensible defaults, or full control with custom loader chains
38
+
39
+ 🔌 **Extensible**: Easy to add new loaders for additional platforms
40
+
41
+ ## Table of Contents
42
+
43
+ - [Installation](#installation)
44
+ - [Usage](#usage)
45
+ - [CLI](#cli)
46
+ - [Python API - Sync](#python-api---sync)
47
+ - [Python API - Async](#python-api---async)
48
+ - [Supported Sources](#supported-sources)
49
+ - [Examples](#examples)
50
+ - [Troubleshooting](#troubleshooting)
51
+ - [Development](#development)
52
+ - [License](#license)
53
+
54
+ ## Installation
55
+
56
+ ```shell
57
+ pip install kabigon
58
+ playwright install chromium
59
+ ```
60
+
61
+ ## Usage
62
+
63
+ ### CLI
64
+
65
+ ```shell
66
+ kabigon <url>
67
+
68
+ # Examples
69
+ kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ
70
+ kabigon https://x.com/elonmusk/status/123456789
71
+ kabigon https://truthsocial.com/@realDonaldTrump/posts/123456
72
+ kabigon https://reddit.com/r/python/comments/xyz/...
73
+ kabigon https://github.com/anthropics/claude-code/blob/main/plugins/ralph-wiggum/README.md
74
+ kabigon https://example.com/document.pdf
75
+ ```
76
+
77
+ ### Python API - Sync
78
+
79
+ ```python
80
+ import kabigon
81
+
82
+ url = "https://www.google.com.tw"
83
+
84
+ # Simplest usage - automatically uses the best loader
85
+ content = kabigon.load_url_sync(url)
86
+ print(content)
87
+
88
+ # Or use specific loader
89
+ content = kabigon.PlaywrightLoader().load_sync(url)
90
+ print(content)
91
+
92
+ # With multiple loaders (tries each in order)
93
+ loader = kabigon.Compose([
94
+ kabigon.TwitterLoader(),
95
+ kabigon.TruthSocialLoader(),
96
+ kabigon.YoutubeLoader(),
97
+ kabigon.RedditLoader(),
98
+ kabigon.PDFLoader(),
99
+ kabigon.PlaywrightLoader(), # Fallback for generic URLs
100
+ ])
101
+ content = loader.load_sync(url)
102
+ print(content)
103
+ ```
104
+
105
+ ### Python API - Async
106
+
107
+ ```python
108
+ import asyncio
109
+ import kabigon
110
+
111
+ async def main():
112
+ url = "https://www.google.com.tw"
113
+
114
+ # Simplest usage - automatically uses the best loader
115
+ content = await kabigon.load_url(url)
116
+ print(content)
117
+
118
+ # Or use specific loader
119
+ loader = kabigon.PlaywrightLoader()
120
+ content = await loader.load(url)
121
+ print(content)
122
+
123
+ # Batch processing multiple URLs in parallel
124
+ urls = [
125
+ "https://x.com/user1/status/123",
126
+ "https://truthsocial.com/@user/posts/456",
127
+ "https://youtube.com/watch?v=abc",
128
+ "https://reddit.com/r/python/comments/xyz",
129
+ ]
130
+
131
+ loader = kabigon.Compose([
132
+ kabigon.TwitterLoader(),
133
+ kabigon.TruthSocialLoader(),
134
+ kabigon.YoutubeLoader(),
135
+ kabigon.RedditLoader(),
136
+ kabigon.PlaywrightLoader(),
137
+ ])
138
+
139
+ # Parallel processing with automatic loader selection
140
+ results = await asyncio.gather(*[kabigon.load_url(url) for url in urls])
141
+ for url, content in zip(urls, results):
142
+ print(f"{url}: {len(content)} chars")
143
+
144
+ asyncio.run(main())
145
+ ```
146
+
147
+ ### API Comparison
148
+
149
+ | Usage | Simplest | Custom Loader Chain |
150
+ |-------|----------|---------------------|
151
+ | **Sync** | `kabigon.load_url_sync(url)` | `loader.load_sync(url)` |
152
+ | **Async** | `await kabigon.load_url(url)` | `await loader.load(url)` |
153
+ | **Batch Async** | `await asyncio.gather(*[kabigon.load_url(url) for url in urls])` | `await asyncio.gather(*[loader.load(url) for url in urls])` |
154
+
155
+ ## Supported Sources
156
+
157
+ | Source | Loader | Description |
158
+ |--------|--------|-------------|
159
+ | YouTube | `YoutubeLoader` | Extracts video transcripts |
160
+ | YouTube | `YoutubeYtdlpLoader` | Audio transcription via yt-dlp + Whisper |
161
+ | Twitter/X | `TwitterLoader` | Extracts tweet content |
162
+ | Truth Social | `TruthSocialLoader` | Extracts Truth Social posts |
163
+ | Reddit | `RedditLoader` | Extracts Reddit posts and comments |
164
+ | Instagram Reels | `ReelLoader` | Audio transcription + metadata |
165
+ | GitHub | `GitHubLoader` | Fetches GitHub web pages and file content (supports repo URLs + `github.com/.../blob/...`) |
166
+ | PDF | `PDFLoader` | Extracts text from PDF files (URL or local) |
167
+ | PTT | `PttLoader` | Taiwan PTT forum posts |
168
+ | Generic Web | `PlaywrightLoader` | Browser-based scraping for any website |
169
+ | Generic Web | `HttpxLoader` | Simple HTTP requests with markdown conversion |
170
+
171
+ ## Examples
172
+
173
+ See the [`examples/`](examples/) directory for more usage examples:
174
+
175
+ - [`simple_usage.py`](examples/simple_usage.py) - Basic single-line usage
176
+ - [`async_usage.py`](examples/async_usage.py) - Async usage and parallel batch processing
177
+ - [`twitter.py`](examples/twitter.py) - Twitter/X post extraction
178
+ - [`truthsocial.py`](examples/truthsocial.py) - Truth Social post extraction
179
+ - [`read_reddit.py`](examples/read_reddit.py) - Reddit post and comments extraction
180
+ - [`ptt.py`](examples/ptt.py) - PTT forum post extraction
181
+ - [`fetch_billgertz_tweet.py`](examples/fetch_billgertz_tweet.py) - Real-world Twitter scraping example
182
+
183
+ ## Troubleshooting
184
+
185
+ ### Playwright browser not installed
186
+
187
+ **Error**: `Executable doesn't exist at /path/to/chromium`
188
+
189
+ **Solution**: Install Playwright browsers after installing kabigon:
190
+ ```bash
191
+ playwright install chromium
192
+ ```
193
+
194
+ ### FFmpeg not found (for audio transcription)
195
+
196
+ **Error**: `ffmpeg not found`
197
+
198
+ **Solution**: Install FFmpeg for your platform:
199
+ ```bash
200
+ # Ubuntu/Debian
201
+ sudo apt-get install ffmpeg
202
+
203
+ # macOS
204
+ brew install ffmpeg
205
+
206
+ # Windows
207
+ # Download from https://ffmpeg.org/download.html
208
+ ```
209
+
210
+ Or set custom FFmpeg path:
211
+ ```bash
212
+ export FFMPEG_PATH=/path/to/ffmpeg
213
+ ```
214
+
215
+ ### Timeout errors
216
+
217
+ **Error**: `Timeout 30000ms exceeded`
218
+
219
+ **Solution**: Increase timeout for slow-loading pages:
220
+ ```python
221
+ # Increase timeout to 60 seconds
222
+ loader = kabigon.PlaywrightLoader(timeout=60_000)
223
+ content = loader.load_sync(url)
224
+ ```
225
+
226
+ ### CAPTCHA or rate limiting
227
+
228
+ Some websites may show CAPTCHAs or block automated access. For Reddit, kabigon automatically uses `old.reddit.com` to avoid CAPTCHAs. For other sites, you may need to:
229
+
230
+ - Add delays between requests
231
+ - Use a custom user agent
232
+ - Implement retry logic with exponential backoff
233
+
234
+ ### Logging and debugging
235
+
236
+ Enable debug logging to see what's happening:
237
+ ```bash
238
+ export LOGURU_LEVEL=DEBUG
239
+ kabigon <url>
240
+ ```
241
+
242
+ Or in Python:
243
+ ```python
244
+ import os
245
+ os.environ["LOGURU_LEVEL"] = "DEBUG"
246
+ import kabigon
247
+ ```
248
+
249
+ ## Development
250
+
251
+ ### Setup
252
+
253
+ ```bash
254
+ # Clone the repository
255
+ git clone https://github.com/narumiruna/kabigon.git
256
+ cd kabigon
257
+
258
+ # Install dependencies with uv
259
+ uv sync
260
+
261
+ # Install Playwright browsers
262
+ playwright install chromium
263
+ ```
264
+
265
+ ### Testing
266
+
267
+ ```bash
268
+ # Run all tests with coverage
269
+ uv run pytest -v -s --cov=src tests
270
+
271
+ # Run specific test file
272
+ uv run pytest -v -s tests/loaders/test_youtube.py
273
+ ```
274
+
275
+ Current test coverage: **69%** (37 tests passing)
276
+
277
+ ### Linting and Type Checking
278
+
279
+ ```bash
280
+ # Run linter
281
+ uv run ruff check .
282
+
283
+ # Run type checker
284
+ uv run ty check .
285
+
286
+ # Auto-fix linting issues
287
+ uv run ruff check --fix .
288
+
289
+ # Format code
290
+ uv run ruff format .
291
+ ```
292
+
293
+ ### Building and Publishing
294
+
295
+ ```bash
296
+ # Build wheel
297
+ uv build -f wheel
298
+
299
+ # Publish to PyPI
300
+ uv publish
301
+ ```
302
+
303
+ ### Contributing
304
+
305
+ Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
306
+
307
+ When adding a new loader:
308
+ 1. Create a new file in `src/kabigon/loaders/`
309
+ 2. Inherit from the `Loader` base class
310
+ 3. Implement `async def load(url: str) -> str`
311
+ 4. Add domain validation
312
+ 5. Add tests in `tests/loaders/`
313
+ 6. Update documentation
314
+
315
+ See [`CLAUDE.md`](CLAUDE.md) for detailed development guidelines.
316
+
317
+ ## License
318
+
319
+ MIT License - see [LICENSE](LICENSE) file for details.
@@ -0,0 +1,28 @@
1
+ kabigon/__init__.py,sha256=dkHmFZImqh0b4YnoDrD3bhjknyO5r_0LTg7CIcmq0zI,1070
2
+ kabigon/api.py,sha256=9xLM_A9L0V5Ipxpb921lusPO4NcNrBMJxdYNMLPPi5A,2029
3
+ kabigon/cli.py,sha256=-fl8KhW6k2bLREVponkvZ82a3Q_D0St8oE2dn032EZ8,188
4
+ kabigon/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
5
+ kabigon/core/__init__.py,sha256=ZJNvmTRPUAqFxWn4QuVizsbbP9pFNGupl8rt67wNFrg,536
6
+ kabigon/core/exception.py,sha256=OZFAyU0CMX9eA2rKBrKM1lTRWR3KFOm6m1_FgWl79pw,1267
7
+ kabigon/core/loader.py,sha256=DXKRzmd8Dl4u-UgPHVcn4V37nPfeMehpL4O7cBzOibU,220
8
+ kabigon/loaders/__init__.py,sha256=ZmcXKE9RxDnX_q5eg4UnzlBG_i1KjUFYJNsIvn49B8Q,779
9
+ kabigon/loaders/compose.py,sha256=37F5C_HS1YeMUmtTFc4VzQCbFszXdq8OBymXzqmP5PU,971
10
+ kabigon/loaders/firecrawl.py,sha256=SHA_X1qKkb9R2nxe32NqD5dS3-3lHHCb266N0-CgbZM,804
11
+ kabigon/loaders/github.py,sha256=quoSHikGRaIpbKI_9z7PtSbh96-eG3ry1iJyE96J1sY,5217
12
+ kabigon/loaders/httpx.py,sha256=RoR2YIZErSq9JMeM8L6Xk7i0HfURT-2YCkSvE4uQl4o,499
13
+ kabigon/loaders/pdf.py,sha256=jxzo8DTB6YBi3k4U3Ub7ngkd7__5PUZB0kut6j9c1Ak,1484
14
+ kabigon/loaders/playwright.py,sha256=3n8gfOoIxHs8tn6vBtPMCzDY7u1J7d7qLXIlRWE28G0,1220
15
+ kabigon/loaders/ptt.py,sha256=BLbsaJfnnGRVKr8PJpybTWn49jXXvH2pzQFt-AKO5wA,824
16
+ kabigon/loaders/reddit.py,sha256=R5hBDftJ7bc4xtK8EBbXrxkul9kX3olm1k7ALKxVmtA,2244
17
+ kabigon/loaders/reel.py,sha256=8U-MDag31lTnF9siptLS2hv1uAh_P4EF92o4XsTfmmY,692
18
+ kabigon/loaders/truthsocial.py,sha256=HnIlci_p7lWwnkDt4xaRHdt-x0aeLXjahwxzEB7toNM,1929
19
+ kabigon/loaders/twitter.py,sha256=deQRQphE6WWVe5YE37YhDZsRyDnGgeZmHRB5TgD1DY4,3955
20
+ kabigon/loaders/utils.py,sha256=eNTLtHLSB2erDac2HH3jWemgfr8Ou_ozwVb8h9BD-4g,922
21
+ kabigon/loaders/youtube.py,sha256=cmzhw8aCsmh5LYrtmWpKtCl06OU6KUS3pHa6tpVHgFg,4611
22
+ kabigon/loaders/youtube_ytdlp.py,sha256=EsTkWWwhxFPmv7wtzZsVcWQoC89-CZ7qLHo9cpG3sdQ,342
23
+ kabigon/loaders/ytdlp.py,sha256=De4myywRYPmUEz5ZVUzPCLb2T9vFLPvS9NWZqsHkLog,1696
24
+ kabigon-0.14.2.dist-info/METADATA,sha256=qerhecLBMs4Jyji4SuRXLFIA0VkttTs_H_ypHL2hgRc,8801
25
+ kabigon-0.14.2.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
26
+ kabigon-0.14.2.dist-info/entry_points.txt,sha256=O3FYAO9w-NQvlGMJrBvtrnGHSK2QkUnQBTa30YXRbVE,45
27
+ kabigon-0.14.2.dist-info/licenses/LICENSE,sha256=H2T3_RTgmcngMeC7p_SXT3GwBLkd2DaNgAZuxulcfiA,1066
28
+ kabigon-0.14.2.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: hatchling 1.28.0
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ kabigon = kabigon.cli:main
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2022 ăȘるみ
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.