insta2table 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- insta2table-0.1.0/LICENSE +21 -0
- insta2table-0.1.0/MANIFEST.in +3 -0
- insta2table-0.1.0/PKG-INFO +86 -0
- insta2table-0.1.0/README.md +61 -0
- insta2table-0.1.0/examples/links.txt +4 -0
- insta2table-0.1.0/pyproject.toml +37 -0
- insta2table-0.1.0/setup.cfg +7 -0
- insta2table-0.1.0/src/insta2table/__init__.py +7 -0
- insta2table-0.1.0/src/insta2table/cli.py +38 -0
- insta2table-0.1.0/src/insta2table/crawler.py +224 -0
- insta2table-0.1.0/src/insta2table/processor.py +77 -0
- insta2table-0.1.0/src/insta2table/utils.py +44 -0
- insta2table-0.1.0/src/insta2table.egg-info/PKG-INFO +86 -0
- insta2table-0.1.0/src/insta2table.egg-info/SOURCES.txt +18 -0
- insta2table-0.1.0/src/insta2table.egg-info/dependency_links.txt +1 -0
- insta2table-0.1.0/src/insta2table.egg-info/entry_points.txt +3 -0
- insta2table-0.1.0/src/insta2table.egg-info/requires.txt +14 -0
- insta2table-0.1.0/src/insta2table.egg-info/top_level.txt +1 -0
- insta2table-0.1.0/tests/test_import.py +3 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 S. Kashyap
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: insta2table
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Fetch Instagram captions (with optional OCR) and convert them to a normalized table via Gemini.
|
|
5
|
+
Author: S. Kashyap
|
|
6
|
+
License: MIT License
|
|
7
|
+
Project-URL: Homepage, https://github.com/Sk1499/insta2table
|
|
8
|
+
Project-URL: Issues, https://github.com/Sk1499/insta2table/issues
|
|
9
|
+
Requires-Python: >=3.9
|
|
10
|
+
Description-Content-Type: text/markdown
|
|
11
|
+
License-File: LICENSE
|
|
12
|
+
Requires-Dist: pandas>=2.0
|
|
13
|
+
Requires-Dist: tqdm>=4.66
|
|
14
|
+
Requires-Dist: tenacity>=8.2
|
|
15
|
+
Requires-Dist: python-dotenv>=1.0
|
|
16
|
+
Requires-Dist: instaloader>=4.12
|
|
17
|
+
Requires-Dist: requests>=2.31
|
|
18
|
+
Provides-Extra: ocr
|
|
19
|
+
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
|
|
20
|
+
Requires-Dist: Pillow>=10.0; extra == "ocr"
|
|
21
|
+
Provides-Extra: genai
|
|
22
|
+
Requires-Dist: langchain>=0.2.0; extra == "genai"
|
|
23
|
+
Requires-Dist: langchain-google-genai>=2.0.0; extra == "genai"
|
|
24
|
+
Dynamic: license-file
|
|
25
|
+
|
|
26
|
+
# insta2table
|
|
27
|
+
|
|
28
|
+
[](https://github.com/Sk1499/insta2table/actions)
|
|
29
|
+
[](https://pypi.org/project/insta2table/)
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
A tiny toolkit to:
|
|
33
|
+
1. **Crawl Instagram links** from a text file, extract captions and (optionally) **OCR** text from images → `output.csv`
|
|
34
|
+
2. **Convert** those rows into a clean **single-row table** per link using **Gemini** → `result.csv`
|
|
35
|
+
|
|
36
|
+
> Credentials & keys via env:
|
|
37
|
+
> - `IG_USER`, `IG_PASS` (optional): for Instaloader login (reduces 403s)
|
|
38
|
+
> - `GOOGLE_API_KEY`: required for Gemini
|
|
39
|
+
|
|
40
|
+
## Quickstart
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
# 1) (Recommended) Create and activate a virtualenv
|
|
44
|
+
python -m venv .venv
|
|
45
|
+
# Linux/macOS
|
|
46
|
+
source .venv/bin/activate
|
|
47
|
+
# Windows (PowerShell)
|
|
48
|
+
# .venv\Scripts\Activate.ps1
|
|
49
|
+
|
|
50
|
+
# 2) Install (with OCR & Gemini extras if you need them)
|
|
51
|
+
pip install -e .[ocr,genai]
|
|
52
|
+
|
|
53
|
+
# 3) Prepare links
|
|
54
|
+
cp examples/links.txt .
|
|
55
|
+
|
|
56
|
+
# 4) Crawl -> output.csv
|
|
57
|
+
export IG_USER="your_user"
|
|
58
|
+
export IG_PASS="your_pass"
|
|
59
|
+
insta2csv --links links.txt --out output.csv
|
|
60
|
+
|
|
61
|
+
# 5) Process with Gemini -> result.csv
|
|
62
|
+
export GOOGLE_API_KEY="your_key"
|
|
63
|
+
insta2table --in output.csv --out result.csv
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
## CLI
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
insta2csv --links links.txt --out output.csv [--no-ocr]
|
|
70
|
+
insta2table --in output.csv --out result.csv
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
## Notes
|
|
74
|
+
- OCR requires Tesseract installed on your system if you opt in.
|
|
75
|
+
- Instagram scraping without login can trigger 403s. Supplying IG credentials helps.
|
|
76
|
+
- Gemini formatting expects a single-row Markdown table per input.
|
|
77
|
+
|
|
78
|
+
|
|
79
|
+
## Publishing
|
|
80
|
+
|
|
81
|
+
We use **trusted publishing** from GitHub to PyPI (no API token needed).
|
|
82
|
+
|
|
83
|
+
1. Create the project on PyPI (only first time) and enable **'Manage publishing'** with GitHub OIDC for your repo.
|
|
84
|
+
2. In GitHub: create a new Release with a tag like `v0.1.0`.
|
|
85
|
+
3. The **Publish to PyPI** workflow will build and upload the release automatically.
|
|
86
|
+
4. Alternatively (manual): `make build` then `twine upload dist/*` (requires a PyPI token).
|
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
# insta2table
|
|
2
|
+
|
|
3
|
+
[](https://github.com/Sk1499/insta2table/actions)
|
|
4
|
+
[](https://pypi.org/project/insta2table/)
|
|
5
|
+
|
|
6
|
+
|
|
7
|
+
A tiny toolkit to:
|
|
8
|
+
1. **Crawl Instagram links** from a text file, extract captions and (optionally) **OCR** text from images → `output.csv`
|
|
9
|
+
2. **Convert** those rows into a clean **single-row table** per link using **Gemini** → `result.csv`
|
|
10
|
+
|
|
11
|
+
> Credentials & keys via env:
|
|
12
|
+
> - `IG_USER`, `IG_PASS` (optional): for Instaloader login (reduces 403s)
|
|
13
|
+
> - `GOOGLE_API_KEY`: required for Gemini
|
|
14
|
+
|
|
15
|
+
## Quickstart
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
# 1) (Recommended) Create and activate a virtualenv
|
|
19
|
+
python -m venv .venv
|
|
20
|
+
# Linux/macOS
|
|
21
|
+
source .venv/bin/activate
|
|
22
|
+
# Windows (PowerShell)
|
|
23
|
+
# .venv\Scripts\Activate.ps1
|
|
24
|
+
|
|
25
|
+
# 2) Install (with OCR & Gemini extras if you need them)
|
|
26
|
+
pip install -e .[ocr,genai]
|
|
27
|
+
|
|
28
|
+
# 3) Prepare links
|
|
29
|
+
cp examples/links.txt .
|
|
30
|
+
|
|
31
|
+
# 4) Crawl -> output.csv
|
|
32
|
+
export IG_USER="your_user"
|
|
33
|
+
export IG_PASS="your_pass"
|
|
34
|
+
insta2csv --links links.txt --out output.csv
|
|
35
|
+
|
|
36
|
+
# 5) Process with Gemini -> result.csv
|
|
37
|
+
export GOOGLE_API_KEY="your_key"
|
|
38
|
+
insta2table --in output.csv --out result.csv
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
## CLI
|
|
42
|
+
|
|
43
|
+
```bash
|
|
44
|
+
insta2csv --links links.txt --out output.csv [--no-ocr]
|
|
45
|
+
insta2table --in output.csv --out result.csv
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
## Notes
|
|
49
|
+
- OCR requires Tesseract installed on your system if you opt in.
|
|
50
|
+
- Instagram scraping without login can trigger 403s. Supplying IG credentials helps.
|
|
51
|
+
- Gemini formatting expects a single-row Markdown table per input.
|
|
52
|
+
|
|
53
|
+
|
|
54
|
+
## Publishing
|
|
55
|
+
|
|
56
|
+
We use **trusted publishing** from GitHub to PyPI (no API token needed).
|
|
57
|
+
|
|
58
|
+
1. Create the project on PyPI (only first time) and enable **'Manage publishing'** with GitHub OIDC for your repo.
|
|
59
|
+
2. In GitHub: create a new Release with a tag like `v0.1.0`.
|
|
60
|
+
3. The **Publish to PyPI** workflow will build and upload the release automatically.
|
|
61
|
+
4. Alternatively (manual): `make build` then `twine upload dist/*` (requires a PyPI token).
|
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=68", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "insta2table"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
description = "Fetch Instagram captions (with optional OCR) and convert them to a normalized table via Gemini."
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
requires-python = ">=3.9"
|
|
11
|
+
license = {text = "MIT License"}
|
|
12
|
+
authors = [{name = "S. Kashyap"}]
|
|
13
|
+
dependencies = [
|
|
14
|
+
"pandas>=2.0",
|
|
15
|
+
"tqdm>=4.66",
|
|
16
|
+
"tenacity>=8.2",
|
|
17
|
+
"python-dotenv>=1.0",
|
|
18
|
+
"instaloader>=4.12",
|
|
19
|
+
"requests>=2.31"
|
|
20
|
+
]
|
|
21
|
+
|
|
22
|
+
[project.optional-dependencies]
|
|
23
|
+
ocr = ["pytesseract>=0.3.10", "Pillow>=10.0"]
|
|
24
|
+
genai = ["langchain>=0.2.0", "langchain-google-genai>=2.0.0"]
|
|
25
|
+
|
|
26
|
+
[project.scripts]
|
|
27
|
+
insta2csv = "insta2table.cli:crawl"
|
|
28
|
+
insta2table = "insta2table.cli:process"
|
|
29
|
+
|
|
30
|
+
[tool.setuptools]
|
|
31
|
+
package-dir = {"" = "src"}
|
|
32
|
+
|
|
33
|
+
[tool.setuptools.packages.find]
|
|
34
|
+
where = ["src"]
|
|
35
|
+
[project.urls]
|
|
36
|
+
Homepage = "https://github.com/Sk1499/insta2table"
|
|
37
|
+
Issues = "https://github.com/Sk1499/insta2table/issues"
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
import argparse, logging, os
|
|
3
|
+
from .crawler import crawl_links
|
|
4
|
+
from .processor import process_csv
|
|
5
|
+
|
|
6
|
+
logging.basicConfig(level=os.getenv("INSTA2TABLE_LOG","INFO"))
|
|
7
|
+
|
|
8
|
+
def crawl(argv=None):
|
|
9
|
+
p = argparse.ArgumentParser(description="Crawl Instagram links -> output.csv")
|
|
10
|
+
p.add_argument("--links", required=True, help="Path to links.txt (one URL per line)")
|
|
11
|
+
p.add_argument("--out", default="output.csv", help="Path to write output CSV")
|
|
12
|
+
p.add_argument("--no-ocr", action="store_true", help="Disable OCR")
|
|
13
|
+
p.add_argument("--proxies-file", help="Path to a file with proxies (one per line)")
|
|
14
|
+
p.add_argument("--user-agents-file", help="Path to a file with user agents (one per line)")
|
|
15
|
+
p.add_argument("--proxy", action='append', help="Provide a proxy (can be used multiple times). Example: http://user:pass@1.2.3.4:3128")
|
|
16
|
+
p.add_argument("--user-agent", action='append', help="Provide a user-agent (can be used multiple times).")
|
|
17
|
+
args = p.parse_args(argv)
|
|
18
|
+
proxies = None
|
|
19
|
+
user_agents = None
|
|
20
|
+
if args.proxy:
|
|
21
|
+
proxies = args.proxy
|
|
22
|
+
if args.user_agent:
|
|
23
|
+
user_agents = args.user_agent
|
|
24
|
+
# Files override env if provided via CLI
|
|
25
|
+
if args.proxies_file:
|
|
26
|
+
os.environ['PROXIES_FILE'] = args.proxies_file
|
|
27
|
+
if args.user_agents_file:
|
|
28
|
+
os.environ['USER_AGENTS_FILE'] = args.user_agents_file
|
|
29
|
+
out = crawl_links(args.links, args.out, do_ocr=not args.no_ocr, proxies=proxies, user_agents=user_agents)
|
|
30
|
+
print(out)
|
|
31
|
+
|
|
32
|
+
def process(argv=None):
|
|
33
|
+
p = argparse.ArgumentParser(description="Convert output.csv to result.csv using Gemini")
|
|
34
|
+
p.add_argument("--in", dest="in_csv", required=True, help="Input CSV from insta2csv")
|
|
35
|
+
p.add_argument("--out", dest="out_csv", default="result.csv", help="Path to write result CSV")
|
|
36
|
+
args = p.parse_args(argv)
|
|
37
|
+
out = process_csv(args.in_csv, args.out_csv)
|
|
38
|
+
print(out)
|
|
@@ -0,0 +1,224 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
import os, csv, io, logging, requests
|
|
3
|
+
from typing import List
|
|
4
|
+
from dataclasses import dataclass
|
|
5
|
+
from .utils import classify_url, safe_sleep
|
|
6
|
+
|
|
7
|
+
from itertools import cycle
|
|
8
|
+
from typing import Iterable, Optional, Iterator
|
|
9
|
+
import random
|
|
10
|
+
import itertools
|
|
11
|
+
|
|
12
|
+
def _read_list_from_file(path: Optional[str]) -> list:
|
|
13
|
+
if not path:
|
|
14
|
+
return []
|
|
15
|
+
p = Path(path)
|
|
16
|
+
if not p.exists():
|
|
17
|
+
return []
|
|
18
|
+
return [line.strip() for line in p.read_text(encoding="utf-8").splitlines() if line.strip()]
|
|
19
|
+
|
|
20
|
+
def _coerce_list(env_var: Optional[str]) -> list:
|
|
21
|
+
if not env_var:
|
|
22
|
+
return []
|
|
23
|
+
# Accept comma-separated lists or newline separated
|
|
24
|
+
items = [i.strip() for i in env_var.replace("\\r","").split(",") if i.strip()]
|
|
25
|
+
if len(items) == 1 and "\\n" in env_var:
|
|
26
|
+
items = [i.strip() for i in env_var.splitlines() if i.strip()]
|
|
27
|
+
return items
|
|
28
|
+
|
|
29
|
+
def _make_cycle(items: Iterable[str]) -> Iterator[str]:
|
|
30
|
+
items = list(items) or [None]
|
|
31
|
+
# If only one item and None, cycle will yield None allowing caller to skip proxies/UA when None.
|
|
32
|
+
return cycle(items)
|
|
33
|
+
|
|
34
|
+
def _apply_to_instaloader_session(L, user_agent: Optional[str], proxy: Optional[str]):
|
|
35
|
+
"""
|
|
36
|
+
Apply user-agent & proxy settings to an Instaloader session if available.
|
|
37
|
+
We try to update L.context._session (requests.Session) headers & proxies.
|
|
38
|
+
"""
|
|
39
|
+
try:
|
|
40
|
+
if getattr(L, 'context', None) and hasattr(L.context, '_session'):
|
|
41
|
+
sess = getattr(L.context, '_session')
|
|
42
|
+
if user_agent:
|
|
43
|
+
sess.headers.update({'User-Agent': user_agent})
|
|
44
|
+
if proxy:
|
|
45
|
+
sess.proxies.update({'http': proxy, 'https': proxy})
|
|
46
|
+
except Exception:
|
|
47
|
+
pass
|
|
48
|
+
|
|
49
|
+
|
|
50
|
+
log = logging.getLogger("insta2table")
|
|
51
|
+
|
|
52
|
+
try:
|
|
53
|
+
import instaloader
|
|
54
|
+
except Exception as e:
|
|
55
|
+
instaloader = None
|
|
56
|
+
log.warning("Instaloader not available: %s", e)
|
|
57
|
+
|
|
58
|
+
# Optional OCR deps
|
|
59
|
+
try:
|
|
60
|
+
from PIL import Image
|
|
61
|
+
import pytesseract
|
|
62
|
+
except Exception:
|
|
63
|
+
Image = None
|
|
64
|
+
pytesseract = None
|
|
65
|
+
|
|
66
|
+
@dataclass
|
|
67
|
+
class Row:
|
|
68
|
+
url: str
|
|
69
|
+
category: str
|
|
70
|
+
caption: str = ""
|
|
71
|
+
image_urls: str = ""
|
|
72
|
+
image_text: str = ""
|
|
73
|
+
image_count: int = 0
|
|
74
|
+
error: str = ""
|
|
75
|
+
|
|
76
|
+
def _login_loader():
|
|
77
|
+
if instaloader is None:
|
|
78
|
+
return None
|
|
79
|
+
L = instaloader.Instaloader(download_pictures=False, download_comments=False, save_metadata=False, compress_json=False)
|
|
80
|
+
user = os.getenv("IG_USER")
|
|
81
|
+
pw = os.getenv("IG_PASS")
|
|
82
|
+
if user and pw:
|
|
83
|
+
try:
|
|
84
|
+
L.login(user, pw)
|
|
85
|
+
log.info("Logged into Instagram as %s", user)
|
|
86
|
+
except Exception as e:
|
|
87
|
+
log.warning("Login failed; continuing unauthenticated: %s", e)
|
|
88
|
+
return L
|
|
89
|
+
|
|
90
|
+
def _fetch_post(L, shortcode: str):
|
|
91
|
+
try:
|
|
92
|
+
return instaloader.Post.from_shortcode(L.context, shortcode)
|
|
93
|
+
except Exception as e:
|
|
94
|
+
log.debug("Failed fetching shortcode %s: %s", shortcode, e)
|
|
95
|
+
return None
|
|
96
|
+
|
|
97
|
+
def _gather_image_urls(post) -> List[str]:
|
|
98
|
+
urls = []
|
|
99
|
+
try:
|
|
100
|
+
if post.typename == "GraphSidecar":
|
|
101
|
+
for node in post.get_sidecar_nodes():
|
|
102
|
+
if not node.is_video and node.display_url:
|
|
103
|
+
urls.append(node.display_url)
|
|
104
|
+
else:
|
|
105
|
+
if not post.is_video and getattr(post, "url", None):
|
|
106
|
+
urls.append(post.url)
|
|
107
|
+
except Exception as e:
|
|
108
|
+
log.debug("Error collecting image URLs: %s", e)
|
|
109
|
+
return urls
|
|
110
|
+
|
|
111
|
+
def _ocr_images(urls: List[str]) -> str:
|
|
112
|
+
# uses outer ua_cycle/proxy_cycle if available (rotates per image)
|
|
113
|
+
if not urls or Image is None or pytesseract is None:
|
|
114
|
+
return ""
|
|
115
|
+
texts = []
|
|
116
|
+
for u in urls:
|
|
117
|
+
try:
|
|
118
|
+
hdrs = {}
|
|
119
|
+
try:
|
|
120
|
+
ua = next(ua_cycle)
|
|
121
|
+
except Exception:
|
|
122
|
+
ua = None
|
|
123
|
+
try:
|
|
124
|
+
proxy = next(proxy_cycle)
|
|
125
|
+
except Exception:
|
|
126
|
+
proxy = None
|
|
127
|
+
if ua:
|
|
128
|
+
hdrs['User-Agent'] = ua
|
|
129
|
+
kwargs = {'timeout': 30}
|
|
130
|
+
if hdrs:
|
|
131
|
+
kwargs['headers'] = hdrs
|
|
132
|
+
if proxy:
|
|
133
|
+
kwargs['proxies'] = {'http': proxy, 'https': proxy}
|
|
134
|
+
r = requests.get(u, **kwargs)
|
|
135
|
+
r.raise_for_status()
|
|
136
|
+
img = Image.open(io.BytesIO(r.content))
|
|
137
|
+
txt = pytesseract.image_to_string(img)
|
|
138
|
+
if txt:
|
|
139
|
+
texts.append(txt.strip())
|
|
140
|
+
except Exception:
|
|
141
|
+
continue
|
|
142
|
+
return " || ".join(t.strip() for t in texts if t.strip())
|
|
143
|
+
|
|
144
|
+
def crawl_links(links_path: str, out_csv: str, do_ocr: bool = True, proxies: Optional[Iterable[str]] = None, user_agents: Optional[Iterable[str]] = None) -> str:
|
|
145
|
+
"""
|
|
146
|
+
Read links from file, fetch caption & image URLs (and optionally OCR), write CSV.
|
|
147
|
+
Returns the output CSV path.
|
|
148
|
+
"""
|
|
149
|
+
L = _login_loader() if instaloader else None
|
|
150
|
+
|
|
151
|
+
# Prepare proxy & user-agent cycles. Sources (in priority):
|
|
152
|
+
# 1) explicit args passed to crawl_links
|
|
153
|
+
# 2) environment variables PROXIES and USER_AGENTS (comma-separated)
|
|
154
|
+
# 3) files referenced by PROXIES_FILE / USER_AGENTS_FILE in env (newline separated)
|
|
155
|
+
proxies_list = list(proxies) if proxies else _coerce_list(os.getenv("PROXIES"))
|
|
156
|
+
# fallbacks from files
|
|
157
|
+
if not proxies_list and os.getenv("PROXIES_FILE"):
|
|
158
|
+
proxies_list = _read_list_from_file(os.getenv("PROXIES_FILE"))
|
|
159
|
+
user_agents_list = list(user_agents) if user_agents else _coerce_list(os.getenv("USER_AGENTS"))
|
|
160
|
+
if not user_agents_list and os.getenv("USER_AGENTS_FILE"):
|
|
161
|
+
user_agents_list = _read_list_from_file(os.getenv("USER_AGENTS_FILE"))
|
|
162
|
+
|
|
163
|
+
proxy_cycle = _make_cycle([p for p in proxies_list if p])
|
|
164
|
+
ua_cycle = _make_cycle([u for u in user_agents_list if u])
|
|
165
|
+
|
|
166
|
+
rows: List[Row] = []
|
|
167
|
+
with open(links_path, "r", encoding="utf-8") as f:
|
|
168
|
+
for raw in f:
|
|
169
|
+
url = raw.strip()
|
|
170
|
+
if not url:
|
|
171
|
+
continue
|
|
172
|
+
category, shortcode = classify_url(url)
|
|
173
|
+
if category in ("not-instagram", "insta-page"):
|
|
174
|
+
rows.append(Row(url=url, category=category))
|
|
175
|
+
continue
|
|
176
|
+
caption = ""
|
|
177
|
+
image_urls = []
|
|
178
|
+
image_text = ""
|
|
179
|
+
image_count = 0
|
|
180
|
+
error = ""
|
|
181
|
+
post = None
|
|
182
|
+
if shortcode and L:
|
|
183
|
+
# rotate UA & proxy for this fetch
|
|
184
|
+
try:
|
|
185
|
+
ua = next(ua_cycle)
|
|
186
|
+
except Exception:
|
|
187
|
+
ua = None
|
|
188
|
+
try:
|
|
189
|
+
proxy = next(proxy_cycle)
|
|
190
|
+
except Exception:
|
|
191
|
+
proxy = None
|
|
192
|
+
_apply_to_instaloader_session(L, ua, proxy)
|
|
193
|
+
post = _fetch_post(L, shortcode)
|
|
194
|
+
if post is None and category != "not-instagram":
|
|
195
|
+
error = "failed to fetch (unauthenticated or invalid link)"
|
|
196
|
+
else:
|
|
197
|
+
try:
|
|
198
|
+
caption = (post.caption or "").strip()
|
|
199
|
+
image_urls = _gather_image_urls(post)
|
|
200
|
+
image_count = len(image_urls) if image_urls else (0 if post.is_video else 1)
|
|
201
|
+
if do_ocr:
|
|
202
|
+
image_text = _ocr_images(image_urls)
|
|
203
|
+
except Exception as e:
|
|
204
|
+
error = str(e)
|
|
205
|
+
rows.append(Row(
|
|
206
|
+
url=url,
|
|
207
|
+
category=category,
|
|
208
|
+
caption=caption,
|
|
209
|
+
image_urls=";".join(image_urls) if image_urls else "",
|
|
210
|
+
image_text=image_text,
|
|
211
|
+
image_count=image_count,
|
|
212
|
+
error=error
|
|
213
|
+
))
|
|
214
|
+
safe_sleep(0.5)
|
|
215
|
+
|
|
216
|
+
fieldnames = ["url","category","caption","image_urls","image_text","image_count","error"]
|
|
217
|
+
|
|
218
|
+
with open(out_csv, "w", newline="", encoding="utf-8-sig") as f:
|
|
219
|
+
import csv as _csv
|
|
220
|
+
w = _csv.DictWriter(f, fieldnames=fieldnames)
|
|
221
|
+
w.writeheader()
|
|
222
|
+
for r in rows:
|
|
223
|
+
w.writerow({k:getattr(r,k) for k in fieldnames})
|
|
224
|
+
return out_csv
|
|
@@ -0,0 +1,77 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
import os, csv, logging
|
|
3
|
+
from typing import List
|
|
4
|
+
from tenacity import retry, stop_after_attempt, wait_fixed
|
|
5
|
+
|
|
6
|
+
log = logging.getLogger("insta2table")
|
|
7
|
+
|
|
8
|
+
PROMPT = """You are given data from an Instagram link.
|
|
9
|
+
Summarize into ONE single-row Markdown table with EXACTLY these columns:
|
|
10
|
+
| Place | City | Country | Category | Brief Description | Social Media / Website |
|
|
11
|
+
|
|
12
|
+
If information is missing, leave that cell blank. Do not add extra rows or text.
|
|
13
|
+
Content:
|
|
14
|
+
CAPTION: {caption}
|
|
15
|
+
OCR_TEXT: {image_text}
|
|
16
|
+
URL: {url}
|
|
17
|
+
"""
|
|
18
|
+
|
|
19
|
+
def _get_llm():
|
|
20
|
+
api_key = os.getenv("GOOGLE_API_KEY")
|
|
21
|
+
if not api_key:
|
|
22
|
+
raise RuntimeError("GOOGLE_API_KEY is required for processing with Gemini.")
|
|
23
|
+
from langchain_google_genai import ChatGoogleGenerativeAI
|
|
24
|
+
return ChatGoogleGenerativeAI(model="gemini-2.5-flash", api_key=api_key)
|
|
25
|
+
|
|
26
|
+
@retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
|
|
27
|
+
def _call_llm(llm, url: str, caption: str, image_text: str) -> str:
|
|
28
|
+
prompt = PROMPT.format(url=url, caption=caption or "", image_text=image_text or "")
|
|
29
|
+
resp = llm.invoke(prompt)
|
|
30
|
+
text = getattr(resp, "content", None) or str(resp)
|
|
31
|
+
return text.strip()
|
|
32
|
+
|
|
33
|
+
def _parse_single_row_table(md: str) -> List[str]:
|
|
34
|
+
lines = [l.rstrip() for l in md.splitlines() if l.strip()]
|
|
35
|
+
start = None
|
|
36
|
+
for i,l in enumerate(lines):
|
|
37
|
+
if l.strip().startswith("|") and "Place" in l and "Country" in l:
|
|
38
|
+
start = i
|
|
39
|
+
break
|
|
40
|
+
if start is None or start+2 >= len(lines):
|
|
41
|
+
raise ValueError("No valid markdown table header found.")
|
|
42
|
+
data_line = lines[start+2] if lines[start+1].strip().startswith("|-") else lines[start+1]
|
|
43
|
+
cells = [c.strip() for c in data_line.strip("|").split("|")]
|
|
44
|
+
if len(cells) != 6:
|
|
45
|
+
raise ValueError(f"Expected 6 columns, got {len(cells)}")
|
|
46
|
+
return cells
|
|
47
|
+
|
|
48
|
+
def process_csv(in_csv: str, out_csv: str) -> str:
|
|
49
|
+
llm = _get_llm()
|
|
50
|
+
with open(in_csv, "r", encoding="utf-8") as f:
|
|
51
|
+
rows = list(csv.DictReader(f))
|
|
52
|
+
out_rows = []
|
|
53
|
+
for r in rows:
|
|
54
|
+
url = r.get("url","" )
|
|
55
|
+
caption = r.get("caption","" )
|
|
56
|
+
image_text = r.get("image_text","" )
|
|
57
|
+
md = _call_llm(llm, url, caption, image_text)
|
|
58
|
+
try:
|
|
59
|
+
cells = _parse_single_row_table(md)
|
|
60
|
+
except Exception as e:
|
|
61
|
+
log.warning("Parse failed for %s: %s", url, e)
|
|
62
|
+
cells = ["","","","","",url]
|
|
63
|
+
out_rows.append({
|
|
64
|
+
"Place": cells[0],
|
|
65
|
+
"City": cells[1],
|
|
66
|
+
"Country": cells[2],
|
|
67
|
+
"Category": cells[3],
|
|
68
|
+
"Brief Description": cells[4],
|
|
69
|
+
"Social Media / Website": cells[5],
|
|
70
|
+
})
|
|
71
|
+
fieldnames = ["Place","City","Country","Category","Brief Description","Social Media / Website"]
|
|
72
|
+
with open(out_csv, "w", newline="", encoding="utf-8-sig") as f:
|
|
73
|
+
w = csv.DictWriter(f, fieldnames=fieldnames)
|
|
74
|
+
w.writeheader()
|
|
75
|
+
for r in out_rows:
|
|
76
|
+
w.writerow(r)
|
|
77
|
+
return out_csv
|
|
@@ -0,0 +1,44 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
import os, re, io, time, logging, urllib.parse as urlparse
|
|
3
|
+
from typing import Optional, Tuple
|
|
4
|
+
|
|
5
|
+
log = logging.getLogger("insta2table")
|
|
6
|
+
|
|
7
|
+
def classify_url(url: str) -> Tuple[str, Optional[str]]:
|
|
8
|
+
"""
|
|
9
|
+
Returns (category, shortcode_or_none)
|
|
10
|
+
|
|
11
|
+
Categories:
|
|
12
|
+
- insta-reel -> path contains '/reel/<shortcode>'
|
|
13
|
+
- insta-image -> path contains '/p/<shortcode>' or '/tv/<shortcode>'
|
|
14
|
+
- insta-page -> instagram host but not a specific post
|
|
15
|
+
- not-instagram -> non-instagram host
|
|
16
|
+
"""
|
|
17
|
+
try:
|
|
18
|
+
u = urlparse.urlparse(url)
|
|
19
|
+
host = (u.netloc or "").lower()
|
|
20
|
+
if "instagram.com" not in host:
|
|
21
|
+
return "not-instagram", None
|
|
22
|
+
|
|
23
|
+
# split and remove empty segments
|
|
24
|
+
parts = [p for p in (u.path or "/").split("/") if p]
|
|
25
|
+
|
|
26
|
+
# search known markers anywhere in the path
|
|
27
|
+
for marker in ("reel", "p", "tv"):
|
|
28
|
+
if marker in parts:
|
|
29
|
+
idx = parts.index(marker)
|
|
30
|
+
shortcode = parts[idx + 1] if len(parts) > idx + 1 else None
|
|
31
|
+
if marker == "reel":
|
|
32
|
+
return "insta-reel", shortcode
|
|
33
|
+
else:
|
|
34
|
+
return "insta-image", shortcode
|
|
35
|
+
|
|
36
|
+
return "insta-page", None
|
|
37
|
+
except Exception:
|
|
38
|
+
return "not-instagram", None
|
|
39
|
+
|
|
40
|
+
def safe_sleep(seconds: float):
|
|
41
|
+
try:
|
|
42
|
+
time.sleep(seconds)
|
|
43
|
+
except Exception:
|
|
44
|
+
pass
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: insta2table
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Fetch Instagram captions (with optional OCR) and convert them to a normalized table via Gemini.
|
|
5
|
+
Author: S. Kashyap
|
|
6
|
+
License: MIT License
|
|
7
|
+
Project-URL: Homepage, https://github.com/Sk1499/insta2table
|
|
8
|
+
Project-URL: Issues, https://github.com/Sk1499/insta2table/issues
|
|
9
|
+
Requires-Python: >=3.9
|
|
10
|
+
Description-Content-Type: text/markdown
|
|
11
|
+
License-File: LICENSE
|
|
12
|
+
Requires-Dist: pandas>=2.0
|
|
13
|
+
Requires-Dist: tqdm>=4.66
|
|
14
|
+
Requires-Dist: tenacity>=8.2
|
|
15
|
+
Requires-Dist: python-dotenv>=1.0
|
|
16
|
+
Requires-Dist: instaloader>=4.12
|
|
17
|
+
Requires-Dist: requests>=2.31
|
|
18
|
+
Provides-Extra: ocr
|
|
19
|
+
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
|
|
20
|
+
Requires-Dist: Pillow>=10.0; extra == "ocr"
|
|
21
|
+
Provides-Extra: genai
|
|
22
|
+
Requires-Dist: langchain>=0.2.0; extra == "genai"
|
|
23
|
+
Requires-Dist: langchain-google-genai>=2.0.0; extra == "genai"
|
|
24
|
+
Dynamic: license-file
|
|
25
|
+
|
|
26
|
+
# insta2table
|
|
27
|
+
|
|
28
|
+
[](https://github.com/Sk1499/insta2table/actions)
|
|
29
|
+
[](https://pypi.org/project/insta2table/)
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
A tiny toolkit to:
|
|
33
|
+
1. **Crawl Instagram links** from a text file, extract captions and (optionally) **OCR** text from images → `output.csv`
|
|
34
|
+
2. **Convert** those rows into a clean **single-row table** per link using **Gemini** → `result.csv`
|
|
35
|
+
|
|
36
|
+
> Credentials & keys via env:
|
|
37
|
+
> - `IG_USER`, `IG_PASS` (optional): for Instaloader login (reduces 403s)
|
|
38
|
+
> - `GOOGLE_API_KEY`: required for Gemini
|
|
39
|
+
|
|
40
|
+
## Quickstart
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
# 1) (Recommended) Create and activate a virtualenv
|
|
44
|
+
python -m venv .venv
|
|
45
|
+
# Linux/macOS
|
|
46
|
+
source .venv/bin/activate
|
|
47
|
+
# Windows (PowerShell)
|
|
48
|
+
# .venv\Scripts\Activate.ps1
|
|
49
|
+
|
|
50
|
+
# 2) Install (with OCR & Gemini extras if you need them)
|
|
51
|
+
pip install -e .[ocr,genai]
|
|
52
|
+
|
|
53
|
+
# 3) Prepare links
|
|
54
|
+
cp examples/links.txt .
|
|
55
|
+
|
|
56
|
+
# 4) Crawl -> output.csv
|
|
57
|
+
export IG_USER="your_user"
|
|
58
|
+
export IG_PASS="your_pass"
|
|
59
|
+
insta2csv --links links.txt --out output.csv
|
|
60
|
+
|
|
61
|
+
# 5) Process with Gemini -> result.csv
|
|
62
|
+
export GOOGLE_API_KEY="your_key"
|
|
63
|
+
insta2table --in output.csv --out result.csv
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
## CLI
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
insta2csv --links links.txt --out output.csv [--no-ocr]
|
|
70
|
+
insta2table --in output.csv --out result.csv
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
## Notes
|
|
74
|
+
- OCR requires Tesseract installed on your system if you opt in.
|
|
75
|
+
- Instagram scraping without login can trigger 403s. Supplying IG credentials helps.
|
|
76
|
+
- Gemini formatting expects a single-row Markdown table per input.
|
|
77
|
+
|
|
78
|
+
|
|
79
|
+
## Publishing
|
|
80
|
+
|
|
81
|
+
We use **trusted publishing** from GitHub to PyPI (no API token needed).
|
|
82
|
+
|
|
83
|
+
1. Create the project on PyPI (only first time) and enable **'Manage publishing'** with GitHub OIDC for your repo.
|
|
84
|
+
2. In GitHub: create a new Release with a tag like `v0.1.0`.
|
|
85
|
+
3. The **Publish to PyPI** workflow will build and upload the release automatically.
|
|
86
|
+
4. Alternatively (manual): `make build` then `twine upload dist/*` (requires a PyPI token).
|
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
MANIFEST.in
|
|
3
|
+
README.md
|
|
4
|
+
pyproject.toml
|
|
5
|
+
setup.cfg
|
|
6
|
+
examples/links.txt
|
|
7
|
+
src/insta2table/__init__.py
|
|
8
|
+
src/insta2table/cli.py
|
|
9
|
+
src/insta2table/crawler.py
|
|
10
|
+
src/insta2table/processor.py
|
|
11
|
+
src/insta2table/utils.py
|
|
12
|
+
src/insta2table.egg-info/PKG-INFO
|
|
13
|
+
src/insta2table.egg-info/SOURCES.txt
|
|
14
|
+
src/insta2table.egg-info/dependency_links.txt
|
|
15
|
+
src/insta2table.egg-info/entry_points.txt
|
|
16
|
+
src/insta2table.egg-info/requires.txt
|
|
17
|
+
src/insta2table.egg-info/top_level.txt
|
|
18
|
+
tests/test_import.py
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
insta2table
|