xtimeline 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Stephan Akkerman
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,107 @@
1
+ Metadata-Version: 2.4
2
+ Name: xtimeline
3
+ Version: 0.1.0
4
+ Summary: Lightweight X/Twitter timeline client (GraphQL via cURL or auth strategies)
5
+ Author: Stephan Akkerman
6
+ License: MIT
7
+ Keywords: twitter,x,timeline,scraping,aiohttp,asyncio
8
+ Requires-Python: >=3.10
9
+ Description-Content-Type: text/markdown
10
+ License-File: LICENSE
11
+ Requires-Dist: aiohttp>=3.12.15
12
+ Requires-Dist: uncurl>=0.0.11
13
+ Dynamic: license-file
14
+
15
+ # X-Timeline Scraper
16
+ A Python client to scrape tweets from X (formerly Twitter) timelines using a cURL command.
17
+
18
+ <!-- Add a banner here like: https://github.com/StephanAkkerman/fintwit-bot/blob/main/img/logo/fintwit-banner.png -->
19
+
20
+ ---
21
+ <!-- Adjust the link of the second badge to your own repo -->
22
+ <p align="center">
23
+ <img src="https://img.shields.io/badge/python-3.13-blue.svg" alt="Supported versions">
24
+ <img src="https://img.shields.io/github/license/StephanAkkerman/x-timeline-scraper.svg?color=brightgreen" alt="License">
25
+ <a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg" alt="Code style: black"></a>
26
+ </p>
27
+
28
+ ## Introduction
29
+
30
+ This project provides a Python client to scrape tweets from X (formerly Twitter) timelines using a cURL command. It leverages asynchronous programming for efficient data retrieval and includes features for parsing tweet data.
31
+
32
+ ## Table of Contents 🗂
33
+
34
+ - [Installation](#installation)
35
+ - [Usage](#usage)
36
+ - [Citation](#citation)
37
+ - [Contributing](#contributing)
38
+ - [License](#license)
39
+
40
+ ## Installation ⚙️
41
+ <!-- Adjust the link of the second command to your own repo -->
42
+
43
+ The required packages to run this code can be found in the requirements.txt file. To run this file, execute the following code block after cloning the repository:
44
+
45
+ ```bash
46
+ pip install .
47
+ ```
48
+
49
+ or
50
+
51
+ ```bash
52
+ pip install git+https://github.com/StephanAkkerman/x-timeline-scraper.git
53
+ ```
54
+
55
+ ## Usage ⌨️
56
+
57
+ To use the X-Timeline Scraper, you need to provide a cURL command that accesses the desired X timeline. The instructions can be found in [curl_example.txt](curl_example.txt). Then, you can use the `XTimelineClient` class to fetch and parse tweets.
58
+
59
+ Here's a simple example of how to use the client:
60
+
61
+ ```python
62
+ import asyncio
63
+ from src.xclient import XTimelineClient
64
+
65
+ async with XTimelineClient(
66
+ "curl.txt", persist_last_id_path="state/last_id.txt"
67
+ ) as xc:
68
+ tweets = await xc.fetch_tweets(update_last_id=False)
69
+ for t in tweets:
70
+ print(t.to_markdown())
71
+ ```
72
+
73
+ You can also stream new tweets in real-time:
74
+
75
+ ```python
76
+ import asyncio
77
+ from src.xclient import XTimelineClient
78
+ async with XTimelineClient(
79
+ "curl.txt", persist_last_id_path="state/last_id.txt"
80
+ ) as xc:
81
+ async for t in xc.stream(interval_s=5.0):
82
+ print(t.to_markdown())
83
+ ```
84
+
85
+ ## Citation ✍️
86
+ <!-- Be sure to adjust everything here so it matches your name and repo -->
87
+ If you use this project in your research, please cite as follows:
88
+
89
+ ```bibtex
90
+ @misc{project_name,
91
+ author = {Stephan Akkerman},
92
+ title = {X-Timeline Scraper},
93
+ year = {2025},
94
+ publisher = {GitHub},
95
+ journal = {GitHub repository},
96
+ howpublished = {\url{https://github.com/StephanAkkerman/x-timeline-scraper}}
97
+ }
98
+ ```
99
+
100
+ ## Contributing 🛠
101
+ <!-- Be sure to adjust the repo name here for both the URL and GitHub link -->
102
+ Contributions are welcome! If you have a feature request, bug report, or proposal for code refactoring, please feel free to open an issue on GitHub. We appreciate your help in improving this project.\
103
+ ![https://github.com/StephanAkkerman/x-timeline-scraper/graphs/contributors](https://contributors-img.firebaseapp.com/image?repo=StephanAkkerman/x-timeline-scraper)
104
+
105
+ ## License 📜
106
+
107
+ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
@@ -0,0 +1,93 @@
1
+ # X-Timeline Scraper
2
+ A Python client to scrape tweets from X (formerly Twitter) timelines using a cURL command.
3
+
4
+ <!-- Add a banner here like: https://github.com/StephanAkkerman/fintwit-bot/blob/main/img/logo/fintwit-banner.png -->
5
+
6
+ ---
7
+ <!-- Adjust the link of the second badge to your own repo -->
8
+ <p align="center">
9
+ <img src="https://img.shields.io/badge/python-3.13-blue.svg" alt="Supported versions">
10
+ <img src="https://img.shields.io/github/license/StephanAkkerman/x-timeline-scraper.svg?color=brightgreen" alt="License">
11
+ <a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg" alt="Code style: black"></a>
12
+ </p>
13
+
14
+ ## Introduction
15
+
16
+ This project provides a Python client to scrape tweets from X (formerly Twitter) timelines using a cURL command. It leverages asynchronous programming for efficient data retrieval and includes features for parsing tweet data.
17
+
18
+ ## Table of Contents 🗂
19
+
20
+ - [Installation](#installation)
21
+ - [Usage](#usage)
22
+ - [Citation](#citation)
23
+ - [Contributing](#contributing)
24
+ - [License](#license)
25
+
26
+ ## Installation ⚙️
27
+ <!-- Adjust the link of the second command to your own repo -->
28
+
29
+ The required packages to run this code can be found in the requirements.txt file. To run this file, execute the following code block after cloning the repository:
30
+
31
+ ```bash
32
+ pip install .
33
+ ```
34
+
35
+ or
36
+
37
+ ```bash
38
+ pip install git+https://github.com/StephanAkkerman/x-timeline-scraper.git
39
+ ```
40
+
41
+ ## Usage ⌨️
42
+
43
+ To use the X-Timeline Scraper, you need to provide a cURL command that accesses the desired X timeline. The instructions can be found in [curl_example.txt](curl_example.txt). Then, you can use the `XTimelineClient` class to fetch and parse tweets.
44
+
45
+ Here's a simple example of how to use the client:
46
+
47
+ ```python
48
+ import asyncio
49
+ from src.xclient import XTimelineClient
50
+
51
+ async with XTimelineClient(
52
+ "curl.txt", persist_last_id_path="state/last_id.txt"
53
+ ) as xc:
54
+ tweets = await xc.fetch_tweets(update_last_id=False)
55
+ for t in tweets:
56
+ print(t.to_markdown())
57
+ ```
58
+
59
+ You can also stream new tweets in real-time:
60
+
61
+ ```python
62
+ import asyncio
63
+ from src.xclient import XTimelineClient
64
+ async with XTimelineClient(
65
+ "curl.txt", persist_last_id_path="state/last_id.txt"
66
+ ) as xc:
67
+ async for t in xc.stream(interval_s=5.0):
68
+ print(t.to_markdown())
69
+ ```
70
+
71
+ ## Citation ✍️
72
+ <!-- Be sure to adjust everything here so it matches your name and repo -->
73
+ If you use this project in your research, please cite as follows:
74
+
75
+ ```bibtex
76
+ @misc{project_name,
77
+ author = {Stephan Akkerman},
78
+ title = {X-Timeline Scraper},
79
+ year = {2025},
80
+ publisher = {GitHub},
81
+ journal = {GitHub repository},
82
+ howpublished = {\url{https://github.com/StephanAkkerman/x-timeline-scraper}}
83
+ }
84
+ ```
85
+
86
+ ## Contributing 🛠
87
+ <!-- Be sure to adjust the repo name here for both the URL and GitHub link -->
88
+ Contributions are welcome! If you have a feature request, bug report, or proposal for code refactoring, please feel free to open an issue on GitHub. We appreciate your help in improving this project.\
89
+ ![https://github.com/StephanAkkerman/x-timeline-scraper/graphs/contributors](https://contributors-img.firebaseapp.com/image?repo=StephanAkkerman/x-timeline-scraper)
90
+
91
+ ## License 📜
92
+
93
+ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
@@ -0,0 +1,28 @@
1
+ [project]
2
+ name = "xtimeline"
3
+ version = "0.1.0"
4
+ description = "Lightweight X/Twitter timeline client (GraphQL via cURL or auth strategies)"
5
+ readme = "README.md"
6
+ requires-python = ">=3.10"
7
+ license = { text = "MIT" }
8
+ authors = [{ name = "Stephan Akkerman" }]
9
+ keywords = ["twitter", "x", "timeline", "scraping", "aiohttp", "asyncio"]
10
+ dependencies = [
11
+ "aiohttp>=3.12.15",
12
+ "uncurl>=0.0.11",
13
+ ]
14
+
15
+ [tool.isort]
16
+ multi_line_output = 3
17
+ include_trailing_comma = true
18
+ force_grid_wrap = 0
19
+ line_length = 88
20
+ profile = "black"
21
+
22
+ [tool.ruff]
23
+ line-length = 88
24
+ #select = ["I001"]
25
+
26
+ [tool.ruff.lint.pydocstyle]
27
+ # Use Google-style docstrings.
28
+ convention = "numpy"
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,5 @@
1
+ from .tweet import MediaItem, Tweet
2
+ from .xclient import XTimelineClient
3
+
4
+ __all__ = ["XTimelineClient", "Tweet", "MediaItem"]
5
+ __version__ = "0.1.0"
@@ -0,0 +1,71 @@
1
+ import logging
2
+ from dataclasses import asdict, dataclass
3
+
4
+ logger = logging.getLogger(__file__)
5
+
6
+
7
+ @dataclass(frozen=True, slots=True)
8
+ class MediaItem:
9
+ """Single media attachment."""
10
+
11
+ url: str
12
+ type: str # e.g. "photo", "video"
13
+
14
+
15
+ @dataclass(slots=True)
16
+ class Tweet:
17
+ """
18
+ Normalized tweet.
19
+
20
+ Attributes
21
+ ----------
22
+ id : int
23
+ Tweet rest_id as int.
24
+ text : str
25
+ Full text, HTML entities unescaped and trailing t.co removed.
26
+ user_name : str
27
+ Display name.
28
+ user_screen_name : str
29
+ @handle (without @).
30
+ user_img : str
31
+ Profile image URL.
32
+ url : str
33
+ Canonical tweet URL.
34
+ media : list[MediaItem]
35
+ Unique media attachments.
36
+ tickers : list[str]
37
+ Uppercased $SYMBOLS.
38
+ hashtags : list[str]
39
+ Uppercased hashtags (CRYPTO excluded).
40
+ title : str
41
+ A short human-readable title (“X retweeted Y”, etc.).
42
+ media_types : list[str]
43
+ Mirrors the attachment types for convenience.
44
+ """
45
+
46
+ id: int
47
+ text: str
48
+ user_name: str
49
+ user_screen_name: str
50
+ user_img: str
51
+ url: str
52
+ media: list[MediaItem]
53
+ tickers: list[str]
54
+ hashtags: list[str]
55
+ title: str
56
+ media_types: list[str]
57
+
58
+ def to_dict(self) -> dict:
59
+ """Serialize to a plain dict safe for JSON."""
60
+ d = asdict(self)
61
+ d["media"] = [asdict(m) for m in self.media]
62
+ return d
63
+
64
+ def to_markdown(self) -> str:
65
+ """Compact markdown rendering."""
66
+ md = f"**{self.user_name}** ([@{self.user_screen_name}])\n\n{self.text}\n\n{self.url}"
67
+ if self.tickers:
68
+ md += f"\n\n**Tickers:** {', '.join(self.tickers)}"
69
+ if self.hashtags:
70
+ md += f"\n**Hashtags:** {', '.join(self.hashtags)}"
71
+ return md
@@ -0,0 +1,522 @@
1
+ import asyncio
2
+ import datetime as dt
3
+ import json
4
+ import logging
5
+ import re
6
+ from collections.abc import AsyncIterator
7
+ from pathlib import Path
8
+ from typing import Any, Iterable
9
+
10
+ import aiohttp
11
+ import uncurl
12
+
13
+ from tweet import MediaItem, Tweet
14
+
15
+ logger = logging.getLogger(__file__)
16
+
17
+
18
+ def get_in(obj: Any, path: list[str] | tuple[str, ...], default: Any = None) -> Any:
19
+ """Safe nested get: get_in(d, ['a','b','c'])."""
20
+ cur = obj
21
+ for key in path:
22
+ if not isinstance(cur, dict) or key not in cur:
23
+ return default
24
+ cur = cur[key]
25
+ return cur
26
+
27
+
28
+ def is_promoted_entry(entry: dict) -> bool:
29
+ """Detect ads/promoted units."""
30
+ eid = entry.get("entryId", "")
31
+ if eid.startswith(("promoted-", "advertiser-")):
32
+ return True
33
+ content = entry.get("content", {})
34
+ item = content.get("itemContent", {})
35
+ return "promotedMetadata" in item or "promotedMetadata" in content
36
+
37
+
38
+ def is_tweet_item(entry: dict) -> bool:
39
+ """Detect timeline items that actually contain tweets."""
40
+ content = entry.get("content", {})
41
+ if content.get("entryType") != "TimelineTimelineItem":
42
+ return False
43
+ item = content.get("itemContent", {})
44
+ return item.get("itemType") == "TimelineTweet" and "tweet_results" in item
45
+
46
+
47
+ def normalize_tweet_result(tweet_results: dict) -> dict | None:
48
+ """
49
+ Normalize GraphQL result:
50
+ - 'Tweet' -> as-is
51
+ - 'TweetWithVisibilityResults' -> unwrap '.tweet'
52
+ """
53
+ result = tweet_results.get("result")
54
+ if not isinstance(result, dict):
55
+ return None
56
+ tname = result.get("__typename")
57
+ if tname == "Tweet":
58
+ return result
59
+ if tname == "TweetWithVisibilityResults":
60
+ inner = result.get("tweet")
61
+ return inner if isinstance(inner, dict) else None
62
+ return None
63
+
64
+
65
+ def unescape_entities(text: str) -> str:
66
+ """Unescape a minimal set of HTML entities that appear in tweet text."""
67
+ return text.replace("&amp;", "&").replace("&gt;", ">").replace("&lt;", "<")
68
+
69
+
70
+ _RE_TRAILING_TCO = re.compile(r"(https?://t\.co/\S+)$")
71
+
72
+
73
+ def strip_trailing_tco(text: str) -> str:
74
+ return _RE_TRAILING_TCO.sub("", text)
75
+
76
+
77
+ class XTimelineClient:
78
+ """
79
+ Minimal client for polling an X/Twitter timeline endpoint described by a cURL.
80
+
81
+ Parameters
82
+ ----------
83
+ curl_path : str
84
+ Path to a text file containing a single cURL command.
85
+ timeout_s : float
86
+ Per-request timeout in seconds.
87
+ persist_last_id_path : str | None
88
+ Optional path to persist last seen tweet id between runs.
89
+ """
90
+
91
+ def __init__(
92
+ self,
93
+ curl_path: str = "curl.txt",
94
+ timeout_s: float = 30.0,
95
+ persist_last_id_path: str | None = None,
96
+ ) -> None:
97
+ self.curl_path = Path(curl_path)
98
+ self.timeout_s = timeout_s
99
+ self._session: aiohttp.ClientSession | None = None
100
+ self._req: dict[str, Any] = {}
101
+ self._last_tweet_id: int = 0
102
+ self.persist_last_id_path = (
103
+ Path(persist_last_id_path) if persist_last_id_path else None
104
+ )
105
+ self._load_curl()
106
+ self._load_last_id()
107
+
108
+ # ---------- lifecycle ----------
109
+
110
+ def _load_curl(self) -> None:
111
+ """Parse the cURL file into url/headers/cookies/json payload."""
112
+ try:
113
+ raw = self.curl_path.read_text(encoding="utf-8")
114
+ ctx = uncurl.parse_context(
115
+ "".join(line.strip() for line in raw.splitlines())
116
+ )
117
+ self._req = {
118
+ "url": ctx.url,
119
+ "headers": dict(ctx.headers) if ctx.headers else {},
120
+ "cookies": dict(ctx.cookies) if ctx.cookies else {},
121
+ "json": json.loads(ctx.data) if ctx.data else None,
122
+ "method": ctx.method.upper(),
123
+ }
124
+ except Exception as e:
125
+ logger.critical("Error reading %s: %s", self.curl_path, e)
126
+ self._req = {}
127
+
128
+ def _load_last_id(self) -> None:
129
+ """Load last tweet id from disk (if configured)."""
130
+ if not self.persist_last_id_path:
131
+ return
132
+ try:
133
+ self._last_tweet_id = int(
134
+ self.persist_last_id_path.read_text().strip() or "0"
135
+ )
136
+ except FileNotFoundError:
137
+ self._last_tweet_id = 0
138
+ except Exception as e:
139
+ logger.warning("Could not read last id file: %s", e)
140
+
141
+ def _store_last_id(self) -> None:
142
+ """Persist last tweet id to disk (if configured)."""
143
+ if not self.persist_last_id_path:
144
+ return
145
+ try:
146
+ self.persist_last_id_path.parent.mkdir(parents=True, exist_ok=True)
147
+ self.persist_last_id_path.write_text(
148
+ str(self._last_tweet_id), encoding="utf-8"
149
+ )
150
+ except Exception as e:
151
+ logger.warning("Could not write last id file: %s", e)
152
+
153
+ async def __aenter__(self) -> "XTimelineClient":
154
+ await self._ensure_session()
155
+ return self
156
+
157
+ async def __aexit__(self, exc_type, exc, tb) -> None:
158
+ await self.aclose()
159
+
160
+ async def aclose(self) -> None:
161
+ if self._session and not self._session.closed:
162
+ await self._session.close()
163
+ self._session = None
164
+
165
+ async def _ensure_session(self) -> None:
166
+ if self._session is None or self._session.closed:
167
+ timeout = aiohttp.ClientTimeout(total=self.timeout_s)
168
+ self._session = aiohttp.ClientSession(
169
+ headers=self._req.get("headers"),
170
+ cookies=self._req.get("cookies"),
171
+ timeout=timeout,
172
+ )
173
+
174
+ # ---------- HTTP ----------
175
+
176
+ async def fetch_raw(self, *, text: bool = False) -> dict | str:
177
+ """
178
+ Perform a single GET call using the cURL details.
179
+
180
+ Parameters
181
+ ----------
182
+ text : bool
183
+ If True, return raw text; else parse JSON.
184
+
185
+ Returns
186
+ -------
187
+ dict | str
188
+ Parsed JSON (dict) or raw text.
189
+ """
190
+ if not self._req:
191
+ logger.critical("No cURL loaded. Aborting fetch.")
192
+ return "" if text else {}
193
+
194
+ await self._ensure_session()
195
+ assert self._session is not None
196
+ url = self._req["url"]
197
+ json_payload = self._req.get("json")
198
+
199
+ try:
200
+ async with self._session.get(url, json=json_payload) as resp:
201
+ if resp.status >= 400:
202
+ body = await resp.text()
203
+ logger.error(
204
+ "HTTP %s for %s\nResponse: %s", resp.status, url, body[:2000]
205
+ )
206
+ return "" if text else {}
207
+
208
+ if text:
209
+ return await resp.text()
210
+
211
+ try:
212
+ return await resp.json()
213
+ except aiohttp.ContentTypeError:
214
+ raw = await resp.text()
215
+ logger.error("Non-JSON response from %s\nBody: %s", url, raw[:2000])
216
+ return {}
217
+ except json.JSONDecodeError as e:
218
+ raw = await resp.text()
219
+ logger.error("JSON decode error: %s\nBody: %s", e, raw[:2000])
220
+ return {}
221
+ except aiohttp.ClientError as e:
222
+ logger.error("Network error for %s: %s", url, e)
223
+ except asyncio.TimeoutError:
224
+ logger.error("Timeout after %ss for %s", self.timeout_s, url)
225
+ return "" if text else {}
226
+
227
+ # ---------- extraction ----------
228
+
229
+ @staticmethod
230
+ def _get_entries(payload: dict) -> list[dict]:
231
+ """
232
+ Extract raw timeline entries (not yet filtered/normalized).
233
+
234
+ Returns
235
+ -------
236
+ list[dict]
237
+ Entries inside TimelineAddEntries.
238
+ """
239
+ instructions = get_in(
240
+ payload, ["data", "home", "home_timeline_urt", "instructions"], []
241
+ )
242
+ if not isinstance(instructions, list):
243
+ return []
244
+ for inst in instructions:
245
+ if inst.get("type") == "TimelineAddEntries":
246
+ entries = inst.get("entries", [])
247
+ return entries if isinstance(entries, list) else []
248
+ return []
249
+
250
+ def _iter_entry_tweets(self, entries: list[dict]) -> Iterable[dict]:
251
+ """
252
+ Yield normalized tweet dicts from raw entries, skipping promoted & non-tweet items.
253
+ """
254
+ for entry in entries:
255
+ if not is_tweet_item(entry):
256
+ continue
257
+ if is_promoted_entry(entry):
258
+ continue
259
+ item = entry["content"]["itemContent"]
260
+ twr = item.get("tweet_results", {})
261
+ tw = normalize_tweet_result(twr)
262
+ if not tw:
263
+ continue
264
+ yield tw
265
+
266
+ # ---------- parsing ----------
267
+
268
+ def _save_errored_tweet(self, tweet: dict, error_msg: str) -> None:
269
+ logger.error(error_msg)
270
+ ts = dt.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
271
+ Path("logs").mkdir(parents=True, exist_ok=True)
272
+ (Path("logs") / f"error_tweet_{ts}.json").write_text(
273
+ json.dumps(tweet, ensure_ascii=False, indent=4), encoding="utf-8"
274
+ )
275
+
276
+ def _user_field(self, tweet: dict, key: str) -> str:
277
+ return (
278
+ tweet.get("core", {})
279
+ .get("user_results", {})
280
+ .get("result", {})
281
+ .get("legacy", {})
282
+ .get(key, "")
283
+ )
284
+
285
+ def _entities(self, tweet: dict, key: str) -> list[str]:
286
+ legacy = tweet.get("legacy", {})
287
+ entities = legacy.get("entities", {}).get(key)
288
+ if not entities:
289
+ return []
290
+ return [
291
+ e.get("text", "") for e in entities if isinstance(e, dict) and "text" in e
292
+ ]
293
+
294
+ def _collect_media(self, tweet: dict) -> tuple[list[MediaItem], list[str]]:
295
+ legacy = tweet.get("legacy", {})
296
+ ext = legacy.get("extended_entities", {})
297
+ media_items: list[MediaItem] = []
298
+ media_types: list[str] = []
299
+ if "media" in ext:
300
+ for m in ext["media"]:
301
+ if not isinstance(m, dict):
302
+ continue
303
+ url = m.get("media_url_https") or m.get("media_url") or ""
304
+ mtype = m.get("type", "")
305
+ if url:
306
+ media_items.append(MediaItem(url=url, type=mtype))
307
+ media_types.append(mtype)
308
+ # dedupe by URL
309
+ seen = set()
310
+ uniq_items: list[MediaItem] = []
311
+ uniq_types: list[str] = []
312
+ for mi, mt in zip(media_items, media_types):
313
+ if mi.url in seen:
314
+ continue
315
+ seen.add(mi.url)
316
+ uniq_items.append(mi)
317
+ uniq_types.append(mt)
318
+ return uniq_items, uniq_types
319
+
320
+ def _tweet_url(self, tweet_id: int) -> str:
321
+ return f"https://twitter.com/user/status/{tweet_id}"
322
+
323
+ def _parse_single_tweet(
324
+ self, tw: dict, *, allow_update_last_id: bool
325
+ ) -> Tweet | None:
326
+ """
327
+ Parse a normalized Tweet GraphQL node into a Tweet object.
328
+
329
+ Parameters
330
+ ----------
331
+ tw : dict
332
+ GraphQL 'Tweet' node (already normalized).
333
+ allow_update_last_id : bool
334
+ If True, update last seen id and skip older/equal.
335
+
336
+ Returns
337
+ -------
338
+ Tweet | None
339
+ Parsed Tweet or None if filtered/invalid.
340
+ """
341
+ # Prefer legacy.id_str; fallback to rest_id
342
+ try:
343
+ tid = int(tw.get("legacy", {}).get("id_str") or tw.get("rest_id") or 0)
344
+ except Exception:
345
+ tid = 0
346
+ if tid <= 0:
347
+ self._save_errored_tweet(tw, "Missing tweet id")
348
+ return None
349
+
350
+ # last-id gate
351
+ if allow_update_last_id:
352
+ if tid <= self._last_tweet_id:
353
+ return None
354
+ self._last_tweet_id = tid
355
+ self._store_last_id()
356
+
357
+ # text & entities
358
+ legacy = tw.get("legacy", {})
359
+ text = legacy.get("full_text", "")
360
+ text = unescape_entities(strip_trailing_tco(text))
361
+
362
+ tickers = [t.upper() for t in self._entities(tw, "symbols") if t]
363
+ hashtags = [
364
+ h.upper()
365
+ for h in self._entities(tw, "hashtags")
366
+ if h and h.upper() != "CRYPTO"
367
+ ]
368
+
369
+ # user info
370
+ user_name = self._user_field(tw, "name")
371
+ user_screen = self._user_field(tw, "screen_name")
372
+ user_img = self._user_field(tw, "profile_image_url_https")
373
+ url = self._tweet_url(tid)
374
+
375
+ # media
376
+ media_items, media_types = self._collect_media(tw)
377
+
378
+ # reply/quote/retweet handling (best-effort)
379
+ title = f"{user_name} tweeted"
380
+ quoted = tw.get("quoted_status_result") or None
381
+ retweeted = tw.get("legacy", {}).get("retweeted_status_result") or None
382
+
383
+ # For replies, X often embeds it differently; handle when present
384
+ # We try to reuse this parser recursively on embedded nodes.
385
+ def _parse_nested(n: dict) -> Tweet | None:
386
+ # n may already be the normalized 'Tweet' or wrapped
387
+ if "result" in n:
388
+ inner = (
389
+ normalize_tweet_result(n)
390
+ if "tweet" in n.get("result", {})
391
+ else n.get("result")
392
+ )
393
+ else:
394
+ inner = normalize_tweet_result({"result": n}) or n
395
+ if not isinstance(inner, dict):
396
+ return None
397
+ # don't update last_id for nested
398
+ return self._parse_single_tweet(inner, allow_update_last_id=False)
399
+
400
+ nested: Tweet | None = None
401
+ if quoted:
402
+ nested = _parse_nested(quoted)
403
+ if nested:
404
+ title = f"{user_name} quote tweeted {nested.user_name}"
405
+ q_text = "\n".join("> " + line for line in nested.text.splitlines())
406
+ text = f"{text}\n\n> [@{nested.user_screen_name}](https://twitter.com/{nested.user_screen_name}):\n{q_text}"
407
+ # Merge entities/media
408
+ media_items += nested.media
409
+ media_types += nested.media_types
410
+ tickers = sorted(set(tickers) | set(nested.tickers))
411
+ hashtags = sorted(set(hashtags) | set(nested.hashtags))
412
+
413
+ if retweeted:
414
+ nested = _parse_nested(retweeted)
415
+ if nested:
416
+ title = f"{user_name} retweeted {nested.user_name}"
417
+ # Use the full RT text
418
+ text = nested.text
419
+ media_items += nested.media
420
+ media_types += nested.media_types
421
+ tickers = sorted(set(tickers) | set(nested.tickers))
422
+ hashtags = sorted(set(hashtags) | set(nested.hashtags))
423
+
424
+ # replies can show up as composite timeline items; handled in entry stage usually
425
+ # If you later surface reply threads, add that here.
426
+
427
+ # dedupe media again after merges
428
+ uniq_media: list[MediaItem] = []
429
+ seen_urls = set()
430
+ for m in media_items:
431
+ if m.url and m.url not in seen_urls:
432
+ uniq_media.append(m)
433
+ seen_urls.add(m.url)
434
+
435
+ return Tweet(
436
+ id=tid,
437
+ text=text,
438
+ user_name=user_name,
439
+ user_screen_name=user_screen,
440
+ user_img=user_img,
441
+ url=url,
442
+ media=uniq_media,
443
+ tickers=sorted(set(tickers)),
444
+ hashtags=sorted(set(hashtags)),
445
+ title=title,
446
+ media_types=[m.type for m in uniq_media],
447
+ )
448
+
449
+ # ---------- public APIs ----------
450
+
451
+ async def fetch_tweets(self, *, update_last_id: bool = False) -> list[Tweet]:
452
+ """
453
+ Fetch entries and parse into `Tweet` objects.
454
+
455
+ Parameters
456
+ ----------
457
+ update_last_id : bool
458
+ If True, update the client's last-seen tweet id (skip older/equal).
459
+
460
+ Returns
461
+ -------
462
+ list[Tweet]
463
+ Parsed tweets (ads removed, deduped).
464
+ """
465
+ payload = await self.fetch_raw(text=False)
466
+ if not isinstance(payload, dict) or not payload:
467
+ return []
468
+
469
+ out: list[Tweet] = []
470
+ seen: set[int] = set()
471
+ for tw in self._iter_entry_tweets(self._get_entries(payload)):
472
+ parsed = self._parse_single_tweet(tw, allow_update_last_id=update_last_id)
473
+ if not parsed:
474
+ continue
475
+ if parsed.id in seen:
476
+ continue
477
+ seen.add(parsed.id)
478
+ out.append(parsed)
479
+ return out
480
+
481
+ async def stream(self, interval_s: float = 5.0) -> AsyncIterator[Tweet]:
482
+ """
483
+ Async generator that yields new tweets forever.
484
+
485
+ Parameters
486
+ ----------
487
+ interval_s : float
488
+ Polling interval in seconds.
489
+
490
+ Yields
491
+ ------
492
+ Tweet
493
+ Each new parsed Tweet.
494
+ """
495
+ while True:
496
+ try:
497
+ for tw in await self.fetch_tweets(update_last_id=True):
498
+ yield tw
499
+ except Exception as e:
500
+ logger.error("stream() iteration error: %s", e)
501
+ await asyncio.sleep(interval_s)
502
+
503
+
504
+ async def _example_once():
505
+ async with XTimelineClient(
506
+ "curl.txt", persist_last_id_path="state/last_id.txt"
507
+ ) as xc:
508
+ tweets = await xc.fetch_tweets(update_last_id=False)
509
+ for t in tweets:
510
+ print(t.to_markdown())
511
+
512
+
513
+ async def _example_stream():
514
+ async with XTimelineClient(
515
+ "curl.txt", persist_last_id_path="state/last_id.txt"
516
+ ) as xc:
517
+ async for tweet in xc.stream(interval_s=5.0):
518
+ print(tweet.id, tweet.text)
519
+
520
+
521
+ # if __name__ == "__main__":
522
+ # asyncio.run(_example_stream())
@@ -0,0 +1,107 @@
1
+ Metadata-Version: 2.4
2
+ Name: xtimeline
3
+ Version: 0.1.0
4
+ Summary: Lightweight X/Twitter timeline client (GraphQL via cURL or auth strategies)
5
+ Author: Stephan Akkerman
6
+ License: MIT
7
+ Keywords: twitter,x,timeline,scraping,aiohttp,asyncio
8
+ Requires-Python: >=3.10
9
+ Description-Content-Type: text/markdown
10
+ License-File: LICENSE
11
+ Requires-Dist: aiohttp>=3.12.15
12
+ Requires-Dist: uncurl>=0.0.11
13
+ Dynamic: license-file
14
+
15
+ # X-Timeline Scraper
16
+ A Python client to scrape tweets from X (formerly Twitter) timelines using a cURL command.
17
+
18
+ <!-- Add a banner here like: https://github.com/StephanAkkerman/fintwit-bot/blob/main/img/logo/fintwit-banner.png -->
19
+
20
+ ---
21
+ <!-- Adjust the link of the second badge to your own repo -->
22
+ <p align="center">
23
+ <img src="https://img.shields.io/badge/python-3.13-blue.svg" alt="Supported versions">
24
+ <img src="https://img.shields.io/github/license/StephanAkkerman/x-timeline-scraper.svg?color=brightgreen" alt="License">
25
+ <a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg" alt="Code style: black"></a>
26
+ </p>
27
+
28
+ ## Introduction
29
+
30
+ This project provides a Python client to scrape tweets from X (formerly Twitter) timelines using a cURL command. It leverages asynchronous programming for efficient data retrieval and includes features for parsing tweet data.
31
+
32
+ ## Table of Contents 🗂
33
+
34
+ - [Installation](#installation)
35
+ - [Usage](#usage)
36
+ - [Citation](#citation)
37
+ - [Contributing](#contributing)
38
+ - [License](#license)
39
+
40
+ ## Installation ⚙️
41
+ <!-- Adjust the link of the second command to your own repo -->
42
+
43
+ The required packages to run this code can be found in the requirements.txt file. To run this file, execute the following code block after cloning the repository:
44
+
45
+ ```bash
46
+ pip install .
47
+ ```
48
+
49
+ or
50
+
51
+ ```bash
52
+ pip install git+https://github.com/StephanAkkerman/x-timeline-scraper.git
53
+ ```
54
+
55
+ ## Usage ⌨️
56
+
57
+ To use the X-Timeline Scraper, you need to provide a cURL command that accesses the desired X timeline. The instructions can be found in [curl_example.txt](curl_example.txt). Then, you can use the `XTimelineClient` class to fetch and parse tweets.
58
+
59
+ Here's a simple example of how to use the client:
60
+
61
+ ```python
62
+ import asyncio
63
+ from src.xclient import XTimelineClient
64
+
65
+ async with XTimelineClient(
66
+ "curl.txt", persist_last_id_path="state/last_id.txt"
67
+ ) as xc:
68
+ tweets = await xc.fetch_tweets(update_last_id=False)
69
+ for t in tweets:
70
+ print(t.to_markdown())
71
+ ```
72
+
73
+ You can also stream new tweets in real-time:
74
+
75
+ ```python
76
+ import asyncio
77
+ from src.xclient import XTimelineClient
78
+ async with XTimelineClient(
79
+ "curl.txt", persist_last_id_path="state/last_id.txt"
80
+ ) as xc:
81
+ async for t in xc.stream(interval_s=5.0):
82
+ print(t.to_markdown())
83
+ ```
84
+
85
+ ## Citation ✍️
86
+ <!-- Be sure to adjust everything here so it matches your name and repo -->
87
+ If you use this project in your research, please cite as follows:
88
+
89
+ ```bibtex
90
+ @misc{project_name,
91
+ author = {Stephan Akkerman},
92
+ title = {X-Timeline Scraper},
93
+ year = {2025},
94
+ publisher = {GitHub},
95
+ journal = {GitHub repository},
96
+ howpublished = {\url{https://github.com/StephanAkkerman/x-timeline-scraper}}
97
+ }
98
+ ```
99
+
100
+ ## Contributing 🛠
101
+ <!-- Be sure to adjust the repo name here for both the URL and GitHub link -->
102
+ Contributions are welcome! If you have a feature request, bug report, or proposal for code refactoring, please feel free to open an issue on GitHub. We appreciate your help in improving this project.\
103
+ ![https://github.com/StephanAkkerman/x-timeline-scraper/graphs/contributors](https://contributors-img.firebaseapp.com/image?repo=StephanAkkerman/x-timeline-scraper)
104
+
105
+ ## License 📜
106
+
107
+ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
@@ -0,0 +1,11 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ src/__init__.py
5
+ src/tweet.py
6
+ src/xclient.py
7
+ src/xtimeline.egg-info/PKG-INFO
8
+ src/xtimeline.egg-info/SOURCES.txt
9
+ src/xtimeline.egg-info/dependency_links.txt
10
+ src/xtimeline.egg-info/requires.txt
11
+ src/xtimeline.egg-info/top_level.txt
@@ -0,0 +1,2 @@
1
+ aiohttp>=3.12.15
2
+ uncurl>=0.0.11
@@ -0,0 +1,3 @@
1
+ __init__
2
+ tweet
3
+ xclient