twitwi 0.21.1__tar.gz → 0.22.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {twitwi-0.21.1/twitwi.egg-info → twitwi-0.22.0}/PKG-INFO +35 -5
- {twitwi-0.21.1 → twitwi-0.22.0}/README.md +34 -4
- {twitwi-0.21.1 → twitwi-0.22.0}/setup.py +1 -1
- {twitwi-0.21.1 → twitwi-0.22.0}/test/bluesky/formatters_test.py +48 -1
- {twitwi-0.21.1 → twitwi-0.22.0}/test/bluesky/normalizers_test.py +49 -3
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/__init__.py +10 -1
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/constants.py +3 -1
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/formatters.py +9 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/normalizers.py +94 -28
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/types.py +29 -15
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/utils.py +33 -14
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/utils.py +28 -2
- {twitwi-0.21.1 → twitwi-0.22.0/twitwi.egg-info}/PKG-INFO +35 -5
- {twitwi-0.21.1 → twitwi-0.22.0}/LICENSE.txt +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/setup.cfg +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/test/bluesky/__init__.py +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/__init__.py +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/anonymizers.py +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/constants.py +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/exceptions.py +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/formatters.py +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/normalizers.py +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi.egg-info/SOURCES.txt +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi.egg-info/dependency_links.txt +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi.egg-info/requires.txt +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi.egg-info/top_level.txt +0 -0
- {twitwi-0.21.1 → twitwi-0.22.0}/twitwi.egg-info/zip-safe +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: twitwi
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.22.0
|
|
4
4
|
Summary: A collection of Twitter-related helper functions for python.
|
|
5
5
|
Home-page: http://github.com/medialab/twitwi
|
|
6
6
|
Author: Béatrice Mazoyer, Guillaume Plique, Benjamin Ooghe-Tabanou
|
|
@@ -56,18 +56,22 @@ pip install twitwi
|
|
|
56
56
|
*Normalization functions*
|
|
57
57
|
|
|
58
58
|
* [normalize_profile](#normalize_profile)
|
|
59
|
+
* [normalize_partial_profile](#normalize_partial_profile)
|
|
59
60
|
* [normalize_post](#normalize_post)
|
|
60
61
|
|
|
61
62
|
*Formatting functions*
|
|
62
63
|
|
|
63
64
|
* [transform_profile_into_csv_dict](#transform_profile_into_csv_dict)
|
|
65
|
+
* [transform_partial_profile_into_csv_dict](#transform_partial_profile_into_csv_dict)
|
|
64
66
|
* [transform_post_into_csv_dict](#transform_post_into_csv_dict)
|
|
65
67
|
* [format_profile_as_csv_row](#format_profile_as_csv_row)
|
|
68
|
+
* [format_partial_profile_as_csv_row](#format_partial_profile_as_csv_row)
|
|
66
69
|
* [format_post_as_csv_row](#format_post_as_csv_row)
|
|
67
70
|
|
|
68
71
|
*Useful constants (under `twitwi.bluesky.constants`)*
|
|
69
72
|
|
|
70
73
|
* [PROFILE_FIELDS](#profile_fields)
|
|
74
|
+
* [PARTIAL_PROFILE_FIELDS](#partial_profile_fields)
|
|
71
75
|
* [POST_FIELDS](#post_fields)
|
|
72
76
|
|
|
73
77
|
*Examples*
|
|
@@ -95,7 +99,7 @@ for post_data in posts_payload_from_API:
|
|
|
95
99
|
|
|
96
100
|
# Then, saving normalized profiles into a CSV using DictWriter:
|
|
97
101
|
|
|
98
|
-
|
|
102
|
+
import csv
|
|
99
103
|
from twitwi.bluesky.constants import POST_FIELDS
|
|
100
104
|
from twitwi.bluesky import transform_post_into_csv_dict
|
|
101
105
|
|
|
@@ -108,7 +112,6 @@ with open("normalized_bluesky_posts.csv", "w") as f:
|
|
|
108
112
|
|
|
109
113
|
# Or using the basic CSV writer:
|
|
110
114
|
|
|
111
|
-
from csv import writer
|
|
112
115
|
from twitwi.bluesky import format_post_as_csv_row
|
|
113
116
|
|
|
114
117
|
with open("normalized_bluesky_posts.csv", "w") as f:
|
|
@@ -180,7 +183,18 @@ with open("normalized_bluesky_profiles.csv", "w") as f:
|
|
|
180
183
|
|
|
181
184
|
### normalize_profile
|
|
182
185
|
|
|
183
|
-
Function taking a nested dict describing a user profile from Bluesky's JSON payload and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys.
|
|
186
|
+
Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses)) and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys. Be careful not to confuse with the [normalize_partial_profile](#normalize_partial_profile) function which operate on a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
|
|
187
|
+
|
|
188
|
+
Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
|
|
189
|
+
|
|
190
|
+
*Arguments*
|
|
191
|
+
|
|
192
|
+
* **data** *(dict)*: user profile data payload coming from Bluesky API.
|
|
193
|
+
* **locale** *(pytz.timezone as str, optional)*: timezone used to convert dates. If not given, will default to UTC.
|
|
194
|
+
|
|
195
|
+
### normalize_partial_profile
|
|
196
|
+
|
|
197
|
+
Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses)) and returning a flat "normalized" dict composed of all [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) keys. Be careful not to confuse with the [normalize_profile](#normalize_profile) function which operate on the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
|
|
184
198
|
|
|
185
199
|
Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
|
|
186
200
|
|
|
@@ -210,6 +224,12 @@ Function transforming (i.e. mutating, so beware) a given normalized Bluesky prof
|
|
|
210
224
|
|
|
211
225
|
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
212
226
|
|
|
227
|
+
### transform_partial_profile_into_csv_dict
|
|
228
|
+
|
|
229
|
+
Function transforming (i.e. mutating, so beware) a given normalized Bluesky partial profile into a suitable dict able to be written by a `csv.DictWriter` as a row.
|
|
230
|
+
|
|
231
|
+
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
232
|
+
|
|
213
233
|
### transform_post_into_csv_dict
|
|
214
234
|
|
|
215
235
|
Function transforming (i.e. mutating, so beware) a given normalized Bluesky post into a suitable dict able to be written by a `csv.DictWriter` as a row.
|
|
@@ -222,6 +242,12 @@ Function formatting the given normalized Bluesky profile as a list able to be wr
|
|
|
222
242
|
|
|
223
243
|
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
224
244
|
|
|
245
|
+
### format_partial_profile_as_csv_row
|
|
246
|
+
|
|
247
|
+
Function formatting the given normalized Bluesky partial profile as a list able to be written by a `csv.writer` as a row in the order of [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) (which can therefore be used as header row of the CSV).
|
|
248
|
+
|
|
249
|
+
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
250
|
+
|
|
225
251
|
### format_post_as_csv_row
|
|
226
252
|
|
|
227
253
|
Function formatting the given normalized tBluesky post as a list able to be written by a `csv.writer` as a row in the order of [POST_FIELDS](#post_fields) (which can therefore be used as header row of the CSV).
|
|
@@ -230,7 +256,11 @@ Will convert list elements of the normalized data into a string with all element
|
|
|
230
256
|
|
|
231
257
|
### PROFILE_FIELDS
|
|
232
258
|
|
|
233
|
-
List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers.
|
|
259
|
+
List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) which correspond to a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
|
|
260
|
+
|
|
261
|
+
### PARTIAL_PROFILE_FIELDS
|
|
262
|
+
|
|
263
|
+
List of a Bluesky user partial profile's (retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example) normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PROFILE_FIELDS](#profile_fields) which correspond to the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
|
|
234
264
|
|
|
235
265
|
### POST_FIELDS
|
|
236
266
|
|
|
@@ -29,18 +29,22 @@ pip install twitwi
|
|
|
29
29
|
*Normalization functions*
|
|
30
30
|
|
|
31
31
|
* [normalize_profile](#normalize_profile)
|
|
32
|
+
* [normalize_partial_profile](#normalize_partial_profile)
|
|
32
33
|
* [normalize_post](#normalize_post)
|
|
33
34
|
|
|
34
35
|
*Formatting functions*
|
|
35
36
|
|
|
36
37
|
* [transform_profile_into_csv_dict](#transform_profile_into_csv_dict)
|
|
38
|
+
* [transform_partial_profile_into_csv_dict](#transform_partial_profile_into_csv_dict)
|
|
37
39
|
* [transform_post_into_csv_dict](#transform_post_into_csv_dict)
|
|
38
40
|
* [format_profile_as_csv_row](#format_profile_as_csv_row)
|
|
41
|
+
* [format_partial_profile_as_csv_row](#format_partial_profile_as_csv_row)
|
|
39
42
|
* [format_post_as_csv_row](#format_post_as_csv_row)
|
|
40
43
|
|
|
41
44
|
*Useful constants (under `twitwi.bluesky.constants`)*
|
|
42
45
|
|
|
43
46
|
* [PROFILE_FIELDS](#profile_fields)
|
|
47
|
+
* [PARTIAL_PROFILE_FIELDS](#partial_profile_fields)
|
|
44
48
|
* [POST_FIELDS](#post_fields)
|
|
45
49
|
|
|
46
50
|
*Examples*
|
|
@@ -68,7 +72,7 @@ for post_data in posts_payload_from_API:
|
|
|
68
72
|
|
|
69
73
|
# Then, saving normalized profiles into a CSV using DictWriter:
|
|
70
74
|
|
|
71
|
-
|
|
75
|
+
import csv
|
|
72
76
|
from twitwi.bluesky.constants import POST_FIELDS
|
|
73
77
|
from twitwi.bluesky import transform_post_into_csv_dict
|
|
74
78
|
|
|
@@ -81,7 +85,6 @@ with open("normalized_bluesky_posts.csv", "w") as f:
|
|
|
81
85
|
|
|
82
86
|
# Or using the basic CSV writer:
|
|
83
87
|
|
|
84
|
-
from csv import writer
|
|
85
88
|
from twitwi.bluesky import format_post_as_csv_row
|
|
86
89
|
|
|
87
90
|
with open("normalized_bluesky_posts.csv", "w") as f:
|
|
@@ -153,7 +156,18 @@ with open("normalized_bluesky_profiles.csv", "w") as f:
|
|
|
153
156
|
|
|
154
157
|
### normalize_profile
|
|
155
158
|
|
|
156
|
-
Function taking a nested dict describing a user profile from Bluesky's JSON payload and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys.
|
|
159
|
+
Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses)) and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys. Be careful not to confuse with the [normalize_partial_profile](#normalize_partial_profile) function which operate on a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
|
|
160
|
+
|
|
161
|
+
Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
|
|
162
|
+
|
|
163
|
+
*Arguments*
|
|
164
|
+
|
|
165
|
+
* **data** *(dict)*: user profile data payload coming from Bluesky API.
|
|
166
|
+
* **locale** *(pytz.timezone as str, optional)*: timezone used to convert dates. If not given, will default to UTC.
|
|
167
|
+
|
|
168
|
+
### normalize_partial_profile
|
|
169
|
+
|
|
170
|
+
Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses)) and returning a flat "normalized" dict composed of all [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) keys. Be careful not to confuse with the [normalize_profile](#normalize_profile) function which operate on the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
|
|
157
171
|
|
|
158
172
|
Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
|
|
159
173
|
|
|
@@ -183,6 +197,12 @@ Function transforming (i.e. mutating, so beware) a given normalized Bluesky prof
|
|
|
183
197
|
|
|
184
198
|
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
185
199
|
|
|
200
|
+
### transform_partial_profile_into_csv_dict
|
|
201
|
+
|
|
202
|
+
Function transforming (i.e. mutating, so beware) a given normalized Bluesky partial profile into a suitable dict able to be written by a `csv.DictWriter` as a row.
|
|
203
|
+
|
|
204
|
+
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
205
|
+
|
|
186
206
|
### transform_post_into_csv_dict
|
|
187
207
|
|
|
188
208
|
Function transforming (i.e. mutating, so beware) a given normalized Bluesky post into a suitable dict able to be written by a `csv.DictWriter` as a row.
|
|
@@ -195,6 +215,12 @@ Function formatting the given normalized Bluesky profile as a list able to be wr
|
|
|
195
215
|
|
|
196
216
|
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
197
217
|
|
|
218
|
+
### format_partial_profile_as_csv_row
|
|
219
|
+
|
|
220
|
+
Function formatting the given normalized Bluesky partial profile as a list able to be written by a `csv.writer` as a row in the order of [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) (which can therefore be used as header row of the CSV).
|
|
221
|
+
|
|
222
|
+
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
223
|
+
|
|
198
224
|
### format_post_as_csv_row
|
|
199
225
|
|
|
200
226
|
Function formatting the given normalized tBluesky post as a list able to be written by a `csv.writer` as a row in the order of [POST_FIELDS](#post_fields) (which can therefore be used as header row of the CSV).
|
|
@@ -203,7 +229,11 @@ Will convert list elements of the normalized data into a string with all element
|
|
|
203
229
|
|
|
204
230
|
### PROFILE_FIELDS
|
|
205
231
|
|
|
206
|
-
List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers.
|
|
232
|
+
List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) which correspond to a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
|
|
233
|
+
|
|
234
|
+
### PARTIAL_PROFILE_FIELDS
|
|
235
|
+
|
|
236
|
+
List of a Bluesky user partial profile's (retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example) normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PROFILE_FIELDS](#profile_fields) which correspond to the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
|
|
207
237
|
|
|
208
238
|
### POST_FIELDS
|
|
209
239
|
|
|
@@ -5,7 +5,7 @@ with open("./README.md", "r") as f:
|
|
|
5
5
|
|
|
6
6
|
setup(
|
|
7
7
|
name="twitwi",
|
|
8
|
-
version="0.
|
|
8
|
+
version="0.22.0",
|
|
9
9
|
description="A collection of Twitter-related helper functions for python.",
|
|
10
10
|
long_description=long_description,
|
|
11
11
|
long_description_content_type="text/markdown",
|
|
@@ -2,11 +2,13 @@ import csv
|
|
|
2
2
|
from io import StringIO
|
|
3
3
|
from twitwi.bluesky import (
|
|
4
4
|
format_profile_as_csv_row,
|
|
5
|
+
format_partial_profile_as_csv_row,
|
|
5
6
|
format_post_as_csv_row,
|
|
6
7
|
transform_profile_into_csv_dict,
|
|
8
|
+
transform_partial_profile_into_csv_dict,
|
|
7
9
|
transform_post_into_csv_dict,
|
|
8
10
|
)
|
|
9
|
-
from twitwi.bluesky.constants import PROFILE_FIELDS, POST_FIELDS
|
|
11
|
+
from twitwi.bluesky.constants import PROFILE_FIELDS, PARTIAL_PROFILE_FIELDS, POST_FIELDS
|
|
10
12
|
from test.utils import get_json_resource, open_resource
|
|
11
13
|
|
|
12
14
|
|
|
@@ -56,6 +58,51 @@ class TestFormatters:
|
|
|
56
58
|
buffer.seek(0)
|
|
57
59
|
assert list(csv.DictReader(buffer)) == list(csv.DictReader(f))
|
|
58
60
|
|
|
61
|
+
def test_format_partial_profile_as_csv_row(self):
|
|
62
|
+
normalized_partial_profiles = get_json_resource(
|
|
63
|
+
"bluesky-normalized-partial-profiles.json"
|
|
64
|
+
)
|
|
65
|
+
|
|
66
|
+
buffer = StringIO(newline=None)
|
|
67
|
+
writer = csv.writer(buffer, quoting=csv.QUOTE_MINIMAL)
|
|
68
|
+
writer.writerow(PARTIAL_PROFILE_FIELDS)
|
|
69
|
+
|
|
70
|
+
for profile in normalized_partial_profiles:
|
|
71
|
+
writer.writerow(format_partial_profile_as_csv_row(profile))
|
|
72
|
+
|
|
73
|
+
if OVERWRITE_TESTS:
|
|
74
|
+
written = buffer.getvalue()
|
|
75
|
+
|
|
76
|
+
with open("test/resources/bluesky-partial-profiles-export.csv", "w") as f:
|
|
77
|
+
f.write(written)
|
|
78
|
+
|
|
79
|
+
with open_resource("bluesky-partial-profiles-export.csv") as f:
|
|
80
|
+
buffer.seek(0)
|
|
81
|
+
assert list(csv.reader(buffer)) == list(csv.reader(f))
|
|
82
|
+
|
|
83
|
+
def test_transform_partial_profile_into_csv_dict(self):
|
|
84
|
+
normalized_partial_profiles = get_json_resource(
|
|
85
|
+
"bluesky-normalized-partial-profiles.json"
|
|
86
|
+
)
|
|
87
|
+
|
|
88
|
+
buffer = StringIO(newline=None)
|
|
89
|
+
writer = csv.DictWriter(
|
|
90
|
+
buffer,
|
|
91
|
+
fieldnames=PARTIAL_PROFILE_FIELDS,
|
|
92
|
+
extrasaction="ignore",
|
|
93
|
+
restval="",
|
|
94
|
+
quoting=csv.QUOTE_MINIMAL,
|
|
95
|
+
)
|
|
96
|
+
writer.writeheader()
|
|
97
|
+
|
|
98
|
+
for profile in normalized_partial_profiles:
|
|
99
|
+
transform_partial_profile_into_csv_dict(profile)
|
|
100
|
+
writer.writerow(profile)
|
|
101
|
+
|
|
102
|
+
with open_resource("bluesky-partial-profiles-export.csv") as f:
|
|
103
|
+
buffer.seek(0)
|
|
104
|
+
assert list(csv.DictReader(buffer)) == list(csv.DictReader(f))
|
|
105
|
+
|
|
59
106
|
def test_format_post_as_csv_row(self):
|
|
60
107
|
normalized_posts = get_json_resource("bluesky-normalized-posts.json")
|
|
61
108
|
|
|
@@ -5,7 +5,7 @@ from functools import partial
|
|
|
5
5
|
from pytz import timezone
|
|
6
6
|
from copy import deepcopy
|
|
7
7
|
|
|
8
|
-
from twitwi.bluesky import normalize_profile, normalize_post
|
|
8
|
+
from twitwi.bluesky import normalize_profile, normalize_partial_profile, normalize_post
|
|
9
9
|
|
|
10
10
|
from test.utils import get_json_resource
|
|
11
11
|
|
|
@@ -15,6 +15,8 @@ OVERWRITE_TESTS = False
|
|
|
15
15
|
|
|
16
16
|
|
|
17
17
|
FAKE_COLLECTION_TIME = "2025-01-01T00:00:00.000000"
|
|
18
|
+
|
|
19
|
+
|
|
18
20
|
def set_fake_collection_time(dico):
|
|
19
21
|
if "collection_time" in dico:
|
|
20
22
|
dico["collection_time"] = FAKE_COLLECTION_TIME
|
|
@@ -47,7 +49,9 @@ class TestNormalizers:
|
|
|
47
49
|
if OVERWRITE_TESTS:
|
|
48
50
|
from test.utils import dump_json_resource
|
|
49
51
|
|
|
50
|
-
normalized_profiles = [
|
|
52
|
+
normalized_profiles = [
|
|
53
|
+
set_fake_collection_time(fn(profile)) for profile in profiles
|
|
54
|
+
]
|
|
51
55
|
dump_json_resource(normalized_profiles, "bluesky-normalized-profiles.json")
|
|
52
56
|
|
|
53
57
|
expected = get_json_resource("bluesky-normalized-profiles.json")
|
|
@@ -70,6 +74,42 @@ class TestNormalizers:
|
|
|
70
74
|
|
|
71
75
|
assert profile == original_arg
|
|
72
76
|
|
|
77
|
+
def test_normalize_partial_profile(self):
|
|
78
|
+
tz = timezone("Europe/Paris")
|
|
79
|
+
|
|
80
|
+
profiles = get_json_resource("bluesky-partial-profiles.json")
|
|
81
|
+
fn = partial(normalize_partial_profile, locale=tz)
|
|
82
|
+
|
|
83
|
+
if OVERWRITE_TESTS:
|
|
84
|
+
from test.utils import dump_json_resource
|
|
85
|
+
|
|
86
|
+
normalized_profiles = [
|
|
87
|
+
set_fake_collection_time(fn(profile)) for profile in profiles
|
|
88
|
+
]
|
|
89
|
+
dump_json_resource(
|
|
90
|
+
normalized_profiles, "bluesky-normalized-partial-profiles.json"
|
|
91
|
+
)
|
|
92
|
+
|
|
93
|
+
expected = get_json_resource("bluesky-normalized-partial-profiles.json")
|
|
94
|
+
|
|
95
|
+
for idx, profile in enumerate(profiles):
|
|
96
|
+
result = fn(profile)
|
|
97
|
+
assert isinstance(result, dict)
|
|
98
|
+
assert "collection_time" in result and isinstance(
|
|
99
|
+
result["collection_time"], str
|
|
100
|
+
)
|
|
101
|
+
|
|
102
|
+
compare_dicts(profile["handle"], result, expected[idx])
|
|
103
|
+
|
|
104
|
+
def test_normalize_partial_profile_should_not_mutate(self):
|
|
105
|
+
profile = get_json_resource("bluesky-partial-profiles.json")[0]
|
|
106
|
+
|
|
107
|
+
original_arg = deepcopy(profile)
|
|
108
|
+
|
|
109
|
+
normalize_partial_profile(profile)
|
|
110
|
+
|
|
111
|
+
assert profile == original_arg
|
|
112
|
+
|
|
73
113
|
def test_normalize_post(self):
|
|
74
114
|
tz = timezone("Europe/Paris")
|
|
75
115
|
|
|
@@ -79,7 +119,13 @@ class TestNormalizers:
|
|
|
79
119
|
if OVERWRITE_TESTS:
|
|
80
120
|
from test.utils import dump_json_resource
|
|
81
121
|
|
|
82
|
-
normalized_posts = [
|
|
122
|
+
normalized_posts = [
|
|
123
|
+
[
|
|
124
|
+
set_fake_collection_time(p)
|
|
125
|
+
for p in fn(post, extract_referenced_posts=True)
|
|
126
|
+
]
|
|
127
|
+
for post in posts
|
|
128
|
+
]
|
|
83
129
|
dump_json_resource(normalized_posts, "bluesky-normalized-posts.json")
|
|
84
130
|
|
|
85
131
|
expected = get_json_resource("bluesky-normalized-posts.json")
|
|
@@ -1,7 +1,13 @@
|
|
|
1
|
-
from twitwi.bluesky.normalizers import
|
|
1
|
+
from twitwi.bluesky.normalizers import (
|
|
2
|
+
normalize_profile,
|
|
3
|
+
normalize_partial_profile,
|
|
4
|
+
normalize_post,
|
|
5
|
+
)
|
|
2
6
|
from twitwi.bluesky.formatters import (
|
|
3
7
|
transform_profile_into_csv_dict,
|
|
4
8
|
format_profile_as_csv_row,
|
|
9
|
+
transform_partial_profile_into_csv_dict,
|
|
10
|
+
format_partial_profile_as_csv_row,
|
|
5
11
|
transform_post_into_csv_dict,
|
|
6
12
|
format_post_as_csv_row,
|
|
7
13
|
)
|
|
@@ -9,8 +15,11 @@ from twitwi.bluesky.formatters import (
|
|
|
9
15
|
__all__ = [
|
|
10
16
|
"transform_profile_into_csv_dict",
|
|
11
17
|
"format_profile_as_csv_row",
|
|
18
|
+
"transform_partial_profile_into_csv_dict",
|
|
19
|
+
"format_partial_profile_as_csv_row",
|
|
12
20
|
"transform_post_into_csv_dict",
|
|
13
21
|
"format_post_as_csv_row",
|
|
14
22
|
"normalize_profile",
|
|
23
|
+
"normalize_partial_profile",
|
|
15
24
|
"normalize_post",
|
|
16
25
|
]
|
|
@@ -1,9 +1,11 @@
|
|
|
1
1
|
from typing import List, Optional
|
|
2
2
|
|
|
3
|
-
from twitwi.bluesky.types import BlueskyProfile, BlueskyPost
|
|
3
|
+
from twitwi.bluesky.types import BlueskyProfile, BlueskyPartialProfile, BlueskyPost
|
|
4
4
|
|
|
5
5
|
PROFILE_FIELDS = list(BlueskyProfile.__annotations__.keys())
|
|
6
6
|
|
|
7
|
+
PARTIAL_PROFILE_FIELDS = list(BlueskyPartialProfile.__annotations__.keys())
|
|
8
|
+
|
|
7
9
|
POST_FIELDS = list(BlueskyPost.__annotations__.keys())
|
|
8
10
|
|
|
9
11
|
POST_PLURAL_FIELDS = [
|
|
@@ -1,6 +1,7 @@
|
|
|
1
1
|
from twitwi.formatters import make_transform_into_csv_dict, make_format_as_csv_row
|
|
2
2
|
from twitwi.bluesky.constants import (
|
|
3
3
|
PROFILE_FIELDS,
|
|
4
|
+
PARTIAL_PROFILE_FIELDS,
|
|
4
5
|
POST_FIELDS,
|
|
5
6
|
POST_PLURAL_FIELDS,
|
|
6
7
|
POST_BOOLEAN_FIELDS,
|
|
@@ -20,10 +21,18 @@ transform_profile_into_csv_dict = make_transform_into_csv_dict([], [])
|
|
|
20
21
|
|
|
21
22
|
format_profile_as_csv_row = make_format_as_csv_row(PROFILE_FIELDS, [], [])
|
|
22
23
|
|
|
24
|
+
transform_partial_profile_into_csv_dict = make_transform_into_csv_dict([], [])
|
|
25
|
+
|
|
26
|
+
format_partial_profile_as_csv_row = make_format_as_csv_row(
|
|
27
|
+
PARTIAL_PROFILE_FIELDS, [], []
|
|
28
|
+
)
|
|
29
|
+
|
|
23
30
|
|
|
24
31
|
__all__ = [
|
|
25
32
|
"transform_post_into_csv_dict",
|
|
26
33
|
"format_post_as_csv_row",
|
|
27
34
|
"transform_profile_into_csv_dict",
|
|
28
35
|
"format_profile_as_csv_row",
|
|
36
|
+
"transform_partial_profile_into_csv_dict",
|
|
37
|
+
"format_partial_profile_as_csv_row",
|
|
29
38
|
]
|
|
@@ -1,11 +1,13 @@
|
|
|
1
1
|
from copy import deepcopy
|
|
2
|
-
from typing import List, Dict, Union, Optional, Literal, overload
|
|
2
|
+
from typing import List, Dict, Union, Optional, Literal, Any, overload
|
|
3
|
+
|
|
4
|
+
from ural import is_url
|
|
3
5
|
|
|
4
6
|
from twitwi.exceptions import BlueskyPayloadError
|
|
5
7
|
from twitwi.utils import (
|
|
6
8
|
get_collection_time,
|
|
7
9
|
get_dates,
|
|
8
|
-
|
|
10
|
+
safe_normalize_url,
|
|
9
11
|
custom_get_normalized_hostname,
|
|
10
12
|
)
|
|
11
13
|
from twitwi.bluesky.utils import (
|
|
@@ -18,10 +20,10 @@ from twitwi.bluesky.utils import (
|
|
|
18
20
|
format_starterpack_url,
|
|
19
21
|
format_media_url,
|
|
20
22
|
)
|
|
21
|
-
from twitwi.bluesky.types import BlueskyProfile, BlueskyPost
|
|
23
|
+
from twitwi.bluesky.types import BlueskyProfile, BlueskyPartialProfile, BlueskyPost
|
|
22
24
|
|
|
23
25
|
|
|
24
|
-
def normalize_profile(data: Dict, locale: Optional[
|
|
26
|
+
def normalize_profile(data: Dict, locale: Optional[Any] = None) -> BlueskyProfile:
|
|
25
27
|
associated = data["associated"]
|
|
26
28
|
|
|
27
29
|
pinned_post_uri = None
|
|
@@ -38,23 +40,48 @@ def normalize_profile(data: Dict, locale: Optional[str] = None) -> BlueskyProfil
|
|
|
38
40
|
"did": data["did"],
|
|
39
41
|
"url": format_profile_url(data["handle"]),
|
|
40
42
|
"handle": data["handle"],
|
|
41
|
-
"display_name": data.get("displayName"
|
|
43
|
+
"display_name": data.get("displayName"),
|
|
42
44
|
"created_at": created_at,
|
|
43
45
|
"timestamp_utc": timestamp_utc,
|
|
44
|
-
"description": data
|
|
45
|
-
"avatar": data.get("avatar"
|
|
46
|
+
"description": data.get("description"),
|
|
47
|
+
"avatar": data.get("avatar"),
|
|
46
48
|
"posts": data["postsCount"],
|
|
47
49
|
"followers": data["followersCount"],
|
|
48
50
|
"follows": data["followsCount"],
|
|
49
51
|
"lists": associated["lists"],
|
|
50
52
|
"feedgens": associated["feedgens"],
|
|
51
53
|
"starter_packs": associated["starterPacks"],
|
|
52
|
-
"banner": data
|
|
54
|
+
"banner": data.get("banner"),
|
|
53
55
|
"pinned_post_uri": pinned_post_uri,
|
|
54
56
|
"collection_time": get_collection_time(),
|
|
55
57
|
}
|
|
56
58
|
|
|
57
59
|
|
|
60
|
+
def normalize_partial_profile(
|
|
61
|
+
data: Dict, locale: Optional[Any] = None
|
|
62
|
+
) -> BlueskyPartialProfile:
|
|
63
|
+
associated = data["associated"]
|
|
64
|
+
|
|
65
|
+
timestamp_utc, created_at = get_dates(
|
|
66
|
+
data["createdAt"], locale=locale, source="bluesky"
|
|
67
|
+
)
|
|
68
|
+
|
|
69
|
+
return {
|
|
70
|
+
"did": data["did"],
|
|
71
|
+
"url": format_profile_url(data["handle"]),
|
|
72
|
+
"handle": data["handle"],
|
|
73
|
+
"display_name": data.get("displayName"),
|
|
74
|
+
"created_at": created_at,
|
|
75
|
+
"timestamp_utc": timestamp_utc,
|
|
76
|
+
"description": data.get("description"),
|
|
77
|
+
"avatar": data.get("avatar"),
|
|
78
|
+
"lists": associated.get("lists"),
|
|
79
|
+
"feedgens": associated.get("feedgens"),
|
|
80
|
+
"starter_packs": associated.get("starterPacks"),
|
|
81
|
+
"collection_time": get_collection_time(),
|
|
82
|
+
}
|
|
83
|
+
|
|
84
|
+
|
|
58
85
|
def prepare_native_gif_as_media(gif_data, user_did, source):
|
|
59
86
|
if "thumb" in gif_data:
|
|
60
87
|
media_cid = gif_data["thumb"]["ref"]["$link"]
|
|
@@ -73,8 +100,12 @@ def prepare_native_gif_as_media(gif_data, user_did, source):
|
|
|
73
100
|
|
|
74
101
|
|
|
75
102
|
def prepare_image_as_media(image_data):
|
|
103
|
+
if "ref" not in image_data["image"] or "$link" not in image_data["image"]["ref"]:
|
|
104
|
+
image_id = image_data["image"]["cid"]
|
|
105
|
+
else:
|
|
106
|
+
image_id = image_data["image"]["ref"]["$link"]
|
|
76
107
|
return {
|
|
77
|
-
"id":
|
|
108
|
+
"id": image_id,
|
|
78
109
|
"type": image_data["image"]["mimeType"],
|
|
79
110
|
"alt": image_data["alt"],
|
|
80
111
|
}
|
|
@@ -92,7 +123,9 @@ def process_starterpack_card(embed_data, post):
|
|
|
92
123
|
|
|
93
124
|
card = embed_data.get("record", {})
|
|
94
125
|
creator_did, pack_did = parse_post_uri(embed_data["uri"])
|
|
95
|
-
post["card_link"] = format_starterpack_url(
|
|
126
|
+
post["card_link"] = format_starterpack_url(
|
|
127
|
+
embed_data.get("creator", {}).get("handle") or creator_did, pack_did
|
|
128
|
+
)
|
|
96
129
|
post["card_title"] = card.get("name", "")
|
|
97
130
|
post["card_description"] = card.get("description", "")
|
|
98
131
|
post["card_thumbnail"] = card.get("thumb", "")
|
|
@@ -119,7 +152,14 @@ def prepare_quote_data(embed_quote, card_data, post, links):
|
|
|
119
152
|
)
|
|
120
153
|
|
|
121
154
|
# First store ugly quoted url with user did in case full quote data is missing (recursion > 3 or detached quote)
|
|
122
|
-
|
|
155
|
+
# Handling special posts types (only lists for now, for example: https://bsky.app/profile/lanana421.bsky.social/lists/3lxdgjtpqhf2z)
|
|
156
|
+
if "/app.bsky.graph.list/" in post["quoted_uri"]:
|
|
157
|
+
post_splitter = "/lists/"
|
|
158
|
+
else:
|
|
159
|
+
post_splitter = "/post/"
|
|
160
|
+
post["quoted_url"] = format_post_url(
|
|
161
|
+
post["quoted_user_did"], post["quoted_did"], post_splitter=post_splitter
|
|
162
|
+
)
|
|
123
163
|
|
|
124
164
|
quoted_data = None
|
|
125
165
|
if card_data:
|
|
@@ -145,7 +185,9 @@ def prepare_quote_data(embed_quote, card_data, post, links):
|
|
|
145
185
|
|
|
146
186
|
# Extract user handle from url
|
|
147
187
|
if "did:plc:" not in post["quoted_url"]:
|
|
148
|
-
post["quoted_user_handle"], _ = parse_post_url(
|
|
188
|
+
post["quoted_user_handle"], _ = parse_post_url(
|
|
189
|
+
post["quoted_url"], post["url"]
|
|
190
|
+
)
|
|
149
191
|
|
|
150
192
|
return (post, quoted_data, links)
|
|
151
193
|
|
|
@@ -176,7 +218,7 @@ def merge_nested_posts(referenced_posts, nested, source):
|
|
|
176
218
|
|
|
177
219
|
@overload
|
|
178
220
|
def normalize_post(
|
|
179
|
-
|
|
221
|
+
payload: Dict,
|
|
180
222
|
locale: Optional[str] = ...,
|
|
181
223
|
extract_referenced_posts: Literal[True] = ...,
|
|
182
224
|
collection_source: Optional[str] = ...,
|
|
@@ -185,7 +227,7 @@ def normalize_post(
|
|
|
185
227
|
|
|
186
228
|
@overload
|
|
187
229
|
def normalize_post(
|
|
188
|
-
|
|
230
|
+
payload: Dict,
|
|
189
231
|
locale: Optional[str] = ...,
|
|
190
232
|
extract_referenced_posts: Literal[False] = ...,
|
|
191
233
|
collection_source: Optional[str] = ...,
|
|
@@ -194,7 +236,7 @@ def normalize_post(
|
|
|
194
236
|
|
|
195
237
|
def normalize_post(
|
|
196
238
|
payload: Dict,
|
|
197
|
-
locale: Optional[
|
|
239
|
+
locale: Optional[Any] = None,
|
|
198
240
|
extract_referenced_posts: bool = False,
|
|
199
241
|
collection_source: Optional[str] = None,
|
|
200
242
|
) -> Union[BlueskyPost, List[BlueskyPost]]:
|
|
@@ -308,7 +350,7 @@ def normalize_post(
|
|
|
308
350
|
feat = facet["features"][0]
|
|
309
351
|
|
|
310
352
|
# Hashtags
|
|
311
|
-
if feat["$type"].endswith("#tag"):
|
|
353
|
+
if feat["$type"].endswith("#tag") or feat["$type"].endswith("#hashtag"):
|
|
312
354
|
hashtags.add(feat["tag"].strip().lower())
|
|
313
355
|
|
|
314
356
|
# Mentions
|
|
@@ -323,7 +365,11 @@ def normalize_post(
|
|
|
323
365
|
byteStart = text.find(b"@", byteStart)
|
|
324
366
|
|
|
325
367
|
handle = (
|
|
326
|
-
text[
|
|
368
|
+
text[
|
|
369
|
+
byteStart + 1 : facet["index"]["byteEnd"]
|
|
370
|
+
+ byteStart
|
|
371
|
+
- facet["index"]["byteStart"]
|
|
372
|
+
]
|
|
327
373
|
.strip()
|
|
328
374
|
.lower()
|
|
329
375
|
.decode("utf-8")
|
|
@@ -335,22 +381,36 @@ def normalize_post(
|
|
|
335
381
|
# Handle native polls
|
|
336
382
|
if "https://poll.blue/" in feat["uri"]:
|
|
337
383
|
if feat["uri"].endswith("/0"):
|
|
338
|
-
|
|
384
|
+
link = safe_normalize_url(feat["uri"])
|
|
385
|
+
if is_url(link):
|
|
386
|
+
links.add(link)
|
|
387
|
+
else:
|
|
388
|
+
continue
|
|
339
389
|
text += b" %s" % feat["uri"].encode("utf-8")
|
|
340
390
|
continue
|
|
341
391
|
|
|
342
|
-
|
|
392
|
+
link = safe_normalize_url(feat["uri"])
|
|
393
|
+
if is_url(link):
|
|
394
|
+
links.add(link)
|
|
395
|
+
else:
|
|
396
|
+
continue
|
|
343
397
|
# Check & fix occasional errored link positioning
|
|
344
|
-
#
|
|
398
|
+
# examples: https://bsky.app/profile/ecrime.ch/post/3lqotmopayr23
|
|
399
|
+
# https://bsky.app/profile/clustz.com/post/3lqfi7mnto52w
|
|
345
400
|
byteStart = facet["index"]["byteStart"]
|
|
346
|
-
|
|
347
|
-
|
|
401
|
+
|
|
402
|
+
if not text[byteStart : facet["index"]["byteEnd"]].startswith(b"http"):
|
|
403
|
+
new_byteStart = text.find(b"http", byteStart, facet["index"]["byteEnd"])
|
|
404
|
+
if new_byteStart != -1:
|
|
405
|
+
byteStart = new_byteStart
|
|
348
406
|
|
|
349
407
|
links_to_replace.append(
|
|
350
408
|
{
|
|
351
409
|
"uri": feat["uri"].encode("utf-8"),
|
|
352
410
|
"start": byteStart,
|
|
353
|
-
"end": byteStart
|
|
411
|
+
"end": byteStart
|
|
412
|
+
- facet["index"]["byteStart"]
|
|
413
|
+
+ facet["index"]["byteEnd"],
|
|
354
414
|
}
|
|
355
415
|
)
|
|
356
416
|
|
|
@@ -426,7 +486,7 @@ def normalize_post(
|
|
|
426
486
|
|
|
427
487
|
# Extra card links sometimes missing from facets & text due to manual action in post form
|
|
428
488
|
else:
|
|
429
|
-
extra_links.append(
|
|
489
|
+
extra_links.append(link)
|
|
430
490
|
# Handle link card metadata
|
|
431
491
|
if "embed" in data:
|
|
432
492
|
post = process_card_data(data["embed"]["external"], post)
|
|
@@ -442,7 +502,9 @@ def normalize_post(
|
|
|
442
502
|
# Quote & Starter-packs
|
|
443
503
|
if embed["$type"].endswith(".record"):
|
|
444
504
|
if "app.bsky.graph.starterpack" in embed["record"]["uri"]:
|
|
445
|
-
post = process_starterpack_card(
|
|
505
|
+
post = process_starterpack_card(
|
|
506
|
+
data.get("embed", {}).get("record"), post
|
|
507
|
+
)
|
|
446
508
|
if post["card_link"]:
|
|
447
509
|
extra_links.append(post["card_link"])
|
|
448
510
|
else:
|
|
@@ -499,9 +561,10 @@ def normalize_post(
|
|
|
499
561
|
|
|
500
562
|
# Process extra links
|
|
501
563
|
for link in extra_links:
|
|
502
|
-
norm_link =
|
|
564
|
+
norm_link = safe_normalize_url(link)
|
|
503
565
|
if norm_link not in links:
|
|
504
|
-
|
|
566
|
+
if is_url(norm_link):
|
|
567
|
+
links.add(norm_link)
|
|
505
568
|
text += b" " + link.encode("utf-8")
|
|
506
569
|
|
|
507
570
|
# Process medias
|
|
@@ -523,7 +586,10 @@ def normalize_post(
|
|
|
523
586
|
|
|
524
587
|
# Rewrite post's text to include links to medias within
|
|
525
588
|
text += b" " + (
|
|
526
|
-
media_thumb
|
|
589
|
+
media_thumb
|
|
590
|
+
if media_type.startswith("video")
|
|
591
|
+
and not media_type.endswith("/gif")
|
|
592
|
+
else media_url
|
|
527
593
|
).encode("utf-8")
|
|
528
594
|
|
|
529
595
|
# Process quotes
|
|
@@ -10,7 +10,7 @@ class BlueskyProfile(TypedDict):
|
|
|
10
10
|
url: str # URL of the profile accessible on the web
|
|
11
11
|
handle: str # updatable human-readable username of the account (usually like username.bsky.social or username.com)
|
|
12
12
|
display_name: Optional[str] # updatable human-readable name of the account
|
|
13
|
-
description: str
|
|
13
|
+
description: Optional[str] # profile short description written by the user
|
|
14
14
|
posts: int # total number of posts submitted by the user (at collection time)
|
|
15
15
|
followers: int # total number of followers of the user (at collection time)
|
|
16
16
|
follows: int # total number of other users followed by the user (at collection time)
|
|
@@ -18,12 +18,26 @@ class BlueskyProfile(TypedDict):
|
|
|
18
18
|
feedgens: int # total number of custom feeds created by the user (at collection time)
|
|
19
19
|
starter_packs: int # total number of starter packs created by the user (at collection time)
|
|
20
20
|
avatar: Optional[str] # URL to the image serving as avatar to the user
|
|
21
|
-
banner: str
|
|
21
|
+
banner: Optional[str] # URL to the image serving as profile banner to the user
|
|
22
22
|
pinned_post_uri: Optional[str] # ATProto's internal URI to the post potentially pinned by the user to appear at the top of his posts on his profile
|
|
23
23
|
created_at: str # datetime (potentially timezoned) of when the user created the account
|
|
24
24
|
timestamp_utc: int # Unix UTC timestamp of when the user created the account
|
|
25
25
|
collection_time: Optional[str] # datetime (potentially timezoned) of when the data was normalized
|
|
26
26
|
|
|
27
|
+
class BlueskyPartialProfile(TypedDict): # A partial version of the profile found in follower/follow profile payloads
|
|
28
|
+
did: str # persistent long-term identifier of the account
|
|
29
|
+
url: str # URL of the profile accessible on the web
|
|
30
|
+
handle: str # updatable human-readable username of the account (usually like username.bsky.social or username.com)
|
|
31
|
+
display_name: Optional[str] # updatable human-readable name of the account
|
|
32
|
+
description: Optional[str] # profile short description written by the user
|
|
33
|
+
lists: Optional[int] # total number of lists created by the user (at collection time)
|
|
34
|
+
feedgens: Optional[int] # total number of custom feeds created by the user (at collection time)
|
|
35
|
+
starter_packs: Optional[int] # total number of starter packs created by the user (at collection time)
|
|
36
|
+
avatar: Optional[str] # URL to the image serving as avatar to the user
|
|
37
|
+
created_at: str # datetime (potentially timezoned) of when the user created the account
|
|
38
|
+
timestamp_utc: int # Unix UTC timestamp of when the user created the account
|
|
39
|
+
collection_time: Optional[str] # datetime (potentially timezoned) of when the data was normalized
|
|
40
|
+
|
|
27
41
|
|
|
28
42
|
|
|
29
43
|
class BlueskyPost(TypedDict):
|
|
@@ -64,7 +78,7 @@ class BlueskyPost(TypedDict):
|
|
|
64
78
|
# user_lists: int # not available from posts payloads
|
|
65
79
|
user_langs: List[str] # languages in which the author of the posts usually writes posts (declarative)
|
|
66
80
|
user_avatar: Optional[str] # URL to the image serving as avatar to the user who authored the post
|
|
67
|
-
user_created_at: str # datetime (potentially timezoned)
|
|
81
|
+
user_created_at: str # datetime (potentially timezoned) of when the user who authored the post created the account
|
|
68
82
|
user_timestamp_utc: int # Unix UTC timestamp of when the user who authored the post created the account
|
|
69
83
|
|
|
70
84
|
# Parent post identifying fields
|
|
@@ -102,27 +116,27 @@ class BlueskyPost(TypedDict):
|
|
|
102
116
|
quoted_user_handle: Optional[str] # updatable human-readable username of the account who authored the quoted post
|
|
103
117
|
quoted_created_at: Optional[int] # datetime (potentially timezoned) of when the quoted post was submitted
|
|
104
118
|
quoted_timestamp_utc: Optional[int] # Unix UTC timestamp of when the quoted post was submitted
|
|
105
|
-
quoted_status: Optional[str] # empty or "detached" when the author of the quoted post intentionnally required the quoting post not to
|
|
119
|
+
quoted_status: Optional[str] # empty or "detached" when the author of the quoted post intentionnally required the quoting post not to appear in the list of this post's quotes
|
|
106
120
|
|
|
107
121
|
# Embedded elements metadata fields
|
|
108
122
|
links: List[str] # list of URLs of all links shared within the post (including potentially the embedded card detailed below, but not the link to a potential quoted post)
|
|
109
|
-
domains: List[str] # list of domains of the links shared within the post (here a domain
|
|
123
|
+
domains: List[str] # list of domains of the links shared within the post (here a domain refers to a full hostname, including subdomains, for instance bluesky.com or medialab.sciencespo.fr)
|
|
110
124
|
card_link: Optional[str] # URL of the link displayed as a card within the post if any
|
|
111
125
|
card_title: Optional[str] # title of the webpage corresponding to the linkg diplayed as a card within the post if any
|
|
112
126
|
card_description: Optional[str] # description of the webpage corresponding to the linkg diplayed as a card within the post if any
|
|
113
|
-
card_thumbnail: Optional[str] # image displayed as an illustration of the webpage corresponding to the
|
|
114
|
-
media_urls: List[str] # list of URLs to all
|
|
115
|
-
media_thumbnails: List[str] # list of URLs to small thumbnail version of all
|
|
116
|
-
media_types: List[str] # MIME types (such as image/jpeg, image/gif, video/mp4, etc.) of all
|
|
117
|
-
media_alt_texts: List[str] # description texts of all
|
|
118
|
-
mentioned_user_dids: List[str] # list of all persistent long-term
|
|
119
|
-
mentioned_user_handles: List[str] # list of all updatable human-readable
|
|
127
|
+
card_thumbnail: Optional[str] # image displayed as an illustration of the webpage corresponding to the link diplayed as a card within the post if any
|
|
128
|
+
media_urls: List[str] # list of URLs to all media (images, videos, gifs) embedded in the post
|
|
129
|
+
media_thumbnails: List[str] # list of URLs to small thumbnail version of all media (images, videos, gifs) embedded in the post
|
|
130
|
+
media_types: List[str] # MIME types (such as image/jpeg, image/gif, video/mp4, etc.) of all media (images, videos, gifs) embedded in the post
|
|
131
|
+
media_alt_texts: List[str] # description texts of all media (images, videos, gifs) embedded in the post
|
|
132
|
+
mentioned_user_dids: List[str] # list of all persistent long-term identifiers of the accounts adressed within the post (does not include users to which the post replied)
|
|
133
|
+
mentioned_user_handles: List[str] # list of all updatable human-readable usernames of the accounts adressed within the post (does not include users to which the post replied)
|
|
120
134
|
hashtags: List[str] # list of all unique lowercased hashtags found within the post's text
|
|
121
135
|
|
|
122
136
|
# Conversation rules fields
|
|
123
137
|
replies_rules: Optional[List[str]] # list of specific conversation rules set by the author for the current post (can be one or a combination of: disallow, allow_from_follower, allow_from_following, allow_from_mention, or allow_from_list: followed by a list of user DIDs)
|
|
124
138
|
replies_rules_created_at: Optional[str] # datetime (potentially timezoned) of when the user set the replies_rules
|
|
125
|
-
replies_rules_timestamp_utc: Optional[int] # Unix UTC timestamp of when the
|
|
139
|
+
replies_rules_timestamp_utc: Optional[int] # Unix UTC timestamp of when the user set the replies_rules
|
|
126
140
|
hidden_replies_uris: Optional[List[str]] # list of ATProto's internal URIs to posts who replied to the post, but where intentionnally marked as hidden by the current post's author
|
|
127
141
|
# quotes_rule: Optional[str] # not available from posts payloads, cf https://github.com/bluesky-social/atproto/issues/3712
|
|
128
142
|
# quotes_rules_created_at: Optional[str] # not available from posts payloads, cf https://github.com/bluesky-social/atproto/issues/3712
|
|
@@ -131,5 +145,5 @@ class BlueskyPost(TypedDict):
|
|
|
131
145
|
|
|
132
146
|
# Extra fields linked to the data collection and processing
|
|
133
147
|
collection_time: Optional[str] # datetime (potentially timezoned) of when the data was normalized
|
|
134
|
-
collected_via: Optional[List[str]] # extra field added by the normalization process to express how the data collection was ran, will be "quote" or "thread" when a post was grabbed as a referenced post within
|
|
135
|
-
match_query: Optional[bool] # extra field added by the normalization process to express whether the post was an intentionnally collected one or only came as a referenced post within
|
|
148
|
+
collected_via: Optional[List[str]] # extra field added by the normalization process to express how the data collection was ran, will be "quote" or "thread" when a post was grabbed as a referenced post within the originally collected post using the "extract_referenced_posts" option of "normalize_post"
|
|
149
|
+
match_query: Optional[bool] # extra field added by the normalization process to express whether the post was an intentionnally collected one or only came as a referenced post within the originally collected post using the "extract_referenced_posts" option of "normalize_post"
|
|
@@ -55,7 +55,9 @@ def validate_post_payload(data):
|
|
|
55
55
|
return True, None
|
|
56
56
|
|
|
57
57
|
|
|
58
|
-
re_embed_types = re.compile(
|
|
58
|
+
re_embed_types = re.compile(
|
|
59
|
+
r"\.(record|recordWithMedia|images|video|external)(?:#.*)?$"
|
|
60
|
+
)
|
|
59
61
|
|
|
60
62
|
|
|
61
63
|
def valid_embed_type(embed_type):
|
|
@@ -66,29 +68,39 @@ def format_profile_url(user_handle_or_did):
|
|
|
66
68
|
return f"https://bsky.app/profile/{user_handle_or_did}"
|
|
67
69
|
|
|
68
70
|
|
|
69
|
-
def format_post_url(user_handle_or_did, post_did):
|
|
70
|
-
return f"https://bsky.app/profile/{user_handle_or_did}
|
|
71
|
+
def format_post_url(user_handle_or_did, post_did, post_splitter="/post/"):
|
|
72
|
+
return f"https://bsky.app/profile/{user_handle_or_did}{post_splitter}{post_did}"
|
|
71
73
|
|
|
72
74
|
|
|
73
75
|
def parse_post_url(url, source):
|
|
74
76
|
"""Returns a tuple of (author_handle/did, post_did) from an https://bsky.app post URL"""
|
|
75
77
|
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
78
|
+
known_splits = ["/post/", "/lists/"]
|
|
79
|
+
|
|
80
|
+
if url.startswith("https://bsky.app/profile/"):
|
|
81
|
+
for split in known_splits:
|
|
82
|
+
if split in url[25:]:
|
|
83
|
+
return url[25:].split(split)
|
|
84
|
+
|
|
85
|
+
raise BlueskyPayloadError(source, f"{url} is not a usual Bluesky post url")
|
|
79
86
|
|
|
80
87
|
|
|
81
88
|
def parse_post_uri(uri, source=None):
|
|
82
89
|
"""Returns a tuple of (author_did, post_did) from an at:// post URI"""
|
|
83
90
|
|
|
84
|
-
|
|
85
|
-
|
|
91
|
+
known_splits = [
|
|
92
|
+
"/app.bsky.feed.post/",
|
|
93
|
+
"/app.bsky.graph.starterpack/",
|
|
94
|
+
"/app.bsky.feed.generator/",
|
|
95
|
+
"/app.bsky.graph.list/",
|
|
96
|
+
]
|
|
86
97
|
|
|
87
|
-
if
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
98
|
+
if uri.startswith("at://"):
|
|
99
|
+
for split in known_splits:
|
|
100
|
+
if split in uri:
|
|
101
|
+
return uri[5:].split(split)
|
|
102
|
+
|
|
103
|
+
raise BlueskyPayloadError(source or uri, f"{uri} is not a usual Bluesky post uri")
|
|
92
104
|
|
|
93
105
|
|
|
94
106
|
def format_starterpack_url(user_handle_or_did, record_did):
|
|
@@ -105,6 +117,13 @@ def format_media_url(user_did, media_cid, mime_type, source):
|
|
|
105
117
|
media_thumb = (
|
|
106
118
|
f"https://video.bsky.app/watch/{user_did}/{media_cid}/thumbnail.jpg"
|
|
107
119
|
)
|
|
120
|
+
elif mime_type == "application/octet-stream":
|
|
121
|
+
media_url = (
|
|
122
|
+
f"https://cdn.bsky.app/img/feed_fullsize/plain/{user_did}/{media_cid}@jpeg"
|
|
123
|
+
)
|
|
124
|
+
media_thumb = (
|
|
125
|
+
f"https://cdn.bsky.app/img/feed_thumbnail/plain/{user_did}/{media_cid}@jpeg"
|
|
126
|
+
)
|
|
108
127
|
else:
|
|
109
|
-
raise BlueskyPayloadError(source, f"{mime_type} is an
|
|
128
|
+
raise BlueskyPayloadError(source, f"{mime_type} is an unusual media mimeType")
|
|
110
129
|
return media_url, media_thumb
|
|
@@ -4,6 +4,8 @@
|
|
|
4
4
|
#
|
|
5
5
|
# Miscellaneous utility functions.
|
|
6
6
|
#
|
|
7
|
+
from typing import Tuple
|
|
8
|
+
|
|
7
9
|
from pytz import timezone
|
|
8
10
|
from dateutil.parser import parse as parse_date
|
|
9
11
|
from ural import normalize_url, get_normalized_hostname
|
|
@@ -25,6 +27,21 @@ UTC_TIMEZONE = timezone("UTC")
|
|
|
25
27
|
|
|
26
28
|
custom_normalize_url = partial(normalize_url, **CANONICAL_URL_KWARGS)
|
|
27
29
|
|
|
30
|
+
|
|
31
|
+
def safe_normalize_url(url):
|
|
32
|
+
# We avoid normalizing bluesky urls containing specific uri parts because
|
|
33
|
+
# Bluesky servers don't handle quoting correctly...
|
|
34
|
+
# See https://github.com/medialab/twitwi/issues/72
|
|
35
|
+
if "/did:plc:" in url:
|
|
36
|
+
return url
|
|
37
|
+
|
|
38
|
+
try:
|
|
39
|
+
return custom_normalize_url(url)
|
|
40
|
+
except Exception:
|
|
41
|
+
# In case of error, return the original URL. Possibly not a valid URL, e.g. url containing double slashes
|
|
42
|
+
return url
|
|
43
|
+
|
|
44
|
+
|
|
28
45
|
custom_get_normalized_hostname = partial(
|
|
29
46
|
get_normalized_hostname, **CANONICAL_HOSTNAME_KWARGS
|
|
30
47
|
)
|
|
@@ -34,7 +51,9 @@ def get_collection_time():
|
|
|
34
51
|
return datetime.now().strftime(FORMATTED_FULL_DATETIME_FORMAT)
|
|
35
52
|
|
|
36
53
|
|
|
37
|
-
def get_dates(
|
|
54
|
+
def get_dates(
|
|
55
|
+
date_str: str, locale=None, source: str = "v1", millisecond_timestamp: bool = False
|
|
56
|
+
) -> Tuple[int, str]:
|
|
38
57
|
if source not in ["v1", "v2", "bluesky"]:
|
|
39
58
|
raise Exception("source should be one of v1, v2 or bluesky")
|
|
40
59
|
|
|
@@ -56,8 +75,14 @@ def get_dates(date_str, locale=None, source="v1"):
|
|
|
56
75
|
utc_datetime = UTC_TIMEZONE.localize(parsed_datetime)
|
|
57
76
|
locale_datetime = utc_datetime.astimezone(locale)
|
|
58
77
|
|
|
78
|
+
timestamp = int(utc_datetime.timestamp())
|
|
79
|
+
|
|
80
|
+
if millisecond_timestamp:
|
|
81
|
+
timestamp *= 1000
|
|
82
|
+
timestamp += utc_datetime.microsecond / 1000
|
|
83
|
+
|
|
59
84
|
return (
|
|
60
|
-
int(
|
|
85
|
+
int(timestamp),
|
|
61
86
|
datetime.strftime(
|
|
62
87
|
locale_datetime,
|
|
63
88
|
FORMATTED_FULL_DATETIME_FORMAT
|
|
@@ -106,6 +131,7 @@ def get_dates_from_id(tweet_id, locale=None):
|
|
|
106
131
|
locale = UTC_TIMEZONE
|
|
107
132
|
|
|
108
133
|
timestamp = get_timestamp_from_id(tweet_id)
|
|
134
|
+
assert timestamp is not None
|
|
109
135
|
|
|
110
136
|
locale_datetime = datetime.fromtimestamp(timestamp, locale)
|
|
111
137
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: twitwi
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.22.0
|
|
4
4
|
Summary: A collection of Twitter-related helper functions for python.
|
|
5
5
|
Home-page: http://github.com/medialab/twitwi
|
|
6
6
|
Author: Béatrice Mazoyer, Guillaume Plique, Benjamin Ooghe-Tabanou
|
|
@@ -56,18 +56,22 @@ pip install twitwi
|
|
|
56
56
|
*Normalization functions*
|
|
57
57
|
|
|
58
58
|
* [normalize_profile](#normalize_profile)
|
|
59
|
+
* [normalize_partial_profile](#normalize_partial_profile)
|
|
59
60
|
* [normalize_post](#normalize_post)
|
|
60
61
|
|
|
61
62
|
*Formatting functions*
|
|
62
63
|
|
|
63
64
|
* [transform_profile_into_csv_dict](#transform_profile_into_csv_dict)
|
|
65
|
+
* [transform_partial_profile_into_csv_dict](#transform_partial_profile_into_csv_dict)
|
|
64
66
|
* [transform_post_into_csv_dict](#transform_post_into_csv_dict)
|
|
65
67
|
* [format_profile_as_csv_row](#format_profile_as_csv_row)
|
|
68
|
+
* [format_partial_profile_as_csv_row](#format_partial_profile_as_csv_row)
|
|
66
69
|
* [format_post_as_csv_row](#format_post_as_csv_row)
|
|
67
70
|
|
|
68
71
|
*Useful constants (under `twitwi.bluesky.constants`)*
|
|
69
72
|
|
|
70
73
|
* [PROFILE_FIELDS](#profile_fields)
|
|
74
|
+
* [PARTIAL_PROFILE_FIELDS](#partial_profile_fields)
|
|
71
75
|
* [POST_FIELDS](#post_fields)
|
|
72
76
|
|
|
73
77
|
*Examples*
|
|
@@ -95,7 +99,7 @@ for post_data in posts_payload_from_API:
|
|
|
95
99
|
|
|
96
100
|
# Then, saving normalized profiles into a CSV using DictWriter:
|
|
97
101
|
|
|
98
|
-
|
|
102
|
+
import csv
|
|
99
103
|
from twitwi.bluesky.constants import POST_FIELDS
|
|
100
104
|
from twitwi.bluesky import transform_post_into_csv_dict
|
|
101
105
|
|
|
@@ -108,7 +112,6 @@ with open("normalized_bluesky_posts.csv", "w") as f:
|
|
|
108
112
|
|
|
109
113
|
# Or using the basic CSV writer:
|
|
110
114
|
|
|
111
|
-
from csv import writer
|
|
112
115
|
from twitwi.bluesky import format_post_as_csv_row
|
|
113
116
|
|
|
114
117
|
with open("normalized_bluesky_posts.csv", "w") as f:
|
|
@@ -180,7 +183,18 @@ with open("normalized_bluesky_profiles.csv", "w") as f:
|
|
|
180
183
|
|
|
181
184
|
### normalize_profile
|
|
182
185
|
|
|
183
|
-
Function taking a nested dict describing a user profile from Bluesky's JSON payload and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys.
|
|
186
|
+
Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses)) and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys. Be careful not to confuse with the [normalize_partial_profile](#normalize_partial_profile) function which operate on a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
|
|
187
|
+
|
|
188
|
+
Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
|
|
189
|
+
|
|
190
|
+
*Arguments*
|
|
191
|
+
|
|
192
|
+
* **data** *(dict)*: user profile data payload coming from Bluesky API.
|
|
193
|
+
* **locale** *(pytz.timezone as str, optional)*: timezone used to convert dates. If not given, will default to UTC.
|
|
194
|
+
|
|
195
|
+
### normalize_partial_profile
|
|
196
|
+
|
|
197
|
+
Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses)) and returning a flat "normalized" dict composed of all [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) keys. Be careful not to confuse with the [normalize_profile](#normalize_profile) function which operate on the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
|
|
184
198
|
|
|
185
199
|
Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
|
|
186
200
|
|
|
@@ -210,6 +224,12 @@ Function transforming (i.e. mutating, so beware) a given normalized Bluesky prof
|
|
|
210
224
|
|
|
211
225
|
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
212
226
|
|
|
227
|
+
### transform_partial_profile_into_csv_dict
|
|
228
|
+
|
|
229
|
+
Function transforming (i.e. mutating, so beware) a given normalized Bluesky partial profile into a suitable dict able to be written by a `csv.DictWriter` as a row.
|
|
230
|
+
|
|
231
|
+
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
232
|
+
|
|
213
233
|
### transform_post_into_csv_dict
|
|
214
234
|
|
|
215
235
|
Function transforming (i.e. mutating, so beware) a given normalized Bluesky post into a suitable dict able to be written by a `csv.DictWriter` as a row.
|
|
@@ -222,6 +242,12 @@ Function formatting the given normalized Bluesky profile as a list able to be wr
|
|
|
222
242
|
|
|
223
243
|
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
224
244
|
|
|
245
|
+
### format_partial_profile_as_csv_row
|
|
246
|
+
|
|
247
|
+
Function formatting the given normalized Bluesky partial profile as a list able to be written by a `csv.writer` as a row in the order of [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) (which can therefore be used as header row of the CSV).
|
|
248
|
+
|
|
249
|
+
Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
|
|
250
|
+
|
|
225
251
|
### format_post_as_csv_row
|
|
226
252
|
|
|
227
253
|
Function formatting the given normalized tBluesky post as a list able to be written by a `csv.writer` as a row in the order of [POST_FIELDS](#post_fields) (which can therefore be used as header row of the CSV).
|
|
@@ -230,7 +256,11 @@ Will convert list elements of the normalized data into a string with all element
|
|
|
230
256
|
|
|
231
257
|
### PROFILE_FIELDS
|
|
232
258
|
|
|
233
|
-
List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers.
|
|
259
|
+
List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) which correspond to a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
|
|
260
|
+
|
|
261
|
+
### PARTIAL_PROFILE_FIELDS
|
|
262
|
+
|
|
263
|
+
List of a Bluesky user partial profile's (retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example) normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PROFILE_FIELDS](#profile_fields) which correspond to the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
|
|
234
264
|
|
|
235
265
|
### POST_FIELDS
|
|
236
266
|
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|