twitwi 0.21.1__tar.gz → 0.22.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (27) hide show
  1. {twitwi-0.21.1/twitwi.egg-info → twitwi-0.22.0}/PKG-INFO +35 -5
  2. {twitwi-0.21.1 → twitwi-0.22.0}/README.md +34 -4
  3. {twitwi-0.21.1 → twitwi-0.22.0}/setup.py +1 -1
  4. {twitwi-0.21.1 → twitwi-0.22.0}/test/bluesky/formatters_test.py +48 -1
  5. {twitwi-0.21.1 → twitwi-0.22.0}/test/bluesky/normalizers_test.py +49 -3
  6. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/__init__.py +10 -1
  7. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/constants.py +3 -1
  8. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/formatters.py +9 -0
  9. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/normalizers.py +94 -28
  10. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/types.py +29 -15
  11. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/bluesky/utils.py +33 -14
  12. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/utils.py +28 -2
  13. {twitwi-0.21.1 → twitwi-0.22.0/twitwi.egg-info}/PKG-INFO +35 -5
  14. {twitwi-0.21.1 → twitwi-0.22.0}/LICENSE.txt +0 -0
  15. {twitwi-0.21.1 → twitwi-0.22.0}/setup.cfg +0 -0
  16. {twitwi-0.21.1 → twitwi-0.22.0}/test/bluesky/__init__.py +0 -0
  17. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/__init__.py +0 -0
  18. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/anonymizers.py +0 -0
  19. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/constants.py +0 -0
  20. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/exceptions.py +0 -0
  21. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/formatters.py +0 -0
  22. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi/normalizers.py +0 -0
  23. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi.egg-info/SOURCES.txt +0 -0
  24. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi.egg-info/dependency_links.txt +0 -0
  25. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi.egg-info/requires.txt +0 -0
  26. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi.egg-info/top_level.txt +0 -0
  27. {twitwi-0.21.1 → twitwi-0.22.0}/twitwi.egg-info/zip-safe +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: twitwi
3
- Version: 0.21.1
3
+ Version: 0.22.0
4
4
  Summary: A collection of Twitter-related helper functions for python.
5
5
  Home-page: http://github.com/medialab/twitwi
6
6
  Author: Béatrice Mazoyer, Guillaume Plique, Benjamin Ooghe-Tabanou
@@ -56,18 +56,22 @@ pip install twitwi
56
56
  *Normalization functions*
57
57
 
58
58
  * [normalize_profile](#normalize_profile)
59
+ * [normalize_partial_profile](#normalize_partial_profile)
59
60
  * [normalize_post](#normalize_post)
60
61
 
61
62
  *Formatting functions*
62
63
 
63
64
  * [transform_profile_into_csv_dict](#transform_profile_into_csv_dict)
65
+ * [transform_partial_profile_into_csv_dict](#transform_partial_profile_into_csv_dict)
64
66
  * [transform_post_into_csv_dict](#transform_post_into_csv_dict)
65
67
  * [format_profile_as_csv_row](#format_profile_as_csv_row)
68
+ * [format_partial_profile_as_csv_row](#format_partial_profile_as_csv_row)
66
69
  * [format_post_as_csv_row](#format_post_as_csv_row)
67
70
 
68
71
  *Useful constants (under `twitwi.bluesky.constants`)*
69
72
 
70
73
  * [PROFILE_FIELDS](#profile_fields)
74
+ * [PARTIAL_PROFILE_FIELDS](#partial_profile_fields)
71
75
  * [POST_FIELDS](#post_fields)
72
76
 
73
77
  *Examples*
@@ -95,7 +99,7 @@ for post_data in posts_payload_from_API:
95
99
 
96
100
  # Then, saving normalized profiles into a CSV using DictWriter:
97
101
 
98
- from csv import DictWriter
102
+ import csv
99
103
  from twitwi.bluesky.constants import POST_FIELDS
100
104
  from twitwi.bluesky import transform_post_into_csv_dict
101
105
 
@@ -108,7 +112,6 @@ with open("normalized_bluesky_posts.csv", "w") as f:
108
112
 
109
113
  # Or using the basic CSV writer:
110
114
 
111
- from csv import writer
112
115
  from twitwi.bluesky import format_post_as_csv_row
113
116
 
114
117
  with open("normalized_bluesky_posts.csv", "w") as f:
@@ -180,7 +183,18 @@ with open("normalized_bluesky_profiles.csv", "w") as f:
180
183
 
181
184
  ### normalize_profile
182
185
 
183
- Function taking a nested dict describing a user profile from Bluesky's JSON payload and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys.
186
+ Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses)) and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys. Be careful not to confuse with the [normalize_partial_profile](#normalize_partial_profile) function which operate on a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
187
+
188
+ Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
189
+
190
+ *Arguments*
191
+
192
+ * **data** *(dict)*: user profile data payload coming from Bluesky API.
193
+ * **locale** *(pytz.timezone as str, optional)*: timezone used to convert dates. If not given, will default to UTC.
194
+
195
+ ### normalize_partial_profile
196
+
197
+ Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses)) and returning a flat "normalized" dict composed of all [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) keys. Be careful not to confuse with the [normalize_profile](#normalize_profile) function which operate on the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
184
198
 
185
199
  Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
186
200
 
@@ -210,6 +224,12 @@ Function transforming (i.e. mutating, so beware) a given normalized Bluesky prof
210
224
 
211
225
  Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
212
226
 
227
+ ### transform_partial_profile_into_csv_dict
228
+
229
+ Function transforming (i.e. mutating, so beware) a given normalized Bluesky partial profile into a suitable dict able to be written by a `csv.DictWriter` as a row.
230
+
231
+ Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
232
+
213
233
  ### transform_post_into_csv_dict
214
234
 
215
235
  Function transforming (i.e. mutating, so beware) a given normalized Bluesky post into a suitable dict able to be written by a `csv.DictWriter` as a row.
@@ -222,6 +242,12 @@ Function formatting the given normalized Bluesky profile as a list able to be wr
222
242
 
223
243
  Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
224
244
 
245
+ ### format_partial_profile_as_csv_row
246
+
247
+ Function formatting the given normalized Bluesky partial profile as a list able to be written by a `csv.writer` as a row in the order of [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) (which can therefore be used as header row of the CSV).
248
+
249
+ Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
250
+
225
251
  ### format_post_as_csv_row
226
252
 
227
253
  Function formatting the given normalized tBluesky post as a list able to be written by a `csv.writer` as a row in the order of [POST_FIELDS](#post_fields) (which can therefore be used as header row of the CSV).
@@ -230,7 +256,11 @@ Will convert list elements of the normalized data into a string with all element
230
256
 
231
257
  ### PROFILE_FIELDS
232
258
 
233
- List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers.
259
+ List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) which correspond to a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
260
+
261
+ ### PARTIAL_PROFILE_FIELDS
262
+
263
+ List of a Bluesky user partial profile's (retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example) normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PROFILE_FIELDS](#profile_fields) which correspond to the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
234
264
 
235
265
  ### POST_FIELDS
236
266
 
@@ -29,18 +29,22 @@ pip install twitwi
29
29
  *Normalization functions*
30
30
 
31
31
  * [normalize_profile](#normalize_profile)
32
+ * [normalize_partial_profile](#normalize_partial_profile)
32
33
  * [normalize_post](#normalize_post)
33
34
 
34
35
  *Formatting functions*
35
36
 
36
37
  * [transform_profile_into_csv_dict](#transform_profile_into_csv_dict)
38
+ * [transform_partial_profile_into_csv_dict](#transform_partial_profile_into_csv_dict)
37
39
  * [transform_post_into_csv_dict](#transform_post_into_csv_dict)
38
40
  * [format_profile_as_csv_row](#format_profile_as_csv_row)
41
+ * [format_partial_profile_as_csv_row](#format_partial_profile_as_csv_row)
39
42
  * [format_post_as_csv_row](#format_post_as_csv_row)
40
43
 
41
44
  *Useful constants (under `twitwi.bluesky.constants`)*
42
45
 
43
46
  * [PROFILE_FIELDS](#profile_fields)
47
+ * [PARTIAL_PROFILE_FIELDS](#partial_profile_fields)
44
48
  * [POST_FIELDS](#post_fields)
45
49
 
46
50
  *Examples*
@@ -68,7 +72,7 @@ for post_data in posts_payload_from_API:
68
72
 
69
73
  # Then, saving normalized profiles into a CSV using DictWriter:
70
74
 
71
- from csv import DictWriter
75
+ import csv
72
76
  from twitwi.bluesky.constants import POST_FIELDS
73
77
  from twitwi.bluesky import transform_post_into_csv_dict
74
78
 
@@ -81,7 +85,6 @@ with open("normalized_bluesky_posts.csv", "w") as f:
81
85
 
82
86
  # Or using the basic CSV writer:
83
87
 
84
- from csv import writer
85
88
  from twitwi.bluesky import format_post_as_csv_row
86
89
 
87
90
  with open("normalized_bluesky_posts.csv", "w") as f:
@@ -153,7 +156,18 @@ with open("normalized_bluesky_profiles.csv", "w") as f:
153
156
 
154
157
  ### normalize_profile
155
158
 
156
- Function taking a nested dict describing a user profile from Bluesky's JSON payload and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys.
159
+ Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses)) and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys. Be careful not to confuse with the [normalize_partial_profile](#normalize_partial_profile) function which operate on a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
160
+
161
+ Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
162
+
163
+ *Arguments*
164
+
165
+ * **data** *(dict)*: user profile data payload coming from Bluesky API.
166
+ * **locale** *(pytz.timezone as str, optional)*: timezone used to convert dates. If not given, will default to UTC.
167
+
168
+ ### normalize_partial_profile
169
+
170
+ Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses)) and returning a flat "normalized" dict composed of all [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) keys. Be careful not to confuse with the [normalize_profile](#normalize_profile) function which operate on the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
157
171
 
158
172
  Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
159
173
 
@@ -183,6 +197,12 @@ Function transforming (i.e. mutating, so beware) a given normalized Bluesky prof
183
197
 
184
198
  Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
185
199
 
200
+ ### transform_partial_profile_into_csv_dict
201
+
202
+ Function transforming (i.e. mutating, so beware) a given normalized Bluesky partial profile into a suitable dict able to be written by a `csv.DictWriter` as a row.
203
+
204
+ Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
205
+
186
206
  ### transform_post_into_csv_dict
187
207
 
188
208
  Function transforming (i.e. mutating, so beware) a given normalized Bluesky post into a suitable dict able to be written by a `csv.DictWriter` as a row.
@@ -195,6 +215,12 @@ Function formatting the given normalized Bluesky profile as a list able to be wr
195
215
 
196
216
  Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
197
217
 
218
+ ### format_partial_profile_as_csv_row
219
+
220
+ Function formatting the given normalized Bluesky partial profile as a list able to be written by a `csv.writer` as a row in the order of [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) (which can therefore be used as header row of the CSV).
221
+
222
+ Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
223
+
198
224
  ### format_post_as_csv_row
199
225
 
200
226
  Function formatting the given normalized tBluesky post as a list able to be written by a `csv.writer` as a row in the order of [POST_FIELDS](#post_fields) (which can therefore be used as header row of the CSV).
@@ -203,7 +229,11 @@ Will convert list elements of the normalized data into a string with all element
203
229
 
204
230
  ### PROFILE_FIELDS
205
231
 
206
- List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers.
232
+ List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) which correspond to a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
233
+
234
+ ### PARTIAL_PROFILE_FIELDS
235
+
236
+ List of a Bluesky user partial profile's (retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example) normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PROFILE_FIELDS](#profile_fields) which correspond to the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
207
237
 
208
238
  ### POST_FIELDS
209
239
 
@@ -5,7 +5,7 @@ with open("./README.md", "r") as f:
5
5
 
6
6
  setup(
7
7
  name="twitwi",
8
- version="0.21.1",
8
+ version="0.22.0",
9
9
  description="A collection of Twitter-related helper functions for python.",
10
10
  long_description=long_description,
11
11
  long_description_content_type="text/markdown",
@@ -2,11 +2,13 @@ import csv
2
2
  from io import StringIO
3
3
  from twitwi.bluesky import (
4
4
  format_profile_as_csv_row,
5
+ format_partial_profile_as_csv_row,
5
6
  format_post_as_csv_row,
6
7
  transform_profile_into_csv_dict,
8
+ transform_partial_profile_into_csv_dict,
7
9
  transform_post_into_csv_dict,
8
10
  )
9
- from twitwi.bluesky.constants import PROFILE_FIELDS, POST_FIELDS
11
+ from twitwi.bluesky.constants import PROFILE_FIELDS, PARTIAL_PROFILE_FIELDS, POST_FIELDS
10
12
  from test.utils import get_json_resource, open_resource
11
13
 
12
14
 
@@ -56,6 +58,51 @@ class TestFormatters:
56
58
  buffer.seek(0)
57
59
  assert list(csv.DictReader(buffer)) == list(csv.DictReader(f))
58
60
 
61
+ def test_format_partial_profile_as_csv_row(self):
62
+ normalized_partial_profiles = get_json_resource(
63
+ "bluesky-normalized-partial-profiles.json"
64
+ )
65
+
66
+ buffer = StringIO(newline=None)
67
+ writer = csv.writer(buffer, quoting=csv.QUOTE_MINIMAL)
68
+ writer.writerow(PARTIAL_PROFILE_FIELDS)
69
+
70
+ for profile in normalized_partial_profiles:
71
+ writer.writerow(format_partial_profile_as_csv_row(profile))
72
+
73
+ if OVERWRITE_TESTS:
74
+ written = buffer.getvalue()
75
+
76
+ with open("test/resources/bluesky-partial-profiles-export.csv", "w") as f:
77
+ f.write(written)
78
+
79
+ with open_resource("bluesky-partial-profiles-export.csv") as f:
80
+ buffer.seek(0)
81
+ assert list(csv.reader(buffer)) == list(csv.reader(f))
82
+
83
+ def test_transform_partial_profile_into_csv_dict(self):
84
+ normalized_partial_profiles = get_json_resource(
85
+ "bluesky-normalized-partial-profiles.json"
86
+ )
87
+
88
+ buffer = StringIO(newline=None)
89
+ writer = csv.DictWriter(
90
+ buffer,
91
+ fieldnames=PARTIAL_PROFILE_FIELDS,
92
+ extrasaction="ignore",
93
+ restval="",
94
+ quoting=csv.QUOTE_MINIMAL,
95
+ )
96
+ writer.writeheader()
97
+
98
+ for profile in normalized_partial_profiles:
99
+ transform_partial_profile_into_csv_dict(profile)
100
+ writer.writerow(profile)
101
+
102
+ with open_resource("bluesky-partial-profiles-export.csv") as f:
103
+ buffer.seek(0)
104
+ assert list(csv.DictReader(buffer)) == list(csv.DictReader(f))
105
+
59
106
  def test_format_post_as_csv_row(self):
60
107
  normalized_posts = get_json_resource("bluesky-normalized-posts.json")
61
108
 
@@ -5,7 +5,7 @@ from functools import partial
5
5
  from pytz import timezone
6
6
  from copy import deepcopy
7
7
 
8
- from twitwi.bluesky import normalize_profile, normalize_post
8
+ from twitwi.bluesky import normalize_profile, normalize_partial_profile, normalize_post
9
9
 
10
10
  from test.utils import get_json_resource
11
11
 
@@ -15,6 +15,8 @@ OVERWRITE_TESTS = False
15
15
 
16
16
 
17
17
  FAKE_COLLECTION_TIME = "2025-01-01T00:00:00.000000"
18
+
19
+
18
20
  def set_fake_collection_time(dico):
19
21
  if "collection_time" in dico:
20
22
  dico["collection_time"] = FAKE_COLLECTION_TIME
@@ -47,7 +49,9 @@ class TestNormalizers:
47
49
  if OVERWRITE_TESTS:
48
50
  from test.utils import dump_json_resource
49
51
 
50
- normalized_profiles = [set_fake_collection_time(fn(profile)) for profile in profiles]
52
+ normalized_profiles = [
53
+ set_fake_collection_time(fn(profile)) for profile in profiles
54
+ ]
51
55
  dump_json_resource(normalized_profiles, "bluesky-normalized-profiles.json")
52
56
 
53
57
  expected = get_json_resource("bluesky-normalized-profiles.json")
@@ -70,6 +74,42 @@ class TestNormalizers:
70
74
 
71
75
  assert profile == original_arg
72
76
 
77
+ def test_normalize_partial_profile(self):
78
+ tz = timezone("Europe/Paris")
79
+
80
+ profiles = get_json_resource("bluesky-partial-profiles.json")
81
+ fn = partial(normalize_partial_profile, locale=tz)
82
+
83
+ if OVERWRITE_TESTS:
84
+ from test.utils import dump_json_resource
85
+
86
+ normalized_profiles = [
87
+ set_fake_collection_time(fn(profile)) for profile in profiles
88
+ ]
89
+ dump_json_resource(
90
+ normalized_profiles, "bluesky-normalized-partial-profiles.json"
91
+ )
92
+
93
+ expected = get_json_resource("bluesky-normalized-partial-profiles.json")
94
+
95
+ for idx, profile in enumerate(profiles):
96
+ result = fn(profile)
97
+ assert isinstance(result, dict)
98
+ assert "collection_time" in result and isinstance(
99
+ result["collection_time"], str
100
+ )
101
+
102
+ compare_dicts(profile["handle"], result, expected[idx])
103
+
104
+ def test_normalize_partial_profile_should_not_mutate(self):
105
+ profile = get_json_resource("bluesky-partial-profiles.json")[0]
106
+
107
+ original_arg = deepcopy(profile)
108
+
109
+ normalize_partial_profile(profile)
110
+
111
+ assert profile == original_arg
112
+
73
113
  def test_normalize_post(self):
74
114
  tz = timezone("Europe/Paris")
75
115
 
@@ -79,7 +119,13 @@ class TestNormalizers:
79
119
  if OVERWRITE_TESTS:
80
120
  from test.utils import dump_json_resource
81
121
 
82
- normalized_posts = [[set_fake_collection_time(p) for p in fn(post, extract_referenced_posts=True)] for post in posts]
122
+ normalized_posts = [
123
+ [
124
+ set_fake_collection_time(p)
125
+ for p in fn(post, extract_referenced_posts=True)
126
+ ]
127
+ for post in posts
128
+ ]
83
129
  dump_json_resource(normalized_posts, "bluesky-normalized-posts.json")
84
130
 
85
131
  expected = get_json_resource("bluesky-normalized-posts.json")
@@ -1,7 +1,13 @@
1
- from twitwi.bluesky.normalizers import normalize_profile, normalize_post
1
+ from twitwi.bluesky.normalizers import (
2
+ normalize_profile,
3
+ normalize_partial_profile,
4
+ normalize_post,
5
+ )
2
6
  from twitwi.bluesky.formatters import (
3
7
  transform_profile_into_csv_dict,
4
8
  format_profile_as_csv_row,
9
+ transform_partial_profile_into_csv_dict,
10
+ format_partial_profile_as_csv_row,
5
11
  transform_post_into_csv_dict,
6
12
  format_post_as_csv_row,
7
13
  )
@@ -9,8 +15,11 @@ from twitwi.bluesky.formatters import (
9
15
  __all__ = [
10
16
  "transform_profile_into_csv_dict",
11
17
  "format_profile_as_csv_row",
18
+ "transform_partial_profile_into_csv_dict",
19
+ "format_partial_profile_as_csv_row",
12
20
  "transform_post_into_csv_dict",
13
21
  "format_post_as_csv_row",
14
22
  "normalize_profile",
23
+ "normalize_partial_profile",
15
24
  "normalize_post",
16
25
  ]
@@ -1,9 +1,11 @@
1
1
  from typing import List, Optional
2
2
 
3
- from twitwi.bluesky.types import BlueskyProfile, BlueskyPost
3
+ from twitwi.bluesky.types import BlueskyProfile, BlueskyPartialProfile, BlueskyPost
4
4
 
5
5
  PROFILE_FIELDS = list(BlueskyProfile.__annotations__.keys())
6
6
 
7
+ PARTIAL_PROFILE_FIELDS = list(BlueskyPartialProfile.__annotations__.keys())
8
+
7
9
  POST_FIELDS = list(BlueskyPost.__annotations__.keys())
8
10
 
9
11
  POST_PLURAL_FIELDS = [
@@ -1,6 +1,7 @@
1
1
  from twitwi.formatters import make_transform_into_csv_dict, make_format_as_csv_row
2
2
  from twitwi.bluesky.constants import (
3
3
  PROFILE_FIELDS,
4
+ PARTIAL_PROFILE_FIELDS,
4
5
  POST_FIELDS,
5
6
  POST_PLURAL_FIELDS,
6
7
  POST_BOOLEAN_FIELDS,
@@ -20,10 +21,18 @@ transform_profile_into_csv_dict = make_transform_into_csv_dict([], [])
20
21
 
21
22
  format_profile_as_csv_row = make_format_as_csv_row(PROFILE_FIELDS, [], [])
22
23
 
24
+ transform_partial_profile_into_csv_dict = make_transform_into_csv_dict([], [])
25
+
26
+ format_partial_profile_as_csv_row = make_format_as_csv_row(
27
+ PARTIAL_PROFILE_FIELDS, [], []
28
+ )
29
+
23
30
 
24
31
  __all__ = [
25
32
  "transform_post_into_csv_dict",
26
33
  "format_post_as_csv_row",
27
34
  "transform_profile_into_csv_dict",
28
35
  "format_profile_as_csv_row",
36
+ "transform_partial_profile_into_csv_dict",
37
+ "format_partial_profile_as_csv_row",
29
38
  ]
@@ -1,11 +1,13 @@
1
1
  from copy import deepcopy
2
- from typing import List, Dict, Union, Optional, Literal, overload
2
+ from typing import List, Dict, Union, Optional, Literal, Any, overload
3
+
4
+ from ural import is_url
3
5
 
4
6
  from twitwi.exceptions import BlueskyPayloadError
5
7
  from twitwi.utils import (
6
8
  get_collection_time,
7
9
  get_dates,
8
- custom_normalize_url,
10
+ safe_normalize_url,
9
11
  custom_get_normalized_hostname,
10
12
  )
11
13
  from twitwi.bluesky.utils import (
@@ -18,10 +20,10 @@ from twitwi.bluesky.utils import (
18
20
  format_starterpack_url,
19
21
  format_media_url,
20
22
  )
21
- from twitwi.bluesky.types import BlueskyProfile, BlueskyPost
23
+ from twitwi.bluesky.types import BlueskyProfile, BlueskyPartialProfile, BlueskyPost
22
24
 
23
25
 
24
- def normalize_profile(data: Dict, locale: Optional[str] = None) -> BlueskyProfile:
26
+ def normalize_profile(data: Dict, locale: Optional[Any] = None) -> BlueskyProfile:
25
27
  associated = data["associated"]
26
28
 
27
29
  pinned_post_uri = None
@@ -38,23 +40,48 @@ def normalize_profile(data: Dict, locale: Optional[str] = None) -> BlueskyProfil
38
40
  "did": data["did"],
39
41
  "url": format_profile_url(data["handle"]),
40
42
  "handle": data["handle"],
41
- "display_name": data.get("displayName", ""),
43
+ "display_name": data.get("displayName"),
42
44
  "created_at": created_at,
43
45
  "timestamp_utc": timestamp_utc,
44
- "description": data["description"],
45
- "avatar": data.get("avatar", ""),
46
+ "description": data.get("description"),
47
+ "avatar": data.get("avatar"),
46
48
  "posts": data["postsCount"],
47
49
  "followers": data["followersCount"],
48
50
  "follows": data["followsCount"],
49
51
  "lists": associated["lists"],
50
52
  "feedgens": associated["feedgens"],
51
53
  "starter_packs": associated["starterPacks"],
52
- "banner": data["banner"],
54
+ "banner": data.get("banner"),
53
55
  "pinned_post_uri": pinned_post_uri,
54
56
  "collection_time": get_collection_time(),
55
57
  }
56
58
 
57
59
 
60
+ def normalize_partial_profile(
61
+ data: Dict, locale: Optional[Any] = None
62
+ ) -> BlueskyPartialProfile:
63
+ associated = data["associated"]
64
+
65
+ timestamp_utc, created_at = get_dates(
66
+ data["createdAt"], locale=locale, source="bluesky"
67
+ )
68
+
69
+ return {
70
+ "did": data["did"],
71
+ "url": format_profile_url(data["handle"]),
72
+ "handle": data["handle"],
73
+ "display_name": data.get("displayName"),
74
+ "created_at": created_at,
75
+ "timestamp_utc": timestamp_utc,
76
+ "description": data.get("description"),
77
+ "avatar": data.get("avatar"),
78
+ "lists": associated.get("lists"),
79
+ "feedgens": associated.get("feedgens"),
80
+ "starter_packs": associated.get("starterPacks"),
81
+ "collection_time": get_collection_time(),
82
+ }
83
+
84
+
58
85
  def prepare_native_gif_as_media(gif_data, user_did, source):
59
86
  if "thumb" in gif_data:
60
87
  media_cid = gif_data["thumb"]["ref"]["$link"]
@@ -73,8 +100,12 @@ def prepare_native_gif_as_media(gif_data, user_did, source):
73
100
 
74
101
 
75
102
  def prepare_image_as_media(image_data):
103
+ if "ref" not in image_data["image"] or "$link" not in image_data["image"]["ref"]:
104
+ image_id = image_data["image"]["cid"]
105
+ else:
106
+ image_id = image_data["image"]["ref"]["$link"]
76
107
  return {
77
- "id": image_data["image"]["ref"]["$link"],
108
+ "id": image_id,
78
109
  "type": image_data["image"]["mimeType"],
79
110
  "alt": image_data["alt"],
80
111
  }
@@ -92,7 +123,9 @@ def process_starterpack_card(embed_data, post):
92
123
 
93
124
  card = embed_data.get("record", {})
94
125
  creator_did, pack_did = parse_post_uri(embed_data["uri"])
95
- post["card_link"] = format_starterpack_url(embed_data.get("creator", {}).get("handle") or creator_did, pack_did)
126
+ post["card_link"] = format_starterpack_url(
127
+ embed_data.get("creator", {}).get("handle") or creator_did, pack_did
128
+ )
96
129
  post["card_title"] = card.get("name", "")
97
130
  post["card_description"] = card.get("description", "")
98
131
  post["card_thumbnail"] = card.get("thumb", "")
@@ -119,7 +152,14 @@ def prepare_quote_data(embed_quote, card_data, post, links):
119
152
  )
120
153
 
121
154
  # First store ugly quoted url with user did in case full quote data is missing (recursion > 3 or detached quote)
122
- post["quoted_url"] = format_post_url(post["quoted_user_did"], post["quoted_did"])
155
+ # Handling special posts types (only lists for now, for example: https://bsky.app/profile/lanana421.bsky.social/lists/3lxdgjtpqhf2z)
156
+ if "/app.bsky.graph.list/" in post["quoted_uri"]:
157
+ post_splitter = "/lists/"
158
+ else:
159
+ post_splitter = "/post/"
160
+ post["quoted_url"] = format_post_url(
161
+ post["quoted_user_did"], post["quoted_did"], post_splitter=post_splitter
162
+ )
123
163
 
124
164
  quoted_data = None
125
165
  if card_data:
@@ -145,7 +185,9 @@ def prepare_quote_data(embed_quote, card_data, post, links):
145
185
 
146
186
  # Extract user handle from url
147
187
  if "did:plc:" not in post["quoted_url"]:
148
- post["quoted_user_handle"], _ = parse_post_url(post["quoted_url"], post["url"])
188
+ post["quoted_user_handle"], _ = parse_post_url(
189
+ post["quoted_url"], post["url"]
190
+ )
149
191
 
150
192
  return (post, quoted_data, links)
151
193
 
@@ -176,7 +218,7 @@ def merge_nested_posts(referenced_posts, nested, source):
176
218
 
177
219
  @overload
178
220
  def normalize_post(
179
- data: Dict,
221
+ payload: Dict,
180
222
  locale: Optional[str] = ...,
181
223
  extract_referenced_posts: Literal[True] = ...,
182
224
  collection_source: Optional[str] = ...,
@@ -185,7 +227,7 @@ def normalize_post(
185
227
 
186
228
  @overload
187
229
  def normalize_post(
188
- data: Dict,
230
+ payload: Dict,
189
231
  locale: Optional[str] = ...,
190
232
  extract_referenced_posts: Literal[False] = ...,
191
233
  collection_source: Optional[str] = ...,
@@ -194,7 +236,7 @@ def normalize_post(
194
236
 
195
237
  def normalize_post(
196
238
  payload: Dict,
197
- locale: Optional[str] = None,
239
+ locale: Optional[Any] = None,
198
240
  extract_referenced_posts: bool = False,
199
241
  collection_source: Optional[str] = None,
200
242
  ) -> Union[BlueskyPost, List[BlueskyPost]]:
@@ -308,7 +350,7 @@ def normalize_post(
308
350
  feat = facet["features"][0]
309
351
 
310
352
  # Hashtags
311
- if feat["$type"].endswith("#tag"):
353
+ if feat["$type"].endswith("#tag") or feat["$type"].endswith("#hashtag"):
312
354
  hashtags.add(feat["tag"].strip().lower())
313
355
 
314
356
  # Mentions
@@ -323,7 +365,11 @@ def normalize_post(
323
365
  byteStart = text.find(b"@", byteStart)
324
366
 
325
367
  handle = (
326
- text[byteStart + 1 : facet["index"]["byteEnd"] + byteStart - facet["index"]["byteStart"]]
368
+ text[
369
+ byteStart + 1 : facet["index"]["byteEnd"]
370
+ + byteStart
371
+ - facet["index"]["byteStart"]
372
+ ]
327
373
  .strip()
328
374
  .lower()
329
375
  .decode("utf-8")
@@ -335,22 +381,36 @@ def normalize_post(
335
381
  # Handle native polls
336
382
  if "https://poll.blue/" in feat["uri"]:
337
383
  if feat["uri"].endswith("/0"):
338
- links.add(custom_normalize_url(feat["uri"]))
384
+ link = safe_normalize_url(feat["uri"])
385
+ if is_url(link):
386
+ links.add(link)
387
+ else:
388
+ continue
339
389
  text += b" %s" % feat["uri"].encode("utf-8")
340
390
  continue
341
391
 
342
- links.add(custom_normalize_url(feat["uri"]))
392
+ link = safe_normalize_url(feat["uri"])
393
+ if is_url(link):
394
+ links.add(link)
395
+ else:
396
+ continue
343
397
  # Check & fix occasional errored link positioning
344
- # example: https://bsky.app/profile/ecrime.ch/post/3lqotmopayr23
398
+ # examples: https://bsky.app/profile/ecrime.ch/post/3lqotmopayr23
399
+ # https://bsky.app/profile/clustz.com/post/3lqfi7mnto52w
345
400
  byteStart = facet["index"]["byteStart"]
346
- if b" " in text[byteStart : facet["index"]["byteEnd"]]:
347
- byteStart = text.find(b"http", byteStart)
401
+
402
+ if not text[byteStart : facet["index"]["byteEnd"]].startswith(b"http"):
403
+ new_byteStart = text.find(b"http", byteStart, facet["index"]["byteEnd"])
404
+ if new_byteStart != -1:
405
+ byteStart = new_byteStart
348
406
 
349
407
  links_to_replace.append(
350
408
  {
351
409
  "uri": feat["uri"].encode("utf-8"),
352
410
  "start": byteStart,
353
- "end": byteStart - facet["index"]["byteStart"] + facet["index"]["byteEnd"],
411
+ "end": byteStart
412
+ - facet["index"]["byteStart"]
413
+ + facet["index"]["byteEnd"],
354
414
  }
355
415
  )
356
416
 
@@ -426,7 +486,7 @@ def normalize_post(
426
486
 
427
487
  # Extra card links sometimes missing from facets & text due to manual action in post form
428
488
  else:
429
- extra_links.append(embed["external"]["uri"])
489
+ extra_links.append(link)
430
490
  # Handle link card metadata
431
491
  if "embed" in data:
432
492
  post = process_card_data(data["embed"]["external"], post)
@@ -442,7 +502,9 @@ def normalize_post(
442
502
  # Quote & Starter-packs
443
503
  if embed["$type"].endswith(".record"):
444
504
  if "app.bsky.graph.starterpack" in embed["record"]["uri"]:
445
- post = process_starterpack_card(data.get("embed", {}).get("record"), post)
505
+ post = process_starterpack_card(
506
+ data.get("embed", {}).get("record"), post
507
+ )
446
508
  if post["card_link"]:
447
509
  extra_links.append(post["card_link"])
448
510
  else:
@@ -499,9 +561,10 @@ def normalize_post(
499
561
 
500
562
  # Process extra links
501
563
  for link in extra_links:
502
- norm_link = custom_normalize_url(link)
564
+ norm_link = safe_normalize_url(link)
503
565
  if norm_link not in links:
504
- links.add(norm_link)
566
+ if is_url(norm_link):
567
+ links.add(norm_link)
505
568
  text += b" " + link.encode("utf-8")
506
569
 
507
570
  # Process medias
@@ -523,7 +586,10 @@ def normalize_post(
523
586
 
524
587
  # Rewrite post's text to include links to medias within
525
588
  text += b" " + (
526
- media_thumb if media_type.startswith("video") and not media_type.endswith("/gif") else media_url
589
+ media_thumb
590
+ if media_type.startswith("video")
591
+ and not media_type.endswith("/gif")
592
+ else media_url
527
593
  ).encode("utf-8")
528
594
 
529
595
  # Process quotes
@@ -10,7 +10,7 @@ class BlueskyProfile(TypedDict):
10
10
  url: str # URL of the profile accessible on the web
11
11
  handle: str # updatable human-readable username of the account (usually like username.bsky.social or username.com)
12
12
  display_name: Optional[str] # updatable human-readable name of the account
13
- description: str # profile short description written by the user
13
+ description: Optional[str] # profile short description written by the user
14
14
  posts: int # total number of posts submitted by the user (at collection time)
15
15
  followers: int # total number of followers of the user (at collection time)
16
16
  follows: int # total number of other users followed by the user (at collection time)
@@ -18,12 +18,26 @@ class BlueskyProfile(TypedDict):
18
18
  feedgens: int # total number of custom feeds created by the user (at collection time)
19
19
  starter_packs: int # total number of starter packs created by the user (at collection time)
20
20
  avatar: Optional[str] # URL to the image serving as avatar to the user
21
- banner: str # URL to the image serving as profile banner to the user
21
+ banner: Optional[str] # URL to the image serving as profile banner to the user
22
22
  pinned_post_uri: Optional[str] # ATProto's internal URI to the post potentially pinned by the user to appear at the top of his posts on his profile
23
23
  created_at: str # datetime (potentially timezoned) of when the user created the account
24
24
  timestamp_utc: int # Unix UTC timestamp of when the user created the account
25
25
  collection_time: Optional[str] # datetime (potentially timezoned) of when the data was normalized
26
26
 
27
+ class BlueskyPartialProfile(TypedDict): # A partial version of the profile found in follower/follow profile payloads
28
+ did: str # persistent long-term identifier of the account
29
+ url: str # URL of the profile accessible on the web
30
+ handle: str # updatable human-readable username of the account (usually like username.bsky.social or username.com)
31
+ display_name: Optional[str] # updatable human-readable name of the account
32
+ description: Optional[str] # profile short description written by the user
33
+ lists: Optional[int] # total number of lists created by the user (at collection time)
34
+ feedgens: Optional[int] # total number of custom feeds created by the user (at collection time)
35
+ starter_packs: Optional[int] # total number of starter packs created by the user (at collection time)
36
+ avatar: Optional[str] # URL to the image serving as avatar to the user
37
+ created_at: str # datetime (potentially timezoned) of when the user created the account
38
+ timestamp_utc: int # Unix UTC timestamp of when the user created the account
39
+ collection_time: Optional[str] # datetime (potentially timezoned) of when the data was normalized
40
+
27
41
 
28
42
 
29
43
  class BlueskyPost(TypedDict):
@@ -64,7 +78,7 @@ class BlueskyPost(TypedDict):
64
78
  # user_lists: int # not available from posts payloads
65
79
  user_langs: List[str] # languages in which the author of the posts usually writes posts (declarative)
66
80
  user_avatar: Optional[str] # URL to the image serving as avatar to the user who authored the post
67
- user_created_at: str # datetime (potentially timezoned) ofwhen the user who authored the post created the account
81
+ user_created_at: str # datetime (potentially timezoned) of when the user who authored the post created the account
68
82
  user_timestamp_utc: int # Unix UTC timestamp of when the user who authored the post created the account
69
83
 
70
84
  # Parent post identifying fields
@@ -102,27 +116,27 @@ class BlueskyPost(TypedDict):
102
116
  quoted_user_handle: Optional[str] # updatable human-readable username of the account who authored the quoted post
103
117
  quoted_created_at: Optional[int] # datetime (potentially timezoned) of when the quoted post was submitted
104
118
  quoted_timestamp_utc: Optional[int] # Unix UTC timestamp of when the quoted post was submitted
105
- quoted_status: Optional[str] # empty or "detached" when the author of the quoted post intentionnally required the quoting post not to be accessible from their own
119
+ quoted_status: Optional[str] # empty or "detached" when the author of the quoted post intentionnally required the quoting post not to appear in the list of this post's quotes
106
120
 
107
121
  # Embedded elements metadata fields
108
122
  links: List[str] # list of URLs of all links shared within the post (including potentially the embedded card detailed below, but not the link to a potential quoted post)
109
- domains: List[str] # list of domains of the links shared within the post (here a domain refer to a full hostname, including subdomains, for instance bluesky.com or medialab.sciencespo.fr)
123
+ domains: List[str] # list of domains of the links shared within the post (here a domain refers to a full hostname, including subdomains, for instance bluesky.com or medialab.sciencespo.fr)
110
124
  card_link: Optional[str] # URL of the link displayed as a card within the post if any
111
125
  card_title: Optional[str] # title of the webpage corresponding to the linkg diplayed as a card within the post if any
112
126
  card_description: Optional[str] # description of the webpage corresponding to the linkg diplayed as a card within the post if any
113
- card_thumbnail: Optional[str] # image displayed as an illustration of the webpage corresponding to the linkg diplayed as a card within the post if any
114
- media_urls: List[str] # list of URLs to all medias (images, videos, gifs) embedded in the post
115
- media_thumbnails: List[str] # list of URLs to small thumbnail version of all medias (images, videos, gifs) embedded in the post
116
- media_types: List[str] # MIME types (such as image/jpeg, image/gif, video/mp4, etc.) of all medias (images, videos, gifs) embedded in the post
117
- media_alt_texts: List[str] # description texts of all medias (images, videos, gifs) embedded in the post
118
- mentioned_user_dids: List[str] # list of all persistent long-term identifier of the accounts adressed within the post (does not include users to which the post replied)
119
- mentioned_user_handles: List[str] # list of all updatable human-readable username of the accounts adressed within the post (does not include users to which the post replied)
127
+ card_thumbnail: Optional[str] # image displayed as an illustration of the webpage corresponding to the link diplayed as a card within the post if any
128
+ media_urls: List[str] # list of URLs to all media (images, videos, gifs) embedded in the post
129
+ media_thumbnails: List[str] # list of URLs to small thumbnail version of all media (images, videos, gifs) embedded in the post
130
+ media_types: List[str] # MIME types (such as image/jpeg, image/gif, video/mp4, etc.) of all media (images, videos, gifs) embedded in the post
131
+ media_alt_texts: List[str] # description texts of all media (images, videos, gifs) embedded in the post
132
+ mentioned_user_dids: List[str] # list of all persistent long-term identifiers of the accounts adressed within the post (does not include users to which the post replied)
133
+ mentioned_user_handles: List[str] # list of all updatable human-readable usernames of the accounts adressed within the post (does not include users to which the post replied)
120
134
  hashtags: List[str] # list of all unique lowercased hashtags found within the post's text
121
135
 
122
136
  # Conversation rules fields
123
137
  replies_rules: Optional[List[str]] # list of specific conversation rules set by the author for the current post (can be one or a combination of: disallow, allow_from_follower, allow_from_following, allow_from_mention, or allow_from_list: followed by a list of user DIDs)
124
138
  replies_rules_created_at: Optional[str] # datetime (potentially timezoned) of when the user set the replies_rules
125
- replies_rules_timestamp_utc: Optional[int] # Unix UTC timestamp of when the userset the replies_rules
139
+ replies_rules_timestamp_utc: Optional[int] # Unix UTC timestamp of when the user set the replies_rules
126
140
  hidden_replies_uris: Optional[List[str]] # list of ATProto's internal URIs to posts who replied to the post, but where intentionnally marked as hidden by the current post's author
127
141
  # quotes_rule: Optional[str] # not available from posts payloads, cf https://github.com/bluesky-social/atproto/issues/3712
128
142
  # quotes_rules_created_at: Optional[str] # not available from posts payloads, cf https://github.com/bluesky-social/atproto/issues/3712
@@ -131,5 +145,5 @@ class BlueskyPost(TypedDict):
131
145
 
132
146
  # Extra fields linked to the data collection and processing
133
147
  collection_time: Optional[str] # datetime (potentially timezoned) of when the data was normalized
134
- collected_via: Optional[List[str]] # extra field added by the normalization process to express how the data collection was ran, will be "quote" or "thread" when a post was grabbed as a referenced post within a really collected post using the "extract_referenced_posts" option of "normalize_post"
135
- match_query: Optional[bool] # extra field added by the normalization process to express whether the post was an intentionnally collected one or only came as a referenced post within a really collected post using the "extract_referenced_posts" option of "normalize_post"
148
+ collected_via: Optional[List[str]] # extra field added by the normalization process to express how the data collection was ran, will be "quote" or "thread" when a post was grabbed as a referenced post within the originally collected post using the "extract_referenced_posts" option of "normalize_post"
149
+ match_query: Optional[bool] # extra field added by the normalization process to express whether the post was an intentionnally collected one or only came as a referenced post within the originally collected post using the "extract_referenced_posts" option of "normalize_post"
@@ -55,7 +55,9 @@ def validate_post_payload(data):
55
55
  return True, None
56
56
 
57
57
 
58
- re_embed_types = re.compile(r"\.(record|recordWithMedia|images|video|external)$")
58
+ re_embed_types = re.compile(
59
+ r"\.(record|recordWithMedia|images|video|external)(?:#.*)?$"
60
+ )
59
61
 
60
62
 
61
63
  def valid_embed_type(embed_type):
@@ -66,29 +68,39 @@ def format_profile_url(user_handle_or_did):
66
68
  return f"https://bsky.app/profile/{user_handle_or_did}"
67
69
 
68
70
 
69
- def format_post_url(user_handle_or_did, post_did):
70
- return f"https://bsky.app/profile/{user_handle_or_did}/post/{post_did}"
71
+ def format_post_url(user_handle_or_did, post_did, post_splitter="/post/"):
72
+ return f"https://bsky.app/profile/{user_handle_or_did}{post_splitter}{post_did}"
71
73
 
72
74
 
73
75
  def parse_post_url(url, source):
74
76
  """Returns a tuple of (author_handle/did, post_did) from an https://bsky.app post URL"""
75
77
 
76
- if not url.startswith("https://bsky.app/profile/") and "/post/" not in url:
77
- raise BlueskyPayloadError(source, f"{url} is not a usual Bluesky post url")
78
- return url[25:].split("/post/")
78
+ known_splits = ["/post/", "/lists/"]
79
+
80
+ if url.startswith("https://bsky.app/profile/"):
81
+ for split in known_splits:
82
+ if split in url[25:]:
83
+ return url[25:].split(split)
84
+
85
+ raise BlueskyPayloadError(source, f"{url} is not a usual Bluesky post url")
79
86
 
80
87
 
81
88
  def parse_post_uri(uri, source=None):
82
89
  """Returns a tuple of (author_did, post_did) from an at:// post URI"""
83
90
 
84
- if uri.startswith("at://") and "/app.bsky.graph.starterpack/" in uri:
85
- return uri[5:].split("/app.bsky.graph.starterpack/")
91
+ known_splits = [
92
+ "/app.bsky.feed.post/",
93
+ "/app.bsky.graph.starterpack/",
94
+ "/app.bsky.feed.generator/",
95
+ "/app.bsky.graph.list/",
96
+ ]
86
97
 
87
- if not uri.startswith("at://") and "/app.bsky.feed.post/" not in uri:
88
- raise BlueskyPayloadError(
89
- source or uri, f"{uri} is not a usual Bluesky post uri"
90
- )
91
- return uri[5:].split("/app.bsky.feed.post/")
98
+ if uri.startswith("at://"):
99
+ for split in known_splits:
100
+ if split in uri:
101
+ return uri[5:].split(split)
102
+
103
+ raise BlueskyPayloadError(source or uri, f"{uri} is not a usual Bluesky post uri")
92
104
 
93
105
 
94
106
  def format_starterpack_url(user_handle_or_did, record_did):
@@ -105,6 +117,13 @@ def format_media_url(user_did, media_cid, mime_type, source):
105
117
  media_thumb = (
106
118
  f"https://video.bsky.app/watch/{user_did}/{media_cid}/thumbnail.jpg"
107
119
  )
120
+ elif mime_type == "application/octet-stream":
121
+ media_url = (
122
+ f"https://cdn.bsky.app/img/feed_fullsize/plain/{user_did}/{media_cid}@jpeg"
123
+ )
124
+ media_thumb = (
125
+ f"https://cdn.bsky.app/img/feed_thumbnail/plain/{user_did}/{media_cid}@jpeg"
126
+ )
108
127
  else:
109
- raise BlueskyPayloadError(source, f"{mime_type} is an usual media mimeType")
128
+ raise BlueskyPayloadError(source, f"{mime_type} is an unusual media mimeType")
110
129
  return media_url, media_thumb
@@ -4,6 +4,8 @@
4
4
  #
5
5
  # Miscellaneous utility functions.
6
6
  #
7
+ from typing import Tuple
8
+
7
9
  from pytz import timezone
8
10
  from dateutil.parser import parse as parse_date
9
11
  from ural import normalize_url, get_normalized_hostname
@@ -25,6 +27,21 @@ UTC_TIMEZONE = timezone("UTC")
25
27
 
26
28
  custom_normalize_url = partial(normalize_url, **CANONICAL_URL_KWARGS)
27
29
 
30
+
31
+ def safe_normalize_url(url):
32
+ # We avoid normalizing bluesky urls containing specific uri parts because
33
+ # Bluesky servers don't handle quoting correctly...
34
+ # See https://github.com/medialab/twitwi/issues/72
35
+ if "/did:plc:" in url:
36
+ return url
37
+
38
+ try:
39
+ return custom_normalize_url(url)
40
+ except Exception:
41
+ # In case of error, return the original URL. Possibly not a valid URL, e.g. url containing double slashes
42
+ return url
43
+
44
+
28
45
  custom_get_normalized_hostname = partial(
29
46
  get_normalized_hostname, **CANONICAL_HOSTNAME_KWARGS
30
47
  )
@@ -34,7 +51,9 @@ def get_collection_time():
34
51
  return datetime.now().strftime(FORMATTED_FULL_DATETIME_FORMAT)
35
52
 
36
53
 
37
- def get_dates(date_str, locale=None, source="v1"):
54
+ def get_dates(
55
+ date_str: str, locale=None, source: str = "v1", millisecond_timestamp: bool = False
56
+ ) -> Tuple[int, str]:
38
57
  if source not in ["v1", "v2", "bluesky"]:
39
58
  raise Exception("source should be one of v1, v2 or bluesky")
40
59
 
@@ -56,8 +75,14 @@ def get_dates(date_str, locale=None, source="v1"):
56
75
  utc_datetime = UTC_TIMEZONE.localize(parsed_datetime)
57
76
  locale_datetime = utc_datetime.astimezone(locale)
58
77
 
78
+ timestamp = int(utc_datetime.timestamp())
79
+
80
+ if millisecond_timestamp:
81
+ timestamp *= 1000
82
+ timestamp += utc_datetime.microsecond / 1000
83
+
59
84
  return (
60
- int(utc_datetime.timestamp()),
85
+ int(timestamp),
61
86
  datetime.strftime(
62
87
  locale_datetime,
63
88
  FORMATTED_FULL_DATETIME_FORMAT
@@ -106,6 +131,7 @@ def get_dates_from_id(tweet_id, locale=None):
106
131
  locale = UTC_TIMEZONE
107
132
 
108
133
  timestamp = get_timestamp_from_id(tweet_id)
134
+ assert timestamp is not None
109
135
 
110
136
  locale_datetime = datetime.fromtimestamp(timestamp, locale)
111
137
 
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: twitwi
3
- Version: 0.21.1
3
+ Version: 0.22.0
4
4
  Summary: A collection of Twitter-related helper functions for python.
5
5
  Home-page: http://github.com/medialab/twitwi
6
6
  Author: Béatrice Mazoyer, Guillaume Plique, Benjamin Ooghe-Tabanou
@@ -56,18 +56,22 @@ pip install twitwi
56
56
  *Normalization functions*
57
57
 
58
58
  * [normalize_profile](#normalize_profile)
59
+ * [normalize_partial_profile](#normalize_partial_profile)
59
60
  * [normalize_post](#normalize_post)
60
61
 
61
62
  *Formatting functions*
62
63
 
63
64
  * [transform_profile_into_csv_dict](#transform_profile_into_csv_dict)
65
+ * [transform_partial_profile_into_csv_dict](#transform_partial_profile_into_csv_dict)
64
66
  * [transform_post_into_csv_dict](#transform_post_into_csv_dict)
65
67
  * [format_profile_as_csv_row](#format_profile_as_csv_row)
68
+ * [format_partial_profile_as_csv_row](#format_partial_profile_as_csv_row)
66
69
  * [format_post_as_csv_row](#format_post_as_csv_row)
67
70
 
68
71
  *Useful constants (under `twitwi.bluesky.constants`)*
69
72
 
70
73
  * [PROFILE_FIELDS](#profile_fields)
74
+ * [PARTIAL_PROFILE_FIELDS](#partial_profile_fields)
71
75
  * [POST_FIELDS](#post_fields)
72
76
 
73
77
  *Examples*
@@ -95,7 +99,7 @@ for post_data in posts_payload_from_API:
95
99
 
96
100
  # Then, saving normalized profiles into a CSV using DictWriter:
97
101
 
98
- from csv import DictWriter
102
+ import csv
99
103
  from twitwi.bluesky.constants import POST_FIELDS
100
104
  from twitwi.bluesky import transform_post_into_csv_dict
101
105
 
@@ -108,7 +112,6 @@ with open("normalized_bluesky_posts.csv", "w") as f:
108
112
 
109
113
  # Or using the basic CSV writer:
110
114
 
111
- from csv import writer
112
115
  from twitwi.bluesky import format_post_as_csv_row
113
116
 
114
117
  with open("normalized_bluesky_posts.csv", "w") as f:
@@ -180,7 +183,18 @@ with open("normalized_bluesky_profiles.csv", "w") as f:
180
183
 
181
184
  ### normalize_profile
182
185
 
183
- Function taking a nested dict describing a user profile from Bluesky's JSON payload and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys.
186
+ Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses)) and returning a flat "normalized" dict composed of all [PROFILE_FIELDS](#profile_fields) keys. Be careful not to confuse with the [normalize_partial_profile](#normalize_partial_profile) function which operate on a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
187
+
188
+ Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
189
+
190
+ *Arguments*
191
+
192
+ * **data** *(dict)*: user profile data payload coming from Bluesky API.
193
+ * **locale** *(pytz.timezone as str, optional)*: timezone used to convert dates. If not given, will default to UTC.
194
+
195
+ ### normalize_partial_profile
196
+
197
+ Function taking a nested dict describing a user profile from Bluesky's JSON payload (with the same format as retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses)) and returning a flat "normalized" dict composed of all [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) keys. Be careful not to confuse with the [normalize_profile](#normalize_profile) function which operate on the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
184
198
 
185
199
  Will return datetimes as UTC but can take an optional second `locale` argument as a [`pytz`](https://pypi.org/project/pytz/) string timezone.
186
200
 
@@ -210,6 +224,12 @@ Function transforming (i.e. mutating, so beware) a given normalized Bluesky prof
210
224
 
211
225
  Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
212
226
 
227
+ ### transform_partial_profile_into_csv_dict
228
+
229
+ Function transforming (i.e. mutating, so beware) a given normalized Bluesky partial profile into a suitable dict able to be written by a `csv.DictWriter` as a row.
230
+
231
+ Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
232
+
213
233
  ### transform_post_into_csv_dict
214
234
 
215
235
  Function transforming (i.e. mutating, so beware) a given normalized Bluesky post into a suitable dict able to be written by a `csv.DictWriter` as a row.
@@ -222,6 +242,12 @@ Function formatting the given normalized Bluesky profile as a list able to be wr
222
242
 
223
243
  Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
224
244
 
245
+ ### format_partial_profile_as_csv_row
246
+
247
+ Function formatting the given normalized Bluesky partial profile as a list able to be written by a `csv.writer` as a row in the order of [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) (which can therefore be used as header row of the CSV).
248
+
249
+ Will convert list elements of the normalized data into a string with all elements separated by the `|` character, which can be changed using an optional `plural_separator` argument.
250
+
225
251
  ### format_post_as_csv_row
226
252
 
227
253
  Function formatting the given normalized tBluesky post as a list able to be written by a `csv.writer` as a row in the order of [POST_FIELDS](#post_fields) (which can therefore be used as header row of the CSV).
@@ -230,7 +256,11 @@ Will convert list elements of the normalized data into a string with all element
230
256
 
231
257
  ### PROFILE_FIELDS
232
258
 
233
- List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers.
259
+ List of a Bluesky user profile's normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PARTIAL_PROFILE_FIELDS](#partial_profile_fields) which correspond to a lighter version of the profile data, retrieved from [follower/follow profile payloads](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example.
260
+
261
+ ### PARTIAL_PROFILE_FIELDS
262
+
263
+ List of a Bluesky user partial profile's (retrieved from [`app.bsky.graph.getFollowers` HTTP endpoint](https://docs.bsky.app/docs/api/app-bsky-graph-get-followers#responses) for example) normalized field names. Useful to declare headers with csv writers. Be careful not to confuse with [PROFILE_FIELDS](#profile_fields) which correspond to the full version of the profile data, retrieved from [`app.bsky.actor.getProfiles` HTTP endpoint](docs.bsky.app/docs/api/app-bsky-actor-get-profiles#responses) for example.
234
264
 
235
265
  ### POST_FIELDS
236
266
 
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes