ttd-databricks 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (34) hide show
  1. ttd_databricks-0.1.1/PKG-INFO +464 -0
  2. ttd_databricks-0.1.1/README.md +449 -0
  3. ttd_databricks-0.1.1/pyproject.toml +60 -0
  4. ttd_databricks-0.1.1/setup.cfg +4 -0
  5. ttd_databricks-0.1.1/tests/test_placeholder.py +16 -0
  6. ttd_databricks-0.1.1/ttd_databricks.egg-info/PKG-INFO +464 -0
  7. ttd_databricks-0.1.1/ttd_databricks.egg-info/SOURCES.txt +32 -0
  8. ttd_databricks-0.1.1/ttd_databricks.egg-info/dependency_links.txt +1 -0
  9. ttd_databricks-0.1.1/ttd_databricks.egg-info/requires.txt +7 -0
  10. ttd_databricks-0.1.1/ttd_databricks.egg-info/top_level.txt +1 -0
  11. ttd_databricks-0.1.1/ttd_databricks_python/__init__.py +1 -0
  12. ttd_databricks-0.1.1/ttd_databricks_python/py.typed +0 -0
  13. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/__init__.py +59 -0
  14. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/batching.py +140 -0
  15. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/constants.py +5 -0
  16. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/contexts.py +128 -0
  17. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/endpoints.py +76 -0
  18. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/exceptions.py +41 -0
  19. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/handlers/__init__.py +0 -0
  20. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/handlers/advertiser.py +77 -0
  21. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/handlers/deletion_optout_advertiser.py +57 -0
  22. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/handlers/deletion_optout_merchant.py +56 -0
  23. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/handlers/deletion_optout_thirdparty.py +57 -0
  24. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/handlers/offline_conversion.py +115 -0
  25. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/handlers/third_party.py +77 -0
  26. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/schemas/__init__.py +130 -0
  27. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/schemas/advertiser.py +73 -0
  28. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/schemas/deletion_optout_advertiser.py +32 -0
  29. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/schemas/deletion_optout_merchant.py +32 -0
  30. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/schemas/deletion_optout_thirdparty.py +32 -0
  31. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/schemas/offline_conversion.py +155 -0
  32. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/schemas/third_party.py +63 -0
  33. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/ttd_client.py +476 -0
  34. ttd_databricks-0.1.1/ttd_databricks_python/ttd_databricks/utils.py +14 -0
@@ -0,0 +1,464 @@
1
+ Metadata-Version: 2.4
2
+ Name: ttd-databricks
3
+ Version: 0.1.1
4
+ Summary: Client implementation and helper functions for integrating with the TTD Databricks services.
5
+ Author: The Trade Desk
6
+ License: MIT
7
+ Requires-Python: >=3.10
8
+ Description-Content-Type: text/markdown
9
+ Requires-Dist: ttd-data>=0.0.1
10
+ Provides-Extra: dev
11
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
12
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
13
+ Requires-Dist: mypy>=1.0.0; extra == "dev"
14
+ Requires-Dist: ruff>=0.9.0; extra == "dev"
15
+
16
+ # ttd-databricks
17
+
18
+ > **Alpha:** This SDK is in early development. APIs may change without notice between releases.
19
+
20
+ Python SDK for integrating Databricks with The Trade Desk Data API. Supports First Party Data, Third Party Data, Offline Conversion Data, and Deletion/Opt-Out workflows.
21
+
22
+ - **Ad hoc mode** — push a DataFrame directly and receive per-row results inline
23
+ - **Batch mode** — run incremental pipelines backed by Delta tables with processing checkpoints
24
+ - Built-in schema validation and per-row error tracking
25
+
26
+ ## Table of Contents
27
+
28
+ - [SDK Installation](#sdk-installation)
29
+ - [Initial Setup](#initial-setup)
30
+ - [SDK Example Usage](#sdk-example-usage)
31
+ - [Authentication](#authentication)
32
+ - [Available Resources and Operations](#available-resources-and-operations)
33
+ - [Error Handling](#error-handling)
34
+ - [Server Selection](#server-selection)
35
+ - [Custom HTTP Client](#custom-http-client)
36
+
37
+ ---
38
+
39
+ ## SDK Installation
40
+
41
+ ```bash
42
+ pip install ttd-databricks
43
+ ```
44
+
45
+ Requires Python 3.10 or higher. Intended to run inside a Databricks environment where PySpark is available via the runtime.
46
+
47
+ ---
48
+
49
+ ## Initial Setup
50
+
51
+ ### 1. Create a Client
52
+
53
+ The client is the entry point for all SDK operations. Create it once and reuse it across calls.
54
+
55
+ ```python
56
+ from ttd_databricks_python.ttd_databricks import TtdDatabricksClient
57
+
58
+ # SparkSession is auto-detected from the Databricks runtime if not provided.
59
+ client = TtdDatabricksClient.from_params(api_token="<ttd-auth-token>")
60
+ ```
61
+
62
+ See [Authentication](#authentication) for alternative client creation options.
63
+
64
+ ---
65
+
66
+ ### 2. Create a Context
67
+
68
+ A context specifies which TTD endpoint to target and carries the identifiers (advertiser ID, data provider ID, etc.) required by that endpoint. A single context can be created per endpoint and reused across multiple calls.
69
+
70
+ ```python
71
+ from ttd_databricks_python.ttd_databricks import AdvertiserContext
72
+
73
+ # Each endpoint has its own context class. See Available Resources and Operations
74
+ # for the full list.
75
+ context = AdvertiserContext(
76
+ advertiser_id="<advertiser-id>",
77
+ data_provider_id="<data-provider-id>", # optional
78
+ )
79
+ ```
80
+
81
+ ---
82
+
83
+ ### 3. Set Up Delta Tables
84
+
85
+ If you plan to use batch processing, use the following helpers to set up the input, output, and metadata Delta tables. These can be created once and reused for all future executions.
86
+
87
+ ```python
88
+ from ttd_databricks_python.ttd_databricks import TTDEndpoint
89
+
90
+ # Input table: schema matches the required columns for the chosen endpoint.
91
+ # Created as a managed table in the default metastore location if no location is provided.
92
+ # Default table name: ttd_{endpoint}_input (e.g. "ttd_advertiser_input").
93
+ input_table = client.setup_input_table(endpoint=TTDEndpoint.ADVERTISER)
94
+
95
+ # Output table: mirrors the input schema plus status columns
96
+ # (success, error_code, error_message, processed_timestamp).
97
+ # Default table name: ttd_{endpoint}_output (e.g. "ttd_advertiser_output").
98
+ output_table = client.setup_output_table(endpoint=TTDEndpoint.ADVERTISER)
99
+
100
+ # Metadata table: tracks run history (last_processed_date, run_timestamp, records_processed).
101
+ # Default table name: "ttd_metadata".
102
+ metadata_table = client.setup_metadata_table()
103
+ ```
104
+
105
+ You can also supply custom table names and storage locations:
106
+
107
+ ```python
108
+ input_table = client.setup_input_table(
109
+ endpoint=TTDEndpoint.ADVERTISER,
110
+ table_name="my_catalog.my_schema.advertiser_input",
111
+ location="abfss://container@storage.dfs.core.windows.net/advertiser_input",
112
+ )
113
+ ```
114
+
115
+ ---
116
+
117
+ ## SDK Example Usage
118
+
119
+ The SDK supports two processing modes.
120
+
121
+ ### Ad Hoc Mode (`push_data`)
122
+
123
+ Use this to process a DataFrame directly and receive results inline.
124
+
125
+ ```python
126
+ from ttd_databricks_python.ttd_databricks import (
127
+ TtdDatabricksClient,
128
+ AdvertiserContext,
129
+ )
130
+
131
+ # Create the client using your TTD auth token.
132
+ # SparkSession is auto-detected from the Databricks runtime if not provided.
133
+ client = TtdDatabricksClient.from_params(api_token="<ttd-auth-token>")
134
+
135
+ # Create a context for the target endpoint.
136
+ # The context identifies which advertiser/provider to push data to
137
+ # and is passed to every API call.
138
+ context = AdvertiserContext(
139
+ advertiser_id="<advertiser-id>",
140
+ data_provider_id="<data-provider-id>", # optional
141
+ )
142
+
143
+ # Push the DataFrame to the TTD Data API in batches.
144
+ # Returns the input DataFrame enriched with status columns.
145
+ result_df = client.push_data(
146
+ df=input_df,
147
+ context=context,
148
+ batch_size=1600, # number of rows per API request
149
+ )
150
+ # result_df contains all input columns plus:
151
+ # success, error_code, error_message, processed_timestamp
152
+ ```
153
+
154
+ ### Batch Processing Mode (`batch_process`)
155
+
156
+ Use this for incremental, distributed processing backed by Delta tables. Supports incremental filtering to process only records added since the last run.
157
+
158
+ ```python
159
+ # Tables set up during Initial Setup (see above).
160
+ # input_table, output_table, metadata_table already created.
161
+
162
+ # Run the batch pipeline. With process_new_records_only=True, only rows
163
+ # added since the last successful run (tracked via metadata_table) are sent.
164
+ client.batch_process(
165
+ context=context,
166
+ input_table=input_table,
167
+ output_table=output_table,
168
+ metadata_table=metadata_table,
169
+ process_new_records_only=True, # incremental; set False to reprocess all rows
170
+ batch_size=1600, # rows per API request
171
+ parallelism=16, # number of concurrent Spark tasks
172
+ )
173
+ ```
174
+
175
+ To reprocess from a specific date (e.g. for a backfill), use `last_processed_date_override` to override the last processed date stored in the metadata table:
176
+
177
+ ```python
178
+ from datetime import datetime
179
+
180
+ client.batch_process(
181
+ context=context,
182
+ input_table=input_table,
183
+ output_table=output_table,
184
+ metadata_table=metadata_table,
185
+ process_new_records_only=True,
186
+ last_processed_date_override=datetime(2025, 1, 1), # reprocess from this date
187
+ )
188
+ ```
189
+
190
+ ---
191
+
192
+ ## Authentication
193
+
194
+ All API calls require a TTD auth token passed at client creation time.
195
+
196
+ ### Factory Method (recommended for notebooks)
197
+
198
+ ```python
199
+ # spark is the SparkSession available in the Databricks notebook runtime.
200
+ client = TtdDatabricksClient.from_params(
201
+ api_token="<ttd-auth-token>", # your TTD platform API token
202
+ spark=spark, # optional; auto-detected from Databricks context
203
+ # server_url="https://..." # optional; see Server Selection
204
+ )
205
+ ```
206
+
207
+ ### Dependency Injection (recommended for testing)
208
+
209
+ Provide your own [`DataClient`](https://github.com/thetradedesk/ttd-data-python/blob/main/src/ttd_data/sdk.py) instance to control the underlying HTTP transport directly.
210
+ Use this when you need to configure options not exposed by `from_params()`, or to inject a mock in tests.
211
+
212
+ ```python
213
+ from ttd_data import DataClient
214
+ from ttd_databricks_python.ttd_databricks import TtdDatabricksClient
215
+
216
+ # Configure DataClient with custom HTTP settings.
217
+ data_client = DataClient(
218
+ server_url="https://custom-server.example.com", # override default server URL
219
+ timeout_ms=10000, # request timeout in milliseconds
220
+ )
221
+
222
+ client = TtdDatabricksClient(
223
+ data_api_client=data_client,
224
+ api_token="<ttd-auth-token>",
225
+ spark=spark, # optional; spark variable available from the Databricks notebook runtime
226
+ )
227
+ ```
228
+
229
+ ---
230
+
231
+ ## Available Resources and Operations
232
+
233
+ Each endpoint is represented by a context dataclass that configures the API call.
234
+
235
+ ### First-Party Data — `/data/advertiser`
236
+
237
+ ```python
238
+ from ttd_databricks_python.ttd_databricks import AdvertiserContext
239
+
240
+ # Targets the /data/advertiser endpoint.
241
+ # advertiser_id is required; data_provider_id scopes data to a specific provider.
242
+ context = AdvertiserContext(
243
+ advertiser_id="<advertiser-id>",
244
+ data_provider_id="<data-provider-id>", # optional
245
+ )
246
+ ```
247
+
248
+ See [schema](#inspecting-schemas) for required and optional columns — `TTDEndpoint.ADVERTISER`
249
+
250
+ ---
251
+
252
+ ### Third-Party Data — `/data/thirdparty`
253
+
254
+ ```python
255
+ from ttd_databricks_python.ttd_databricks import ThirdPartyContext
256
+
257
+ # Targets the /data/thirdparty endpoint.
258
+ # Set is_user_id_already_hashed=True if id_value is pre-hashed (e.g. SHA-256).
259
+ context = ThirdPartyContext(
260
+ data_provider_id="<data-provider-id>",
261
+ is_user_id_already_hashed=False, # optional; default False
262
+ )
263
+ ```
264
+
265
+ See [schema](#inspecting-schemas) for required and optional columns — `TTDEndpoint.THIRD_PARTY`
266
+
267
+ ---
268
+
269
+ ### Offline Conversion — `/providerapi/offlineconversion`
270
+
271
+ ```python
272
+ from ttd_databricks_python.ttd_databricks import OfflineConversionContext
273
+
274
+ # Targets the /providerapi/offlineconversion endpoint.
275
+ context = OfflineConversionContext(
276
+ data_provider_id="<data-provider-id>",
277
+ )
278
+ ```
279
+
280
+ See [schema](#inspecting-schemas) for required and optional columns — `TTDEndpoint.OFFLINE_CONVERSION`
281
+
282
+ ---
283
+
284
+ ### Deletion / Opt-Out — Advertiser — `/data/deletion-optout/advertiser`
285
+
286
+ Deletion/Opt-Out endpoint scoped to a specific advertiser.
287
+
288
+ ```python
289
+ from ttd_databricks_python.ttd_databricks import DeletionOptOutAdvertiserContext, PartnerDsrRequestType
290
+
291
+ # request_type controls the action:
292
+ # PartnerDsrRequestType.DELETION — remove user data
293
+ # PartnerDsrRequestType.OPT_OUT — suppress future targeting
294
+ context = DeletionOptOutAdvertiserContext(
295
+ advertiser_id="<advertiser-id>",
296
+ request_type=PartnerDsrRequestType.OPT_OUT, # or OPT_OUT
297
+ data_provider_id="<data-provider-id>", # optional
298
+ )
299
+ ```
300
+
301
+ See [schema](#inspecting-schemas) for required and optional columns — `TTDEndpoint.DELETION_OPTOUT_ADVERTISER`
302
+
303
+ ---
304
+
305
+ ### Deletion / Opt-Out — Third Party — `/data/deletion-optout/thirdparty`
306
+
307
+ Deletion/Opt-Out endpoint scoped to a third-party data provider.
308
+
309
+ ```python
310
+ from ttd_databricks_python.ttd_databricks import DeletionOptOutThirdPartyContext, PartnerDsrRequestType
311
+
312
+ context = DeletionOptOutThirdPartyContext(
313
+ data_provider_id="<data-provider-id>",
314
+ request_type=PartnerDsrRequestType.OPT_OUT, # or OPT_OUT
315
+ brand_id="<brand-id>", # optional
316
+ )
317
+ ```
318
+
319
+ See [schema](#inspecting-schemas) for required and optional columns — `TTDEndpoint.DELETION_OPTOUT_THIRDPARTY`
320
+
321
+ ---
322
+
323
+ ### Deletion / Opt-Out — Merchant — `/data/deletion-optout/merchant`
324
+
325
+ Deletion/Opt-Out endpoint scoped to a merchant.
326
+
327
+ ```python
328
+ from ttd_databricks_python.ttd_databricks import DeletionOptOutMerchantContext, PartnerDsrRequestType
329
+
330
+ context = DeletionOptOutMerchantContext(
331
+ merchant_id="<merchant-id>",
332
+ request_type=PartnerDsrRequestType.OPT_OUT, # or OPT_OUT
333
+ )
334
+ ```
335
+
336
+ See [schema](#inspecting-schemas) for required and optional columns — `TTDEndpoint.DELETION_OPTOUT_MERCHANT`
337
+
338
+ ---
339
+
340
+ ### Inspecting Schemas
341
+
342
+ Retrieve the full input schema for an endpoint:
343
+
344
+ ```python
345
+ from ttd_databricks_python.ttd_databricks import TTDEndpoint
346
+ from ttd_databricks_python.ttd_databricks.schemas import get_ttd_input_schema
347
+
348
+ schema = get_ttd_input_schema(TTDEndpoint.ADVERTISER)
349
+ schema.printTreeString()
350
+ ```
351
+
352
+ Get just the required column names (useful for DataFrame preparation):
353
+
354
+ ```python
355
+ from ttd_databricks_python.ttd_databricks.schemas import get_required_column_names
356
+
357
+ required_cols = get_required_column_names(TTDEndpoint.ADVERTISER)
358
+ # e.g. ["id_type", "id_value", "segment_name"]
359
+ ```
360
+
361
+ Pre-validate a DataFrame before calling `push_data` to catch schema issues early:
362
+
363
+ ```python
364
+ from ttd_databricks_python.ttd_databricks.schemas import validate_ttd_schema
365
+
366
+ # Raises TTDSchemaValidationError if any required columns are missing.
367
+ validate_ttd_schema(df=input_df, endpoint=TTDEndpoint.ADVERTISER)
368
+ ```
369
+
370
+ ---
371
+
372
+ ## Error Handling
373
+
374
+ All SDK exceptions inherit from `TTDError`.
375
+
376
+ ```python
377
+ from ttd_databricks_python.ttd_databricks.exceptions import (
378
+ TTDError,
379
+ TTDApiError,
380
+ TTDConfigurationError,
381
+ TTDSchemaValidationError,
382
+ )
383
+
384
+ try:
385
+ result_df = client.push_data(df=input_df, context=context)
386
+ except TTDSchemaValidationError as e:
387
+ print(f"Missing columns: {e.missing_columns}")
388
+ except TTDApiError as e:
389
+ print(f"API error on batch {e.batch_index}: {e.status_code} — {e.response_text}")
390
+ except TTDConfigurationError as e:
391
+ print(f"Configuration error: {e}")
392
+ ```
393
+
394
+ | Exception | Cause |
395
+ |---|---|
396
+ | `TTDSchemaValidationError` | DataFrame is missing required columns for the endpoint |
397
+ | `TTDApiError` | HTTP error or no response from the TTD Data API |
398
+ | `TTDConfigurationError` | SparkSession not found or PySpark not installed |
399
+
400
+ For `push_data`, row-level errors are also captured inline in the result DataFrame via the `success`, `error_code`, and `error_message` columns — so processing is not interrupted by individual row failures.
401
+
402
+ ---
403
+
404
+ ## Server Selection
405
+
406
+ Each endpoint has its own default server URL, sourced from the `ttd-data` SDK:
407
+
408
+ | Endpoint | Path | Default Server |
409
+ |---|---|---|
410
+ | First-Party Data | `/data/advertiser` | `https://usw-data.adsrvr.org` |
411
+ | Third-Party Data | `/data/thirdparty` | `https://bulk-data.adsrvr.org` |
412
+ | Offline Conversion | `/providerapi/offlineconversion` | `https://offlineattrib.adsrvr.org` |
413
+ | Deletion / Opt-Out — Advertiser | `/data/deletion-optout/advertiser` | `https://usw-data.adsrvr.org` |
414
+ | Deletion / Opt-Out — Third Party | `/data/deletion-optout/thirdparty` | `https://usw-data.adsrvr.org` |
415
+ | Deletion / Opt-Out — Merchant | `/data/deletion-optout/merchant` | `https://usw-data.adsrvr.org` |
416
+
417
+ These can be overridden globally at the client level, or per-request via the context.
418
+
419
+ ### Global Override
420
+
421
+ Applies to all endpoints on the client:
422
+
423
+ ```python
424
+ client = TtdDatabricksClient.from_params(
425
+ api_token="<ttd-auth-token>",
426
+ server_url="https://custom-server.example.com",
427
+ )
428
+ ```
429
+
430
+ ### Per-Request Override
431
+
432
+ Applies only to calls made with that context, leaving the client default unchanged for other endpoints:
433
+
434
+ ```python
435
+ context = AdvertiserContext(
436
+ advertiser_id="<advertiser-id>",
437
+ base_url_override="https://custom-server.example.com",
438
+ )
439
+ ```
440
+
441
+ ---
442
+
443
+ ## Custom HTTP Client
444
+
445
+ The underlying HTTP client is provided by the `ttd-data` SDK via [`DataClient`](https://github.com/thetradedesk/ttd-data-python/blob/main/src/ttd_data/sdk.py). You can inject a custom instance to configure the server URL or connection behaviour.
446
+
447
+ ```python
448
+ from ttd_data import DataClient
449
+ from ttd_data.utils.retries import BackoffStrategy, RetryConfig
450
+ from ttd_databricks_python.ttd_databricks import TtdDatabricksClient
451
+
452
+ data_client = DataClient(
453
+ server_url="https://custom-server.example.com", # override default server URL
454
+ timeout_ms=10000, # request timeout in milliseconds
455
+ retry_config=RetryConfig("backoff", BackoffStrategy(1000, 60000, 1.5, 3600000), True), # custom retry config
456
+ )
457
+
458
+ client = TtdDatabricksClient(
459
+ data_api_client=data_client,
460
+ api_token="<ttd-auth-token>",
461
+ )
462
+ ```
463
+
464
+ In batch processing mode, a `DataClient` singleton is maintained per Spark worker process to enable HTTP connection reuse across batches, reducing overhead during distributed execution.