fluent-plugin-openlineage 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (120) hide show
  1. checksums.yaml +4 -4
  2. data/.github/dependabot.yml +6 -0
  3. data/.github/workflows/linux.yml +30 -0
  4. data/.gitignore +16 -0
  5. data/.idea/.gitignore +8 -0
  6. data/.idea/fluentd.iml +204 -0
  7. data/.idea/misc.xml +4 -0
  8. data/.idea/modules/benchmark-memory-0.2.iml +12 -0
  9. data/.idea/modules/bigdecimal-3.1.iml +11 -0
  10. data/.idea/modules/certstore_c-0.1.iml +15 -0
  11. data/.idea/modules/concurrent-ruby-1.3.iml +18 -0
  12. data/.idea/modules/concurrent-ruby-1.31.iml +15 -0
  13. data/.idea/modules/connection_pool-2.4.iml +11 -0
  14. data/.idea/modules/cool.io-1.8.iml +16 -0
  15. data/.idea/modules/drb-2.2.iml +14 -0
  16. data/.idea/modules/drb-2.21.iml +11 -0
  17. data/.idea/modules/ffi-1.17.iml +20 -0
  18. data/.idea/modules/ffi-win32-extensions-1.0.iml +19 -0
  19. data/.idea/modules/fluentd-1.17.iml +43 -0
  20. data/.idea/modules/http_parser.rb-0.8.iml +17 -0
  21. data/.idea/modules/json-2.7.iml +14 -0
  22. data/.idea/modules/json-2.71.iml +11 -0
  23. data/.idea/modules/msgpack-1.7.iml +15 -0
  24. data/.idea/modules/mutex_m-0.2.iml +15 -0
  25. data/.idea/modules/new_gem.iml +15 -0
  26. data/.idea/modules/power_assert-2.0.iml +19 -0
  27. data/.idea/modules/rake-13.2.iml +18 -0
  28. data/.idea/modules/rake-13.21.iml +15 -0
  29. data/.idea/modules/rake-compiler-1.2.iml +13 -0
  30. data/.idea/modules/rusty_json_schema-0.15.iml +15 -0
  31. data/.idea/modules/serverengine-2.3.iml +17 -0
  32. data/.idea/modules/sigdump-0.2.iml +16 -0
  33. data/.idea/modules/specifications.iml +14 -0
  34. data/.idea/modules/specifications1.iml +11 -0
  35. data/.idea/modules/strptime-0.2.iml +16 -0
  36. data/.idea/modules/thermite-0.13.iml +17 -0
  37. data/.idea/modules/webrick-1.8.iml +18 -0
  38. data/.idea/modules/win32-event-0.6.iml +21 -0
  39. data/.idea/modules/win32-ipc-0.7.iml +20 -0
  40. data/.idea/modules/yajl-ruby-1.4.iml +779 -0
  41. data/.idea/modules.xml +41 -0
  42. data/.rspec +2 -0
  43. data/ChangeLog +3 -0
  44. data/Gemfile +3 -0
  45. data/LICENSE +202 -0
  46. data/README.md +250 -0
  47. data/Rakefile +13 -0
  48. data/fluent-plugin-openlineage.gemspec +28 -0
  49. data/lib/fluent/plugin/parser_openlineage.rb +182 -0
  50. data/misc/fluent.conf +101 -0
  51. data/misc/test-complete.json +73 -0
  52. data/misc/test-start.json +73 -0
  53. data/spec/Naming.md +500 -0
  54. data/spec/OpenLineage.json +304 -0
  55. data/spec/Versioning.md +49 -0
  56. data/spec/events/event_full.json +206 -0
  57. data/spec/events/event_invalid_dataset_facet.json +31 -0
  58. data/spec/events/event_invalid_input_dataset_facet.json +29 -0
  59. data/spec/events/event_invalid_job_facet.json +26 -0
  60. data/spec/events/event_invalid_output_dataset_facet.json +29 -0
  61. data/spec/events/event_invalid_run_facet.json +28 -0
  62. data/spec/events/event_no_run_id.json +28 -0
  63. data/spec/events/event_simple.json +29 -0
  64. data/spec/facets/ColumnLineageDatasetFacet.json +96 -0
  65. data/spec/facets/ColumnLineageDatasetFacet.md +106 -0
  66. data/spec/facets/DataQualityAssertionsDatasetFacet.json +49 -0
  67. data/spec/facets/DataQualityMetricsInputDatasetFacet.json +76 -0
  68. data/spec/facets/DatasetVersionDatasetFacet.json +31 -0
  69. data/spec/facets/DatasourceDatasetFacet.json +32 -0
  70. data/spec/facets/DocumentationDatasetFacet.json +31 -0
  71. data/spec/facets/DocumentationJobFacet.json +30 -0
  72. data/spec/facets/ErrorMessageRunFacet.json +41 -0
  73. data/spec/facets/ExternalQueryRunFacet.json +36 -0
  74. data/spec/facets/ExternalQueryRunFacet.md +49 -0
  75. data/spec/facets/ExtractionErrorRunFacet.json +58 -0
  76. data/spec/facets/JobTypeJobFacet.json +41 -0
  77. data/spec/facets/LifecycleStateChangeDatasetFacet.json +46 -0
  78. data/spec/facets/NominalTimeRunFacet.json +38 -0
  79. data/spec/facets/OutputStatisticsOutputDatasetFacet.json +36 -0
  80. data/spec/facets/OwnershipDatasetFacet.json +45 -0
  81. data/spec/facets/OwnershipJobFacet.json +45 -0
  82. data/spec/facets/ParentRunFacet.json +54 -0
  83. data/spec/facets/ProcessingEngineRunFacet.json +41 -0
  84. data/spec/facets/SQLJobFacet.json +30 -0
  85. data/spec/facets/SchemaDatasetFacet.json +59 -0
  86. data/spec/facets/SourceCodeJobFacet.json +34 -0
  87. data/spec/facets/SourceCodeLocationJobFacet.json +60 -0
  88. data/spec/facets/StorageDatasetFacet.json +35 -0
  89. data/spec/facets/SymlinksDatasetFacet.json +47 -0
  90. data/spec/fluent/plugin/test_parser_openlineage.rb +141 -0
  91. data/spec/registry/core/registry.json +31 -0
  92. data/spec/registry/gcp/facets/GcpCommonJobFacet.json +43 -0
  93. data/spec/registry/gcp/registry.json +6 -0
  94. data/spec/spec_helper.rb +8 -0
  95. data/spec/tests/ColumnLineageDatasetFacet/1.json +172 -0
  96. data/spec/tests/DataQualityAssertionsDatasetFacet/1.json +58 -0
  97. data/spec/tests/DataQualityMetricsInputDatasetFacet/1.json +23 -0
  98. data/spec/tests/DatasetVersionDatasetFacet/1.json +7 -0
  99. data/spec/tests/DatasourceDatasetFacet/1.json +7 -0
  100. data/spec/tests/DocumentationDatasetFacet/1.json +7 -0
  101. data/spec/tests/DocumentationJobFacet/1.json +7 -0
  102. data/spec/tests/ErrorMessageRunFacet/1.json +9 -0
  103. data/spec/tests/ExternalQueryRunFacet/1.json +8 -0
  104. data/spec/tests/ExtractionErrorRunFacet/1.json +15 -0
  105. data/spec/tests/JobTypeJobFacet/1.json +9 -0
  106. data/spec/tests/LifecycleStateChangeDatasetFacet/1.json +11 -0
  107. data/spec/tests/NominalTimeRunFacet/1.json +8 -0
  108. data/spec/tests/OutputStatisticsOutputDatasetFacet/1.json +9 -0
  109. data/spec/tests/OwnershipDatasetFacet/1.json +11 -0
  110. data/spec/tests/OwnershipJobFacet/1.json +11 -0
  111. data/spec/tests/ParentRunFacet/1.json +13 -0
  112. data/spec/tests/ProcessingEngineRunFacet/1.json +9 -0
  113. data/spec/tests/SQLJobFacet/1.json +7 -0
  114. data/spec/tests/SchemaDatasetFacet/1.json +92 -0
  115. data/spec/tests/SourceCodeJobFacet/1.json +8 -0
  116. data/spec/tests/SourceCodeLocationJobFacet/1.json +8 -0
  117. data/spec/tests/StorageDatasetFacet/1.json +8 -0
  118. data/spec/tests/SymlinksDatasetFacet/1.json +13 -0
  119. data/spec/tests/example_full_event.json +24 -0
  120. metadata +188 -3
data/spec/Naming.md ADDED
@@ -0,0 +1,500 @@
1
+ # Naming
2
+
3
+ We define the unique name strategy per resource to ensure it is followed uniformly independently of who is producing
4
+ metadata, so we can connect lineage from various sources.
5
+
6
+ Both Jobs and Datasets are in their own namespaces. Job namespaces are related to their schedulers. The namespace for a
7
+ dataset is the unique name for its datasource.
8
+
9
+ ## Datasets
10
+
11
+ The namespace and name of a datasource can be combined to form a URI (scheme:[//authority]path)
12
+
13
+ - Namespace = scheme:[//authority] (the datasource)
14
+ - Name = path (the datasets)
15
+
16
+ ### Naming conventions for common data stores
17
+
18
+ This list is not exhaustive, if a data store is missing, please create an issue or open a PR.
19
+
20
+ ### Data Warehouses
21
+
22
+ - [Athena](#Athena)
23
+ - [Azure Cosmos DB](#Azure-Cosmos-DB)
24
+ - [Azure Data Explorer](#Azure-Data-Explorer)
25
+ - [Azure Synapse](#Azure-Synapse)
26
+ - [BigQuery](#BigQuery)
27
+ - [Cassandra](#Cassandra)
28
+ - [MySQL](#MySQL)
29
+ - [Postgres](#Postgres)
30
+ - [Redshift](#Redshift)
31
+ - [Snowflake](#Snowflake)
32
+ - [Trino](#Trino)
33
+
34
+ ### Distributed file systems/blob stores
35
+
36
+ - [ABFSS (Azure Data Lake Gen2)](#ABFSS "Azure Data Lake Gen2")
37
+ - [DBFS (Databricks File System)](#DBFS "Databricks File System")
38
+ - [GCS](#GCS)
39
+ - [HDFS](#HDFS)
40
+ - [Kafka](#Kafka)
41
+ - [Local file system](#Local-file-system)
42
+ - [S3](#S3)
43
+ - [WASBS (Azure Blob Storage)](#WASBS "Azure Blob Storage")
44
+
45
+ Datasets are called tables. Tables are organized into databases and schemas.
46
+
47
+ #### Athena:
48
+
49
+ Datasource hierarchy:
50
+
51
+ - Host: athena.{region_name}.amazonaws.com
52
+
53
+ Naming hierarchy:
54
+
55
+ - Catalog
56
+ - Database
57
+ - Table
58
+
59
+ Identifier:
60
+
61
+ - Namespace: awsathena://athena.{region_name}.amazonaws.com of the service instance.
62
+ - Scheme = awsathena
63
+ - Authority = athena.{region_name}.amazonaws.com
64
+ - Unique name: {catalog}.{database}.{table}
65
+ - URI = awsathena://athena.{region_name}.amazonaws.com/{catalog}.{database}.{table}
66
+
67
+ #### Azure Cosmos DB:
68
+
69
+ Datasource hierarchy: azurecosmos://%s.documents.azure.com/dbs/%s
70
+
71
+ - Host: \<XXXXXXXXXXXX>.documents.azure.com
72
+ - Database
73
+
74
+ Naming hierarchy:
75
+
76
+ - Schema
77
+ - Table
78
+
79
+ Identifier:
80
+
81
+ - Namespace: azurecosmos://{host}/dbs/{database}
82
+ - Scheme = azurecosmos
83
+ - Authority = {host}
84
+ - Unique name: /colls/{table}
85
+ - URI = azurecosmos://{host}.documents.azure.com/dbs/{database}/colls/{table}
86
+
87
+ #### Azure Data Explorer:
88
+
89
+ Datasource hierarchy:
90
+
91
+ - Host: \<clustername>.\<clusterlocation>
92
+ - Database
93
+ - Table
94
+
95
+ Naming hierarchy:
96
+
97
+ - Database
98
+ - Table
99
+
100
+ Identifier:
101
+
102
+ - Namespace: azurekusto://{host}.kusto.windows.net/{database}
103
+ - Scheme = azurekusto
104
+ - Unique name: {database}/{table}
105
+ - URI = azurekusto://{host}.kusto.windows.net/{database}/{table}
106
+
107
+ #### Azure Synapse:
108
+
109
+ Datasource hierarchy:
110
+
111
+ - Host: \<XXXXXXXXXXXX>.sql.azuresynapse.net
112
+ - Port: 1433
113
+ - Database: SQLPool1
114
+
115
+ Naming hierarchy:
116
+
117
+ - Schema
118
+ - Table
119
+
120
+ Identifier:
121
+
122
+ - Namespace: sqlserver://{host}:{port};database={database};
123
+ - Scheme = sqlserver
124
+ - Authority = {host}:{port}
125
+ - Unique name: {database}.{schema}.{table}
126
+ - URI = sqlserver://{host}:{port}/{database}.{schema}.{table}
127
+
128
+ #### BigQuery
129
+
130
+ See:
131
+ [Creating and managing projects | Resource Manager Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
132
+ [Introduction to datasets | BigQuery](https://cloud.google.com/bigquery/docs/datasets-intro)
133
+ [Introduction to tables | BigQuery](https://cloud.google.com/bigquery/docs/tables-intro)
134
+
135
+ Datasource hierarchy:
136
+
137
+ - bigquery
138
+
139
+ Naming hierarchy:
140
+
141
+ - Project Name: {project name} => is not unique
142
+ - Project number: {project number} => numeric: is unique across Google cloud
143
+ - Project ID: {project id} => readable: is unique across Google cloud
144
+ - dataset: {dataset name} => is unique within a project
145
+ - table: {table name} => is unique within a dataset
146
+
147
+ Identifier:
148
+
149
+ - Namespace: bigquery
150
+ - Scheme = bigquery
151
+ - Authority =
152
+ - Unique name: {project id}.{dataset name}.{table name}
153
+ - URI = bigquery:{project id}.{dataset name}.{table name}
154
+
155
+ #### Cassandra:
156
+
157
+ Datasource hierarchy:
158
+
159
+ - Host
160
+ - Port
161
+
162
+ Naming hierarchy:
163
+
164
+ - Keyspace
165
+ - Table
166
+
167
+ Identifier:
168
+
169
+ - Namespace: cassandra://{host}:{port} of the service instance.
170
+ - Scheme = cassandra
171
+ - Authority = {host}:{port}
172
+ - Unique name: {keyspace}.{table}
173
+ - URI = cassandra://{host}:{port}/{keyspace}.{table}
174
+
175
+ #### MySQL:
176
+
177
+ Datasource hierarchy:
178
+
179
+ - Host
180
+ - Port
181
+
182
+ Naming hierarchy:
183
+
184
+ - Database
185
+ - Table
186
+
187
+ Identifier:
188
+
189
+ - Namespace: mysql://{host}:{port} of the service instance.
190
+ - Scheme = mysql
191
+ - Authority = {host}:{port}
192
+ - Unique name: {database}.{table}
193
+ - URI = mysql://{host}:{port}/{database}.{table}
194
+
195
+ #### Postgres:
196
+
197
+ Datasource hierarchy:
198
+
199
+ - Host
200
+ - Port
201
+
202
+ Naming hierarchy:
203
+
204
+ - Database
205
+ - Schema
206
+ - Table
207
+
208
+ Identifier:
209
+
210
+ - Namespace: postgres://{host}:{port} of the service instance.
211
+ - Scheme = postgres
212
+ - Authority = {host}:{port}
213
+ - Unique name: {database}.{schema}.{table}
214
+ - URI = postgres://{host}:{port}/{database}.{schema}.{table}
215
+
216
+ #### Redshift:
217
+
218
+ Datasource hierarchy:
219
+
220
+ - Host: examplecluster.\<XXXXXXXXXXXX>.us-west-2.redshift.amazonaws.com
221
+ - Port: 5439
222
+
223
+ OR
224
+
225
+ - Cluster identifier
226
+ - Region name
227
+ - Port (defaults to 5439)
228
+
229
+ Naming hierarchy:
230
+
231
+ - Database
232
+ - Schema
233
+ - Table
234
+
235
+ One can interact with Redshift using SQL or Data API. The combination of cluster identifier and region name is the only
236
+ common unique ID available to both.
237
+
238
+ Identifier:
239
+
240
+ - Namespace: redshift://{cluster_identifier}.{region_name}:{port} of the cluster instance.
241
+ - Scheme = redshift
242
+ - Authority = {cluster_identifier}.{region_name}:{port}
243
+ - Unique name: {database}.{schema}.{table}
244
+ - URI = redshift://{cluster_identifier}.{region_name}:{port}/{database}.{schema}.{table}
245
+
246
+ #### Snowflake
247
+
248
+ See:
249
+
250
+ - [Account Identifiers | Snowflake Documentation](https://docs.snowflake.com/en/user-guide/admin-account-identifier)
251
+ - [Object Identifiers | Snowflake Documentation](https://docs.snowflake.com/en/sql-reference/identifiers.html)
252
+
253
+ Datasource hierarchy:
254
+
255
+ - account identifier (composite of organization name and account name)
256
+
257
+ Naming hierarchy:
258
+
259
+ - Database: {database name} => unique across the account
260
+ - Schema: {schema name} => unique within the database
261
+ - Table: {table name} => unique within the schema
262
+
263
+ Identifier:
264
+
265
+ - Namespace: snowflake://{organization name}-{account name}
266
+ - Scheme = snowflake
267
+ - Authority = {organization name}-{account name}
268
+ - Name: {database}.{schema}.{table}
269
+ - URI = snowflake://{organization name}-{account name}/{database}.{schema}.{table}
270
+
271
+ Snowflake resolves and stores names for databases, schemas, tables and columns differently depending on how they are
272
+ [expressed in statements](https://docs.snowflake.com/en/sql-reference/identifiers-syntax) (e.g. unquoted vs quoted). The
273
+ representation of names in OpenLineage events should be based on the canonical name that Snowflake stores. Specifically:
274
+
275
+ - For dataset names, each period-delimited part (database/schema/table) should be in the simplest form it would take in
276
+ a statement i.e. quoted only if necessary. For example, a table `My Table` in schema `MY_SCHEMA` and in database
277
+ `MY_DATABASE` would be represented as `MY_DATABASE.MY_SCHEMA."My Table"`. If in doubt, check
278
+ [Snowflake's `ACCESS_HISTORY` view](https://docs.snowflake.com/en/sql-reference/account-usage/access_history) to see
279
+ how `objectName` is formed for a given table.
280
+ - For column names, the canonical name should always be used verbatim.
281
+
282
+ #### Trino:
283
+
284
+ Datasource hierarchy:
285
+
286
+ - Host
287
+ - Port
288
+
289
+ Naming hierarchy:
290
+
291
+ - Catalog
292
+ - Schema
293
+ - Table
294
+
295
+ Identifier:
296
+
297
+ - Namespace: trino://{host}:{port} of the service instance.
298
+ - Scheme = trino
299
+ - Authority = {host}:{port}
300
+ - Unique name: {catalog}.{schema}.{table}
301
+ - URI = trino://{host}:{port}/{catalog}.{schema}.{table}
302
+
303
+ ### Distributed file systems/blob stores
304
+
305
+ #### ABFSS (Azure Data Lake Gen2)
306
+
307
+ Naming hierarchy:
308
+
309
+ - service name => globally unique
310
+ - Path
311
+
312
+ Identifier :
313
+
314
+ - Namespace: abfss://{container name}@{service name}
315
+ - Scheme = abfss
316
+ - Authority = service name
317
+ - Unique name: {path}
318
+ - URI = abfss://{container name}@{service name}{path}
319
+
320
+ #### DBFS (Databricks File System)
321
+
322
+ Naming hierarchy:
323
+
324
+ - workspace name: globally unique
325
+ - Path
326
+
327
+ Identifier :
328
+
329
+ - Namespace: hdfs://{workspace name}
330
+ - Scheme = hdfs
331
+ - Authority = workspace name
332
+ - Unique name: {path}
333
+ - URI = hdfs://{workspace name}{path}
334
+
335
+ #### GCS
336
+
337
+ Datasource hierarchy: none, naming is global
338
+
339
+ Naming hierarchy:
340
+
341
+ - bucket name => globally unique
342
+ - Path
343
+
344
+ Identifier :
345
+
346
+ - Namespace: gs://{bucket name}
347
+ - Scheme = gs
348
+ - Authority = {bucket name}
349
+ - Unique name: {path}
350
+ - URI = gs://{bucket name}{path}
351
+
352
+ #### HDFS
353
+
354
+ Naming hierarchy:
355
+
356
+ - Namenode: host + port
357
+ - Path
358
+
359
+ Identifier :
360
+
361
+ - Namespace: hdfs://{namenode host}:{namenode port}
362
+ - Scheme = hdfs
363
+ - Authority = {namenode host}:{namenode port}
364
+ - Unique name: {path}
365
+ - URI = hdfs://{namenode host}:{namenode port}{path}
366
+
367
+ ### Kafka
368
+
369
+ Naming hierarchy:
370
+
371
+ - Kafka bootstrap server host + port
372
+ - topic name
373
+
374
+ Identifier :
375
+
376
+ - Namespace: kafka://{bootstrap server host}:{port}
377
+ - Scheme = kafka
378
+ - Authority = bootstrap server
379
+ - Unique name: {topic name}
380
+ - URI = kafka://{bootstrap server host}:{port}/{topic name}
381
+
382
+ ### Local file system
383
+
384
+ Datasource hierarchy:
385
+
386
+ - IP
387
+ - Port
388
+
389
+ Naming hierarchy:
390
+
391
+ - Path
392
+
393
+ Identifier :
394
+
395
+ - Namespace: file://{IP}:{port}
396
+ - Scheme = file
397
+ - Authority = {IP}:{port}
398
+ - Unique name: {path}
399
+ - URI = file://{IP}:{port}{path}
400
+
401
+ #### S3
402
+
403
+ Naming hierarchy:
404
+
405
+ - bucket name => globally unique
406
+ - Path
407
+
408
+ Identifier :
409
+
410
+ - Namespace: s3://{bucket name}
411
+ - Scheme = s3
412
+ - Authority = {bucket name}
413
+ - Unique name: {path}
414
+ - URI = s3://{bucket name}{path}
415
+
416
+ #### WASBS (Azure Blob Storage)
417
+
418
+ Naming hierarchy:
419
+
420
+ - service name => globally unique
421
+ - Path
422
+
423
+ Identifier :
424
+
425
+ - Namespace: wasbs://{container name}@{service name}
426
+ - Scheme = wasbs
427
+ - Authority = service name
428
+ - Unique name: {path}
429
+ - URI = wasbs://{container name}@{service name}{path}
430
+
431
+ ## Jobs
432
+
433
+ ### Context
434
+
435
+ A `Job` is a recurring data transformation with inputs and outputs. Each execution is captured as a `Run` with
436
+ corresponding metadata. A `Run` event identifies the `Job` it is an instance of by providing the job’s unique
437
+ identifier. The `Job` identifier is composed of a `Namespace` and a `Name`. The job name is unique within that
438
+ namespace.
439
+
440
+ The core property we want to identify about a `Job` is how it changes over time. Different schedules of the same logic
441
+ applied to different datasets (possibly with different parameters) are different jobs. The notion of a `job` is tied to
442
+ a recurring schedule with specific inputs and outputs. It could be an incremental update or a full reprocess or even a
443
+ streaming job.
444
+
445
+ If the same code artifact (for example a Spark jar or a templated SQL query) is used in the context of different
446
+ schedules with different input or outputs, then they are different jobs. We are interested first in how they affect the
447
+ datasets they produce.
448
+
449
+ ### Job Namespace and constructing job names
450
+
451
+ Jobs have a `name` that is unique to them in their `namespace` by construction.
452
+
453
+ The Namespace is the root of the naming hierarchy. The job name is constructed to identify the job within that
454
+ namespace.
455
+
456
+ Example:
457
+
458
+ - Airflow:
459
+ - Namespace: the namespace is assigned to the Airflow instance. Ex: airflow-staging, airflow-prod
460
+ - Job: each task in a DAG is a job. name: {dag name}.{task name}
461
+ - Spark:
462
+ - Namespace: the namespace is provided as a configuration parameter as in Airflow. If there's a parent job, we use the
463
+ same namespace, otherwise it is provided by configuration.
464
+ - Spark app job name: the spark.app.name
465
+ - Spark action job name: {spark.app.name}.{node.name}
466
+
467
+ ### Parent job run: a nested hierarchy of Jobs
468
+
469
+ It is often the case that jobs are part of a nested hierarchy. For example, an Airflow DAG contains tasks. An instance
470
+ of the DAG is finished when all of the tasks are finished. Similarly, a Spark job can spawn multiple actions, each of
471
+ them running independently. Additionally, a Spark job can be launched by an Airflow task within a DAG.
472
+
473
+ Since what we care about is identifying the job as rooted in a recurring schedule, we want to capture that connection
474
+ and make sure that we treat the same application logic triggered at different schedules as different jobs. For example:
475
+ if an Airflow DAG runs individual tasks per partition (e.g., market segments) using the same underlying job logic, they
476
+ will be tracked as separate jobs.
477
+
478
+ To capture this, a run event provides
479
+ [a `ParentRun` facet](https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json#L282-L331) referring to
480
+ the parent `Job` and `Run`. This allows tracking a recurring job from the root of the schedule for which it is running.
481
+ If there's a parent job, we use the same namespace, otherwise it is provided by configuration.
482
+
483
+ Example:
484
+
485
+ ```json
486
+ {
487
+ "run": {
488
+ "runId": "run_uuid"
489
+ },
490
+ "job": {
491
+ "namespace": "job_namespace",
492
+ "name": "job_name"
493
+ }
494
+ }
495
+ ```
496
+
497
+ ---
498
+
499
+ SPDX-License-Identifier: Apache-2.0\
500
+ Copyright 2018-2024 contributors to the OpenLineage project