fluent-plugin-openlineage 0.1.2 → 0.1.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.github/dependabot.yml +6 -0
- data/.github/workflows/linux.yml +30 -0
- data/.gitignore +16 -0
- data/.idea/.gitignore +8 -0
- data/.idea/fluentd.iml +204 -0
- data/.idea/misc.xml +4 -0
- data/.idea/modules/benchmark-memory-0.2.iml +12 -0
- data/.idea/modules/bigdecimal-3.1.iml +11 -0
- data/.idea/modules/certstore_c-0.1.iml +15 -0
- data/.idea/modules/concurrent-ruby-1.3.iml +18 -0
- data/.idea/modules/concurrent-ruby-1.31.iml +15 -0
- data/.idea/modules/connection_pool-2.4.iml +11 -0
- data/.idea/modules/cool.io-1.8.iml +16 -0
- data/.idea/modules/drb-2.2.iml +14 -0
- data/.idea/modules/drb-2.21.iml +11 -0
- data/.idea/modules/ffi-1.17.iml +20 -0
- data/.idea/modules/ffi-win32-extensions-1.0.iml +19 -0
- data/.idea/modules/fluentd-1.17.iml +43 -0
- data/.idea/modules/http_parser.rb-0.8.iml +17 -0
- data/.idea/modules/json-2.7.iml +14 -0
- data/.idea/modules/json-2.71.iml +11 -0
- data/.idea/modules/msgpack-1.7.iml +15 -0
- data/.idea/modules/mutex_m-0.2.iml +15 -0
- data/.idea/modules/new_gem.iml +15 -0
- data/.idea/modules/power_assert-2.0.iml +19 -0
- data/.idea/modules/rake-13.2.iml +18 -0
- data/.idea/modules/rake-13.21.iml +15 -0
- data/.idea/modules/rake-compiler-1.2.iml +13 -0
- data/.idea/modules/rusty_json_schema-0.15.iml +15 -0
- data/.idea/modules/serverengine-2.3.iml +17 -0
- data/.idea/modules/sigdump-0.2.iml +16 -0
- data/.idea/modules/specifications.iml +14 -0
- data/.idea/modules/specifications1.iml +11 -0
- data/.idea/modules/strptime-0.2.iml +16 -0
- data/.idea/modules/thermite-0.13.iml +17 -0
- data/.idea/modules/webrick-1.8.iml +18 -0
- data/.idea/modules/win32-event-0.6.iml +21 -0
- data/.idea/modules/win32-ipc-0.7.iml +20 -0
- data/.idea/modules/yajl-ruby-1.4.iml +779 -0
- data/.idea/modules.xml +41 -0
- data/.rspec +2 -0
- data/ChangeLog +3 -0
- data/Gemfile +3 -0
- data/LICENSE +202 -0
- data/README.md +250 -0
- data/Rakefile +13 -0
- data/fluent-plugin-openlineage.gemspec +28 -0
- data/lib/fluent/plugin/parser_openlineage.rb +182 -0
- data/misc/fluent.conf +101 -0
- data/misc/test-complete.json +73 -0
- data/misc/test-start.json +73 -0
- data/spec/Naming.md +500 -0
- data/spec/OpenLineage.json +304 -0
- data/spec/Versioning.md +49 -0
- data/spec/events/event_full.json +206 -0
- data/spec/events/event_invalid_dataset_facet.json +31 -0
- data/spec/events/event_invalid_input_dataset_facet.json +29 -0
- data/spec/events/event_invalid_job_facet.json +26 -0
- data/spec/events/event_invalid_output_dataset_facet.json +29 -0
- data/spec/events/event_invalid_run_facet.json +28 -0
- data/spec/events/event_no_run_id.json +28 -0
- data/spec/events/event_simple.json +29 -0
- data/spec/facets/ColumnLineageDatasetFacet.json +96 -0
- data/spec/facets/ColumnLineageDatasetFacet.md +106 -0
- data/spec/facets/DataQualityAssertionsDatasetFacet.json +49 -0
- data/spec/facets/DataQualityMetricsInputDatasetFacet.json +76 -0
- data/spec/facets/DatasetVersionDatasetFacet.json +31 -0
- data/spec/facets/DatasourceDatasetFacet.json +32 -0
- data/spec/facets/DocumentationDatasetFacet.json +31 -0
- data/spec/facets/DocumentationJobFacet.json +30 -0
- data/spec/facets/ErrorMessageRunFacet.json +41 -0
- data/spec/facets/ExternalQueryRunFacet.json +36 -0
- data/spec/facets/ExternalQueryRunFacet.md +49 -0
- data/spec/facets/ExtractionErrorRunFacet.json +58 -0
- data/spec/facets/JobTypeJobFacet.json +41 -0
- data/spec/facets/LifecycleStateChangeDatasetFacet.json +46 -0
- data/spec/facets/NominalTimeRunFacet.json +38 -0
- data/spec/facets/OutputStatisticsOutputDatasetFacet.json +36 -0
- data/spec/facets/OwnershipDatasetFacet.json +45 -0
- data/spec/facets/OwnershipJobFacet.json +45 -0
- data/spec/facets/ParentRunFacet.json +54 -0
- data/spec/facets/ProcessingEngineRunFacet.json +41 -0
- data/spec/facets/SQLJobFacet.json +30 -0
- data/spec/facets/SchemaDatasetFacet.json +59 -0
- data/spec/facets/SourceCodeJobFacet.json +34 -0
- data/spec/facets/SourceCodeLocationJobFacet.json +60 -0
- data/spec/facets/StorageDatasetFacet.json +35 -0
- data/spec/facets/SymlinksDatasetFacet.json +47 -0
- data/spec/fluent/plugin/test_parser_openlineage.rb +141 -0
- data/spec/registry/core/registry.json +31 -0
- data/spec/registry/gcp/facets/GcpCommonJobFacet.json +43 -0
- data/spec/registry/gcp/registry.json +6 -0
- data/spec/spec_helper.rb +8 -0
- data/spec/tests/ColumnLineageDatasetFacet/1.json +172 -0
- data/spec/tests/DataQualityAssertionsDatasetFacet/1.json +58 -0
- data/spec/tests/DataQualityMetricsInputDatasetFacet/1.json +23 -0
- data/spec/tests/DatasetVersionDatasetFacet/1.json +7 -0
- data/spec/tests/DatasourceDatasetFacet/1.json +7 -0
- data/spec/tests/DocumentationDatasetFacet/1.json +7 -0
- data/spec/tests/DocumentationJobFacet/1.json +7 -0
- data/spec/tests/ErrorMessageRunFacet/1.json +9 -0
- data/spec/tests/ExternalQueryRunFacet/1.json +8 -0
- data/spec/tests/ExtractionErrorRunFacet/1.json +15 -0
- data/spec/tests/JobTypeJobFacet/1.json +9 -0
- data/spec/tests/LifecycleStateChangeDatasetFacet/1.json +11 -0
- data/spec/tests/NominalTimeRunFacet/1.json +8 -0
- data/spec/tests/OutputStatisticsOutputDatasetFacet/1.json +9 -0
- data/spec/tests/OwnershipDatasetFacet/1.json +11 -0
- data/spec/tests/OwnershipJobFacet/1.json +11 -0
- data/spec/tests/ParentRunFacet/1.json +13 -0
- data/spec/tests/ProcessingEngineRunFacet/1.json +9 -0
- data/spec/tests/SQLJobFacet/1.json +7 -0
- data/spec/tests/SchemaDatasetFacet/1.json +92 -0
- data/spec/tests/SourceCodeJobFacet/1.json +8 -0
- data/spec/tests/SourceCodeLocationJobFacet/1.json +8 -0
- data/spec/tests/StorageDatasetFacet/1.json +8 -0
- data/spec/tests/SymlinksDatasetFacet/1.json +13 -0
- data/spec/tests/example_full_event.json +24 -0
- metadata +188 -3
data/spec/Naming.md
ADDED
@@ -0,0 +1,500 @@
|
|
1
|
+
# Naming
|
2
|
+
|
3
|
+
We define the unique name strategy per resource to ensure it is followed uniformly independently of who is producing
|
4
|
+
metadata, so we can connect lineage from various sources.
|
5
|
+
|
6
|
+
Both Jobs and Datasets are in their own namespaces. Job namespaces are related to their schedulers. The namespace for a
|
7
|
+
dataset is the unique name for its datasource.
|
8
|
+
|
9
|
+
## Datasets
|
10
|
+
|
11
|
+
The namespace and name of a datasource can be combined to form a URI (scheme:[//authority]path)
|
12
|
+
|
13
|
+
- Namespace = scheme:[//authority] (the datasource)
|
14
|
+
- Name = path (the datasets)
|
15
|
+
|
16
|
+
### Naming conventions for common data stores
|
17
|
+
|
18
|
+
This list is not exhaustive, if a data store is missing, please create an issue or open a PR.
|
19
|
+
|
20
|
+
### Data Warehouses
|
21
|
+
|
22
|
+
- [Athena](#Athena)
|
23
|
+
- [Azure Cosmos DB](#Azure-Cosmos-DB)
|
24
|
+
- [Azure Data Explorer](#Azure-Data-Explorer)
|
25
|
+
- [Azure Synapse](#Azure-Synapse)
|
26
|
+
- [BigQuery](#BigQuery)
|
27
|
+
- [Cassandra](#Cassandra)
|
28
|
+
- [MySQL](#MySQL)
|
29
|
+
- [Postgres](#Postgres)
|
30
|
+
- [Redshift](#Redshift)
|
31
|
+
- [Snowflake](#Snowflake)
|
32
|
+
- [Trino](#Trino)
|
33
|
+
|
34
|
+
### Distributed file systems/blob stores
|
35
|
+
|
36
|
+
- [ABFSS (Azure Data Lake Gen2)](#ABFSS "Azure Data Lake Gen2")
|
37
|
+
- [DBFS (Databricks File System)](#DBFS "Databricks File System")
|
38
|
+
- [GCS](#GCS)
|
39
|
+
- [HDFS](#HDFS)
|
40
|
+
- [Kafka](#Kafka)
|
41
|
+
- [Local file system](#Local-file-system)
|
42
|
+
- [S3](#S3)
|
43
|
+
- [WASBS (Azure Blob Storage)](#WASBS "Azure Blob Storage")
|
44
|
+
|
45
|
+
Datasets are called tables. Tables are organized into databases and schemas.
|
46
|
+
|
47
|
+
#### Athena:
|
48
|
+
|
49
|
+
Datasource hierarchy:
|
50
|
+
|
51
|
+
- Host: athena.{region_name}.amazonaws.com
|
52
|
+
|
53
|
+
Naming hierarchy:
|
54
|
+
|
55
|
+
- Catalog
|
56
|
+
- Database
|
57
|
+
- Table
|
58
|
+
|
59
|
+
Identifier:
|
60
|
+
|
61
|
+
- Namespace: awsathena://athena.{region_name}.amazonaws.com of the service instance.
|
62
|
+
- Scheme = awsathena
|
63
|
+
- Authority = athena.{region_name}.amazonaws.com
|
64
|
+
- Unique name: {catalog}.{database}.{table}
|
65
|
+
- URI = awsathena://athena.{region_name}.amazonaws.com/{catalog}.{database}.{table}
|
66
|
+
|
67
|
+
#### Azure Cosmos DB:
|
68
|
+
|
69
|
+
Datasource hierarchy: azurecosmos://%s.documents.azure.com/dbs/%s
|
70
|
+
|
71
|
+
- Host: \<XXXXXXXXXXXX>.documents.azure.com
|
72
|
+
- Database
|
73
|
+
|
74
|
+
Naming hierarchy:
|
75
|
+
|
76
|
+
- Schema
|
77
|
+
- Table
|
78
|
+
|
79
|
+
Identifier:
|
80
|
+
|
81
|
+
- Namespace: azurecosmos://{host}/dbs/{database}
|
82
|
+
- Scheme = azurecosmos
|
83
|
+
- Authority = {host}
|
84
|
+
- Unique name: /colls/{table}
|
85
|
+
- URI = azurecosmos://{host}.documents.azure.com/dbs/{database}/colls/{table}
|
86
|
+
|
87
|
+
#### Azure Data Explorer:
|
88
|
+
|
89
|
+
Datasource hierarchy:
|
90
|
+
|
91
|
+
- Host: \<clustername>.\<clusterlocation>
|
92
|
+
- Database
|
93
|
+
- Table
|
94
|
+
|
95
|
+
Naming hierarchy:
|
96
|
+
|
97
|
+
- Database
|
98
|
+
- Table
|
99
|
+
|
100
|
+
Identifier:
|
101
|
+
|
102
|
+
- Namespace: azurekusto://{host}.kusto.windows.net/{database}
|
103
|
+
- Scheme = azurekusto
|
104
|
+
- Unique name: {database}/{table}
|
105
|
+
- URI = azurekusto://{host}.kusto.windows.net/{database}/{table}
|
106
|
+
|
107
|
+
#### Azure Synapse:
|
108
|
+
|
109
|
+
Datasource hierarchy:
|
110
|
+
|
111
|
+
- Host: \<XXXXXXXXXXXX>.sql.azuresynapse.net
|
112
|
+
- Port: 1433
|
113
|
+
- Database: SQLPool1
|
114
|
+
|
115
|
+
Naming hierarchy:
|
116
|
+
|
117
|
+
- Schema
|
118
|
+
- Table
|
119
|
+
|
120
|
+
Identifier:
|
121
|
+
|
122
|
+
- Namespace: sqlserver://{host}:{port};database={database};
|
123
|
+
- Scheme = sqlserver
|
124
|
+
- Authority = {host}:{port}
|
125
|
+
- Unique name: {database}.{schema}.{table}
|
126
|
+
- URI = sqlserver://{host}:{port}/{database}.{schema}.{table}
|
127
|
+
|
128
|
+
#### BigQuery
|
129
|
+
|
130
|
+
See:
|
131
|
+
[Creating and managing projects | Resource Manager Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
|
132
|
+
[Introduction to datasets | BigQuery](https://cloud.google.com/bigquery/docs/datasets-intro)
|
133
|
+
[Introduction to tables | BigQuery](https://cloud.google.com/bigquery/docs/tables-intro)
|
134
|
+
|
135
|
+
Datasource hierarchy:
|
136
|
+
|
137
|
+
- bigquery
|
138
|
+
|
139
|
+
Naming hierarchy:
|
140
|
+
|
141
|
+
- Project Name: {project name} => is not unique
|
142
|
+
- Project number: {project number} => numeric: is unique across Google cloud
|
143
|
+
- Project ID: {project id} => readable: is unique across Google cloud
|
144
|
+
- dataset: {dataset name} => is unique within a project
|
145
|
+
- table: {table name} => is unique within a dataset
|
146
|
+
|
147
|
+
Identifier:
|
148
|
+
|
149
|
+
- Namespace: bigquery
|
150
|
+
- Scheme = bigquery
|
151
|
+
- Authority =
|
152
|
+
- Unique name: {project id}.{dataset name}.{table name}
|
153
|
+
- URI = bigquery:{project id}.{dataset name}.{table name}
|
154
|
+
|
155
|
+
#### Cassandra:
|
156
|
+
|
157
|
+
Datasource hierarchy:
|
158
|
+
|
159
|
+
- Host
|
160
|
+
- Port
|
161
|
+
|
162
|
+
Naming hierarchy:
|
163
|
+
|
164
|
+
- Keyspace
|
165
|
+
- Table
|
166
|
+
|
167
|
+
Identifier:
|
168
|
+
|
169
|
+
- Namespace: cassandra://{host}:{port} of the service instance.
|
170
|
+
- Scheme = cassandra
|
171
|
+
- Authority = {host}:{port}
|
172
|
+
- Unique name: {keyspace}.{table}
|
173
|
+
- URI = cassandra://{host}:{port}/{keyspace}.{table}
|
174
|
+
|
175
|
+
#### MySQL:
|
176
|
+
|
177
|
+
Datasource hierarchy:
|
178
|
+
|
179
|
+
- Host
|
180
|
+
- Port
|
181
|
+
|
182
|
+
Naming hierarchy:
|
183
|
+
|
184
|
+
- Database
|
185
|
+
- Table
|
186
|
+
|
187
|
+
Identifier:
|
188
|
+
|
189
|
+
- Namespace: mysql://{host}:{port} of the service instance.
|
190
|
+
- Scheme = mysql
|
191
|
+
- Authority = {host}:{port}
|
192
|
+
- Unique name: {database}.{table}
|
193
|
+
- URI = mysql://{host}:{port}/{database}.{table}
|
194
|
+
|
195
|
+
#### Postgres:
|
196
|
+
|
197
|
+
Datasource hierarchy:
|
198
|
+
|
199
|
+
- Host
|
200
|
+
- Port
|
201
|
+
|
202
|
+
Naming hierarchy:
|
203
|
+
|
204
|
+
- Database
|
205
|
+
- Schema
|
206
|
+
- Table
|
207
|
+
|
208
|
+
Identifier:
|
209
|
+
|
210
|
+
- Namespace: postgres://{host}:{port} of the service instance.
|
211
|
+
- Scheme = postgres
|
212
|
+
- Authority = {host}:{port}
|
213
|
+
- Unique name: {database}.{schema}.{table}
|
214
|
+
- URI = postgres://{host}:{port}/{database}.{schema}.{table}
|
215
|
+
|
216
|
+
#### Redshift:
|
217
|
+
|
218
|
+
Datasource hierarchy:
|
219
|
+
|
220
|
+
- Host: examplecluster.\<XXXXXXXXXXXX>.us-west-2.redshift.amazonaws.com
|
221
|
+
- Port: 5439
|
222
|
+
|
223
|
+
OR
|
224
|
+
|
225
|
+
- Cluster identifier
|
226
|
+
- Region name
|
227
|
+
- Port (defaults to 5439)
|
228
|
+
|
229
|
+
Naming hierarchy:
|
230
|
+
|
231
|
+
- Database
|
232
|
+
- Schema
|
233
|
+
- Table
|
234
|
+
|
235
|
+
One can interact with Redshift using SQL or Data API. The combination of cluster identifier and region name is the only
|
236
|
+
common unique ID available to both.
|
237
|
+
|
238
|
+
Identifier:
|
239
|
+
|
240
|
+
- Namespace: redshift://{cluster_identifier}.{region_name}:{port} of the cluster instance.
|
241
|
+
- Scheme = redshift
|
242
|
+
- Authority = {cluster_identifier}.{region_name}:{port}
|
243
|
+
- Unique name: {database}.{schema}.{table}
|
244
|
+
- URI = redshift://{cluster_identifier}.{region_name}:{port}/{database}.{schema}.{table}
|
245
|
+
|
246
|
+
#### Snowflake
|
247
|
+
|
248
|
+
See:
|
249
|
+
|
250
|
+
- [Account Identifiers | Snowflake Documentation](https://docs.snowflake.com/en/user-guide/admin-account-identifier)
|
251
|
+
- [Object Identifiers | Snowflake Documentation](https://docs.snowflake.com/en/sql-reference/identifiers.html)
|
252
|
+
|
253
|
+
Datasource hierarchy:
|
254
|
+
|
255
|
+
- account identifier (composite of organization name and account name)
|
256
|
+
|
257
|
+
Naming hierarchy:
|
258
|
+
|
259
|
+
- Database: {database name} => unique across the account
|
260
|
+
- Schema: {schema name} => unique within the database
|
261
|
+
- Table: {table name} => unique within the schema
|
262
|
+
|
263
|
+
Identifier:
|
264
|
+
|
265
|
+
- Namespace: snowflake://{organization name}-{account name}
|
266
|
+
- Scheme = snowflake
|
267
|
+
- Authority = {organization name}-{account name}
|
268
|
+
- Name: {database}.{schema}.{table}
|
269
|
+
- URI = snowflake://{organization name}-{account name}/{database}.{schema}.{table}
|
270
|
+
|
271
|
+
Snowflake resolves and stores names for databases, schemas, tables and columns differently depending on how they are
|
272
|
+
[expressed in statements](https://docs.snowflake.com/en/sql-reference/identifiers-syntax) (e.g. unquoted vs quoted). The
|
273
|
+
representation of names in OpenLineage events should be based on the canonical name that Snowflake stores. Specifically:
|
274
|
+
|
275
|
+
- For dataset names, each period-delimited part (database/schema/table) should be in the simplest form it would take in
|
276
|
+
a statement i.e. quoted only if necessary. For example, a table `My Table` in schema `MY_SCHEMA` and in database
|
277
|
+
`MY_DATABASE` would be represented as `MY_DATABASE.MY_SCHEMA."My Table"`. If in doubt, check
|
278
|
+
[Snowflake's `ACCESS_HISTORY` view](https://docs.snowflake.com/en/sql-reference/account-usage/access_history) to see
|
279
|
+
how `objectName` is formed for a given table.
|
280
|
+
- For column names, the canonical name should always be used verbatim.
|
281
|
+
|
282
|
+
#### Trino:
|
283
|
+
|
284
|
+
Datasource hierarchy:
|
285
|
+
|
286
|
+
- Host
|
287
|
+
- Port
|
288
|
+
|
289
|
+
Naming hierarchy:
|
290
|
+
|
291
|
+
- Catalog
|
292
|
+
- Schema
|
293
|
+
- Table
|
294
|
+
|
295
|
+
Identifier:
|
296
|
+
|
297
|
+
- Namespace: trino://{host}:{port} of the service instance.
|
298
|
+
- Scheme = trino
|
299
|
+
- Authority = {host}:{port}
|
300
|
+
- Unique name: {catalog}.{schema}.{table}
|
301
|
+
- URI = trino://{host}:{port}/{catalog}.{schema}.{table}
|
302
|
+
|
303
|
+
### Distributed file systems/blob stores
|
304
|
+
|
305
|
+
#### ABFSS (Azure Data Lake Gen2)
|
306
|
+
|
307
|
+
Naming hierarchy:
|
308
|
+
|
309
|
+
- service name => globally unique
|
310
|
+
- Path
|
311
|
+
|
312
|
+
Identifier :
|
313
|
+
|
314
|
+
- Namespace: abfss://{container name}@{service name}
|
315
|
+
- Scheme = abfss
|
316
|
+
- Authority = service name
|
317
|
+
- Unique name: {path}
|
318
|
+
- URI = abfss://{container name}@{service name}{path}
|
319
|
+
|
320
|
+
#### DBFS (Databricks File System)
|
321
|
+
|
322
|
+
Naming hierarchy:
|
323
|
+
|
324
|
+
- workspace name: globally unique
|
325
|
+
- Path
|
326
|
+
|
327
|
+
Identifier :
|
328
|
+
|
329
|
+
- Namespace: hdfs://{workspace name}
|
330
|
+
- Scheme = hdfs
|
331
|
+
- Authority = workspace name
|
332
|
+
- Unique name: {path}
|
333
|
+
- URI = hdfs://{workspace name}{path}
|
334
|
+
|
335
|
+
#### GCS
|
336
|
+
|
337
|
+
Datasource hierarchy: none, naming is global
|
338
|
+
|
339
|
+
Naming hierarchy:
|
340
|
+
|
341
|
+
- bucket name => globally unique
|
342
|
+
- Path
|
343
|
+
|
344
|
+
Identifier :
|
345
|
+
|
346
|
+
- Namespace: gs://{bucket name}
|
347
|
+
- Scheme = gs
|
348
|
+
- Authority = {bucket name}
|
349
|
+
- Unique name: {path}
|
350
|
+
- URI = gs://{bucket name}{path}
|
351
|
+
|
352
|
+
#### HDFS
|
353
|
+
|
354
|
+
Naming hierarchy:
|
355
|
+
|
356
|
+
- Namenode: host + port
|
357
|
+
- Path
|
358
|
+
|
359
|
+
Identifier :
|
360
|
+
|
361
|
+
- Namespace: hdfs://{namenode host}:{namenode port}
|
362
|
+
- Scheme = hdfs
|
363
|
+
- Authority = {namenode host}:{namenode port}
|
364
|
+
- Unique name: {path}
|
365
|
+
- URI = hdfs://{namenode host}:{namenode port}{path}
|
366
|
+
|
367
|
+
### Kafka
|
368
|
+
|
369
|
+
Naming hierarchy:
|
370
|
+
|
371
|
+
- Kafka bootstrap server host + port
|
372
|
+
- topic name
|
373
|
+
|
374
|
+
Identifier :
|
375
|
+
|
376
|
+
- Namespace: kafka://{bootstrap server host}:{port}
|
377
|
+
- Scheme = kafka
|
378
|
+
- Authority = bootstrap server
|
379
|
+
- Unique name: {topic name}
|
380
|
+
- URI = kafka://{bootstrap server host}:{port}/{topic name}
|
381
|
+
|
382
|
+
### Local file system
|
383
|
+
|
384
|
+
Datasource hierarchy:
|
385
|
+
|
386
|
+
- IP
|
387
|
+
- Port
|
388
|
+
|
389
|
+
Naming hierarchy:
|
390
|
+
|
391
|
+
- Path
|
392
|
+
|
393
|
+
Identifier :
|
394
|
+
|
395
|
+
- Namespace: file://{IP}:{port}
|
396
|
+
- Scheme = file
|
397
|
+
- Authority = {IP}:{port}
|
398
|
+
- Unique name: {path}
|
399
|
+
- URI = file://{IP}:{port}{path}
|
400
|
+
|
401
|
+
#### S3
|
402
|
+
|
403
|
+
Naming hierarchy:
|
404
|
+
|
405
|
+
- bucket name => globally unique
|
406
|
+
- Path
|
407
|
+
|
408
|
+
Identifier :
|
409
|
+
|
410
|
+
- Namespace: s3://{bucket name}
|
411
|
+
- Scheme = s3
|
412
|
+
- Authority = {bucket name}
|
413
|
+
- Unique name: {path}
|
414
|
+
- URI = s3://{bucket name}{path}
|
415
|
+
|
416
|
+
#### WASBS (Azure Blob Storage)
|
417
|
+
|
418
|
+
Naming hierarchy:
|
419
|
+
|
420
|
+
- service name => globally unique
|
421
|
+
- Path
|
422
|
+
|
423
|
+
Identifier :
|
424
|
+
|
425
|
+
- Namespace: wasbs://{container name}@{service name}
|
426
|
+
- Scheme = wasbs
|
427
|
+
- Authority = service name
|
428
|
+
- Unique name: {path}
|
429
|
+
- URI = wasbs://{container name}@{service name}{path}
|
430
|
+
|
431
|
+
## Jobs
|
432
|
+
|
433
|
+
### Context
|
434
|
+
|
435
|
+
A `Job` is a recurring data transformation with inputs and outputs. Each execution is captured as a `Run` with
|
436
|
+
corresponding metadata. A `Run` event identifies the `Job` it is an instance of by providing the job’s unique
|
437
|
+
identifier. The `Job` identifier is composed of a `Namespace` and a `Name`. The job name is unique within that
|
438
|
+
namespace.
|
439
|
+
|
440
|
+
The core property we want to identify about a `Job` is how it changes over time. Different schedules of the same logic
|
441
|
+
applied to different datasets (possibly with different parameters) are different jobs. The notion of a `job` is tied to
|
442
|
+
a recurring schedule with specific inputs and outputs. It could be an incremental update or a full reprocess or even a
|
443
|
+
streaming job.
|
444
|
+
|
445
|
+
If the same code artifact (for example a Spark jar or a templated SQL query) is used in the context of different
|
446
|
+
schedules with different input or outputs, then they are different jobs. We are interested first in how they affect the
|
447
|
+
datasets they produce.
|
448
|
+
|
449
|
+
### Job Namespace and constructing job names
|
450
|
+
|
451
|
+
Jobs have a `name` that is unique to them in their `namespace` by construction.
|
452
|
+
|
453
|
+
The Namespace is the root of the naming hierarchy. The job name is constructed to identify the job within that
|
454
|
+
namespace.
|
455
|
+
|
456
|
+
Example:
|
457
|
+
|
458
|
+
- Airflow:
|
459
|
+
- Namespace: the namespace is assigned to the Airflow instance. Ex: airflow-staging, airflow-prod
|
460
|
+
- Job: each task in a DAG is a job. name: {dag name}.{task name}
|
461
|
+
- Spark:
|
462
|
+
- Namespace: the namespace is provided as a configuration parameter as in Airflow. If there's a parent job, we use the
|
463
|
+
same namespace, otherwise it is provided by configuration.
|
464
|
+
- Spark app job name: the spark.app.name
|
465
|
+
- Spark action job name: {spark.app.name}.{node.name}
|
466
|
+
|
467
|
+
### Parent job run: a nested hierarchy of Jobs
|
468
|
+
|
469
|
+
It is often the case that jobs are part of a nested hierarchy. For example, an Airflow DAG contains tasks. An instance
|
470
|
+
of the DAG is finished when all of the tasks are finished. Similarly, a Spark job can spawn multiple actions, each of
|
471
|
+
them running independently. Additionally, a Spark job can be launched by an Airflow task within a DAG.
|
472
|
+
|
473
|
+
Since what we care about is identifying the job as rooted in a recurring schedule, we want to capture that connection
|
474
|
+
and make sure that we treat the same application logic triggered at different schedules as different jobs. For example:
|
475
|
+
if an Airflow DAG runs individual tasks per partition (e.g., market segments) using the same underlying job logic, they
|
476
|
+
will be tracked as separate jobs.
|
477
|
+
|
478
|
+
To capture this, a run event provides
|
479
|
+
[a `ParentRun` facet](https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json#L282-L331) referring to
|
480
|
+
the parent `Job` and `Run`. This allows tracking a recurring job from the root of the schedule for which it is running.
|
481
|
+
If there's a parent job, we use the same namespace, otherwise it is provided by configuration.
|
482
|
+
|
483
|
+
Example:
|
484
|
+
|
485
|
+
```json
|
486
|
+
{
|
487
|
+
"run": {
|
488
|
+
"runId": "run_uuid"
|
489
|
+
},
|
490
|
+
"job": {
|
491
|
+
"namespace": "job_namespace",
|
492
|
+
"name": "job_name"
|
493
|
+
}
|
494
|
+
}
|
495
|
+
```
|
496
|
+
|
497
|
+
---
|
498
|
+
|
499
|
+
SPDX-License-Identifier: Apache-2.0\
|
500
|
+
Copyright 2018-2024 contributors to the OpenLineage project
|