ingestify 0.6.4__py3-none-any.whl → 0.7.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (39) hide show
  1. ingestify/__init__.py +1 -1
  2. ingestify/application/dataset_store.py +228 -11
  3. ingestify/application/ingestion_engine.py +229 -7
  4. ingestify/application/loader.py +153 -28
  5. ingestify/cmdline.py +0 -48
  6. ingestify/domain/models/__init__.py +2 -0
  7. ingestify/domain/models/dataset/collection.py +0 -9
  8. ingestify/domain/models/dataset/dataset_repository.py +4 -0
  9. ingestify/domain/models/dataset/dataset_state.py +5 -0
  10. ingestify/domain/models/dataset/events.py +13 -0
  11. ingestify/domain/models/dataset/file.py +1 -1
  12. ingestify/domain/models/dataset/selector.py +8 -1
  13. ingestify/domain/models/event/event_bus.py +16 -1
  14. ingestify/domain/models/ingestion/ingestion_job.py +23 -4
  15. ingestify/domain/models/resources/dataset_resource.py +0 -1
  16. ingestify/infra/source/statsbomb/base.py +36 -0
  17. ingestify/infra/source/statsbomb/match.py +137 -0
  18. ingestify/infra/source/statsbomb_github.py +46 -44
  19. ingestify/infra/store/dataset/sqlalchemy/repository.py +77 -10
  20. ingestify/infra/store/dataset/sqlalchemy/tables.py +10 -0
  21. ingestify/main.py +35 -10
  22. ingestify/utils.py +2 -32
  23. ingestify-0.7.0.dist-info/METADATA +211 -0
  24. {ingestify-0.6.4.dist-info → ingestify-0.7.0.dist-info}/RECORD +28 -36
  25. ingestify/infra/source/wyscout.py +0 -175
  26. ingestify/static/templates/statsbomb_github/config.yaml.jinja2 +0 -19
  27. ingestify/static/templates/statsbomb_github/database/README.md +0 -1
  28. ingestify/static/templates/statsbomb_github/query.py +0 -14
  29. ingestify/static/templates/wyscout/.env +0 -5
  30. ingestify/static/templates/wyscout/.gitignore +0 -2
  31. ingestify/static/templates/wyscout/README.md +0 -0
  32. ingestify/static/templates/wyscout/config.yaml.jinja2 +0 -18
  33. ingestify/static/templates/wyscout/database/README.md +0 -1
  34. ingestify/static/templates/wyscout/query.py +0 -14
  35. ingestify-0.6.4.dist-info/METADATA +0 -266
  36. /ingestify/{static/templates/statsbomb_github/README.md → infra/source/statsbomb/__init__.py} +0 -0
  37. {ingestify-0.6.4.dist-info → ingestify-0.7.0.dist-info}/WHEEL +0 -0
  38. {ingestify-0.6.4.dist-info → ingestify-0.7.0.dist-info}/entry_points.txt +0 -0
  39. {ingestify-0.6.4.dist-info → ingestify-0.7.0.dist-info}/top_level.txt +0 -0
@@ -1,266 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: ingestify
3
- Version: 0.6.4
4
- Summary: Data Ingestion Framework
5
- Author: Koen Vossen
6
- Author-email: info@koenvossen.nl
7
- License: AGPL
8
- Description-Content-Type: text/markdown
9
- Requires-Dist: requests<3,>=2.0.0
10
- Requires-Dist: SQLAlchemy
11
- Requires-Dist: dataclass-factory
12
- Requires-Dist: cloudpickle
13
- Requires-Dist: click
14
- Requires-Dist: jinja2
15
- Requires-Dist: python-dotenv
16
- Requires-Dist: pyaml-env
17
- Requires-Dist: boto3
18
- Requires-Dist: pytz
19
- Requires-Dist: pydantic>=2.0.0
20
- Provides-Extra: test
21
- Requires-Dist: pytest<7,>=6.2.5; extra == "test"
22
-
23
- # Ingestify
24
-
25
- ## Data Management Platform
26
-
27
- In general a data management platform contains:
28
- 1. Ingestion of data (Extract from Source into Load into Data Lake)
29
- 2. Transformation of data (Extract from Data Lake, Transform and Load into Data Warehouse)
30
- 3. Utilization of data
31
-
32
- <img src="https://www.getdbt.com/ui/img/blog/what-exactly-is-dbt/1-BogoeTTK1OXFU1hPfUyCFw.png" />
33
- Source: https://www.getdbt.com/blog/what-exactly-is-dbt/
34
-
35
- TODO: Improve drawings and explain more
36
-
37
- ## Ingestify
38
-
39
- Ingestify focus' on Ingestion of data.
40
-
41
- ### How does Ingestify work?
42
-
43
- 1. A `Source` is asked for all available `Datasets` using the `discover_datasets` method
44
- 2. All available `Datasets` are compared with what's already fetched, and if it's changed (using a `FetchPolicy`)
45
- 3. A `TaskQueue` is filled with `Tasks` to fetch all missing or stale `Datasets`
46
-
47
- <img src="https://raw.githubusercontent.com/PySport/ingestify/refs/heads/main/docs/overview.svg" />
48
-
49
- - [Source](blob/main/ingestify/domain/models/source.py) is the main entrance from Ingestify to external sources. A Source must always define:
50
- - `discover_datasets` - Creates a list of all available datasets on the Source
51
- - `fetch_dataset_files` - Fetches a single dataset for a Source
52
- - [Dataset Store](blob/main/ingestify/application/dataset_store.py) manages the access to the Metadata storage and the file storage. It keeps track of versions, and knows how to load data.
53
- - [Loader](blob/main/ingestify/application/loader.py) organizes the fetching process. It does this by executing the following steps:
54
- 1. Ask `Source` for all available datasets for a selector
55
- 2. Ask `Dataset Store` for all available datasets for a selector
56
- 3. Determines missing `Datasets`
57
- 4. Create tasks for data retrieval and puts in `TaskQueue`
58
- 5. Use multiprocessing to execute all tasks
59
-
60
- ## Get started
61
-
62
- ### Install
63
-
64
- Make sure you have installed the latest version:
65
- ```bash
66
- pip install git+https://github.com/PySport/ingestify.git
67
-
68
- # OR
69
-
70
- pip install git+ssh://git@github.com/PySport/ingestify.git
71
- ```
72
-
73
- ### Using a template
74
-
75
- Ingestify provides some templates to get started quickly. When using `ingestify init` a new project will be created and example files are copied.
76
- Currently, Ingestify offers a `statsbomb_github` and `wyscout` template.
77
-
78
- #### Statsbomb Github
79
-
80
- This uses https://github.com/statsbomb/open-data as source and syncs some competitions.
81
-
82
- ```
83
- bash# ingestify init --template statsbomb_github /tmp/ingestify-test
84
-
85
- 2023-05-23 08:57:51,250 [INFO] ingestify.cmdline: Initialized project at `/tmp/ingestify-test` with template `statsbomb_github`
86
- ```
87
-
88
- #### Wyscout
89
-
90
- This requires valid Wyscout credentials. The templates includes some security best practices like using a `.env` file for credentials which isn't part of version control.
91
-
92
- ```
93
- bash# ingestify init --template wyscout /tmp/ingestify-test
94
-
95
- 2023-05-23 08:58:18,720 [INFO] ingestify.cmdline: Initialized project at `/tmp/ingestify-test` with template `wyscout`
96
- ```
97
-
98
- ### Running Ingestify
99
-
100
- To actually run Ingestify you first change the current directory to the project directory.
101
-
102
- Then run:
103
- ```bash
104
- bash# ingestify run
105
-
106
- 2023-05-23 08:59:07,066 [INFO] ingestify.main: Initializing sources
107
- 2023-05-23 08:59:07,068 [INFO] ingestify.main: Initializing IngestionEngine
108
- 2023-05-23 08:59:07,086 [INFO] ingestify.main: Determining tasks...
109
- 2023-05-23 08:59:07,364 [INFO] ingestify.application.loader: Discovered 33 datasets from StatsbombGithub using selector competition_id=11/season_id=42 => 33 tasks. 0 skipped.
110
- 2023-05-23 08:59:07,625 [INFO] ingestify.application.loader: Discovered 35 datasets from StatsbombGithub using selector competition_id=11/season_id=90 => 35 tasks. 0 skipped.
111
- 2023-05-23 08:59:07,625 [INFO] ingestify.application.loader: Scheduled 68 tasks. With 10 processes
112
- 2023-05-23 08:59:07,654 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303516)
113
- 2023-05-23 08:59:07,654 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303731)
114
- 2023-05-23 08:59:07,655 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303430)
115
- 2023-05-23 08:59:07,655 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303504)
116
- 2023-05-23 08:59:07,655 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303421)
117
- 2023-05-23 08:59:07,655 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303400)
118
- 2023-05-23 08:59:07,656 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303664)
119
- 2023-05-23 08:59:07,656 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303680)
120
- 2023-05-23 08:59:07,657 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303487)
121
- 2023-05-23 08:59:07,658 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303615)
122
- 2023-05-23 08:59:08,419 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303532)
123
- 2023-05-23 08:59:08,421 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303682)
124
- 2023-05-23 08:59:08,444 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303451)
125
- 2023-05-23 08:59:08,462 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303596)
126
- 2023-05-23 08:59:08,518 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303634)
127
- 2023-05-23 08:59:08,528 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303479)
128
- 2023-05-23 08:59:08,541 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303696)
129
- 2023-05-23 08:59:08,638 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303725)
130
- 2023-05-23 08:59:08,684 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303600)
131
- 2023-05-23 08:59:08,962 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303493)
132
- 2023-05-23 08:59:09,270 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303548)
133
- 2023-05-23 08:59:09,276 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303674)
134
- 2023-05-23 08:59:09,292 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303700)
135
- 2023-05-23 08:59:09,332 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303666)
136
- 2023-05-23 08:59:09,411 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303377)
137
- 2023-05-23 08:59:09,462 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303517)
138
- 2023-05-23 08:59:09,491 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303473)
139
- 2023-05-23 08:59:09,511 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773631)
140
- 2023-05-23 08:59:09,726 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773497)
141
- 2023-05-23 08:59:09,757 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773593)
142
- 2023-05-23 08:59:09,957 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303652)
143
- 2023-05-23 08:59:09,999 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303715)
144
- 2023-05-23 08:59:10,075 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303470)
145
- 2023-05-23 08:59:10,103 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303707)
146
- 2023-05-23 08:59:10,188 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773457)
147
- 2023-05-23 08:59:10,248 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303524)
148
- 2023-05-23 08:59:10,282 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773665)
149
- 2023-05-23 08:59:10,411 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=42/match_id=303610)
150
- 2023-05-23 08:59:10,563 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773466)
151
- 2023-05-23 08:59:10,711 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773585)
152
- 2023-05-23 08:59:10,768 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773672)
153
- 2023-05-23 08:59:10,778 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773565)
154
- 2023-05-23 08:59:10,867 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773660)
155
- 2023-05-23 08:59:10,954 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773656)
156
- 2023-05-23 08:59:10,974 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773586)
157
- 2023-05-23 08:59:11,026 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773387)
158
- 2023-05-23 08:59:11,136 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773369)
159
- 2023-05-23 08:59:11,438 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773552)
160
- 2023-05-23 08:59:11,515 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773597)
161
- 2023-05-23 08:59:11,586 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773571)
162
- 2023-05-23 08:59:11,610 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773587)
163
- 2023-05-23 08:59:11,690 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773386)
164
- 2023-05-23 08:59:11,727 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773377)
165
- 2023-05-23 08:59:11,757 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773372)
166
- 2023-05-23 08:59:11,899 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3764661)
167
- 2023-05-23 08:59:11,901 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773695)
168
- 2023-05-23 08:59:12,006 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773661)
169
- 2023-05-23 08:59:12,186 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773474)
170
- 2023-05-23 08:59:12,283 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773523)
171
- 2023-05-23 08:59:12,339 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773403)
172
- 2023-05-23 08:59:12,426 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773428)
173
- 2023-05-23 08:59:12,582 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773415)
174
- 2023-05-23 08:59:12,583 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773689)
175
- 2023-05-23 08:59:12,705 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773526)
176
- 2023-05-23 08:59:13,510 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773477)
177
- 2023-05-23 08:59:13,538 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3764440)
178
- 2023-05-23 08:59:13,592 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773625)
179
- 2023-05-23 08:59:15,017 [INFO] ingestify.application.loader: Running task CreateDatasetTask(StatsbombGithub -> competition_id=11/season_id=90/match_id=3773547)
180
- 2023-05-23 08:59:15,917 [INFO] ingestify.cmdline: Done
181
- ```
182
-
183
- When we run it for the second time:
184
- ```bash
185
- bash# ingestify run
186
-
187
- 2023-05-23 08:59:48,001 [INFO] ingestify.main: Initializing sources
188
- 2023-05-23 08:59:48,002 [INFO] ingestify.main: Initializing IngestionEngine
189
- 2023-05-23 08:59:48,006 [INFO] ingestify.main: Determining tasks...
190
- 2023-05-23 08:59:48,067 [INFO] ingestify.application.loader: Discovered 33 datasets from StatsbombGithub using selector competition_id=11/season_id=42 => 0 tasks. 33 skipped.
191
- 2023-05-23 08:59:48,118 [INFO] ingestify.application.loader: Discovered 35 datasets from StatsbombGithub using selector competition_id=11/season_id=90 => 0 tasks. 35 skipped.
192
- 2023-05-23 08:59:48,118 [INFO] ingestify.application.loader: Nothing to do.
193
- 2023-05-23 08:59:48,119 [INFO] ingestify.cmdline: Done
194
- ```
195
-
196
- ## Using the data
197
-
198
- The project contains a `query.py` file with an example of how to use the data.
199
-
200
- ```bash
201
- bash# python query.py
202
-
203
- Loaded dataset with 3702 events
204
- Loaded dataset with 3994 events
205
- Loaded dataset with 3831 events
206
- Loaded dataset with 3647 events
207
- Loaded dataset with 4062 events
208
- Loaded dataset with 4051 events
209
-
210
- .....
211
-
212
- ```
213
-
214
-
215
- How to go from raw data to parquet files:
216
-
217
- ```python
218
- from ingestify.main import get_datastore
219
-
220
- store = get_datastore("config.yaml")
221
-
222
- dataset_collection = store.get_dataset_collection(
223
- provider="statsbomb", stage="raw"
224
- )
225
-
226
- # Store.map is using multiprocessing by default
227
- store.map(
228
- lambda dataset: (
229
- store
230
-
231
- # As it's related to https://github.com/PySport/kloppy the store can load files using kloppy
232
- .load_with_kloppy(dataset)
233
-
234
- # Convert it into a polars dataframe using all columns in the original data and some more additional ones
235
- .to_df(
236
- "*",
237
- match_id=dataset.dataset_resource_id.match_id,
238
- competition_id=dataset.dataset_resource_id.competition_id,
239
- season_id=dataset.dataset_resource_id.season_id,
240
-
241
- engine="polars"
242
- )
243
-
244
- # Write to parquet format
245
- .write_parquet(
246
- f"/tmp/files/blaat/{dataset.dataset_resource_id.match_id}.parquet"
247
- )
248
- ),
249
- dataset_collection,
250
- )
251
-
252
- # TODO:
253
- # - when a file is written in parquet format (on any other format) it should be added as such to the store.
254
- ```
255
-
256
-
257
- ## Future work
258
-
259
- Some future work include:
260
- - Workflow tools - Run custom workflows using with tools like [Airflow](https://airflow.apache.org/), [Dagster](https://docs.dagster.io/getting-started), [Prefect](https://www.prefect.io/), [DBT](https://www.getdbt.com/)
261
- - Execution engines - Run tasks on other execution engines like [AWS Lambda](https://aws.amazon.com/lambda/), [Dask](https://www.dask.org/)
262
- - Lineage - Keep track of lineage with tools like [SQLLineage](https://sqllineage.readthedocs.io/en/latest/index.html)
263
- - Data quality - Monitor data quality with tools like [Great Expectations](https://docs.greatexpectations.io/docs/tutorials/quickstart/)
264
- - Event Bus - Automatically publish events to external systems like [AWS Event Bridge](https://aws.amazon.com/eventbridge/), [Azure Event Grid](https://learn.microsoft.com/en-us/azure/event-grid/overview), [Google Cloud Pub/Sub](https://cloud.google.com/pubsub/docs/overview), [Kafka](https://kafka.apache.org/), [RabbitMQ](https://www.rabbitmq.com/)
265
- - Query Engines - Integrate with query engines to run SQL queries directly on the store using tools like [DuckDB](https://duckdb.org/), [DataBend](https://databend.rs/), [DataFusion](https://arrow.apache.org/datafusion/), [Polars](https://www.pola.rs/), [Spark](https://spark.apache.org/)
266
- - Streaming Data - Ingest streaming data