dremiojs 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.eslintrc.json +14 -0
- package/.prettierrc +7 -0
- package/README.md +59 -0
- package/dremiodocs/dremio-cloud/cloud-api-reference.md +748 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-about.md +225 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-admin.md +3754 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-bring-data.md +6098 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-changelog.md +32 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-developer.md +1147 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-explore-analyze.md +2522 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-get-started.md +300 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-help-support.md +869 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-manage-govern.md +800 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-overview.md +36 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-security.md +1844 -0
- package/dremiodocs/dremio-cloud/sql-docs.md +7180 -0
- package/dremiodocs/dremio-software/dremio-software-acceleration.md +1575 -0
- package/dremiodocs/dremio-software/dremio-software-admin.md +884 -0
- package/dremiodocs/dremio-software/dremio-software-client-applications.md +3277 -0
- package/dremiodocs/dremio-software/dremio-software-data-products.md +560 -0
- package/dremiodocs/dremio-software/dremio-software-data-sources.md +8701 -0
- package/dremiodocs/dremio-software/dremio-software-deploy-dremio.md +3446 -0
- package/dremiodocs/dremio-software/dremio-software-get-started.md +848 -0
- package/dremiodocs/dremio-software/dremio-software-monitoring.md +422 -0
- package/dremiodocs/dremio-software/dremio-software-reference.md +677 -0
- package/dremiodocs/dremio-software/dremio-software-security.md +2074 -0
- package/dremiodocs/dremio-software/dremio-software-v25-api.md +32637 -0
- package/dremiodocs/dremio-software/dremio-software-v26-api.md +36757 -0
- package/jest.config.js +10 -0
- package/package.json +25 -0
- package/src/api/catalog.ts +74 -0
- package/src/api/jobs.ts +105 -0
- package/src/api/reflection.ts +77 -0
- package/src/api/source.ts +61 -0
- package/src/api/user.ts +32 -0
- package/src/client/base.ts +66 -0
- package/src/client/cloud.ts +37 -0
- package/src/client/software.ts +73 -0
- package/src/index.ts +16 -0
- package/src/types/catalog.ts +31 -0
- package/src/types/config.ts +18 -0
- package/src/types/job.ts +18 -0
- package/src/types/reflection.ts +29 -0
- package/tests/integration_manual.ts +95 -0
- package/tsconfig.json +19 -0
|
@@ -0,0 +1,1147 @@
|
|
|
1
|
+
# Developer Guide | Dremio Documentation
|
|
2
|
+
|
|
3
|
+
Original URL: https://docs.dremio.com/dremio-cloud/developer/
|
|
4
|
+
|
|
5
|
+
On this page
|
|
6
|
+
|
|
7
|
+
You can develop applications that connect to Dremio using Arrow Flight for high-performance data access, APIs for management operations, or by integrating with development tools and frameworks.
|
|
8
|
+
|
|
9
|
+
## Build Custom Applications
|
|
10
|
+
|
|
11
|
+
Use Arrow Flight and Python SDKs to build applications that connect to Dremio:
|
|
12
|
+
|
|
13
|
+
* [Arrow Flight](/dremio-cloud/developer/arrow-flight) – High-performance data access for analytics applications
|
|
14
|
+
* [Arrow Flight SQL](/dremio-cloud/developer/arrow-flight-sql) – Standardized SQL database interactions with prepared statements
|
|
15
|
+
* [Python](/dremio-cloud/developer/python) – Build applications using Arrow Flight or REST APIs
|
|
16
|
+
* [Dremio MCP Server](/dremio-cloud/developer/mcp-server) – AI Agent integration for natural language interactions
|
|
17
|
+
|
|
18
|
+
## Build Pipelines and Transformations
|
|
19
|
+
|
|
20
|
+
Use your tool of choice to build pipelines, perform transformations, and work with Dremio:
|
|
21
|
+
|
|
22
|
+
* [dbt Integration](/dremio-cloud/developer/dbt) – Transform data with version control and testing
|
|
23
|
+
* [VS Code Extension](/dremio-cloud/developer/vs-code) – Query Dremio from Visual Studio Code
|
|
24
|
+
|
|
25
|
+
## Customize and Automate
|
|
26
|
+
|
|
27
|
+
Use APIs to power any type of customization or automation:
|
|
28
|
+
|
|
29
|
+
* [API Reference](/dremio-cloud/api/) – Web applications and administrative automation
|
|
30
|
+
|
|
31
|
+
For sample applications, connectors, and additional integrations, see [Dremio Hub](https://github.com/dremio-hub).
|
|
32
|
+
|
|
33
|
+
## Supported Data Formats
|
|
34
|
+
|
|
35
|
+
For a deep dive into open table and data formats that Dremio supports, see [Data Formats](/dremio-cloud/developer/data-formats/).
|
|
36
|
+
|
|
37
|
+
Was this page helpful?
|
|
38
|
+
|
|
39
|
+
* Build Custom Applications
|
|
40
|
+
* Build Pipelines and Transformations
|
|
41
|
+
* Customize and Automate
|
|
42
|
+
* Supported Data Formats
|
|
43
|
+
|
|
44
|
+
<div style="page-break-after: always;"></div>
|
|
45
|
+
|
|
46
|
+
# dbt | Dremio Documentation
|
|
47
|
+
|
|
48
|
+
Original URL: https://docs.dremio.com/dremio-cloud/developer/dbt
|
|
49
|
+
|
|
50
|
+
On this page
|
|
51
|
+
|
|
52
|
+
dbt enables analytics engineers to transform their data using the same practices that software engineers use to build applications.
|
|
53
|
+
|
|
54
|
+
You can use Dremio's dbt connector `dbt-dremio` to transform data that is in data sources that are connected to a Dremio project.
|
|
55
|
+
|
|
56
|
+
## Prerequisites
|
|
57
|
+
|
|
58
|
+
* Download the `dbt-dremio` package from <https://github.com/dremio/dbt-dremio>.
|
|
59
|
+
* Ensure that Python 3.9.x or later is installed.
|
|
60
|
+
* Before connecting from a dbt project to Dremio, follow these prerequisite steps:
|
|
61
|
+
+ Ensure that you have the ID of the Dremio project that you want to use. See [Obtain the ID of a Project](/dremio-cloud/admin/projects/#obtain-the-id-of-a-project).
|
|
62
|
+
+ Ensure that you have a personal access token (PAT) for authenticating to Dremio. See [Create a PAT](/dremio-cloud/security/authentication/personal-access-token#create-a-pat).
|
|
63
|
+
|
|
64
|
+
## Install
|
|
65
|
+
|
|
66
|
+
Install this package from PyPi by running this command:
|
|
67
|
+
|
|
68
|
+
Install dbt-dremio package
|
|
69
|
+
|
|
70
|
+
```
|
|
71
|
+
pip install dbt-dremio
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
note
|
|
75
|
+
|
|
76
|
+
`dbt-dremio` works exclusively with dbt-core versions 1.8-1.9. Previous versions of dbt-core are outside of official support.
|
|
77
|
+
|
|
78
|
+
## Initialize a dbt Project
|
|
79
|
+
|
|
80
|
+
1. Run the command `dbt init <project_name>`.
|
|
81
|
+
2. Select `dremio` as the database to use.
|
|
82
|
+
3. Select the `dremio_cloud` option.
|
|
83
|
+
4. Provide a value for `cloud_host`.
|
|
84
|
+
5. Enter your username, PAT, and the ID of your Dremio project.
|
|
85
|
+
6. Select the `enterprise_catalog` option.
|
|
86
|
+
7. For `enterprise_catalog_namespace`, enter the name of an existing namespace within the catalog.
|
|
87
|
+
8. For `enterprise_catalog_folder`, enter the name of a folder which already exists within the namespace.
|
|
88
|
+
|
|
89
|
+
For descriptions of the configurations in the above steps, see Configurations.
|
|
90
|
+
|
|
91
|
+
After these steps are completed, you will now have a profile for your new dbt project. This file will typically be named `profiles.yml`.
|
|
92
|
+
|
|
93
|
+
This file can be edited to add multiple profiles, one for each `target` configuration of Dremio.
|
|
94
|
+
A common pattern is to have a `dev` target a dbt project is tested, and then another `prod` target where changes to the model are promoted after testing:
|
|
95
|
+
|
|
96
|
+
Example Profile
|
|
97
|
+
|
|
98
|
+
```
|
|
99
|
+
[project name]:
|
|
100
|
+
outputs:
|
|
101
|
+
dev:
|
|
102
|
+
cloud_host: api.dremio.cloud
|
|
103
|
+
cloud_project_id: 1ab23456-78c9-01d2-de3f-456g7h890ij1
|
|
104
|
+
enterprise_catalog_folder: sales
|
|
105
|
+
enterprise_catalog_namespace: dev
|
|
106
|
+
pat: A1BCDrE2FwgH3IJkLM4123qrsT5uV6WXyza7I8bcDEFgJ9hIj0Kl1MNOPq2Rstu==
|
|
107
|
+
threads: 1
|
|
108
|
+
type: dremio
|
|
109
|
+
use_ssl: true
|
|
110
|
+
user: name@company.com
|
|
111
|
+
prod:
|
|
112
|
+
cloud_host: api.dremio.cloud
|
|
113
|
+
cloud_project_id: 1ab23456-78c9-01d2-de3f-456g7h890ij1
|
|
114
|
+
enterprise_catalog_folder: sales
|
|
115
|
+
enterprise_catalog_namespace: prod
|
|
116
|
+
pat: A1BCDrE2FwgH3IJkLM4123qrsT5uV6WXyza7I8bcDEFgJ9hIj0Kl1MNOPq2Rstu==
|
|
117
|
+
threads: 1
|
|
118
|
+
type: dremio
|
|
119
|
+
use_ssl: true
|
|
120
|
+
user: name@company.com
|
|
121
|
+
target: dev
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
Note that the `target` value inside of the profiles.yml file can be overriden when invoking the `dbt run`.
|
|
125
|
+
|
|
126
|
+
Specify target for dbt run command
|
|
127
|
+
|
|
128
|
+
```
|
|
129
|
+
dbt run --target <target_name>
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
## Configurations
|
|
133
|
+
|
|
134
|
+
| Configuration | Required | Default Value | Description |
|
|
135
|
+
| --- | --- | --- | --- |
|
|
136
|
+
| `cloud_host` | Yes | `api.dremio.cloud` | US Control Plane: `api.dremio.cloud` EU Control Plane: `api.eu.dremio.cloud` |
|
|
137
|
+
| `cloud_project_id` | Yes | None | The ID of the Dremio project in which to run transformations. |
|
|
138
|
+
| `enterprise_catalog_namespace` | Yes | None | The namespace in which to create tables, views, etc. The dbt aliases are `datalake` (for objects) and `database` (for views). |
|
|
139
|
+
| `enterprise_catalog_folder` | Yes | None | The path in the catalog in which to create catalog objects. The dbt aliases are `root_path` (for objects) and `schema` (for views). Nested folders in the path are separated with periods. |
|
|
140
|
+
| `pat` | Yes | None | The personal access token to use for authentication. See [Personal Access Tokens](/dremio-cloud/security/authentication/personal-access-token/) for instructions about obtaining a token. |
|
|
141
|
+
| `threads` | Yes | 1 | The number of threads the dbt project runs on. |
|
|
142
|
+
| `type` | Yes | `dremio` | Auto-populated when creating a Dremio project. Do not change this value. |
|
|
143
|
+
| `use_ssl` | Yes | `true` | The value must be `true`. |
|
|
144
|
+
| `user` | Yes | None | Email address used as a username in Dremio. |
|
|
145
|
+
|
|
146
|
+
## Known Limitations
|
|
147
|
+
|
|
148
|
+
[Model contracts](https://docs.getdbt.com/docs/collaborate/govern/model-contracts) are not supported.
|
|
149
|
+
|
|
150
|
+
Was this page helpful?
|
|
151
|
+
|
|
152
|
+
* Prerequisites
|
|
153
|
+
* Install
|
|
154
|
+
* Initialize a dbt Project
|
|
155
|
+
* Configurations
|
|
156
|
+
* Known Limitations
|
|
157
|
+
|
|
158
|
+
<div style="page-break-after: always;"></div>
|
|
159
|
+
|
|
160
|
+
# Python | Dremio Documentation
|
|
161
|
+
|
|
162
|
+
Original URL: https://docs.dremio.com/dremio-cloud/developer/python
|
|
163
|
+
|
|
164
|
+
On this page
|
|
165
|
+
|
|
166
|
+
You can develop client applications in Python that use that use [Arrow Flight](/dremio-cloud/developer/arrow-flight/) and connect to Dremio's Arrow Flight server endpoint. For help getting started, try out the sample application.
|
|
167
|
+
|
|
168
|
+
## Sample Python Arrow Flight Client Application
|
|
169
|
+
|
|
170
|
+
This lightweight sample Python client application connects to the Dremio Arrow Flight server endpoint. You can use token-based credentials for authentication. Any datasets in Dremio that are accessible by the provided Dremio user can be queried. You can change settings in a `.yaml` configuration file before running the client.
|
|
171
|
+
|
|
172
|
+
The Sample Python Client Application
|
|
173
|
+
|
|
174
|
+
```
|
|
175
|
+
"""
|
|
176
|
+
Copyright (C) 2017-2021 Dremio Corporation
|
|
177
|
+
|
|
178
|
+
Licensed under the Apache License, Version 2.0 (the "License");
|
|
179
|
+
you may not use this file except in compliance with the License.
|
|
180
|
+
You may obtain a copy of the License at
|
|
181
|
+
|
|
182
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
|
183
|
+
|
|
184
|
+
Unless required by applicable law or agreed to in writing, software
|
|
185
|
+
distributed under the License is distributed on an "AS IS" BASIS,
|
|
186
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
187
|
+
See the License for the specific language governing permissions and
|
|
188
|
+
limitations under the License.
|
|
189
|
+
"""
|
|
190
|
+
from dremio.arguments.parse import get_config
|
|
191
|
+
from dremio.flight.endpoint import DremioFlightEndpoint
|
|
192
|
+
|
|
193
|
+
if __name__ == "__main__":
|
|
194
|
+
# Parse the config file.
|
|
195
|
+
args = get_config()
|
|
196
|
+
|
|
197
|
+
# Instantiate DremioFlightEndpoint object
|
|
198
|
+
dremio_flight_endpoint = DremioFlightEndpoint(args)
|
|
199
|
+
|
|
200
|
+
# Connect to Dremio Arrow Flight server endpoint.
|
|
201
|
+
flight_client = dremio_flight_endpoint.connect()
|
|
202
|
+
|
|
203
|
+
# Execute query
|
|
204
|
+
dataframe = dremio_flight_endpoint.execute_query(flight_client)
|
|
205
|
+
|
|
206
|
+
# Print out the data
|
|
207
|
+
print(dataframe)
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
### Steps
|
|
211
|
+
|
|
212
|
+
1. Install [Python 3](https://www.python.org/downloads/).
|
|
213
|
+
2. Download the [Dremio Flight endpoint .whl file](https://github.com/dremio-hub/arrow-flight-client-examples/releases).
|
|
214
|
+
3. Install the `.whl` file:
|
|
215
|
+
Command for installing the file
|
|
216
|
+
|
|
217
|
+
```
|
|
218
|
+
python3 -m pip install <path to .whl file>
|
|
219
|
+
```
|
|
220
|
+
4. Create a local folder to store the client file and config file.
|
|
221
|
+
5. Create a file named `example.py` in the folder that you created.
|
|
222
|
+
6. Copy the contents of `arrow-flight-client-examples/python/example.py` (available [here](https://github.com/dremio-hub/arrow-flight-client-examples/blob/main/python/example.py)) into `example.py`.
|
|
223
|
+
7. Create a file named `config.yaml` in the folder that you created.
|
|
224
|
+
8. Copy the contents of `arrow-flight-client-examples/python/config_template.yaml` (available [here](https://github.com/dremio-hub/arrow-flight-client-examples/blob/main/python/config_template.yaml)) into `config.yaml`.
|
|
225
|
+
9. Uncomment the options in `config.yaml`, as needed, appending arguments after their keys (i.e., `username: my_username`). You can either delete the options that are not being used or leave them commented.
|
|
226
|
+
|
|
227
|
+
Example config file for connecting to Dremio
|
|
228
|
+
|
|
229
|
+
```
|
|
230
|
+
hostname: data.dremio.cloud
|
|
231
|
+
port: 443
|
|
232
|
+
pat: my_PAT
|
|
233
|
+
tls: true
|
|
234
|
+
query: SELECT * FROM Samples."samples.dremio.com"."NYC-taxi-trips" limit 10
|
|
235
|
+
```
|
|
236
|
+
10. Run the Python Arrow Flight Client by navigating to the folder that you created in the previous step and running this command:
|
|
237
|
+
Command for running the client
|
|
238
|
+
|
|
239
|
+
```
|
|
240
|
+
python3 example.py [-config CONFIG_REL_PATH | --config-path CONFIG_REL_PATH]
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
* `[-config CONFIG_REL_PATH | --config-path CONFIG_REL_PATH]`: Use either of these options to set the relative path to the config file. The default is "./config.yaml".
|
|
244
|
+
|
|
245
|
+
### Config File Options
|
|
246
|
+
|
|
247
|
+
Default content of the config file
|
|
248
|
+
|
|
249
|
+
```
|
|
250
|
+
hostname:
|
|
251
|
+
port:
|
|
252
|
+
username:
|
|
253
|
+
password:
|
|
254
|
+
token:
|
|
255
|
+
query:
|
|
256
|
+
tls:
|
|
257
|
+
disable_certificate_verification:
|
|
258
|
+
path_to_certs:
|
|
259
|
+
session_properties:
|
|
260
|
+
engine:
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
| Name | Type | Required? | Default | Description |
|
|
264
|
+
| --- | --- | --- | --- | --- |
|
|
265
|
+
| `hostname` | string | No | `localhost` | Must be `data.dremio.cloud`. |
|
|
266
|
+
| `port` | integer | No | 32010 | Dremio's Arrow Flight server port. Must be `443`. |
|
|
267
|
+
| `username` | string | No | N/A | Not applicable when connecting to Dremio. |
|
|
268
|
+
| `password` | string | No | N/A | Not applicable when connecting to Dremio. |
|
|
269
|
+
| `token` | string | Yes | N/A | Either a Personal Access Token or an OAuth2 Token. |
|
|
270
|
+
| `query` | string | Yes | N/A | The SQL query to test. |
|
|
271
|
+
| `tls` | boolean | No | false | Enables encryption on a connection. |
|
|
272
|
+
| `disable_certificate_verification` | boolean | No | false | Disables TLS server verification. |
|
|
273
|
+
| `path_to_certs` | string | No | System Certificates | Path to trusted certificates for encrypted connections. |
|
|
274
|
+
| `session_properties` | list of strings | No | N/A | Key value pairs of `session_properties`. Example: ``` session_properties: - schema='Samples."samples.dremio.com"' ``` For a list of the available properties, see [Manage Workloads](/dremio-cloud/developer/arrow-flight#manage-workloads). |
|
|
275
|
+
| `engine` | string | No | N/A | The specific engine to run against. |
|
|
276
|
+
|
|
277
|
+
Was this page helpful?
|
|
278
|
+
|
|
279
|
+
* Sample Python Arrow Flight Client Application
|
|
280
|
+
+ Steps
|
|
281
|
+
+ Config File Options
|
|
282
|
+
|
|
283
|
+
<div style="page-break-after: always;"></div>
|
|
284
|
+
|
|
285
|
+
# What You Can Do
|
|
286
|
+
|
|
287
|
+
Original URL: https://docs.dremio.com/dremio-cloud/developer/vs-code
|
|
288
|
+
|
|
289
|
+
On this page
|
|
290
|
+
|
|
291
|
+
The Dremio Visual Studio (VS) Code extension transforms VS Code into an AI-ready workspace, enabling you to discover, explore, and analyze enterprise data with natural language and SQL side by side, directly in your IDE.
|
|
292
|
+
|
|
293
|
+
# What You Can Do
|
|
294
|
+
|
|
295
|
+
The VS Code extension for Dremio allows you to:
|
|
296
|
+
|
|
297
|
+
* Connect across projects – Access one or more Dremio Cloud projects from within VS Code.
|
|
298
|
+
* Browse & discover with context – Explore governed objects in your catalog, complete with metadata and semantic context.
|
|
299
|
+
* Query with intelligence – Write and run SQL with autocomplete, formatting, and syntax highlighting—or let AI agents generate SQL for you.
|
|
300
|
+
* Explore and get insights using natural language – Use the built-in Microsoft Copilot integration to ask questions in plain English, moving from questions to insights faster, without leaving your development environment.
|
|
301
|
+
|
|
302
|
+
## Prerequisites
|
|
303
|
+
|
|
304
|
+
Before you begin, ensure you have:
|
|
305
|
+
|
|
306
|
+
* Access to a Dremio Cloud project.
|
|
307
|
+
* Personal access token (PAT) for connectivity to your project. For instructions, see [Create a PAT](/cloud/security/authentication/personal-access-token/#creating-a-pat).
|
|
308
|
+
* Visual Studio Code installed with access to the Extensions tab in the tool.
|
|
309
|
+
|
|
310
|
+
## Install VS Code Extension for Dremio
|
|
311
|
+
|
|
312
|
+
1. Launch VS Code and click the Extensions button on the left navigation toolbar.
|
|
313
|
+
2. Search for and click on the **Dremio** extension.
|
|
314
|
+
3. On the Dremio extension page, click **Install**.
|
|
315
|
+
Once the installation is complete, you're ready to start querying Dremio from VS Code.
|
|
316
|
+
|
|
317
|
+
## Connect to Dremio from VS Code
|
|
318
|
+
|
|
319
|
+
To create a connection from VS Code:
|
|
320
|
+
|
|
321
|
+
1. From the extension for Dremio, click the + button that appears when you hover over the **Connections** heading on the left panel.
|
|
322
|
+
2. For **Select your Dremio deployment**, select **Dremio Cloud**.
|
|
323
|
+
3. From the **Select a control plane** menu, select **US Control Plane** or **European Control Plane** based on where your Dremio Cloud organization is located.
|
|
324
|
+
4. Click **Personal Access Token** and enter the PAT that you have previously generated and press Enter.
|
|
325
|
+
5. The connection to your Dremio Cloud project will appear on the left under **Connections**.
|
|
326
|
+
6. To browse your data, click `<your_dremio_account_email>` under your connection.
|
|
327
|
+
|
|
328
|
+
## Use the Copilot Integration
|
|
329
|
+
|
|
330
|
+
With Copilot in VS Code set to Agent mode, you can interact with your data through plain-language queries powered by Dremio’s semantic layer. For example, try asking:
|
|
331
|
+
|
|
332
|
+
* "What curated views are available for financial analysis?"
|
|
333
|
+
* "Summarize sales trends over the last 90 days by product category."
|
|
334
|
+
* "Write SQL to compare revenue growth in North America vs. Europe."
|
|
335
|
+
|
|
336
|
+
Behind the scenes, Copilot taps into Dremio’s AI Semantic Layer and autonomous optimization to ensure queries run with sub-second performance — whether executed by humans or AI agents.
|
|
337
|
+
|
|
338
|
+
Was this page helpful?
|
|
339
|
+
|
|
340
|
+
* Prerequisites
|
|
341
|
+
* Install VS Code Extension for Dremio
|
|
342
|
+
* Connect to Dremio from VS Code
|
|
343
|
+
* Use the Copilot Integration
|
|
344
|
+
|
|
345
|
+
<div style="page-break-after: always;"></div>
|
|
346
|
+
|
|
347
|
+
# Dremio MCP Server | Dremio Documentation
|
|
348
|
+
|
|
349
|
+
Original URL: https://docs.dremio.com/dremio-cloud/developer/mcp-server
|
|
350
|
+
|
|
351
|
+
On this page
|
|
352
|
+
|
|
353
|
+
The [Dremio MCP Server](https://github.com/dremio/dremio-mcp) is an open-source project that enables AI chat clients or agents to securely interact with your Dremio deployment using natural language. Connecting to the Dremio-hosted MCP Server is the fastest path to enabling external AI chat clients to work with Dremio. The Dremio-hosted MCP Server provides OAuth support, which guarantees and propagates the user identity, authentication, and authorization for all interactions with Dremio. Once connected, you can use natural language to explore and query data, perform analysis and create visualizations, create views, and analyze system performance. While you can fork the open-source Dremio MCP Server for customization or install it locally for use with a personal AI chat client account we recommend using the Dremio-hosted MCP Server available to all projects for experimentation, development and production when possible.
|
|
354
|
+
|
|
355
|
+
## Configure Connectivity
|
|
356
|
+
|
|
357
|
+
Review the documentation below from AI chat client providers to verify you meet the requirements for creating custom connectors before proceeding.
|
|
358
|
+
|
|
359
|
+
* [Claude Custom Connector Documentation](https://support.claude.com/en/articles/11175166-getting-started-with-custom-connectors-using-remote-mcp#h_3d1a65aded)
|
|
360
|
+
* [ChatGPT Custom Connector Documentation](https://help.openai.com/en/articles/11487775-connectors-in-chatgpt#h_a454f0d0b6)
|
|
361
|
+
|
|
362
|
+
To configure connectivity to your Dremio-hosted MCP Server, you first need to set up a [Native OAUth application](/dremio-cloud/security/authentication/app-authentication/oauth-apps) and provide the redirect URLs for the AI chat client you are using.
|
|
363
|
+
|
|
364
|
+
* If you are using Claude, fill in `https://claude.ai/api/mcp/auth_callback,https://claude.com/api/mcp/auth_callback,http://localhost/callback,http://localhost` as redirect URLs for the OAuth Application
|
|
365
|
+
* If you are using ChatGPT, fill in `https://chatgpt.com/connector_platform_oauth_redirect,http://localhost` as the redirect URLs for the OAuth Application
|
|
366
|
+
* For a custom AI chat client, you will need to speak to your administrator.
|
|
367
|
+
|
|
368
|
+
Then configure the custom connector to the Dremio-hosted MCP Server by providing the client ID from the OAuth application and the MCP endpoint for your control plane.
|
|
369
|
+
|
|
370
|
+
* For Dremio instances using the US control plane, your MCP endpoint is `mcp.dremio.cloud/mcp/{project_id}`.
|
|
371
|
+
* For Dremio instances using the European control plane, your MCP endpoint is `mcp.eu.dremio.cloud/mcp/{project_id}`.
|
|
372
|
+
* If you are unsure of your endpoint, you can copy the **MCP endpoint** from the Project Overview page in Project Settings.
|
|
373
|
+
|
|
374
|
+
Was this page helpful?
|
|
375
|
+
|
|
376
|
+
* Configure Connectivity
|
|
377
|
+
|
|
378
|
+
<div style="page-break-after: always;"></div>
|
|
379
|
+
|
|
380
|
+
# Arrow Flight | Dremio Documentation
|
|
381
|
+
|
|
382
|
+
Original URL: https://docs.dremio.com/dremio-cloud/developer/arrow-flight
|
|
383
|
+
|
|
384
|
+
On this page
|
|
385
|
+
|
|
386
|
+
You can create client applications that use [Arrow Flight](https://arrow.apache.org/docs/format/Flight.html) to query data lakes at data-transfer speeds greater than speeds possible with ODBC and JDBC, without incurring the cost in time and CPU resources of deserializing data. As the volumes of data that are transferred increase in size, the performance benefits from the use of Apache Flight rather than ODBC or JDBC also increase.
|
|
387
|
+
|
|
388
|
+
You can run queries on datasets that are in the default project of a Dremio organization. Dremio is able to determine the organization and the default project from the authentication token that a Flight client uses. To query datasets in a non-default project, you can pass in the ID for the non-default project.
|
|
389
|
+
|
|
390
|
+
Dremio provides these endpoints for Arrow Flight connections:
|
|
391
|
+
|
|
392
|
+
* In the US control plane: `data.dremio.cloud:443`
|
|
393
|
+
* In the EU control plane: `data.eu.dremio.cloud:443`
|
|
394
|
+
|
|
395
|
+
All traffic within a control plane between Flight clients and Dremio go through the endpoint for that control plane. However, Dremio can scale up or down automatically to accommodate increasing and decreasing traffic on the endpoint.
|
|
396
|
+
|
|
397
|
+
Unless you pass in a different project ID, Arrow Flight clients run queries only against datasets that are in the default project or on datasources that are associated with the default project. By default, Dremio uses the oldest project in an organization as that organization's default project.
|
|
398
|
+
|
|
399
|
+
## Supported Versions of Apache Arrow
|
|
400
|
+
|
|
401
|
+
Dremio supports client applications that use Arrow Flight in Apache Arrow version 6.0.
|
|
402
|
+
|
|
403
|
+
## Supported Authentication Method
|
|
404
|
+
|
|
405
|
+
Client applications can authenticate to Dremio with personal access tokens (PATs). To create a PAT, follow the steps in the section [Creating a Token](/dremio-cloud/security/authentication/personal-access-token#create-a-pat).
|
|
406
|
+
|
|
407
|
+
## Flight Sessions
|
|
408
|
+
|
|
409
|
+
A Flight session has a duration of 120 minutes during which a Flight client interacts with Dremio. A Flight client initiates a new session by passing a `getFlightInfo()` request that does not include a Cookie header that specifies a session ID that was obtained from Dremio. All requests that pass the same session ID are considered to be in the same session.
|
|
410
|
+
|
|
411
|
+

|
|
412
|
+
|
|
413
|
+
1. The Flight client, having obtained a PAT from Dremio, sends a `getFlightInfo()` request that includes the query to run, the URI for the endpoint, and the bearer token (PAT). A single bearer token can be used for requests until it expires.
|
|
414
|
+
2. If Dremio is able to authenticate the Flight client by using the bearer token, it sends a response that includes FlightInfo, a Set-Cookie header with the session ID, the bearer token, and a Set-Cookie header with the ID of the default project in the organization.
|
|
415
|
+
|
|
416
|
+
FlightInfo responses from Dremio include the single endpoint for the control plane being used and the ticket for that endpoint. There is only one endpoint listed in FlightInfo responses.
|
|
417
|
+
|
|
418
|
+
Session IDs are generated by Dremio.
|
|
419
|
+
3. The client sends a `getStream()` request that includes the ticket, a Cookie header for the session ID, the bearer token, and a Cookie header for the ID of the default project.
|
|
420
|
+
4. Dremio returns the query results in one flight.
|
|
421
|
+
5. The Flight client sends another `getFlightInfo()` request using the same session ID and bearer token. If this second request did not include the session ID that Dremio sent in response to the first request, then Dremio would send a new session ID and a new session would begin.
|
|
422
|
+
|
|
423
|
+
### Use a Non-Default Project
|
|
424
|
+
|
|
425
|
+
To run queries on datasets and data sources in non-default projects in Dremio, the `project_id` of the projects must be passed as a session option. The `project_id` is stored in the user session, and the server responds with a `Set-Cookie` header containing the session ID. The client must include this cookie in all subsequent requests.
|
|
426
|
+
|
|
427
|
+
To enable this behavior, a cookie middleware must be added to the Flight client. This middleware is responsible for managing cookies and will add the previous session ID to all subsequent requests.
|
|
428
|
+
|
|
429
|
+
After adding the middleware when initializing the client object, the `project_id` can be passed as a session option.
|
|
430
|
+
|
|
431
|
+
Here are examples of how to implement the `project_id` in Java and Go:
|
|
432
|
+
|
|
433
|
+
* Java
|
|
434
|
+
* Go
|
|
435
|
+
|
|
436
|
+
Pass in the ID for a non-default project in [Java](https://arrow.apache.org/docs/java/)
|
|
437
|
+
|
|
438
|
+
```
|
|
439
|
+
// Create a ClientCookieMiddleware
|
|
440
|
+
final FlightClient.Builder flightClientBuilder = FlightClient.builder();
|
|
441
|
+
final ClientCookieMiddleware.Factory cookieFactory = new ClientCookieMiddleware.Factory();
|
|
442
|
+
flightClientBuilder.intercept(cookieFactory);
|
|
443
|
+
|
|
444
|
+
// Add the project ID to the session options
|
|
445
|
+
final SetSessionOptionsRequest setSessionOptionRequest =
|
|
446
|
+
new SetSessionOptionsRequest(ImmutableMap.<String, SessionOptionValue>
|
|
447
|
+
builder().put("project_id",
|
|
448
|
+
SessionOptionValueFactory.makeSessionOptionValue(yourprojectid)).build());
|
|
449
|
+
|
|
450
|
+
// Close your session later once query is done
|
|
451
|
+
client.closeSession(new CloseSessionRequest(), bearerToken, headerCallOption);
|
|
452
|
+
```
|
|
453
|
+
|
|
454
|
+
Pass in the ID for a non-default project in [Go](https://github.com/apache/arrow-go)
|
|
455
|
+
|
|
456
|
+
```
|
|
457
|
+
// Create a ClientCookieMiddleware
|
|
458
|
+
client, err := flight.NewClientWithMiddleware(
|
|
459
|
+
net.JoinHostPort(config.Host, config.Port),
|
|
460
|
+
nil,
|
|
461
|
+
[]flight.ClientMiddleware{flight.NewClientCookieMiddleware(),},
|
|
462
|
+
grpc.WithTransportCredentials(creds),
|
|
463
|
+
)
|
|
464
|
+
// Close the session once the query is done
|
|
465
|
+
defer client.CloseSession(ctx, &flight.CloseSessionRequest{})
|
|
466
|
+
// Add the project ID to the session options
|
|
467
|
+
projectIdSessionOption, err := flight.NewSessionOptionValue(projectID)
|
|
468
|
+
sessionOptionsRequest := flight.SetSessionOptionsRequest{
|
|
469
|
+
SessionOptions: map[string]*flight.SessionOptionValue{
|
|
470
|
+
"project_id": &projectIdSessionOption,
|
|
471
|
+
},
|
|
472
|
+
}
|
|
473
|
+
response, err = client.SetSessionOptions(ctx, &sessionOptionsRequest)
|
|
474
|
+
```
|
|
475
|
+
|
|
476
|
+
note
|
|
477
|
+
|
|
478
|
+
In Dremio, the term catalog is sometimes used interchangeably with `project_id`. Therefore, using catalog instead of `project_id` will also work when selecting a non-default project. We recommend using `project_id` for clarity. Throughout this documentation, we will consistently use `project_id`.
|
|
479
|
+
|
|
480
|
+
## Manage Workloads
|
|
481
|
+
|
|
482
|
+
Dremio administrators can use the Arrow Flight server endpoint to manage query workloads by adding the following connection properties to Flight clients:
|
|
483
|
+
|
|
484
|
+
| Flight Client Property | Description |
|
|
485
|
+
| --- | --- |
|
|
486
|
+
| `ENGINE` | Name of the engine to use to process all queries issued during the current session. |
|
|
487
|
+
| `SCHEMA` | The name of the schema (datasource or folder, including child paths, such as `mySource.folder1` and `folder1.folder2`) to use by default when a schema is not specified in a query. |
|
|
488
|
+
|
|
489
|
+
## Sample Arrow Flight Client Applications
|
|
490
|
+
|
|
491
|
+
Dremio provides sample Arrow Flight client applications in several languages at [Dremio Hub](https://github.com/dremio-hub/arrow-flight-client-examples).
|
|
492
|
+
|
|
493
|
+
Both sample clients use the hostname `local` and the port number `32010` by default. Make sure you override these defaults with the hostname `data.dremio.cloud` or `data.eu.dremio.cloud` and the port number `443`.
|
|
494
|
+
|
|
495
|
+
note
|
|
496
|
+
|
|
497
|
+
The Python sample application only supports connecting to the default project in Dremio.
|
|
498
|
+
|
|
499
|
+
Was this page helpful?
|
|
500
|
+
|
|
501
|
+
* Supported Versions of Apache Arrow
|
|
502
|
+
* Supported Authentication Method
|
|
503
|
+
* Flight Sessions
|
|
504
|
+
+ Use a Non-Default Project
|
|
505
|
+
* Manage Workloads
|
|
506
|
+
* Sample Arrow Flight Client Applications
|
|
507
|
+
|
|
508
|
+
<div style="page-break-after: always;"></div>
|
|
509
|
+
|
|
510
|
+
# Data Formats | Dremio Documentation
|
|
511
|
+
|
|
512
|
+
Original URL: https://docs.dremio.com/dremio-cloud/developer/data-formats/
|
|
513
|
+
|
|
514
|
+
Dremio supports the following data formats:
|
|
515
|
+
|
|
516
|
+
* File Formats
|
|
517
|
+
|
|
518
|
+
+ Delimited text files, such as comma-separated values
|
|
519
|
+
+ JSON
|
|
520
|
+
+ ORC
|
|
521
|
+
+ [Parquet](/dremio-cloud/developer/data-formats/parquet)
|
|
522
|
+
* Table Formats
|
|
523
|
+
|
|
524
|
+
+ [Apache Iceberg](/dremio-cloud/developer/data-formats/iceberg)
|
|
525
|
+
+ [Delta Lake](/dremio-cloud/developer/data-formats/delta-lake)
|
|
526
|
+
|
|
527
|
+
Was this page helpful?
|
|
528
|
+
|
|
529
|
+
<div style="page-break-after: always;"></div>
|
|
530
|
+
|
|
531
|
+
# Arrow Flight SQL | Dremio Documentation
|
|
532
|
+
|
|
533
|
+
Original URL: https://docs.dremio.com/dremio-cloud/developer/arrow-flight-sql
|
|
534
|
+
|
|
535
|
+
On this page
|
|
536
|
+
|
|
537
|
+
You can use Apache Arrow Flight SQL to develop client applications that interact with Dremio. Apache Arrow Flight SQL is a new API developed by the Apache Arrow community for interacting with SQL databases. For more information about Apache Arrow Flight SQL, see the documentation for the [Apache Arrow project](https://arrow.apache.org/docs/format/FlightSql.html#).
|
|
538
|
+
|
|
539
|
+
Through Flight SQL, client applications can run queries, create prepared statements, and fetch metadata about the SQL dialect supported by datasource in Dremio, available types, defined tables, and more.
|
|
540
|
+
|
|
541
|
+
The requests for running queries are
|
|
542
|
+
|
|
543
|
+
* CommandExecute
|
|
544
|
+
* CommandStatementUpdate
|
|
545
|
+
|
|
546
|
+
The commands on prepared statements are:
|
|
547
|
+
|
|
548
|
+
* ActionClosePreparedStatementRequest: Closes a prepared statement.
|
|
549
|
+
* ActionCreatePreparedStatementRequest: Creates a prepared statement.
|
|
550
|
+
* CommandPreparedStatementQuery: Runs a prepared statement.
|
|
551
|
+
* CommandPreparedStatementUpdate: Runs a prepared statement that updates data.
|
|
552
|
+
|
|
553
|
+
The metadata requests that Dremio supports are:
|
|
554
|
+
|
|
555
|
+
* CommandGetDbSchemas: Lists the schemas that are in a catalog.
|
|
556
|
+
* CommandGetTables: Lists that tables that are in a catalog or schema.
|
|
557
|
+
* CommandGetTableTypes: Lists the table types that are supported in a catalog or schema. The types are Table, View, and System Table.
|
|
558
|
+
* CommandGetSqlInfo: Retrieves information about the datasource and the SQL dialect that it supports.
|
|
559
|
+
|
|
560
|
+
There are two clients already implemented and available in the Apache Arrow repository on GitHub for you to make use of:
|
|
561
|
+
|
|
562
|
+
* [Client in C++](https://github.com/apache/arrow/blob/dfca6a704ad7e8e87e1c8c3d0224ba13b25786ea/cpp/src/arrow/flight/sql/client.h)
|
|
563
|
+
* [Client in Java](https://github.com/apache/arrow/blob/dfca6a704ad7e8e87e1c8c3d0224ba13b25786ea/java/flight/flight-sql/src/main/java/org/apache/arrow/flight/sql/FlightSqlClient.java)
|
|
564
|
+
|
|
565
|
+
note
|
|
566
|
+
|
|
567
|
+
At this time, you can only connect to the default project in Dremio.
|
|
568
|
+
|
|
569
|
+
## Use the Sample Client
|
|
570
|
+
|
|
571
|
+
You can download and try out the sample client from <https://github.com/dremio-hub/arrow-flight-sql-clients>. Extract the content of the file and then, in a terminal window, change to the `flight-sql-client-example` directory.
|
|
572
|
+
|
|
573
|
+
Before running the sample client, ensure that you have met these prerequisites:
|
|
574
|
+
|
|
575
|
+
* Add the Samples data lake to your Dremio project by clicking the  icon in the **Data Lakes** section of the Datasets page.
|
|
576
|
+
* Ensure that Java 8 or later (up to Java 15) is installed on the system on which you run the example commands.
|
|
577
|
+
|
|
578
|
+
### Command Syntax for the Sample Client
|
|
579
|
+
|
|
580
|
+
Use this syntax when sending commands to the sample client:
|
|
581
|
+
|
|
582
|
+
Sample client usage
|
|
583
|
+
|
|
584
|
+
```
|
|
585
|
+
Usage: java -jar flight-sql-sample-client-application.jar -host data.dremio.cloud -port 443 ...
|
|
586
|
+
|
|
587
|
+
-command,--command <arg> Method to run
|
|
588
|
+
-dsv,--disableServerVerification <arg> Disable TLS server verification.
|
|
589
|
+
Defaults to false.
|
|
590
|
+
-host,--hostname <arg> `data.dremio.cloud` for Dremio's US control plane
|
|
591
|
+
`data.eu.dremio.cloud` for Dremio's European control plane
|
|
592
|
+
-kstpass,--keyStorePassword <arg> The jks keystore password.
|
|
593
|
+
-kstpath,--keyStorePath <arg> Path to the jks keystore.
|
|
594
|
+
-pat,--personalAccessToken <arg> Personal access token
|
|
595
|
+
-port,--flightport <arg> 443
|
|
596
|
+
-query,--query <arg> The query to run
|
|
597
|
+
-schema,--schema <arg> The schema to use
|
|
598
|
+
-sp,--sessionProperty <arg> Key value pairs of
|
|
599
|
+
SessionProperty, example: -sp
|
|
600
|
+
schema='Samples."samples.dremio.
|
|
601
|
+
com"' -sp key=value
|
|
602
|
+
-table,--table <arg> The table to query
|
|
603
|
+
-tls,--tls <arg> Enable encrypted connection.
|
|
604
|
+
Defaults to true.
|
|
605
|
+
```
|
|
606
|
+
|
|
607
|
+
### Examples
|
|
608
|
+
|
|
609
|
+
The examples demonstrate what is returned for each of these requests:
|
|
610
|
+
|
|
611
|
+
* CommandGetDbSchemas
|
|
612
|
+
* CommandGetTables
|
|
613
|
+
* CommandGetTableTypes
|
|
614
|
+
* CommandExecute
|
|
615
|
+
|
|
616
|
+
note
|
|
617
|
+
|
|
618
|
+
These examples use the Flight endpoint for Dremio's US control plane: `data.dremio.cloud`. To use Dremio's European control plane, use this endpoint instead: `data.eu.dremio.cloud`.
|
|
619
|
+
|
|
620
|
+
#### Flight SQL Request: CommandGetDbSchemas
|
|
621
|
+
|
|
622
|
+
This command submits a `CommandGetDbSchemas` request to list the schemas in a catalog.
|
|
623
|
+
|
|
624
|
+
Example CommandGetDbSchemas request
|
|
625
|
+
|
|
626
|
+
```
|
|
627
|
+
java -jar flight-sql-sample-client-application.jar -tls true -host data.dremio.cloud -port 443 --pat '<personal-access-token>' -command GetSchemas
|
|
628
|
+
```
|
|
629
|
+
|
|
630
|
+
Example output for CommandGetDbSchemas request
|
|
631
|
+
|
|
632
|
+
```
|
|
633
|
+
catalog_name db_schema_name
|
|
634
|
+
null @myUserName
|
|
635
|
+
null INFORMATION_SCHEMA
|
|
636
|
+
null Samples
|
|
637
|
+
null sys
|
|
638
|
+
```
|
|
639
|
+
|
|
640
|
+
#### Flight SQL Request: CommandGetTables
|
|
641
|
+
|
|
642
|
+
This command submits a `CommandGetTables` request to list the tables that are in a catalog or schema.
|
|
643
|
+
|
|
644
|
+
Example CommandGetTables request
|
|
645
|
+
|
|
646
|
+
```
|
|
647
|
+
java -jar flight-sql-sample-client-application.jar -tls true -host data.dremio.cloud -port 443 --pat '<personal-access-token>' -command GetTables -schema INFORMATION_SCHEMA
|
|
648
|
+
```
|
|
649
|
+
|
|
650
|
+
If you have a folder in your schema, you can escape it like this:
|
|
651
|
+
|
|
652
|
+
Example CommandGetTables request with folder in schema
|
|
653
|
+
|
|
654
|
+
```
|
|
655
|
+
java -jar flight-sql-sample-client-application.jar -tls true -host data.dremio.cloud -port 443 --pat '<personal-access-token>' -command GetTables -schema "Samples\ (1).samples.dremio.com"
|
|
656
|
+
```
|
|
657
|
+
|
|
658
|
+
Example output for CommandGetTables request
|
|
659
|
+
|
|
660
|
+
```
|
|
661
|
+
catalog_name db_schema_name table_name table_type
|
|
662
|
+
null INFORMATION_SCHEMA CATALOGS SYSTEM_TABLE
|
|
663
|
+
null INFORMATION_SCHEMA COLUMNS SYSTEM_TABLE
|
|
664
|
+
null INFORMATION_SCHEMA SCHEMATA SYSTEM_TABLE
|
|
665
|
+
null INFORMATION_SCHEMA TABLES SYSTEM_TABLE
|
|
666
|
+
null INFORMATION_SCHEMA VIEWS SYSTEM_TABLE
|
|
667
|
+
```
|
|
668
|
+
|
|
669
|
+
#### Flight SQL Request: CommandGetTableTypes
|
|
670
|
+
|
|
671
|
+
This command submits a `CommandTableTypes` request to list the table types supported.
|
|
672
|
+
|
|
673
|
+
Example CommandTableTypes request
|
|
674
|
+
|
|
675
|
+
```
|
|
676
|
+
java -jar flight-sql-sample-client-application.jar -tls true -host data.dremio.cloud -port 443 --pat '<personal-access-token>' -command GetTableTypes
|
|
677
|
+
```
|
|
678
|
+
|
|
679
|
+
Example output for CommandTableTypes request
|
|
680
|
+
|
|
681
|
+
```
|
|
682
|
+
table_type
|
|
683
|
+
TABLE
|
|
684
|
+
SYSTEM_TABLE
|
|
685
|
+
VIEW
|
|
686
|
+
```
|
|
687
|
+
|
|
688
|
+
#### Flight SQL Request: CommandExecute
|
|
689
|
+
|
|
690
|
+
This command submits a `CommandExecute` request to run a single SQL statement.
|
|
691
|
+
|
|
692
|
+
Example CommandExecute request
|
|
693
|
+
|
|
694
|
+
```
|
|
695
|
+
java -jar flight-sql-sample-client-application.jar -tls true -host data.dremio.cloud -port 443 --pat '<personal-access-token>' -command Execute -query 'SELECT * FROM Samples."samples.<Dremio-user-name>.com"."NYC-taxi-trips" limit 10'
|
|
696
|
+
```
|
|
697
|
+
|
|
698
|
+
Example output for CommandExecute request
|
|
699
|
+
|
|
700
|
+
```
|
|
701
|
+
pickup_datetime passenger_count trip_distance_mi fare_amount tip_amount total_amount
|
|
702
|
+
2013-05-27T19:15 1 1.26 7.5 0.0 8.0
|
|
703
|
+
2013-05-31T16:40 1 0.73 5.0 1.2 7.7
|
|
704
|
+
2013-05-27T19:03 2 9.23 27.5 5.0 38.33
|
|
705
|
+
2013-05-31T16:24 1 2.27 12.0 0.0 13.5
|
|
706
|
+
2013-05-27T19:17 1 0.71 5.0 0.0 5.5
|
|
707
|
+
2013-05-27T19:11 1 2.52 10.5 3.15 14.15
|
|
708
|
+
2013-05-31T16:41 5 1.01 6.0 1.1 8.6
|
|
709
|
+
2013-05-31T16:37 1 1.25 8.5 0.0 10.0
|
|
710
|
+
2013-05-31T16:39 1 2.04 10.0 1.5 13.0
|
|
711
|
+
2013-05-27T19:02 1 11.73 32.5 8.12 41.12
|
|
712
|
+
```
|
|
713
|
+
|
|
714
|
+
## Code Samples
|
|
715
|
+
|
|
716
|
+
### Create a FlightSqlClient
|
|
717
|
+
|
|
718
|
+
Refer to [this code sample](https://github.com/dremio-hub/arrow-flight-client-examples/blob/main/java/src/main/java/com/adhoc/flight/client/AdhocFlightClient.java) to create a `FlightClient`. Then, wrap your `FlightClient` in a `FlightSqlClient`:
|
|
719
|
+
|
|
720
|
+
Wrap FlightClient in FlightSqlClient
|
|
721
|
+
|
|
722
|
+
```
|
|
723
|
+
// Wraps a FlightClient in a FlightSqlClient
|
|
724
|
+
FlightSqlClient flightSqlClient = new FlightSqlClient(flightClient);
|
|
725
|
+
|
|
726
|
+
// Be sure to close the FlightSqlClient after using it
|
|
727
|
+
flightSqlClient.close();
|
|
728
|
+
```
|
|
729
|
+
|
|
730
|
+
### Retrieve a List of Database Schemas
|
|
731
|
+
|
|
732
|
+
This code issues a CommandGetSchemas metadata request:
|
|
733
|
+
|
|
734
|
+
CommandGetSchemas metadata request
|
|
735
|
+
|
|
736
|
+
```
|
|
737
|
+
String catalog = null; // The catalog. (may be null)
|
|
738
|
+
String dbSchemaFilterPattern = null; // The schema filter pattern. (may be null)
|
|
739
|
+
FlightInfo flightInfo = flightSqlClient.getSchemas(catalog, dbSchemaFilterPattern);
|
|
740
|
+
```
|
|
741
|
+
|
|
742
|
+
### Retrieve a List of Tables
|
|
743
|
+
|
|
744
|
+
This code issues a CommandGetTables metadata request:
|
|
745
|
+
|
|
746
|
+
CommandGetTables metadata request
|
|
747
|
+
|
|
748
|
+
```
|
|
749
|
+
String catalog = null; // The catalog. (may be null)
|
|
750
|
+
String dbSchemaFilterPattern = "Samples\\ (1).samples.dremio.com"; // The schema filter pattern. (may be null)
|
|
751
|
+
String tableFilterPattern = null; // The table filter pattern. (may be null)
|
|
752
|
+
List<String> tableTypes = null; // The table types to include. (may be null)
|
|
753
|
+
boolean includeSchema = false; // True to include the schema upon return, false to not include the schema.
|
|
754
|
+
FlightInfo flightInfo = flightSqlClient.getTables(catalog, dbSchemaFilterPattern, tableFilterPattern, tableTypes, includeSchema);
|
|
755
|
+
```
|
|
756
|
+
|
|
757
|
+
### Retrieve a List of Table Types That a Database Supports
|
|
758
|
+
|
|
759
|
+
This code issues a CommandGetTableTypes metadata request:
|
|
760
|
+
|
|
761
|
+
CommandGetTableTypes metadata request
|
|
762
|
+
|
|
763
|
+
```
|
|
764
|
+
FlightInfo flightInfo = flightSqlClient.getTableTypes();
|
|
765
|
+
```
|
|
766
|
+
|
|
767
|
+
### Run a Query
|
|
768
|
+
|
|
769
|
+
This code issues a CommandExecute request:
|
|
770
|
+
|
|
771
|
+
CommandExecute request
|
|
772
|
+
|
|
773
|
+
```
|
|
774
|
+
FlightInfo flightInfo = flightSqlClient.execute("SELECT * FROM Samples.\"samples.myUserName.com\".\"NYC-taxi-trips\" limit 10");
|
|
775
|
+
```
|
|
776
|
+
|
|
777
|
+
### Consume Data Returned for a Query
|
|
778
|
+
|
|
779
|
+
Consume data returned for query
|
|
780
|
+
|
|
781
|
+
```
|
|
782
|
+
FlightInfo flightInfo; // Use a FlightSqlClient method to get a FlightInfo
|
|
783
|
+
|
|
784
|
+
// 1. Fetch each partition sequentially (though this can be done in parallel)
|
|
785
|
+
for (FlightEndpoint endpoint : flightInfo.getEndpoints()) {
|
|
786
|
+
|
|
787
|
+
// 2. Get a stream of results as Arrow vectors
|
|
788
|
+
try (FlightStream stream = flightSqlClient.getStream(endpoint.getTicket())) {
|
|
789
|
+
|
|
790
|
+
// 3. Iterate through the stream until the end
|
|
791
|
+
while (stream.next()) {
|
|
792
|
+
|
|
793
|
+
// 4. Get a chunk of results (VectorSchemaRoot) and print it to the console
|
|
794
|
+
VectorSchemaRoot vectorSchemaRoot = stream.getRoot();
|
|
795
|
+
System.out.println(vectorSchemaRoot.contentToTSVString());
|
|
796
|
+
}
|
|
797
|
+
}
|
|
798
|
+
}
|
|
799
|
+
```
|
|
800
|
+
|
|
801
|
+
## Client Interactions with Dremio
|
|
802
|
+
|
|
803
|
+
This diagram shows an example of how an Arrow Flight SQL client initiates a Flight session and runs a query. It also shows what messages pass between the proxy at the Arrow Flight SQL endpoint, the control plane, and the execution plane.
|
|
804
|
+
|
|
805
|
+

|
|
806
|
+
|
|
807
|
+
1. The Flight client, having obtained a PAT from Dremio, calls the `execute()` method, which then sends a `getFlightInfo()` request. This request includes the query to run, the URI for the endpoint, and the bearer token (PAT). A single bearer token can be used for requests until it expires.
|
|
808
|
+
|
|
809
|
+
A `getFlightInfo()` request initiates a new Flight session, which has a duration of 120 minutes. A Flight session is identified by its ID. Session IDs are generated by the proxy at the Arrow Flight SQL endpoint. All requests that pass the same session ID are considered to be in the same Flight session.
|
|
810
|
+
2. The bearer token includes the user ID and the organization ID. From those two pieces of information, the proxy at the endpoint determines the project ID, and then passes the organization ID, project ID, and user ID in the `getFlightInfo()` request that it forwards to the control plane.
|
|
811
|
+
3. If the control plane is able to authenticate the Flight client by using the bearer token, it sends a response that includes FlightInfo to the proxy.
|
|
812
|
+
|
|
813
|
+
FlightInfo responses include the single endpoint for the control plane being used and the ticket for that endpoint. There is only one endpoint listed in FlightInfo responses.
|
|
814
|
+
4. The proxy at the endpoint adds the session ID and the project ID, and passes the response to the client.
|
|
815
|
+
5. The client sends a `getStream()` request that includes the ticket, a Cookie header for the session ID, the bearer token, and a Cookie header for the ID of the default project.
|
|
816
|
+
6. The proxy adds the organization ID and passes the `getStream()` request to the control plane.
|
|
817
|
+
7. The control plane devises the query plan and sends that to the execution plane.
|
|
818
|
+
8. The execution plane runs the query and sends the results to the control plane in one flight.
|
|
819
|
+
9. The control plane passes the results to the proxy.
|
|
820
|
+
10. The proxy passes the results to the client.
|
|
821
|
+
|
|
822
|
+
Was this page helpful?
|
|
823
|
+
|
|
824
|
+
* Use the Sample Client
|
|
825
|
+
+ Command Syntax for the Sample Client
|
|
826
|
+
+ Examples
|
|
827
|
+
* Code Samples
|
|
828
|
+
+ Create a FlightSqlClient
|
|
829
|
+
+ Retrieve a List of Database Schemas
|
|
830
|
+
+ Retrieve a List of Tables
|
|
831
|
+
+ Retrieve a List of Table Types That a Database Supports
|
|
832
|
+
+ Run a Query
|
|
833
|
+
+ Consume Data Returned for a Query
|
|
834
|
+
* Client Interactions with Dremio
|
|
835
|
+
|
|
836
|
+
<div style="page-break-after: always;"></div>
|
|
837
|
+
|
|
838
|
+
# Apache Iceberg | Dremio Documentation
|
|
839
|
+
|
|
840
|
+
Original URL: https://docs.dremio.com/dremio-cloud/developer/data-formats/iceberg
|
|
841
|
+
|
|
842
|
+
On this page
|
|
843
|
+
|
|
844
|
+
[Apache Iceberg](https://iceberg.apache.org/docs/latest/) enables Dremio to provide powerful, SQL database-like functionality on data lakes using industry-standard SQL commands. Dremio currently supports [Iceberg v2](https://iceberg.apache.org/spec/#version-2) tables, offering a solid foundation for building and managing data lakehouse tables. Certain features, such as Iceberg native branching and tagging, and the UUID data type, are not yet supported.
|
|
845
|
+
|
|
846
|
+
For a deeper dive into Apache Iceberg, see:
|
|
847
|
+
|
|
848
|
+
* [Apache Iceberg: An Architectural Look Under the Covers](https://www.dremio.com/apache-iceberg-an-architectural-look-under-the-covers/)
|
|
849
|
+
* [What is Apache Iceberg?](https://www.dremio.com/data-lake/apache-iceberg/)
|
|
850
|
+
|
|
851
|
+
### Benefits of Iceberg Tables
|
|
852
|
+
|
|
853
|
+
Iceberg tables offer the following benefits over other formats traditionally used in the data lake, including:
|
|
854
|
+
|
|
855
|
+
* **[Schema evolution](https://iceberg.apache.org/docs/latest/evolution/):** Supports add, drop, update, or rename column commands with no side effects or inconsistency.
|
|
856
|
+
* **[Partition evolution](https://iceberg.apache.org/docs/latest/evolution/#partition-evolution):** Facilitates the modification of partition layouts in a table, such as data volume or query pattern changes without needing to rewrite the entire table.
|
|
857
|
+
* **Transactional consistency:** Helps users avoid partial or uncommitted changes by tracking atomic transactions with atomicity, consistency, isolation, and durability (ACID) properties.
|
|
858
|
+
* **Increased performance:** Ensures data files are intelligently filtered for accelerated processing via advanced partition pruning and column-level statistics.
|
|
859
|
+
* **Time travel:** Allows users to query any previous versions of the table to examine and compare data or reproduce results using previous queries.
|
|
860
|
+
* **[Automatic optimization](/dremio-cloud/manage-govern/optimization):** Optimize query performance to maximize the speed and efficiency with which data is retrieved.
|
|
861
|
+
* **Version rollback:** Corrects any discovered problems quickly by resetting tables to a known good state.
|
|
862
|
+
|
|
863
|
+
## Clustering
|
|
864
|
+
|
|
865
|
+
Clustered Iceberg tables in Dremio makes use of Z-Ordering to provide a more intuitive data layout with comparable or better performance characteristics to Iceberg partitioning.
|
|
866
|
+
|
|
867
|
+
Iceberg clustering sorts individual records in data files based on the clustered columns provided in the [`CREATE TABLE`](/dremio-cloud/sql/commands/create-table) or [`ALTER TABLE`](/dremio-cloud/sql/commands/alter-table/) statement. The data file level clustering of data allows Parquet metadata to be used in query planning and execution to reduce the amount of data scanned as part of the query. In addition, clustering eliminates common problems with partitioned data, such as over-partitioned tables and partition skew.
|
|
868
|
+
|
|
869
|
+
Clustering provides a general-purpose file layout that enables both efficient reads and writes. However, you may not see immediate benefits from clustering if the tables are too small.
|
|
870
|
+
|
|
871
|
+
A common pattern is to choose clustered columns that are either primary keys of the table or commonly used for query filters. These column choices will effectively filter the working dataset, thereby improving query times. Clustered columns are ordered in precedence of filtering or cardinality with the most commonly queried columns of highest cardinality first.
|
|
872
|
+
|
|
873
|
+
#### Supported Data Types for Clustered Columns
|
|
874
|
+
|
|
875
|
+
Dremio Iceberg clustering supports clustered columns of the following data types:
|
|
876
|
+
|
|
877
|
+
* `DECIMAL`
|
|
878
|
+
* `INT`
|
|
879
|
+
* `BIGINT`
|
|
880
|
+
* `FLOAT`
|
|
881
|
+
* `DOUBLE`
|
|
882
|
+
* `VARCHAR`
|
|
883
|
+
* `VARBINARY`
|
|
884
|
+
* `DATE`
|
|
885
|
+
* `TIME`
|
|
886
|
+
* `TIMESTAMP`
|
|
887
|
+
|
|
888
|
+
Automated table maintenance eliminates the need to run optimizations for clustered Iceberg tables manually, although if using manual optimization, its behavior differs based on whether or not tables are clustered.
|
|
889
|
+
|
|
890
|
+
For clustered tables, [`OPTIMIZE TABLE`](/dremio-cloud/sql/commands/optimize-table) incrementally reorders data to achieve the optimal data layout and manages file sizes. This mechanism may take longer to run on newly loaded or unsorted tables. Additionally, you may be required to run multiple `OPTIMIZE TABLE` SQL commands to converge on an optimal file layout.
|
|
891
|
+
|
|
892
|
+
For unclustered tables, `OPTIMIZE TABLE` combines small files or splits large files to achieve an optimal file size, reducing metadata overhead and runtime file open costs.
|
|
893
|
+
|
|
894
|
+
#### CTAS Behavior and Clustering
|
|
895
|
+
|
|
896
|
+
When running a [`CREATE TABLE AS`](/dremio-cloud/sql/commands/create-table-as) statement with clustering, the data is written in an unordered way. For the best performance, you should run an `OPTIMIZE TABLE` SQL command after creating a table using a [`CREATE TABLE AS`](/dremio-cloud/sql/commands/create-table-as) statement.
|
|
897
|
+
|
|
898
|
+
## Iceberg Table Management
|
|
899
|
+
|
|
900
|
+
Learn how to manage Iceberg tables in Dremio with supported Iceberg features such as expiring snapshots and optimizing tables.
|
|
901
|
+
|
|
902
|
+
### Vacuum
|
|
903
|
+
|
|
904
|
+
Each write to an Iceberg table creates a snapshot of that table, which is a timestamped version of the table. As snapshots accumulate, data files that are no longer referenced in recent snapshots take up more and more storage. Additionally, the more snapshots a table has, the larger its metadata becomes. You can expire older snapshots to delete the data files that are unique to them and to remove them from table metadata. It is recommended that you expire snapshots regularly. For the SQL command to expire snapshots, see [`VACUUM TABLE`](/dremio-cloud/sql/commands/vacuum-table/).
|
|
905
|
+
|
|
906
|
+
Sometimes failed SQL commands may leave orphan data files in the table location that are no longer referenced by any active snapshot of the table. You can remove orphan files in the table location by running `remove_orphan_files`. See [`VACUUM TABLE`](/dremio-cloud/sql/commands/vacuum-table/) for details.
|
|
907
|
+
|
|
908
|
+
### Optimization
|
|
909
|
+
|
|
910
|
+
Dremio provides [automatic optimization](/dremio-cloud/manage-govern/optimization/), which automatically maintains Iceberg tables in the Open Catalog using a dedicated engine configured by Dremio. However, for immediate optimization, you can use the [`OPTIMIZE TABLE`](/dremio-cloud/sql/commands/optimize-table) SQL command and route jobs to specific engines in your project by creating a routing rule with the `query_label()` condition and the `OPTIMIZATION` label. For more information, see [Workload Management](/dremio-cloud/admin/engines/workload-management).
|
|
911
|
+
|
|
912
|
+
When optimizing tables manually, you can use:
|
|
913
|
+
|
|
914
|
+
* [`FOR PARTITIONS`](/dremio-cloud/sql/commands/optimize-table/) to optimize selected partitions.
|
|
915
|
+
* [`MIN_INPUT_FILES`](/dremio-cloud/sql/commands/optimize-table/) to consider the minimum number of qualified files needed for compaction. Delete files count towards determining whether the minimum threshold is reached.
|
|
916
|
+
|
|
917
|
+
## Iceberg Catalogs in Dremio
|
|
918
|
+
|
|
919
|
+
The Apache Iceberg table format uses an Iceberg catalog service to track snapshots and ensure transactional consistency between tools. For more information about how Iceberg catalogs and tables work together, see [Iceberg Catalog](https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/#toc_item_Iceberg%20catalog).
|
|
920
|
+
|
|
921
|
+
note
|
|
922
|
+
|
|
923
|
+
Currently, Dremio does not support the Amazon DynamoDB nor JDBC catalogs. For additional information on limitations of Apache Iceberg as implemented in Dremio, see Limitations.
|
|
924
|
+
|
|
925
|
+
The catalog is the source of truth for the current metadata pointer for a table. You can use [Dremio's Open Catalog](/dremio-cloud/developer/data-formats/iceberg/#iceberg-catalogs-in-dremio) as a catalog for all your tables. You can also add external Iceberg catalogs as a source in Dremio, which allows you to work with Iceberg tables that are not cataloged in Dremio's Open Catalog.The list of Iceberg catalogs that can be added as a source can be found here:
|
|
926
|
+
|
|
927
|
+
* AWS Glue Data Catalog
|
|
928
|
+
* Iceberg REST Catalog
|
|
929
|
+
* Snowflake Open Catalog
|
|
930
|
+
* Unity Catalog
|
|
931
|
+
|
|
932
|
+
Once a table is created with a specific catalog, you must continue using that same catalog to access the table. For example, if you create a table using AWS Glue as the catalog, you cannot later access that table by adding its S3 location as a source in Dremio. You must add the AWS Glue Data Catalog as a source and access the table through it.
|
|
933
|
+
|
|
934
|
+
## Rollbacks
|
|
935
|
+
|
|
936
|
+
When you modify an Iceberg table using data definition language (DDL) or data manipulation language (DML), each change creates a new [snapshot](https://iceberg.apache.org/terms/#snapshot) in the table's metadata. The Iceberg [catalog](/dremio-cloud/developer/data-formats/iceberg/#iceberg-catalogs-in-dremio) tracks the current snapshot through a root pointer.
|
|
937
|
+
You can use the [`ROLLBACK TABLE`](/dremio-cloud/sql/commands/rollback-table) SQL command to roll back a table by redirecting this pointer to an earlier snapshot—useful for undoing recent data errors. Rollbacks can target a specific timestamp or snapshot ID.
|
|
938
|
+
When you perform a rollback, Dremio creates a new snapshot identical to the selected one. For example, if a table has snapshots (1) `first_snapshot`, (2) `second_snapshot`, and (3) `third_snapshot`, rolling back to `first_snapshot` restores the table to that state while preserving all snapshots for time travel queries.
|
|
939
|
+
|
|
940
|
+
## SQL Command Compatibility
|
|
941
|
+
|
|
942
|
+
Dremio supports running most combinations of concurrent SQL commands on Iceberg tables. To take a few examples, two [`INSERT`](/dremio-cloud/sql/commands/insert) commands can run concurrently on the same table, as can two [`SELECT`](/dremio-cloud/sql/commands/SELECT) commands, or an [`UPDATE`](/dremio-cloud/sql/commands/update) and an [`ALTER`](/dremio-cloud/sql/commands/alter-table) command.
|
|
943
|
+
|
|
944
|
+
However, Apache Iceberg’s Serializable Isolation level with non-locking table semantics can result in scenarios in which write collisions occur. In these circumstances, the SQL command that finishes second fails with an error. Such failures occur only for a subset of combinations of two SQL commands running concurrently on a single Iceberg table.
|
|
945
|
+
|
|
946
|
+
This table shows which types of SQL commands can and cannot run concurrently with other types on a single Iceberg table:
|
|
947
|
+
|
|
948
|
+
* Y: Running these two types of commands concurrently is supported.
|
|
949
|
+
* N: Running these two types of commands concurrently is not supported. The second command to complete fails with an error.
|
|
950
|
+
* D: Running two [`OPTIMIZE`](/cloud/reference/sql/commands/optimize-table) commands concurrently is supported if they run against different table partitions.
|
|
951
|
+
|
|
952
|
+

|
|
953
|
+
|
|
954
|
+
## Table Properties
|
|
955
|
+
|
|
956
|
+
The following Apache Iceberg table properties are supported in Dremio. You can use these properties to configure aspects of Apache Iceberg tables:
|
|
957
|
+
|
|
958
|
+
| Property | Description | Default |
|
|
959
|
+
| --- | --- | --- |
|
|
960
|
+
| commit.manifest.target-size-bytes | The target size when merging manifest files. | `8 MB` |
|
|
961
|
+
| commit.status-check.num-retries | The number of times to check whether a commit succeeded after a connection is lost before failing due to an unknown commit state. | `3` |
|
|
962
|
+
| compatibility.snapshot-id-inheritance.enabled | Enables committing snapshots without explicit snapshot IDs. | `false` (always `true` if the format version is > 1) |
|
|
963
|
+
| format-version | The table’s format version defined in the Spec. Options: `1` or `2` | `2` |
|
|
964
|
+
| history.expire.max-snapshot-age-ms | The maximum age (in milliseconds) of snapshots to keep as expiring snapshots. | `432000000` (5 days) |
|
|
965
|
+
| history.expire.min-snapshots-to-keep | The default minimum number of snapshots to keep as expiring snapshots. | `1` |
|
|
966
|
+
| write.delete.mode | The table’s method for handling row-level deletes. See [Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg](https://www.dremio.com/blog/row-level-changes-on-the-lakehouse-copy-on-write-vs-merge-on-read-in-apache-iceberg/) for more information on which mode is best for your table’s DML operations. Options: `copy-on-write` or `merge-on-read` | `copy-on-write` |
|
|
967
|
+
| write.merge.mode | The table’s method for handling row-level merges. See [Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg](https://www.dremio.com/blog/row-level-changes-on-the-lakehouse-copy-on-write-vs-merge-on-read-in-apache-iceberg/) for more information on which mode is best for your table’s DML operations. Options: `copy-on-write` or `merge-on-read` | `copy-on-write` |
|
|
968
|
+
| write.metadata.compression-codec | The Metadata compression codec. Options: `none` or `gzip` | `none` |
|
|
969
|
+
| write.metadata.delete-after-commit.enabled | Controls whether to delete the oldest tracked version metadata files after commit. | `false` |
|
|
970
|
+
| write.metadata.metrics.column.col1 | Metrics mode for column `col1` to allow per-column tuning. Options: `none`, `counts`, `truncate(length)`, or `full` | (not set) |
|
|
971
|
+
| write.metadata.metrics.default | Default metrics mode for all columns in the table. Options: `none`, `counts`, `truncate(length)`, or `full` | `truncate(16)` |
|
|
972
|
+
| write.metadata.metrics.max-inferred-column-defaults | Defines the maximum number of top-level columns for which metrics are collected. The number of stored metrics can be higher than this limit for a table with nested fields. | `100` |
|
|
973
|
+
| write.metadata.previous-versions-max | The maximum number of previous version metadata files to keep before deleting after commit. | `100` |
|
|
974
|
+
| write.parquet.compression-codec | The Parquet compression codec. Options: `zstd`, `gzip`, `snappy`, or `uncompressed` | `zstd` |
|
|
975
|
+
| write.parquet.compression-level | The Parquet compression level. Supported for `gzip` and `zstd`. | `null` |
|
|
976
|
+
| write.parquet.dict-size-bytes | The Parquet dictionary page size (in bytes). | `2097152` (2 MB) |
|
|
977
|
+
| write.parquet.page-row-limit | The Parquet page row limit. | `20000` |
|
|
978
|
+
| write.parquet.page-size-bytes | The Parquet page size (in bytes). | `1048576` (1 MB) |
|
|
979
|
+
| write.parquet.row-group-size-bytes | Parquet row group size. Dremio uses this property as a target file size since it writes one row-group per Parquet file. Ignores the `store.parquet.block-size` and `dremio.iceberg.optimize.target_file_size_mb` support keys. | `134217728` (128 MB) |
|
|
980
|
+
| write.summary.partition-limit | Includes partition-level summary stats in snapshot summaries if the changed partition count is less than this limit. | `0` |
|
|
981
|
+
| write.update.mode | The table’s method for handling row-level updates. See [Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg](https://www.dremio.com/blog/row-level-changes-on-the-lakehouse-copy-on-write-vs-merge-on-read-in-apache-iceberg/) for more information on which mode is best for your table’s DML operations. Options: `copy-on-write` or `merge-on-read` | `copy-on-write` |
|
|
982
|
+
|
|
983
|
+
You can configure these properties when you [create](/dremio-cloud/sql/commands/create-table) or [alter](/dremio-cloud/sql/commands/alter-table) Iceberg tables.
|
|
984
|
+
|
|
985
|
+
Dremio uses the Iceberg default value for table properties that are not set. See Iceberg's documentation for the full list of [table properties](https://iceberg.apache.org/docs/latest/configuration/#table-properties). To view the properties that are set for a table, use the SQL command [`SHOW TBLPROPERTIES`](/dremio-cloud/sql/commands/show-table-properties).
|
|
986
|
+
|
|
987
|
+
In cases where Dremio has a support key for a feature covered by a table property, Dremio uses the table property instead of the support key.
|
|
988
|
+
|
|
989
|
+
## Limitations
|
|
990
|
+
|
|
991
|
+
The following are limitations with Apache Iceberg as implemented in Dremio:
|
|
992
|
+
|
|
993
|
+
* Only Parquet file formats are currently supported. Other formats (such as ORC and Avro) are not supported at this time.
|
|
994
|
+
* Amazon DynamoDB and JDBC catalogs are currently not supported.
|
|
995
|
+
* Unable to use DynamoDB as a lock manager with the Hadoop catalog on Amazon S3.
|
|
996
|
+
* Dremio caches query plans for recently executed statements to improve query performance. However, running a rollback query using a snapshot ID invalidates all cached query plans that reference the affected table.
|
|
997
|
+
* If a table is running DML operations when a rollback query using a snapshot ID executes, the DML operations can fail to complete because the current snapshot ID has changed to a new value due to the rollback query. However, `SELECT` queries that are in the midst of executing can be completed.
|
|
998
|
+
* Clustering keys must be columns in the table. Transformations are not supported.
|
|
999
|
+
* You can run only one optimize query at a time on the selected Iceberg table partition.
|
|
1000
|
+
* The optimize functionality does not support sort ordering.
|
|
1001
|
+
|
|
1002
|
+
## Related Topics
|
|
1003
|
+
|
|
1004
|
+
* [Automatic Optimization](/dremio-cloud/manage-govern/optimization/) – Learn how Dremio optimizes Iceberg tables automatically.
|
|
1005
|
+
* [Load Data Into Tables](/dremio-cloud/bring-data/load/) - Load data from CSV, JSON, or Parquet files into existing Iceberg tables.
|
|
1006
|
+
* [SQL Commands](/dremio-cloud/sql/commands/) – See the syntax of the SQL commands that Dremio supports for Iceberg tables.
|
|
1007
|
+
|
|
1008
|
+
Was this page helpful?
|
|
1009
|
+
|
|
1010
|
+
* Benefits of Iceberg Tables
|
|
1011
|
+
* Clustering
|
|
1012
|
+
* Iceberg Table Management
|
|
1013
|
+
+ Vacuum
|
|
1014
|
+
+ Optimization
|
|
1015
|
+
* Iceberg Catalogs in Dremio
|
|
1016
|
+
* Rollbacks
|
|
1017
|
+
* SQL Command Compatibility
|
|
1018
|
+
* Table Properties
|
|
1019
|
+
* Limitations
|
|
1020
|
+
* Related Topics
|
|
1021
|
+
|
|
1022
|
+
<div style="page-break-after: always;"></div>
|
|
1023
|
+
|
|
1024
|
+
# Parquet | Dremio Documentation
|
|
1025
|
+
|
|
1026
|
+
Original URL: https://docs.dremio.com/dremio-cloud/developer/data-formats/parquet
|
|
1027
|
+
|
|
1028
|
+
On this page
|
|
1029
|
+
|
|
1030
|
+
This topic provides general information and recommendations for Parquet files.
|
|
1031
|
+
|
|
1032
|
+
## Read Parquet Files
|
|
1033
|
+
|
|
1034
|
+
Dremio's vectorized Parquet file reader improves parallelism on columnar data, reduces latencies, and enables more efficient resource and memory usage.
|
|
1035
|
+
|
|
1036
|
+
Dremio supports off-heap memory buffers for reading Parquet files.
|
|
1037
|
+
|
|
1038
|
+
Dremio supports file compression with `snappy`, `gzip`, and `zstd` for reading Parquet files.
|
|
1039
|
+
|
|
1040
|
+
## Parquet Limitations
|
|
1041
|
+
|
|
1042
|
+
Take into consideration the following limitations when generating and configuring Parquet files. Failure to adhere to these restrictions may cause errors to trigger when using Parquet files with Dremio.
|
|
1043
|
+
|
|
1044
|
+
* **Maximum nested levels are restricted to 16.** Multiple structs may be defined up to a total nesting level of 16. Exceeding this results in a failed query.
|
|
1045
|
+
* **Maximum allowable elements in an array are restricted to 128.** The maximum allowable number of elements in an array may not exceed this quantity. Additional elements beyond the allowed 128 results in a query failure.
|
|
1046
|
+
* **Maximum footer size is restricted to 16MB.** The footer consists of metadata. This includes information about the version of the format, the schema, extra key-value pairs, and metadata for columns in the file. When the footer exceeds this size, a query failure occurs.
|
|
1047
|
+
|
|
1048
|
+
## Recommended Configuration
|
|
1049
|
+
|
|
1050
|
+
When using other tools to generate Parquet files for consumption in Dremio, we recommend the following configuration:
|
|
1051
|
+
|
|
1052
|
+
| Type | Implementation |
|
|
1053
|
+
| --- | --- |
|
|
1054
|
+
| Row Groups | Implement your row groups using the following: A single row group per file, and a target of 1MB-25MB column stripes for most datasets (ideally). By default, Dremio uses 256 MB row groups for the Parquet files that it generates. |
|
|
1055
|
+
| Pages | Implement your pages using the following: Snappy compression, and a target of ~100K page size. Use a recent Parquet library to avoid bad statistics issues. |
|
|
1056
|
+
| Statistics | Use a recent Parquet library to avoid bad statistics issues. |
|
|
1057
|
+
|
|
1058
|
+
Was this page helpful?
|
|
1059
|
+
|
|
1060
|
+
* Read Parquet Files
|
|
1061
|
+
* Parquet Limitations
|
|
1062
|
+
* Recommended Configuration
|
|
1063
|
+
|
|
1064
|
+
<div style="page-break-after: always;"></div>
|
|
1065
|
+
|
|
1066
|
+
# Delta Lake | Dremio Documentation
|
|
1067
|
+
|
|
1068
|
+
Original URL: https://docs.dremio.com/dremio-cloud/developer/data-formats/delta-lake
|
|
1069
|
+
|
|
1070
|
+
On this page
|
|
1071
|
+
|
|
1072
|
+
[Delta Lake](https://docs.delta.io/latest/index.html) is an open-source table format that provides transactional consistency and increased scale for datasets by creating a consistent definition of datasets and including schema evolution changes and data mutations. With Delta Lake, updates to datasets are viewed in a consistent manner across all applications consuming the datasets, and users are kept from seeing inconsistent views of data during transformations. Consistent and reliable views of datasets in a data lake are maintained even as the datasets are updated and modified over time.
|
|
1073
|
+
|
|
1074
|
+
Data consistency for a dataset is enabled through the creation of a series of manifest files which define the schema and data for a given point in time, as well as a transaction log that defines an ordered record of every transaction on the dataset. By reading the transaction log and manifest files, applications are guaranteed to see a consistent view of data at any point in time, and users can ensure intermediate changes are invisible until a write operation is complete.
|
|
1075
|
+
|
|
1076
|
+
Delta Lake provides the following benefits:
|
|
1077
|
+
|
|
1078
|
+
* Large-scale support: Efficient metadata handling enables applications to readily process petabyte-sized datasets with millions of files
|
|
1079
|
+
* Schema consistency: All applications processing a dataset operate on a consistent and shared definition of the dataset metadata such as columns, data types, partitions.
|
|
1080
|
+
|
|
1081
|
+
## Supported Data Sources
|
|
1082
|
+
|
|
1083
|
+
The Delta Lake table format is supported with the following sources in the Parquet file format:
|
|
1084
|
+
|
|
1085
|
+
* [Amazon S3](/dremio-cloud/bring-data/connect/object-storage/amazon-s3)
|
|
1086
|
+
* [AWS Glue Data Catalog](/dremio-cloud/bring-data/connect/catalogs/aws-glue-data-catalog)
|
|
1087
|
+
|
|
1088
|
+
## Analyze Delta Lake Datasets
|
|
1089
|
+
|
|
1090
|
+
Dremio supports analyzing Delta Lake datasets on the sources listed above through a native and high-performance reader. It automatically identifies which datasets are saved in the Delta Lake format, and imports table information from the Delta Lake manifest files. Dataset promotion is seamless and operates the same as any other data format in Dremio, where users can promote file system directories containing a Delta Lake dataset to a table manually or automatically by querying the directory. When using Delta Lake format, Dremio supports datasets of any size including petabyte-sized datasets with billions of files.
|
|
1091
|
+
|
|
1092
|
+
Dremio reads Delta Lake tables created or updated by another engine, such as Spark and others, with transactional consistency. Dremio automatically identifies tables that are in the Delta Lake format and selects the appropriate format for the user.
|
|
1093
|
+
|
|
1094
|
+
### Refresh Metadata
|
|
1095
|
+
|
|
1096
|
+
Metadata refresh is required to query the latest version of a Delta Lake table. You can wait for an automatic refresh of metadata or manually refresh it.
|
|
1097
|
+
|
|
1098
|
+
#### Example of Querying a Delta Lake Table
|
|
1099
|
+
|
|
1100
|
+
Perform the following steps to query a Delta Lake table:
|
|
1101
|
+
|
|
1102
|
+
1. In Dremio, open the **Datasets** page.
|
|
1103
|
+
2. Go to the data source that contains the Delta Lake table.
|
|
1104
|
+
3. If the data source is not an AWS Glu Data Catalog, follow these steps:
|
|
1105
|
+
1. Hover over the row for the table and click  to the right. Dremio automatically identifies tables that are in the Delta Lake format and selects the appropriate format.
|
|
1106
|
+
2. Click **Save**.
|
|
1107
|
+
4. If the data source is an AWS Glue Data Catalog, hover over the row for the table and click  to the right.
|
|
1108
|
+
5. Run a query on the Delta Lake table to see the results.
|
|
1109
|
+
6. Update the table in the data source.
|
|
1110
|
+
7. Go back to the **Datasets** UI and wait for the table metadata to refresh or manually refresh it using the syntax below.
|
|
1111
|
+
|
|
1112
|
+
Syntax to manually refresh table metadata
|
|
1113
|
+
|
|
1114
|
+
```
|
|
1115
|
+
ALTER TABLE `<path_of_the_dataset>`
|
|
1116
|
+
REFRESH METADATA
|
|
1117
|
+
```
|
|
1118
|
+
|
|
1119
|
+
The following statement shows refreshing metadata of a Delta Lake table.
|
|
1120
|
+
|
|
1121
|
+
Example command to manually refresh table metadata
|
|
1122
|
+
|
|
1123
|
+
```
|
|
1124
|
+
ALTER TABLE s3."data.dremio.com".data.deltalake."tpcds10_delta"."call_center"
|
|
1125
|
+
REFRESH METADATA
|
|
1126
|
+
```
|
|
1127
|
+
|
|
1128
|
+
8. Run the previous query on the Delta Lake table to retrieve the results from the updated Delta Lake table.
|
|
1129
|
+
|
|
1130
|
+
## Limitations
|
|
1131
|
+
|
|
1132
|
+
* Creating Delta Lake tables is not supported.
|
|
1133
|
+
* DML operations are not supported.
|
|
1134
|
+
* Incremental Reflections are not supported.
|
|
1135
|
+
* Metadata refresh is required to query the latest version of a Delta Lake table.
|
|
1136
|
+
* Time travel or data versioning is not supported.
|
|
1137
|
+
* Only Delta Lake tables with minReaderVersion 1 or 2 can be read. Column Mapping is supported with minReaderVersion 2.
|
|
1138
|
+
|
|
1139
|
+
Was this page helpful?
|
|
1140
|
+
|
|
1141
|
+
* Supported Data Sources
|
|
1142
|
+
* Analyze Delta Lake Datasets
|
|
1143
|
+
+ Refresh Metadata
|
|
1144
|
+
* Limitations
|
|
1145
|
+
|
|
1146
|
+
<div style="page-break-after: always;"></div>
|
|
1147
|
+
|