dremiojs 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.eslintrc.json +14 -0
- package/.prettierrc +7 -0
- package/README.md +59 -0
- package/dremiodocs/dremio-cloud/cloud-api-reference.md +748 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-about.md +225 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-admin.md +3754 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-bring-data.md +6098 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-changelog.md +32 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-developer.md +1147 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-explore-analyze.md +2522 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-get-started.md +300 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-help-support.md +869 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-manage-govern.md +800 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-overview.md +36 -0
- package/dremiodocs/dremio-cloud/dremio-cloud-security.md +1844 -0
- package/dremiodocs/dremio-cloud/sql-docs.md +7180 -0
- package/dremiodocs/dremio-software/dremio-software-acceleration.md +1575 -0
- package/dremiodocs/dremio-software/dremio-software-admin.md +884 -0
- package/dremiodocs/dremio-software/dremio-software-client-applications.md +3277 -0
- package/dremiodocs/dremio-software/dremio-software-data-products.md +560 -0
- package/dremiodocs/dremio-software/dremio-software-data-sources.md +8701 -0
- package/dremiodocs/dremio-software/dremio-software-deploy-dremio.md +3446 -0
- package/dremiodocs/dremio-software/dremio-software-get-started.md +848 -0
- package/dremiodocs/dremio-software/dremio-software-monitoring.md +422 -0
- package/dremiodocs/dremio-software/dremio-software-reference.md +677 -0
- package/dremiodocs/dremio-software/dremio-software-security.md +2074 -0
- package/dremiodocs/dremio-software/dremio-software-v25-api.md +32637 -0
- package/dremiodocs/dremio-software/dremio-software-v26-api.md +36757 -0
- package/jest.config.js +10 -0
- package/package.json +25 -0
- package/src/api/catalog.ts +74 -0
- package/src/api/jobs.ts +105 -0
- package/src/api/reflection.ts +77 -0
- package/src/api/source.ts +61 -0
- package/src/api/user.ts +32 -0
- package/src/client/base.ts +66 -0
- package/src/client/cloud.ts +37 -0
- package/src/client/software.ts +73 -0
- package/src/index.ts +16 -0
- package/src/types/catalog.ts +31 -0
- package/src/types/config.ts +18 -0
- package/src/types/job.ts +18 -0
- package/src/types/reflection.ts +29 -0
- package/tests/integration_manual.ts +95 -0
- package/tsconfig.json +19 -0
|
@@ -0,0 +1,869 @@
|
|
|
1
|
+
# Help and Support | Dremio Documentation
|
|
2
|
+
|
|
3
|
+
Original URL: https://docs.dremio.com/dremio-cloud/help-support/
|
|
4
|
+
|
|
5
|
+
The section contains additional details including:
|
|
6
|
+
|
|
7
|
+
* [Limits](/dremio-cloud/help-support/limits/)
|
|
8
|
+
* [Well-Architected Framework](/dremio-cloud/help-support/well-architected-framework/)
|
|
9
|
+
* [Keyboard Shortcuts](/dremio-cloud/help-support/keyboard-shortcuts/)
|
|
10
|
+
|
|
11
|
+
Was this page helpful?
|
|
12
|
+
|
|
13
|
+
<div style="page-break-after: always;"></div>
|
|
14
|
+
|
|
15
|
+
# Limits | Dremio Documentation
|
|
16
|
+
|
|
17
|
+
Original URL: https://docs.dremio.com/dremio-cloud/help-support/limits
|
|
18
|
+
|
|
19
|
+
On this page
|
|
20
|
+
|
|
21
|
+
For each organization, Dremio imposes limits on the use of its resources. The following limits are grouped according to Dremio components.
|
|
22
|
+
|
|
23
|
+
Please contact Dremio to discuss if extra capacity is required.
|
|
24
|
+
|
|
25
|
+
## Organization
|
|
26
|
+
|
|
27
|
+
| Item | Enterprise Trial | Enterprise Paid |
|
|
28
|
+
| --- | --- | --- |
|
|
29
|
+
| Number of projects | 1 | 200 |
|
|
30
|
+
| Enterprise identity providers | 1 | 1 |
|
|
31
|
+
| Number of users | 5 | 15,000 |
|
|
32
|
+
| User invitations at once | 5 | 10 |
|
|
33
|
+
| Pending user invitations | 5 | 100 |
|
|
34
|
+
| Daily user invitations | 5 | 100 |
|
|
35
|
+
| Number of custom roles | 10 | 5,000 |
|
|
36
|
+
| Layers of nested roles | 10 | 10 |
|
|
37
|
+
| Direct custom role members | 5 | 1,000 |
|
|
38
|
+
|
|
39
|
+
## Projects
|
|
40
|
+
|
|
41
|
+
| Item | Enterprise Trial | Enterprise Paid |
|
|
42
|
+
| --- | --- | --- |
|
|
43
|
+
| Number of engines | 1 | 50 |
|
|
44
|
+
| Number of sources | 100 | 100 |
|
|
45
|
+
| Folder nesting depth | 8 | 8 |
|
|
46
|
+
| Number of tables | Unlimited | Unlimited |
|
|
47
|
+
| Number of views | Unlimited | Unlimited |
|
|
48
|
+
| Number of scripts per user | 1,000 | 1,000 |
|
|
49
|
+
| ACLs update rate per minute | 600 | 600 |
|
|
50
|
+
|
|
51
|
+
## Engines
|
|
52
|
+
|
|
53
|
+
| Item | Enterprise Trial | Enterprise Paid |
|
|
54
|
+
| --- | --- | --- |
|
|
55
|
+
| Replica sizes | XS | 2XS, XS, S, M, L, XL, 2XL, 3XL |
|
|
56
|
+
| Number of replicas | 3 | 100 |
|
|
57
|
+
| Query concurrency | The query concurrency limits are determined by the replica sizes as described in [Engines](/dremio-cloud/admin/engines/). | The query concurrency limits are determined by the replica sizes as described in [Engines](/dremio-cloud/admin/engines). |
|
|
58
|
+
| Query runtime max limit | Min 30 seconds | Min 30 seconds |
|
|
59
|
+
|
|
60
|
+
## Datasets
|
|
61
|
+
|
|
62
|
+
| Item | Enterprise Trial | Enterprise Paid |
|
|
63
|
+
| --- | --- | --- |
|
|
64
|
+
| Metadata refresh time - data lake | 15 minutes | 15 minutes |
|
|
65
|
+
| Metadata refresh time - RDBMS | 1 hour | 1 hour |
|
|
66
|
+
| Reflection refresh frequency | 1 hour | 1 hour |
|
|
67
|
+
| Wiki character limit | 100k characters | 100k characters |
|
|
68
|
+
| Number of JSON files | 300,000 | 300,000 |
|
|
69
|
+
| Row width | 16 MB | 16 MB |
|
|
70
|
+
|
|
71
|
+
## Reflections
|
|
72
|
+
|
|
73
|
+
| Item | Enterprise Trial | Enterprise Paid |
|
|
74
|
+
| --- | --- | --- |
|
|
75
|
+
| Maximum number of Reflections (including enabled and disabled Reflections) | 500 | 500 |
|
|
76
|
+
| Autonomous Reflections | 100 | 100 |
|
|
77
|
+
|
|
78
|
+
## Arrow Flight SQL (ADBC, ODBC, and JDBC)
|
|
79
|
+
|
|
80
|
+
| Item | Enterprise Trial | Enterprise Paid |
|
|
81
|
+
| --- | --- | --- |
|
|
82
|
+
| Max returned data volume | 10GB | 10GB |
|
|
83
|
+
| Flight Service Data Pipeline Drain Timeout | 50 seconds | 50 seconds |
|
|
84
|
+
|
|
85
|
+
## Rate Limits
|
|
86
|
+
|
|
87
|
+
Rate limits are enforced on a single IP address and apply across all organizations and projects.
|
|
88
|
+
|
|
89
|
+
| Item | Enterprise Trial | Enterprise Paid |
|
|
90
|
+
| --- | --- | --- |
|
|
91
|
+
| Login rate per user per second | 100 | 100 |
|
|
92
|
+
| SCIM reads per minute - user | 180 | 180 |
|
|
93
|
+
| SCIM writes per minute - user | 300 | 300 |
|
|
94
|
+
| API calls per minute | 1,200 | 1,200 |
|
|
95
|
+
| API: `/job/{id}/results` calls per minute | 1,000 | 1,000 |
|
|
96
|
+
| API: `/job/{id}/cancel` calls per minute | 100 | 100 |
|
|
97
|
+
| API: `/job/{id}` calls per minute | 100 | 100 |
|
|
98
|
+
| API: `/login/userpass` calls per second | 45 | 45 |
|
|
99
|
+
| Access control list update rate per minute | 60 | 60 |
|
|
100
|
+
|
|
101
|
+
Was this page helpful?
|
|
102
|
+
|
|
103
|
+
* Organization
|
|
104
|
+
* Projects
|
|
105
|
+
* Engines
|
|
106
|
+
* Datasets
|
|
107
|
+
* Reflections
|
|
108
|
+
* Arrow Flight SQL (ADBC, ODBC, and JDBC)
|
|
109
|
+
* Rate Limits
|
|
110
|
+
|
|
111
|
+
<div style="page-break-after: always;"></div>
|
|
112
|
+
|
|
113
|
+
# Keyboard Shortcuts | Dremio Documentation
|
|
114
|
+
|
|
115
|
+
Original URL: https://docs.dremio.com/dremio-cloud/help-support/keyboard-shortcuts
|
|
116
|
+
|
|
117
|
+
On this page
|
|
118
|
+
|
|
119
|
+
Keyboard shortcuts for functions supported by Dremio are available for macOS, Windows, and Linux.
|
|
120
|
+
|
|
121
|
+
## SQL Editor
|
|
122
|
+
|
|
123
|
+
While using the SQL editor on the SQL Runner page or Datasets page, you can use shortcuts for commonly used actions, as shown in the following table:
|
|
124
|
+
|
|
125
|
+
| Function | macOS Shortcut | Windows/Linux Shortcut |
|
|
126
|
+
| --- | --- | --- |
|
|
127
|
+
| Preview | Cmd + Enter | Ctrl + Enter |
|
|
128
|
+
| Run | Cmd + Shift + Enter | Ctrl + Shift + Enter |
|
|
129
|
+
| Search | Cmd + K | Ctrl + K |
|
|
130
|
+
| Comment Out/In | Cmd + / | Ctrl + / |
|
|
131
|
+
| Find | Cmd + f | Ctrl + f |
|
|
132
|
+
| Trigger Autocomplete | Ctrl + Space | Ctrl + Space |
|
|
133
|
+
| Format Query | Cmd + Shift + f | Ctrl + Shift + f |
|
|
134
|
+
| Delete Line | Cmd + Shift + k | Ctrl + Shift + k |
|
|
135
|
+
| Toggle AI Agent | Cmd + Shift + g | Ctrl + Shift + g |
|
|
136
|
+
|
|
137
|
+
Was this page helpful?
|
|
138
|
+
|
|
139
|
+
* SQL Editor
|
|
140
|
+
|
|
141
|
+
<div style="page-break-after: always;"></div>
|
|
142
|
+
|
|
143
|
+
# Well-Architected Framework | Dremio Documentation
|
|
144
|
+
|
|
145
|
+
Original URL: https://docs.dremio.com/dremio-cloud/help-support/well-architected-framework/
|
|
146
|
+
|
|
147
|
+
On this page
|
|
148
|
+
|
|
149
|
+
Dremio’s well-architected framework is a resource for anyone who is designing or operating solutions with Dremio. It provides insight from lessons learned through helping hundreds of customers be successful. The framework is composed of pillars that describe design principles as well as best practices based on those principles.
|
|
150
|
+
|
|
151
|
+
The Well-Architected Framework is considered complementary to the [Dremio Shared Responsibility Model](/assets/files/Dremio-Cloud-Shared-Responsibility-Model-15f76b24f0b48153532ca15b25d831c4.pdf). The Shared Responsibility Model lays out Dremio's responsibilities and your responsibilities for maintaining and operating an optimal Dremio environment, while the Well-Architected Framework provides details for carrying out your responsibilities.
|
|
152
|
+
|
|
153
|
+
## Key Pillars of Dremio’s Well-Architected Framework
|
|
154
|
+
|
|
155
|
+
Dremio’s well-architected framework follows five common pillars from cloud providers AWS, Microsoft, and Google and a sixth Dremio-specific pillar:
|
|
156
|
+
|
|
157
|
+
1. [Security](/dremio-cloud/help-support/well-architected-framework/security)
|
|
158
|
+
2. [Performance Efficiency](/dremio-cloud/help-support/well-architected-framework/performance-efficiency)
|
|
159
|
+
3. [Cost Optimization](/dremio-cloud/help-support/well-architected-framework/cost-optimization)
|
|
160
|
+
4. [Reliability](/dremio-cloud/help-support/well-architected-framework/reliability)
|
|
161
|
+
5. [Operational Excellence](/dremio-cloud/help-support/well-architected-framework/operational-excellence)
|
|
162
|
+
6. [Self-Serve Semantic Layer](/dremio-cloud/help-support/well-architected-framework/self-serve-semantic-layer)
|
|
163
|
+
|
|
164
|
+
Each pillar includes principles, best practices, and how-to articles on the pillar's theme.
|
|
165
|
+
|
|
166
|
+
Dremio's well-architected framework covers best practices related to configuration and operation of Dremio. Read [Architecture](/dremio-cloud/about/architecture) for more information about the Dremio architecture.
|
|
167
|
+
|
|
168
|
+
Was this page helpful?
|
|
169
|
+
|
|
170
|
+
* Key Pillars of Dremio’s Well-Architected Framework
|
|
171
|
+
|
|
172
|
+
<div style="page-break-after: always;"></div>
|
|
173
|
+
|
|
174
|
+
# Security | Dremio Documentation
|
|
175
|
+
|
|
176
|
+
Original URL: https://docs.dremio.com/dremio-cloud/help-support/well-architected-framework/security
|
|
177
|
+
|
|
178
|
+
On this page
|
|
179
|
+
|
|
180
|
+
The security pillar is essential to ensuring that your data is secured properly when using Dremio to query your data lakehouse. The security components are especially important to architect and design your data platform. After your workloads are in production, you must continue to review your security components to ensure compliance and eliminate threats.
|
|
181
|
+
|
|
182
|
+
## Principles
|
|
183
|
+
|
|
184
|
+
### Leverage Industry-Standard Identity Providers and Authorization Systems
|
|
185
|
+
|
|
186
|
+
Dremio integrates with leading social and enterprise identity providers and data authorization systems. For robust enterprise integration with corporate policies, it is essential to leverage those third-party systems. We recommend systems that use multi-factor authentication methods and are connected to single sign-on (SSO) platforms.
|
|
187
|
+
|
|
188
|
+
### Design for Least-Privilege Access to Objects
|
|
189
|
+
|
|
190
|
+
When providing self-service access to your data lakehouse via Dremio’s [AI semantic layer](/dremio-cloud/help-support/well-architected-framework/self-serve-semantic-layer), access should only be granted to the data that is required for the role accessing the data.
|
|
191
|
+
|
|
192
|
+
## Best Practices
|
|
193
|
+
|
|
194
|
+
### Protect Access Credentials
|
|
195
|
+
|
|
196
|
+
Where possible, leverage identity providers such as [Microsoft Entra ID](/dremio-cloud/security/authentication/idp/microsoft-entra-id) and [Okta](/dremio-cloud/security/authentication/idp/okta) in conjunction with [System for Cross-domain Identity Management (SCIM)](/dremio-cloud/security/authentication/idp/#scim) where applicable to ensure that you never need to share passwords with Dremio. SSO with Microsoft Entra ID or Okta is also recommended where possible.
|
|
197
|
+
|
|
198
|
+
### Leverage Role Based Access Controls
|
|
199
|
+
|
|
200
|
+
Access to each catalog, folder, view, and table can be managed and regulated by [roles](/dremio-cloud/security/roles). Roles are used to organize privileges at scale rather than managing privileges for each individual user. You can create roles to manage privileges for users with different job functions in your organization, such as “Analyst” and “Security\_Admin” roles. Users who are members of a role gain all of the privileges granted to the role. Roles can also be nested. For example, the users in the "UK" role can automatically be members of the "EMEA” role.
|
|
201
|
+
|
|
202
|
+
Access control protects the integrity of your data and simplifies the data architecture available to users based on their roles and responsibilities within your organization. Effective controls allow users to access data that is central to their work without regard for the complexities of where and how the data is physically stored and organized.
|
|
203
|
+
|
|
204
|
+
Was this page helpful?
|
|
205
|
+
|
|
206
|
+
* Principles
|
|
207
|
+
+ Leverage Industry-Standard Identity Providers and Authorization Systems
|
|
208
|
+
+ Design for Least-Privilege Access to Objects
|
|
209
|
+
* Best Practices
|
|
210
|
+
+ Protect Access Credentials
|
|
211
|
+
+ Leverage Role Based Access Controls
|
|
212
|
+
|
|
213
|
+
<div style="page-break-after: always;"></div>
|
|
214
|
+
|
|
215
|
+
# Reliability | Dremio Documentation
|
|
216
|
+
|
|
217
|
+
Original URL: https://docs.dremio.com/dremio-cloud/help-support/well-architected-framework/reliability
|
|
218
|
+
|
|
219
|
+
On this page
|
|
220
|
+
|
|
221
|
+
The reliability pillar focuses on ensuring your system is up and running and can be quickly and efficiently restored in case of unexpected downtime.
|
|
222
|
+
|
|
223
|
+
## Principles
|
|
224
|
+
|
|
225
|
+
### Set Engine Routing Rules and Engine Settings
|
|
226
|
+
|
|
227
|
+
Dremio’s engine routing rules and engine settings are powerful and protect the system from being overloaded by queries that exceed currently available resources.
|
|
228
|
+
|
|
229
|
+
### Monitor and Measure Platform Activity
|
|
230
|
+
|
|
231
|
+
To ensure the reliability of your Dremio project, you must regularly monitor and measure its activity.
|
|
232
|
+
|
|
233
|
+
## Best Practices
|
|
234
|
+
|
|
235
|
+
### Initialize Engine Routing and Engine Settings
|
|
236
|
+
|
|
237
|
+
It is important to set up engine routing rules and engines with sensible concurrency, replica, and time limits. It's better to spin replicas at sensible concurrency limits rather than risk a large number of rogue queries bringing down the engine.
|
|
238
|
+
|
|
239
|
+
### Use the Monitor Page in the Dremio Console
|
|
240
|
+
|
|
241
|
+
As an administrator using the Dremio console, you can effectively monitor catalog usage and jobs within your projects. The [Monitor page](/dremio-cloud/admin/monitor/) provides detailed visualizations and metrics that allow you to track usage patterns, resource consumption, and user impact.
|
|
242
|
+
|
|
243
|
+
In the Catalog Usage tab, you can view the 10 most-queried datasets and source folders, along with relevant statistics such as linked jobs and acceleration usage. The Catalog Usage tab excludes system tables and INFORMATION\_SCHEMA datasets and focuses solely on user queries.
|
|
244
|
+
|
|
245
|
+
In the Jobs tab, you can access comprehensive metrics on job performance, including daily job counts, failure rates, and user activity. Visualizations include graphs of completed and failed jobs, job states, and the 10 longest-running jobs, providing an overview of job execution and performance trends.
|
|
246
|
+
|
|
247
|
+
We recommend that administrators frequently review the Monitor page, including daily consumption patterns and the weekly and monthly aggregate. Monitoring insights like the most queried datasets over time can help administrators optimize performance, adapt a Reflection strategy, and leverage the jobs-per-engine distribution to improve workload management and resource allocation.
|
|
248
|
+
|
|
249
|
+
### Perform Impact Analysis if Security Rules Change
|
|
250
|
+
|
|
251
|
+
Dremio’s control plane interacts with your own virtual private clouds for query execution. If you make changes to your security rules after they are initially set and working correctly with Dremio, perform impact analysis to make sure that your connectivity with Dremio remains unaffected.
|
|
252
|
+
|
|
253
|
+
Was this page helpful?
|
|
254
|
+
|
|
255
|
+
* Principles
|
|
256
|
+
+ Set Engine Routing Rules and Engine Settings
|
|
257
|
+
+ Monitor and Measure Platform Activity
|
|
258
|
+
* Best Practices
|
|
259
|
+
+ Initialize Engine Routing and Engine Settings
|
|
260
|
+
+ Use the Monitor Page in the Dremio Console
|
|
261
|
+
+ Perform Impact Analysis if Security Rules Change
|
|
262
|
+
|
|
263
|
+
<div style="page-break-after: always;"></div>
|
|
264
|
+
|
|
265
|
+
# Cost Optimization | Dremio Documentation
|
|
266
|
+
|
|
267
|
+
Original URL: https://docs.dremio.com/dremio-cloud/help-support/well-architected-framework/cost-optimization
|
|
268
|
+
|
|
269
|
+
On this page
|
|
270
|
+
|
|
271
|
+
Although it's important to get the best performance possible with Dremio, it's also important to optimize costs associated with managing the Dremio platform.
|
|
272
|
+
|
|
273
|
+
## Principles
|
|
274
|
+
|
|
275
|
+
### Minimize Running Executor Nodes
|
|
276
|
+
|
|
277
|
+
Dremio can scale to many hundreds of nodes, but any given engine should have only as many nodes as are required to satisfy the current load and meet service-level agreements.
|
|
278
|
+
|
|
279
|
+
### Dynamically Scale Executor Nodes Up and Down
|
|
280
|
+
|
|
281
|
+
When running Dremio engines, designers can leverage concurrency per replica and minimum and maximum number of replicas to dynamically expand and contract capacity based on load.
|
|
282
|
+
|
|
283
|
+
### Eliminate Unnecessary Data Processing
|
|
284
|
+
|
|
285
|
+
As described in the [best practices for Pillar 2: Performance Efficiency](/dremio-cloud/help-support/well-architected-framework/performance-efficiency#leverage-reflections-to-improve-performance), creating too many Reflections, especially those that perform similar work to other Reflections or provide little added benefit in terms of query performance, can incur unnecessary costs because Reflections need system resources to rebuild. For this reason, consider removing any unnecessary Reflections.
|
|
286
|
+
|
|
287
|
+
To avoid the need to process data that is not required for a query to succeed, use filters that can be pushed down to the source wherever possible. Enabling partitioning on source data that are in line with the filters also helps speed up data retrieval.
|
|
288
|
+
|
|
289
|
+
Also, optimize source data files by merging smaller files or splitting larger files whenever possible.
|
|
290
|
+
|
|
291
|
+
## Best Practices
|
|
292
|
+
|
|
293
|
+
### Size Engines to the Minimum Replicas Required
|
|
294
|
+
|
|
295
|
+
To avoid accruing unnecessary cost, reduce the number of active replicas in your engines to the minimum (typically 1, but 0 when the engine is not in use on weekends or non-business hours). A minimum replica count of 0 delays the first query of the day due to engine startup, which you can mitigate with an external script that executes a dummy SQL statement prior to normal daily use.
|
|
296
|
+
|
|
297
|
+
### Remove Unused Reflections
|
|
298
|
+
|
|
299
|
+
Analyze the results in Dremio `sys.project.jobs_recent` system table along with the results for the system tables [`sys.project.reflections`](/dremio-cloud/sql/system-tables/reflections) and [`sys.project.materializations`](/dremio-cloud/sql/system-tables/materializations) to get information about the frequency at which each Reflection present in Dremio is leveraged. You can further analyze Reflections that are not being leveraged to determine if any are still being refreshed, and if they are, how many times they have been refreshed in the reporting period and how many hours of cluster execution time they have been consuming.
|
|
300
|
+
|
|
301
|
+
Checking for and removing unused Reflections is good practice because it can reduce clutter in the Reflection configuration and often free up many hours of cluster execution cycles that can be used for more critical workloads.
|
|
302
|
+
|
|
303
|
+
### Optimize Metadata Refresh Frequency
|
|
304
|
+
|
|
305
|
+
Ensure metadata-refresh frequencies are set appropriately based on what you know about the frequency that metadata is changing in the data source.
|
|
306
|
+
|
|
307
|
+
The default metadata refresh frequency set against data sources is once per hour, which is too frequent for many data sources. For example, if data in the sources are only updated once every 6 hours, it is not necessary to refresh the data sets every hour. Instead, change the refresh schedule to every 6 hours in the data source settings.
|
|
308
|
+
|
|
309
|
+
Furthermore, because metadata refreshes can be scheduled at the data source level, overridden at each individual table level, and performed programmatically, it makes sense to review each new data source to determine the most appropriate setting for it. For example, for data lake sources, you might set a long metadata refresh schedule such as 3000 weeks so that the scheduled refresh is very unlikely to fire, and then perform the refresh programmatically as part of the extract, transform, and load (ETL) process, where you know when the data generation has completed. You might set relational data sources to refresh every few days, but then override the source-level setting for tables that change more frequently.
|
|
310
|
+
|
|
311
|
+
When datasets are updated as part of overnight ETL runs, it doesn’t make sense to refresh the dataset metadata until you know the ETL process is finished. In this case, you can create a script that triggers the manual refresh of each dataset in the ETL process after you know the dataset ETL is complete.
|
|
312
|
+
|
|
313
|
+
For data sources that contain a large number of datasets but few datasets that change their structure or have new files added, it makes little sense to refresh at the source level on a fixed schedule. Instead, set the metadata to a long source-level refresh timeframe like 52 weeks and use scripts to trigger a manual refresh against a specific dataset.
|
|
314
|
+
|
|
315
|
+
If you set the metadata refresh schedule for a long timeframe and you do not have any scripting mechanism to refresh your metadata, when a query runs and the planner notices that the metadata is stale or invalid, Dremio performs an inline metadata refresh during the query planning phase. This can have a negative impact on the duration of query execution because it also incorporates that metadata refresh duration.
|
|
316
|
+
|
|
317
|
+
Was this page helpful?
|
|
318
|
+
|
|
319
|
+
* Principles
|
|
320
|
+
+ Minimize Running Executor Nodes
|
|
321
|
+
+ Dynamically Scale Executor Nodes Up and Down
|
|
322
|
+
+ Eliminate Unnecessary Data Processing
|
|
323
|
+
* Best Practices
|
|
324
|
+
+ Size Engines to the Minimum Replicas Required
|
|
325
|
+
+ Remove Unused Reflections
|
|
326
|
+
+ Optimize Metadata Refresh Frequency
|
|
327
|
+
|
|
328
|
+
<div style="page-break-after: always;"></div>
|
|
329
|
+
|
|
330
|
+
# Operational Excellence | Dremio Documentation
|
|
331
|
+
|
|
332
|
+
Original URL: https://docs.dremio.com/dremio-cloud/help-support/well-architected-framework/operational-excellence
|
|
333
|
+
|
|
334
|
+
On this page
|
|
335
|
+
|
|
336
|
+
Following a regular schedule of maintenance tasks is key to keeping your Dremio project operating at peak performance and efficiency. The operational excellence pillar describes the tasks required to maintain an operationally healthy Dremio project.
|
|
337
|
+
|
|
338
|
+
## Principles
|
|
339
|
+
|
|
340
|
+
### Regularly Evaluate Engine Resources
|
|
341
|
+
|
|
342
|
+
As workloads expand and grow on your Dremio project, it is important to evaluate engine usage to ensure that you have correctly sized engines and the right number of replicas.
|
|
343
|
+
|
|
344
|
+
### Regularly Evaluate Query Performance
|
|
345
|
+
|
|
346
|
+
Regular query performance reviews help you identify challenges and mitigate them before they become a problem. For example, if you find an unacceptably large number of queries waiting on engine or replica starts, you can adjust the minimum, maximum, and last replica auto-stop settings. If you see an unacceptable number of query execution failures, you can adjust concurrency limits per replica more appropriately or revisit the semantic layer and introduce Reflections to improve performance.
|
|
347
|
+
|
|
348
|
+
### Clean Up Tables with Vacuum
|
|
349
|
+
|
|
350
|
+
Open Catalog automates Iceberg maintenance operations like compaction and vacuum, which maximizes query performance, minimizes storage costs, and eliminates the need to run manual data maintenance.
|
|
351
|
+
|
|
352
|
+
### Optimize Tables
|
|
353
|
+
|
|
354
|
+
When operating on Iceberg tables and using Open Catalog, you can schedule [optimization](/dremio-cloud/developer/data-formats/iceberg#optimization) jobs to help you manage the accumulation of data files that occurs through data manipulation language (DML) operations. Regular maintenance ensures optimal query performance on these tables.
|
|
355
|
+
|
|
356
|
+
### Regularly Monitor Live Metrics for Dremio
|
|
357
|
+
|
|
358
|
+
To ensure smooth operations in Dremio, collect metrics and take action when appropriate. Read [Monitor](/dremio-cloud/admin/monitor/) for more details.
|
|
359
|
+
|
|
360
|
+
## Best Practices
|
|
361
|
+
|
|
362
|
+
### Optimize Workload Management Rules
|
|
363
|
+
|
|
364
|
+
Because workloads and volumes of queries change over time, you should periodically reevaluate workload management engine routing rules and engines and adjust for optimal size, concurrency, and replica limits.
|
|
365
|
+
|
|
366
|
+
### Configure Engines
|
|
367
|
+
|
|
368
|
+
When possible, leverage engines to segregate workloads. Configuring engine and usage offers the following benefits:
|
|
369
|
+
|
|
370
|
+
* Platform stability: if one engine goes down, it won’t affect other engines.
|
|
371
|
+
* Flexibility to start and stop engines on demand at certain times of day.
|
|
372
|
+
* Engines can be sized differently based on workload patterns.
|
|
373
|
+
* It's possible to separate queries from different tenants into their own engine to enable a chargeback model.
|
|
374
|
+
|
|
375
|
+
We recommend separate engines for the following types of workloads:
|
|
376
|
+
|
|
377
|
+
* Reflection refreshes.
|
|
378
|
+
* Metadata refreshes.
|
|
379
|
+
* API queries.
|
|
380
|
+
* Queries from BI tools.
|
|
381
|
+
* Extract, transform, and load (ETL)-type workloads like CREATE TABLE AS (CTAS) and Iceberg DML.
|
|
382
|
+
* Ad hoc data science queries with long execution times.
|
|
383
|
+
|
|
384
|
+
In multi-tenant environments like multiple departments or geographic locations where chargeback models can be implemented for resource usage, we recommend having a separate set of engines per tenant.
|
|
385
|
+
|
|
386
|
+
### Optimize Query Performance
|
|
387
|
+
|
|
388
|
+
When developing the semantic layer, it is best to create the views in each of the three layers according to best practices without using Reflections, then test queries of the application layer views to gauge baseline performance.
|
|
389
|
+
|
|
390
|
+
For queries that appear to be running sub-optimally, we recommend analyzing the query profile to determine whether any bottlenecks can be removed to improve performance. If performance issues persist, place Reflections where they will have the most benefit. A well-architected semantic layer allows you to place Reflections at strategic locations in the semantic layer such that large volumes of queries benefit from the fewest number of Reflections, such as in the business layer where a view is constructed by joining several other views.
|
|
391
|
+
|
|
392
|
+
### Design Reflections for Expensive Query Patterns
|
|
393
|
+
|
|
394
|
+
1. Review query history (jobs) to determine the most expensive and most-frequent queries being submitted.
|
|
395
|
+
2. Look in the job profiles for these queries. Tables and views referenced by multiple queries that perform expensive scans, joins, and aggregations are good candidates for Reflections.
|
|
396
|
+
3. Examine the SQL for the selected queries that reference the same table or view to find patterns that can help you define a Reflection on that table or view that satisfies as many of those queries as possible.
|
|
397
|
+
|
|
398
|
+
### Avoid the “More Is Always Better” Approach
|
|
399
|
+
|
|
400
|
+
Creating more Reflections than are necessary to support your data consumers can lead to the use of more resources than might be optimal for your environment, both in terms of system resources and the time and attention devoted to working with them.
|
|
401
|
+
|
|
402
|
+
### Establish Criteria for When to create Reflections
|
|
403
|
+
|
|
404
|
+
Create them only when data consumers are experiencing slow query responses, or when reports are not meeting established SLAs.
|
|
405
|
+
|
|
406
|
+
### Create Reflections Without Duplicating the Work of Other Reflections
|
|
407
|
+
|
|
408
|
+
Dremio recommends that, when you create tables and views, you create them in layers:
|
|
409
|
+
|
|
410
|
+
* The bottom or first layer consists of your tables.
|
|
411
|
+
* In the second layer are views, one for each table, that do lightweight preparation of data for views in the next layers. Here, administrators might create views that do limited casting, type conversion, and field renaming, and redacting sensitive information, among other prepping operations. Administrators can also add security by subsetting both rows and fields that users in other layers are not allowed to access. The data has been lightly scrubbed and restricted to the group of people who have the business knowledge that lets them use these views to build higher-order views that data consumers can use. Then, admins grant access to these views to users who create views in the next layer, without being able to see the raw data in the tables.
|
|
412
|
+
* In the third layer, users create views that perform joins and other expensive operations. This layer is where the intensive work on data is performed. These users then create Reflections (raw, aggregation, or both) from their views.
|
|
413
|
+
* In the fourth layer, users can create lightweight views for dashboards, reports, and visualization tools. They can also create aggregation Reflections, as needed.
|
|
414
|
+
|
|
415
|
+
### Establish a Routine for Checking How often Reflections Are Used
|
|
416
|
+
|
|
417
|
+
At regular intervals, check for Reflections that are no longer being used by the query planner and evaluate whether they should be removed. Query patterns can change over time, and frequently-used Reflections can gradually become less relevant.
|
|
418
|
+
|
|
419
|
+
### Use Supporting Anchors
|
|
420
|
+
|
|
421
|
+
Anchors for Reflections are views that data consumers have access to from their business-intelligence tools. As you develop a better understanding of query patterns, you might want to support those patterns by creating Reflections from views that perform expensive joins, transformations, filters, calculations, or a combination of those operations. You would probably not want data consumers to be able to access those views directly in situations where the query optimizer did not use any of the Reflections created from those views. Repeated and concurrent queries on such views could put severe strain on system resources.
|
|
422
|
+
|
|
423
|
+
You can prevent queries run by data consumers from accessing those views directly. Anchors that perform expensive operations and to which access is restricted are called supporting anchors.
|
|
424
|
+
|
|
425
|
+
For example, suppose that you find these three, very large tables are used in many queries:
|
|
426
|
+
|
|
427
|
+
* Customer
|
|
428
|
+
* Order
|
|
429
|
+
* Lineitem
|
|
430
|
+
|
|
431
|
+
You determine that there are a few common patterns in the user queries on these tables:
|
|
432
|
+
|
|
433
|
+
* The queries frequently join the three tables together.
|
|
434
|
+
* Queries always filter by `commit_date < ship_date`
|
|
435
|
+
* There is a calculated field in most of the queries: `extended_price * (1-discount) AS revenue`
|
|
436
|
+
|
|
437
|
+
You can create a view that applies these common patterns, and then create a raw Reflection to accelerate queries that follow these patterns.
|
|
438
|
+
|
|
439
|
+
First, you create a folder in the Dremio space that your data consumers have access to. Then, you configure this folder to be invisible and inaccessible to the data consumers.
|
|
440
|
+
|
|
441
|
+
Next, you write the query to create the view, you follow these guidelines:
|
|
442
|
+
|
|
443
|
+
* Use `SELECT *` to include all fields, making it possible for the query optimizer to accelerate the broadest set of queries. Alternatively, if you know exactly which subset of fields are used in the three tables, you can include just that subset in the view.
|
|
444
|
+
* Add any calculated fields, which in this case is the revenue field.
|
|
445
|
+
* Apply the appropriate join on the three tables.
|
|
446
|
+
* Apply any filters that are used by all queries, which in this case is only `commit_date < ship_date`.
|
|
447
|
+
* Always use the most generic predicate possible to maximize the number of queries that will match.
|
|
448
|
+
|
|
449
|
+
Next, you run the following query to create a new view:
|
|
450
|
+
|
|
451
|
+
Create a new view
|
|
452
|
+
|
|
453
|
+
```
|
|
454
|
+
SELECT *, extendedprice * (1 - discount) AS revenue FROM customer AS c, orders AS o, lineitem AS l WHERE c.c_custkey = o.o_custkey AND l.l_orderkey = o.o_orderkey AND o.commit_date < o.ship_date
|
|
455
|
+
```
|
|
456
|
+
|
|
457
|
+
Then, you save the view in the folder that you created earlier.
|
|
458
|
+
|
|
459
|
+
Finally, you create one or more raw Reflections on this new supporting anchor. If most of the queries against the view were aggregation queries, you could create an aggregation Reflection. In both cases, you can select fields, as needed, to sort on or partition on.
|
|
460
|
+
|
|
461
|
+
The result is that, even though the data consumers do not have access to the supporting anchor, Dremio can accelerate their queries by using the new Reflections as long as they have access to the tables that the Reflections are ultimately derived from: Customer, Order, and Lineitem.
|
|
462
|
+
|
|
463
|
+
If the query optimizer should determine that a query cannot be satisfied by any of the Reflections, it is possible, if no other views can satisfy it, for the query to run directly against the tables, as is always the case with any query.
|
|
464
|
+
|
|
465
|
+
### Horizontally Partition Reflections that Have Many Rows
|
|
466
|
+
|
|
467
|
+
If you select a field for partitioning in a data Reflection, Dremio physically groups records together into a common directory on the file system. For example, if you partition by the field Country, in which the values are two-letter abbreviations for the names of countries, such as US, UK, DE, and CA, Dremio stores the data for each country in a separate directory named US, UK, DE, CA, and so on. This optimization allows Dremio to scan a subset of the directories based on the query, which is an optimization called partition pruning.
|
|
468
|
+
|
|
469
|
+
If a user queries on records for which the value of Country is US or UK, then Dremio can apply partition pruning to scan only the US and UK directories, significantly reducing the amount of data that is scanned for the query.
|
|
470
|
+
|
|
471
|
+
When you are selecting a partitioning field for a data Reflection, ask yourself these questions:
|
|
472
|
+
|
|
473
|
+
1. Is the field used in many queries?
|
|
474
|
+
2. Are there relatively few unique values in the field (low cardinality)?
|
|
475
|
+
|
|
476
|
+
To partition the data, Dremio must first sort all records, which consumes resources. Accordingly, partition data only on fields that can be used to optimize queries. In addition, the number of unique values for a field should be relatively small, so that Dremio creates only a relatively small number of partitions. If all values in a field are unique, the cost to partition outweighs the benefit.
|
|
477
|
+
|
|
478
|
+
In general, Dremio recommends the total number of partitions for a Reflection to be less than 10,000.
|
|
479
|
+
|
|
480
|
+
Because Reflections are created as Apache Iceberg tables, you can use partition transforms to specify transformations to apply to partition columns to produce partition values. For example, if you choose to partition on a column of timestamps, you can set partition transforms that produce partition values that are the years, months, days, or hours in those timestamps. The following table lists the partition transforms that you can choose from.
|
|
481
|
+
|
|
482
|
+
note
|
|
483
|
+
|
|
484
|
+
* If a column is listed as a partition column, it cannot also be listed as a sort column for the same Reflection.
|
|
485
|
+
* In aggregation Reflections, each column specified as a partition column or used in transform must also be listed as a dimension column.
|
|
486
|
+
* In raw Reflections, each column specified as a partition column or used in transform must also be listed as a display column.
|
|
487
|
+
|
|
488
|
+
Value | Type of Partition Transform | Description || IDENTITY | identity(<column\_name>) | Creates one partition per value. This is the default transform. If no transform is specified for a column named by the `name` property, an IDENTITY transform is performed. The column can use any supported data type. |
|
|
489
|
+
| YEAR | year(<column\_name>) | Partitions by year. The column must use the DATE or TIMESTAMP data type. |
|
|
490
|
+
| MONTH | month(<column\_name>) | Partitions by month. The column must use the DATE or TIMESTAMP data type. |
|
|
491
|
+
| DAY | day(<column\_name>) | Partitions on the equivalent of dateint. The column must use the DATE or TIMESTAMP data type. |
|
|
492
|
+
| HOUR | hour(<column\_name>) | Partitions on the equivalent of dateint and hour. The column must use the TIMESTAMP data type. |
|
|
493
|
+
| BUCKET | bucket(<integer>, <column\_name>) | Partitions data into the number of partitions specified by an integer. For example, if the integer value N is specified, the data is partitioned into N, or (0 to (N-1)), partitions. The partition in which an individual row is stored is determined by hashing the column value and then calculating `<hash_value> mod N`. If the result is 0, the row is placed in partition 0; if the result is 1, the row is placed in partition 1; and so on. The column can use the DECIMAL, INT, BIGINT, VARCHAR, VARBINARY, DATE, or TIMESTAMP data type. |
|
|
494
|
+
| TRUNCATE | truncate(<integer>, <column\_name>) | If the specified column uses the string data type, truncates strings to a maximum of the number of characters specified by an integer. For example, suppose the specified transform is truncate(1, stateUS). A value of `CA` is truncated to `C`, and the row is placed in partition C. A value of `CO` is also truncated to `C`, and the row is also placed in partition C. If the specified column uses the integer or long data type, truncates column values in the following way: For any `truncate(L, col)`, truncates the column value to the biggest multiple of L that is smaller than the column value. For example, suppose the specified transform is `truncate(10, intColumn)`. A value of 1 is truncated to 0 and the row is placed in the partition 0. A value of 247 is truncated to 240 and the row is placed in partition 240. If the transform is `truncate(3, intColumn)`, a value of 13 is truncated to 12 and the row is placed in partition 12. A value of 255 is not truncated, because it is divisble by 3, and the row is placed in partition 255. The column can use the DECIMAL, INT, BIGINT, VARCHAR, or VARBINARY data type. **Note:** The truncate transform does not change column values. It uses column values to calculate the correct partitions in which to place rows. |
|
|
495
|
+
|
|
496
|
+
### Partition Reflections to Allow for Partition-Based Incremental Refreshes
|
|
497
|
+
|
|
498
|
+
Incremental refreshes of data in Reflections are much faster than full refreshes. Partition-based incremental refreshes are based on Iceberg metadata that is used to identify modified partitions and to restrict the scope of the refresh to only those partitions. For more information about partition-based incremental refreshes, see Types of Refresh for Reflections on Apache Iceberg Tables, Filesystem Sources, Glue Sources, and Hive Sources in [Refresh Reflections](/dremio-cloud/admin/performance/manual-reflections/reflection-refresh).
|
|
499
|
+
|
|
500
|
+
For partition-based incremental refreshes, both the base table and its Reflections must be partitioned, and the partition transforms that they use must be compatible. The following table lists which partition transforms on the base table and which partition transforms on Reflections are compatible:
|
|
501
|
+
|
|
502
|
+
| Partition Transform on the Base Table | Compatible Partition Transforms on Reflections |
|
|
503
|
+
| --- | --- |
|
|
504
|
+
| Identity | Identity, Hour, Day, Month, Year, Truncate |
|
|
505
|
+
| Hour | Hour, Day, Month, Year |
|
|
506
|
+
| Day | Day, Month, Year |
|
|
507
|
+
| Month | Month, Year |
|
|
508
|
+
| Year | Year |
|
|
509
|
+
| Truncate | Truncate |
|
|
510
|
+
|
|
511
|
+
note
|
|
512
|
+
|
|
513
|
+
* If both a base table and a Reflection use the Truncate partition transform, follow these rules concerning truncation lengths:
|
|
514
|
+
+ If the partition column uses the String data type, the truncation length used for the Reflection must be less than or equal to the truncation length used for the base table.
|
|
515
|
+
+ If the partition column uses the Integer data type, the remainder from the truncation length on the Reflection (A) divided by the truncation length on the base table (B) must be equal to 0: `A MOD B = 0`
|
|
516
|
+
+ If the partition column uses any other data type, the truncation lengths must be identical.
|
|
517
|
+
* If a base table uses the Bucket partition transform, partition-based incremental refreshes are not possible.
|
|
518
|
+
|
|
519
|
+
#### Partition Aggregation Reflections on Timestamp Data in Very Large Base Tables
|
|
520
|
+
|
|
521
|
+
Suppose you want to define an aggregation Reflection on a base table that has billions of rows. The base table includes a column that either uses the TIMESTAMP data type or includes a timestamp as a string, and the base table is partitioned on that column.
|
|
522
|
+
|
|
523
|
+
In your aggregation Reflection, you plan to aggregate on timestamp data that is in the base table. However, to get the benefits of partition-based incremental refresh, you need to partition the Reflection in a way that is compatible with the partitioning on the base table. You can make the partitioning compatible in either of two ways:
|
|
524
|
+
|
|
525
|
+
* By defining a view on the base table, and then defining the aggregation Reflection on that view
|
|
526
|
+
* By using the advanced Reflection editor to define the aggregation Reflection on the base table
|
|
527
|
+
|
|
528
|
+
##### Define an Aggregation Reflection on a View
|
|
529
|
+
|
|
530
|
+
If the timestamp column in the base table uses the TIMESTAMP data type, use one of the functions in this table to define the corresponding column in the view. You can partition the aggregation Reflection on the view column and use the partition transform that corresponds to the function.
|
|
531
|
+
|
|
532
|
+
| Function in View Definition | Corresponding Partition Transform |
|
|
533
|
+
| --- | --- |
|
|
534
|
+
| DATE\_TRUNC('HOUR', <base\_table\_column>) | HOUR(<view\_col>) |
|
|
535
|
+
| DATE\_TRUNC('DAY', <base\_table\_column>) | DAY(<view\_col>) |
|
|
536
|
+
| DATE\_TRUNC('MONTH', <base\_table\_column>) | MONTH(<view\_col>) |
|
|
537
|
+
| DATE\_TRUNC('YEAR', <base\_table\_column>) | YEAR(<view\_col>) |
|
|
538
|
+
| CAST <base\_table\_column> as DATE | DAY(<view\_col>) |
|
|
539
|
+
| TO\_DATE(<base\_table\_column>) | DAY(<view\_col>) |
|
|
540
|
+
|
|
541
|
+
If the timestamp column in the base table uses the STRING data type, use one of the functions in this table to define the corresponding column in the view. You can partition the aggregation Reflection on the view column and use the partition transform that corresponds to the function.
|
|
542
|
+
|
|
543
|
+
| Function in View Definition | Corresponding Partition Transform |
|
|
544
|
+
| --- | --- |
|
|
545
|
+
| LEFT(<base\_table\_column>, X) | TRUNCATE(<view\_col>, X) |
|
|
546
|
+
| SUBSTR(<base\_table\_column>, 0, X) | TRUNCATE(<view\_col>, X) |
|
|
547
|
+
| SUBSTRING(<base\_table\_column>, 0, X) | TRUNCATE(<view\_col>, X) |
|
|
548
|
+
|
|
549
|
+
##### Define an Aggregation Reflection on a Base Table
|
|
550
|
+
|
|
551
|
+
When creating or editing the aggregation Reflection in the Advanced View, as described in [Manual Reflections](/dremio-cloud/admin/performance/manual-reflections/), follow these steps:
|
|
552
|
+
|
|
553
|
+
1. Set the base table's timestamp column as a dimension.
|
|
554
|
+
|
|
555
|
+

|
|
556
|
+
|
|
557
|
+
2. Click the down-arrow next to the green circle.
|
|
558
|
+
3. Select **Date** for the date granularity.
|
|
559
|
+
|
|
560
|
+

|
|
561
|
+
|
|
562
|
+
### Use Dimmensions with Low Cardinality
|
|
563
|
+
|
|
564
|
+
Use dimensions that have relatively low cardinality in a table or view. The higher the cardinality of a dimension, the less benefit an aggregation Reflection has on query performance. Lower cardinality aggregation Reflections require less time to scan.
|
|
565
|
+
|
|
566
|
+
### Create One Aggregation Reflection for Each Important Subset of Dimensions
|
|
567
|
+
|
|
568
|
+
* For a single table or view, create one aggregation Reflection for each important subset of dimensions in your queries, rather than one aggregation Reflection that includes all dimensions. Multiple small aggregation Reflections (versus one large one) are good for isolated pockets of query patterns on the same table or view that do not overlap. If your query patterns overlap, use fewer larger aggregation Reflections.
|
|
569
|
+
|
|
570
|
+
There are two cautions that accompany this advice, however:
|
|
571
|
+
|
|
572
|
+
+ Be careful of creating aggregation Reflections that have too few dimensions for your queries.
|
|
573
|
+
|
|
574
|
+
If a query uses more dimensions than are included in an aggregation Reflection, the Reflection cannot satisfy the query and the query optimizer does not run the query against it.
|
|
575
|
+
+ Be careful of creating more aggregation Reflections than are necessary to satisfy queries against a table or view.
|
|
576
|
+
|
|
577
|
+
The more Reflections you create, the more time the query optimizer requires to plan the execution of queries. Therefore, creating more aggregation Reflections than you need can slow down query performance, even if your aggregation Reflections are low-cardinality.
|
|
578
|
+
|
|
579
|
+
### Sort Reflections on High-Cardinality Fields
|
|
580
|
+
|
|
581
|
+
The sort option is useful for optimizing queries that use filters or range predicates, especially on fields with high cardinality. If sorting is enabled, during query execution, Dremio skips over large blocks of records based on filters on sorted fields.
|
|
582
|
+
|
|
583
|
+
Dremio sorts data during the execution of a query if a Reflection spans multiple nodes and is composed of multiple partitions.
|
|
584
|
+
|
|
585
|
+
Sorting on more than one field in a single data Reflection typically does not improve read performance significantly and increases the costs of maintenance tasks.
|
|
586
|
+
|
|
587
|
+
For workloads that need sorting on more than one field, consider creating multiple Reflections, each being sorted on a single field.
|
|
588
|
+
|
|
589
|
+
### Create Reflections from Joins that are Based on Joins from Multiple Queries
|
|
590
|
+
|
|
591
|
+
Joins between tables, views, or both tend to be expensive. You can reduce the costs of joins by performing them only when building and refreshing Reflections.
|
|
592
|
+
|
|
593
|
+
As an administrator, you can identify a group of queries that use similar joins. Then, you can create a general query that uses a join that is based on the similar joins, but does not include any additional predicates from the queries in the group. This generic query can serve as the basis of a raw Reflection, an aggregation Reflection, or both.
|
|
594
|
+
|
|
595
|
+
For example, consider the following three queries which use similar joins on views A, B and C:
|
|
596
|
+
|
|
597
|
+
Three queries with joins on views A, B, and C
|
|
598
|
+
|
|
599
|
+
```
|
|
600
|
+
SELECT a.col1, b.col1, c.col1 FROM a join b on (a.col4 = b.col4) join c on (c.col5=a.col5)
|
|
601
|
+
WHERE a.size = 'M' AND a.col3 > '2001-01-01' AND b.col3 IN ('red','blue','green')
|
|
602
|
+
SELECT a.col1, a.col2, c.col1, COUNT(b.col1) FROM a join b on (a.col4 = b.col4) join c on (c.col5=a.col5)
|
|
603
|
+
WHERE a.size = 'M' AND b.col2 < 10 AND c.col2 > 2 GROUP BY a.col1, a.col2, c.col1
|
|
604
|
+
SELECT a.col1, b.col2 FROM a join b on (a.col4 = b.col4) join c on (c.col5=a.col5)
|
|
605
|
+
WHERE c.col1 = 123
|
|
606
|
+
```
|
|
607
|
+
|
|
608
|
+
You can write and run this generic query to create a raw Reflection to accelerate all three original queries:
|
|
609
|
+
|
|
610
|
+
Create a Reflection to accelerate three queries
|
|
611
|
+
|
|
612
|
+
```
|
|
613
|
+
SELECT a.col1 , a.col2, a.col3, b.col1, b.col2, b.col3, c.col1, c.col2 FROM a join b on (a.col4 = b.col4) join c on (c.col5=a.col5)
|
|
614
|
+
```
|
|
615
|
+
|
|
616
|
+
### Time Reflection Refreshes to Occur After Metadata Refreshes of Tables
|
|
617
|
+
|
|
618
|
+
Time your refresh Reflections to occur only after the metadata for their underlying tables is refreshed. Otherwise, Reflection refreshes do not include data from any files that were added to a table since the last metadata refresh, if any files were added.
|
|
619
|
+
|
|
620
|
+
For example, suppose a data source that is promoted to a table consists of 10,000 files, and that the metadata refresh for the table is set to happen every three hours. Subsequently, Reflections are created from views on that table, and the refresh of Reflections on the table is set to occur every hour.
|
|
621
|
+
|
|
622
|
+
Now, one thousand files are added to the table. Before the next metadata refresh, the Reflections are refreshed twice, yet the refreshes do not add data from those one thousand files. Only on the third refresh of the Reflections does data from those files get added to the Reflections.
|
|
623
|
+
|
|
624
|
+
### Rotation Personal Access Tokens
|
|
625
|
+
|
|
626
|
+
When Dremio [personal access tokens (PATs)](/dremio-cloud/security/authentication/personal-access-token/) are used in custom applications, consider scripting an automated periodic refresh to avoid job failures when the PATs expire.
|
|
627
|
+
|
|
628
|
+
### Monitor Dremio Projects
|
|
629
|
+
|
|
630
|
+
It's important to set up a good monitoring solution to maximize your investment in Dremio and identify and resolve issues related to Dremio projects before they have a broader impact on workload. Your monitoring solution should ensure overall cluster health and performance.
|
|
631
|
+
|
|
632
|
+
Was this page helpful?
|
|
633
|
+
|
|
634
|
+
* Principles
|
|
635
|
+
+ Regularly Evaluate Engine Resources
|
|
636
|
+
+ Regularly Evaluate Query Performance
|
|
637
|
+
+ Clean Up Tables with Vacuum
|
|
638
|
+
+ Optimize Tables
|
|
639
|
+
+ Regularly Monitor Live Metrics for Dremio
|
|
640
|
+
* Best Practices
|
|
641
|
+
+ Optimize Workload Management Rules
|
|
642
|
+
+ Configure Engines
|
|
643
|
+
+ Optimize Query Performance
|
|
644
|
+
+ Design Reflections for Expensive Query Patterns
|
|
645
|
+
+ Avoid the “More Is Always Better” Approach
|
|
646
|
+
+ Establish Criteria for When to create Reflections
|
|
647
|
+
+ Create Reflections Without Duplicating the Work of Other Reflections
|
|
648
|
+
+ Establish a Routine for Checking How often Reflections Are Used
|
|
649
|
+
+ Use Supporting Anchors
|
|
650
|
+
+ Horizontally Partition Reflections that Have Many Rows
|
|
651
|
+
+ Partition Reflections to Allow for Partition-Based Incremental Refreshes
|
|
652
|
+
+ Use Dimmensions with Low Cardinality
|
|
653
|
+
+ Create One Aggregation Reflection for Each Important Subset of Dimensions
|
|
654
|
+
+ Sort Reflections on High-Cardinality Fields
|
|
655
|
+
+ Create Reflections from Joins that are Based on Joins from Multiple Queries
|
|
656
|
+
+ Time Reflection Refreshes to Occur After Metadata Refreshes of Tables
|
|
657
|
+
+ Rotation Personal Access Tokens
|
|
658
|
+
+ Monitor Dremio Projects
|
|
659
|
+
|
|
660
|
+
<div style="page-break-after: always;"></div>
|
|
661
|
+
|
|
662
|
+
# Performance Efficiency | Dremio Documentation
|
|
663
|
+
|
|
664
|
+
Original URL: https://docs.dremio.com/dremio-cloud/help-support/well-architected-framework/performance-efficiency
|
|
665
|
+
|
|
666
|
+
On this page
|
|
667
|
+
|
|
668
|
+
Dremio is a powerful massively parallel processing (MPP) platform that can process terabyte-scale datasets. To get the best performance from your Dremio environment, follow these design principles and best practices for implementation.
|
|
669
|
+
|
|
670
|
+
## Dimensions of Performance Optimization
|
|
671
|
+
|
|
672
|
+
When optimizing Dremio, several factors can affect workload and project performance. Queries submitted to Dremio must be planned on the control plane before being routed for execution. The resource requirements and degree of optimization of individual queries can vary widely. Those queries can be rewritten and optimized on their own without regard to a larger engine.
|
|
673
|
+
|
|
674
|
+
Beyond individual queries, executor nodes have individual constraints of memory and CPU. Executors in Dremio are also part of an engine that groups executors together to process queries in parallel across multiple machines. The size of the engine that a query runs on can affect its performance. To enhance the ability to handle additional queries beyond certain concurrency thresholds, configure replicas for the engine.
|
|
675
|
+
|
|
676
|
+
These dimensions of performance optimization can be simplified in the following decision tree, which addresses the most common scenarios. In the decision tree, `engine_start_epoch_millis > 0` implies that the engine is down.
|
|
677
|
+
|
|
678
|
+

|
|
679
|
+
|
|
680
|
+
## Principles
|
|
681
|
+
|
|
682
|
+
### Perform Regular Maintenance
|
|
683
|
+
|
|
684
|
+
Conduct regular maintenance to ensure that your project is set up for optimal performance and can handle more data, queries, and workloads. Regular maintenance will establish a solid baseline from which you can design and optimize. Dremio can be set up to automatically optimize and vacuum tables in your Open catalog.
|
|
685
|
+
|
|
686
|
+
### Optimize Queries for Efficiency
|
|
687
|
+
|
|
688
|
+
Before worrying about scaling out your engines, it is important to optimize your semantic layer and queries to be as efficient as possible. For example, if there are partition columns, you should use them. Create sorted or partitioned Reflections. Follow standard SQL writing best practices such as applying functions to values rather than columns in where clauses.
|
|
689
|
+
|
|
690
|
+
### Optimize Engines
|
|
691
|
+
|
|
692
|
+
Dremio provides several facilities to allow workload isolation and ensure your queries do not overload the engines. Multiple engines are used to keep some queries from affecting others and concurrency rules are used to buffer queries to prevent overloading any one particular engine.
|
|
693
|
+
|
|
694
|
+
## Best Practices
|
|
695
|
+
|
|
696
|
+
### Design Semantic Layer for Workload Performance
|
|
697
|
+
|
|
698
|
+
Dremio’s enterprise-scale semantic layer clearly defines the boundary between your physically stored tables and your logical, governed, and self-service views. The semantic layer seamlessly allows data engineers and semantic data modelers to create views based on tables without having to make copies of the physical data.
|
|
699
|
+
|
|
700
|
+
Since interactive performance for business users is a key capability of the semantic layer, when appropriate, Dremio can leverage physically optimized representations of source data known as [Reflections](/dremio-cloud/admin/performance/autonomous-reflections). When queries are made against views that have Reflections enabled, the query optimizer can accelerate a query by using one or more Reflections to partially or entirely satisfy that query rather than processing the raw data in the underlying data source. Queries do not have to be rewritten to take advantage of Reflections. Instead, Dremio's optimizer automatically considers Reflection suitability while planning the query.
|
|
701
|
+
|
|
702
|
+
With Dremio, you can create layers of views that allow you to present data to business consumers in a format they need while satisfying the security requirements of the organization. Business consumers do not need to worry about which physical locations the data comes from or how the data is physically organized. A layered approach allows you to create sets of views that can be reused many times across multiple projects.
|
|
703
|
+
|
|
704
|
+
Leveraging Dremio’s layering best practices promotes a more-performant, low-maintenance solution that can provide agility to development teams and business users as well as better control over data.
|
|
705
|
+
|
|
706
|
+
### Improve the Performance of Poor-Performing Queries
|
|
707
|
+
|
|
708
|
+
Run `SELECT * FROM sys.project.jobs_recent` to analyze the query history and determine which queries are performing sub-optimally. This allows you to consider a number of factors, including the overall execution time of a query. Identify the 10 longest-running queries to understand why they are taking so long. For example, is it the time taken to read data from the source, lacking CPU cycles, query spilling to disk, query queued at the start, or another issue? Did the query take a long time to plan?
|
|
709
|
+
|
|
710
|
+
note
|
|
711
|
+
|
|
712
|
+
Read [Query Performance Analysis and Improvement](https://www.dremio.com/wp-content/uploads/2024/01/Query-Performance-Analysis-and-Improvement.pdf) for details about query performance analysis techniques. This white paper was developed based on Dremio Software, but the content applies equally to Dremio.
|
|
713
|
+
|
|
714
|
+
The query history also allows you to focus on planning times. You should also investigate queries to pinpoint high planning time, which could be due to the complexity of the query (which you can address by rewriting the query) or due to many Reflections being considered (which indiciates that too many Reflections are defined in the environment). Read [Remove Unused Reflections](/dremio-cloud/help-support/well-architected-framework/cost-optimization#remove-unused-reflections) for more information about identifying redundant Reflections in your Dremio project.
|
|
715
|
+
|
|
716
|
+
The query history also allows you to focus on metadata refresh times, which could be due to inline metadata refresh. Read [Optimize Metadata Refresh Frequency](/dremio-cloud/help-support/well-architected-framework/cost-optimization#optimize-metadata-refresh-frequency) for more information about checking metadata refresh schedules.
|
|
717
|
+
|
|
718
|
+
Sometimes, query performance is inconsistent. A query may complete execution in less than 10 seconds in one instance but require 1 minute of execution time in another instance. This is a sign of resource contention in the engine, which can happen in high-volume environments or when too many jobs (including user queries, metadata, and reflections) are running at the same time. We recommend having separate, dedicated engines for metadata refreshes, Reflection refreshes, and user queries to reduce the contention for resources when user queries need run concurrently with refreshes.
|
|
719
|
+
|
|
720
|
+
For Reflection jobs that require excessive memory, we recommend two Reflection refresh engines of different sizes, routing the Reflections that require excessive memory to the larger engine. This is typically needed for Reflections on views that depend on the largest datasets and can be done with the [ALTER TABLE ROUTE REFLECTIONS](/dremio-cloud/sql/commands/alter-table/) command.
|
|
721
|
+
|
|
722
|
+
### Read Dremio Profiles to Pinpoint Bottlenecks
|
|
723
|
+
|
|
724
|
+
Dremio job profiles contain a lot of fine-grained information about how a query was planned, how the phases of execution were constructed, how the query was actually executed, and the decisions made about whether to use Reflections to accelerate the query.
|
|
725
|
+
|
|
726
|
+
For each phase of the query that is documented in the job profile, check the start and end times of the phase for an initial indication of the phase in which any bottlenecks are located. After you identify the phase, check the operators of that phase to identify which operator or thread within the operator may have been the specific bottleneck. This information usually helps determine why a query performs sub-optimally so that you can plan improvements. Reasons for bottlenecks and potential improvement options include:
|
|
727
|
+
|
|
728
|
+
* High metadata retrieval times from inline metadata refresh indicate that you should revisit metadata refresh settings.
|
|
729
|
+
* High planning time can be caused by too many Reflections or may mean that a query is too complex and should be rewritten.
|
|
730
|
+
* High engine start times indicate that the engine is down. The enqueued time may include replica start time. You may be able to mitigate these issues with minimum replica and last replica auto-stop settings.
|
|
731
|
+
* High setup times in table functions indicate overhead due to opening and closing too many small files. High wait times indicate that there is a network or i/o delay in reading the data. High sleep times in certain phases could indicate CPU contention.
|
|
732
|
+
|
|
733
|
+
note
|
|
734
|
+
|
|
735
|
+
Read [Reading Dremio Job Profiles](https://www.dremio.com/wp-content/uploads/2024/01/Reading-Dremio-Job-Profiles.pdf) for details about job profile analysis techniques. This white paper was developed based on Dremio Software, but the content applies equally to Dremio.
|
|
736
|
+
|
|
737
|
+
### Engine Routing and Workload Management
|
|
738
|
+
|
|
739
|
+
Since the workloads and volumes of queries change over time, reevaluate engine routing settings, engine sizes, engine replicas, and concurrency per replica and adjust as needed to rebalance the proportion of queries that execute concurrently on a replica of an engine.
|
|
740
|
+
|
|
741
|
+
### Right-Size Engines and Executors
|
|
742
|
+
|
|
743
|
+
Analyze the query history to determine whether a change in the number of executors in your engines is necessary.
|
|
744
|
+
|
|
745
|
+
When the volume of queries being simultaneously executed by the current set of executor nodes in an engine starts to reach a saturation point, Dremio exhibits several symptoms. Saturation point is typically manifested as increased sleep time during query execution. Sleep time is incurred when a running query needs to wait for available CPU cycles due to all available CPUs being in operation. Another symptom is an increased number of queries spilling to disk or out-of-memory exceptions.
|
|
746
|
+
|
|
747
|
+
You can identify these symptoms by analyzing the system table by running `SELECT * FROM sys.project.jobs_recent`. The resulting table lists query execution times, planning time, engine start times, enqueued times, and job failure.
|
|
748
|
+
|
|
749
|
+
Failure to address these symptoms can result in increasing query failures, increasing query times, and queries spilling to disk, which in turn lead to a bad end-user experience and poor satisfaction. Spilling to disk ensures that a query succeeds because some of its high-memory-consuming operations are processed via local disks. This reduces the memory footprint of the query significantly, but the trade-off is that the query inevitably runs more slowly.
|
|
750
|
+
|
|
751
|
+
You can alleviate these issues by adding replicas to the engine and reducing concurrency per replica and adding a larger engine, then altering the engine routing rules to route some of the workload to the new engine. Remember that a query executes on the nodes of a single replica or an engine, not across multiple replicas or multiple engines.
|
|
752
|
+
|
|
753
|
+
A good reason to create a new engine is when a new workload is introduced to your Dremio project, perhaps by a new department within an organization, and the queries cause the existing engine setup to degrade in performance. Creating a new engine to isolate the new workload, most likely by creating rules to route queries from users in that organization to the new engine, is a useful way of segregating workloads.
|
|
754
|
+
|
|
755
|
+
### Leverage Reflections to Improve Performance
|
|
756
|
+
|
|
757
|
+
When developing use cases in Dremio’s semantic layer, it’s often best to build out the use case iteratively without any Reflections to begin with. Then, as you complete iterations, run the queries and analyze the data in the query history to deduce which queries take the longest to execute and whether any common factors among a set of slow queries are contributing to the slowness.
|
|
758
|
+
|
|
759
|
+
For example, if a set of five slow queries are each derived from a view that contains a join between two relatively large tables, you might find that adding a raw Reflection on the view that is performing the join helps to speed up all five queries because doing so creates an Apache Iceberg materialization of the join results, which is automatically used to accelerate views derived from the join. This provides the query planning and performance benefits of Apache Iceberg and allows you to partition the Reflection to accelerate queries for which the underlying data weren't initially optimized. This is an important pattern because it means you can leverage a small number of Reflections to speed up many workloads.
|
|
760
|
+
|
|
761
|
+
Raw Reflections can be useful when you have large volumes of JSON or CSV data. Querying such data requires processing the entire data set, which can be inefficient. Adding a raw Reflection over the JSON or CSV data again allows for an Apache Iceberg representation of that data to be created and opens up all of the planning and performance benefits that come along with it.
|
|
762
|
+
|
|
763
|
+
Another use of raw Reflections is to offload heavy queries from an operational data store. Often, database administrators do not want their operational data stores (for example, online transaction processing databases) overloaded with analytical queries while they are busy processing billions of transactions. In this situation, you can leverage Dremio raw Reflections again to create an Apache Iceberg representation of the operational table. When a query comes in that needs the data, Dremio reads the Reflection data instead of going back to the operational source.
|
|
764
|
+
|
|
765
|
+
Another very important use case that often requires raw Reflections is when you join on-premises data to cloud data. In this situation, retrieving the on-premises data often becomes a bottleneck for queries due to the latency in retrieving data from the source system. Leveraging a default raw Reflection on the view where the data is joined together often yields significant performance gains.
|
|
766
|
+
|
|
767
|
+
If you have connected Dremio to client tools that issue different sets of GROUP BY queries against a view, and the GROUP BY statements take too long to process compared to the desired service level agreement, consider adding an aggregation Reflection to the view to satisfy the combinations of dimensions and measures that are submitted from the client tool.
|
|
768
|
+
|
|
769
|
+
Read [Best Practices for Creating Raw and Aggregation Reflections](/dremio-cloud/admin/performance/manual-reflections) when you are considering how and where to apply Reflections.
|
|
770
|
+
|
|
771
|
+
Failing to make use of Dremio Reflections means you could be missing out on significant performance enhancements for some of your poorest-performing queries. However, creating too many Reflections can also have a negative impact on the system as a whole. The misconception is often that more Reflections must be better, but when you consider the overhead in maintaining and refreshing Reflections at intervals, Reflection refreshes can end up stealing valuable resources from your everyday workloads, especially if you have not created a dedicated Reflection refresh engine.
|
|
772
|
+
|
|
773
|
+
Where possible, organize your queries by pattern. The idea is to create as few Reflections as possible to service as many queries as possible, so finding points in the semantic tree through which many queries go can help you accelerate a larger number of queries. The more Reflections you have that may be able to accelerate the same query patterns, the longer the planner takes to evaluate which Reflection is best suited for accelerating the query being planned.
|
|
774
|
+
|
|
775
|
+
### Optimize Metadata Refresh Performance
|
|
776
|
+
|
|
777
|
+
Add a dedicated metadata refresh engine to your Dremio project. This ensures that all metadata refresh activities for Parquet, Optimized Row Columnar (ORC), and Avro datasets that are serviced by executors are completed in isolation from any other workloads and prevents problems with metadata refresh workloads taking CPU cycles and memory away from business-critical workloads. This gives the refreshes have the best chance of finishing in a timely manner.
|
|
778
|
+
|
|
779
|
+
Was this page helpful?
|
|
780
|
+
|
|
781
|
+
* Dimensions of Performance Optimization
|
|
782
|
+
* Principles
|
|
783
|
+
+ Perform Regular Maintenance
|
|
784
|
+
+ Optimize Queries for Efficiency
|
|
785
|
+
+ Optimize Engines
|
|
786
|
+
* Best Practices
|
|
787
|
+
+ Design Semantic Layer for Workload Performance
|
|
788
|
+
+ Improve the Performance of Poor-Performing Queries
|
|
789
|
+
+ Read Dremio Profiles to Pinpoint Bottlenecks
|
|
790
|
+
+ Engine Routing and Workload Management
|
|
791
|
+
+ Right-Size Engines and Executors
|
|
792
|
+
+ Leverage Reflections to Improve Performance
|
|
793
|
+
+ Optimize Metadata Refresh Performance
|
|
794
|
+
|
|
795
|
+
<div style="page-break-after: always;"></div>
|
|
796
|
+
|
|
797
|
+
# AI Semantic Layer | Dremio Documentation
|
|
798
|
+
|
|
799
|
+
Original URL: https://docs.dremio.com/dremio-cloud/help-support/well-architected-framework/self-serve-semantic-layer
|
|
800
|
+
|
|
801
|
+
On this page
|
|
802
|
+
|
|
803
|
+
Dremio has a unique capability in its AI Semantic Layer, which is where the magic happens in mapping the physical structure of the underlying data storage to how the data is consumed via SQL queries. When you optimally design and maintain the semantic layer, data is more discoverable, writing queries is more straightforward, and performance is optimized.
|
|
804
|
+
|
|
805
|
+
## Principles
|
|
806
|
+
|
|
807
|
+
### Layer Views
|
|
808
|
+
|
|
809
|
+
Layering your views allows you to balance security, performance, and usability. Layered views help you expose the data in your physical tables to external consumption tools in the format the tools require, with proper security and performance. A well-architected semantic layer consists of three layers to organize your views: preparation, business, and application. Each layer serves a purpose in transforming data for consumption by external tools.
|
|
810
|
+
|
|
811
|
+
### Annotate Datasets to Enhance Discovery and Understanding
|
|
812
|
+
|
|
813
|
+
You can label and document datasets within Dremio to make data more discoverable and verifiable and allow you to apply governance.
|
|
814
|
+
|
|
815
|
+
## Best Practices
|
|
816
|
+
|
|
817
|
+
### Use the Preparation Layer to Map 1-1 to Tables
|
|
818
|
+
|
|
819
|
+
The preparation layer is closest to the data source. This layer is used to organize and expose only the required datasets from the source rather than all datasets the source contains. In the preparation layer, each view is mapped to the table that it is derived from in the data source, and there are no joins to other views.
|
|
820
|
+
|
|
821
|
+
Typically, a data engineer is responsible for preparing the data in the preparation layer. The data engineer should apply column aliasing so that all downstream views can use the normalized column names. Casting column data types should also be done in the preparation layer so that all higher-level views can leverage the correct type and conversion is done only once. Data should be cleansed in the preparation layer for central management and to ensure that all downstream views use clean data. Derived columns based on existing columns should be configured in the preparation layer so that all future layers can use the new columns.
|
|
822
|
+
|
|
823
|
+
### Use the Business Layer to Logically Join Datasets
|
|
824
|
+
|
|
825
|
+
The business layer provides a holistic view of all data across your catalog or folder. It is the first layer where joins among and between sources should occur. All views in the business layer must be built by either querying resources in the preparation layer or querying other resources in the same business layer.
|
|
826
|
+
|
|
827
|
+
* Querying resources in the preparation layer: views in the business layer should start with selecting all columns from the preparation layer of that view. This is typically a 1-1 mapping between the preparation and business layer view.
|
|
828
|
+
* Querying other resources in the same business layer: when joining two views together, they should be joined from the business layer representation of the view, not the preparation layer. This allows all changes made in the business layer to propagate to all joins.
|
|
829
|
+
|
|
830
|
+
Use your list of common terms to describe the key business entities in your organization, such as customer, product, and order. Typically, a data modeler works with business experts and data providers to define the views that represent the business entities.
|
|
831
|
+
|
|
832
|
+
You can create many sub-layers inside the business layer, each consisting of views for different subject areas or verticals. These views are reusable components that can and should be shared across business lines. Typically, views do not filter rows or columns in the business layer; this is deferred to the application layer.
|
|
833
|
+
|
|
834
|
+
Use the business layer to improve productivity for analytics initiatives and minimize the risk of duplicative efforts in your organization by reducing the cost of service delivery to lines of business, providing a self-service model for data engineers to quickly provision datasets, and enabling data consumers to quickly use and share datasets.
|
|
835
|
+
|
|
836
|
+
### Use the Application Layer to Arrange Datasets for Consumption
|
|
837
|
+
|
|
838
|
+
Application layer views are arranged for the needs of data consumers and organizational departments. Typically, data consumers like analysts and data scientists use the views from the business layer and work directly in the application layer to create and modify views in their own dashboards.
|
|
839
|
+
|
|
840
|
+
If the application layer provides self-service access to Dremio’s AI Semantic Layer, you should expose all business layer views in the application layer at minimum. Even if the view is created by running `SELECT * from BUSINESS_VIEW`, it provides logical separation for security and performance improvements.
|
|
841
|
+
|
|
842
|
+
If the application layer is not for self-service but for particular applications, the views in the application layer should be built on top of those self-service views in the application layer, adding any application-specific logic. Application logic should be row filters as needed by the application. Columns can be left as-is, and the list of columns the application selects are reduced in the SQL query.
|
|
843
|
+
|
|
844
|
+
### Leverage Labels to Enhance Searchability
|
|
845
|
+
|
|
846
|
+
Use Dremio’s [label](/dremio-cloud/manage-govern/wikis-labels) functionality to create and assign labels to tables and views to group related objects and enhance the discoverability of data across your organization. You can search for sets of tables and views based on a label or click on a label in the Dremio console to start a search based on it. Objects can have multiple labels so that they can belong to different logical groups.
|
|
847
|
+
|
|
848
|
+
### Create Wiki Content to Describe Datasets
|
|
849
|
+
|
|
850
|
+
Use Dremio’s [wiki](/dremio-cloud/manage-govern/wikis-labels) functionality to add descriptions for catalogs, sources, folders, tables, and views. Wikis enhance understanding of data inside your organization. Wikis allow you to provide context for datasets, such as descriptions for each column, and content that helps users get started with the data, such as usage examples, notes, and points of contact for questions or issues.
|
|
851
|
+
|
|
852
|
+
Dremio wikis use [GitHub-Flavored Markdown](https://github.github.com/gfm/) and are supported by a rich text editor.
|
|
853
|
+
|
|
854
|
+
To help eliminate the need for labor-intensive manual classification and cataloging, you can use [generative AI](/dremio-cloud/manage-govern/wikis-labels) to generate labels and wikis for your datasets. Enabling the generative AI feature in Dremio allows you to generate a detailed description of each dataset’s purpose and schema. Dremio's generative AI bases its understanding on your schema and data to produce descriptions of datasets because it can determine how the columns within the dataset relate to each other and to the dataset as a whole.
|
|
855
|
+
|
|
856
|
+
Was this page helpful?
|
|
857
|
+
|
|
858
|
+
* Principles
|
|
859
|
+
+ Layer Views
|
|
860
|
+
+ Annotate Datasets to Enhance Discovery and Understanding
|
|
861
|
+
* Best Practices
|
|
862
|
+
+ Use the Preparation Layer to Map 1-1 to Tables
|
|
863
|
+
+ Use the Business Layer to Logically Join Datasets
|
|
864
|
+
+ Use the Application Layer to Arrange Datasets for Consumption
|
|
865
|
+
+ Leverage Labels to Enhance Searchability
|
|
866
|
+
+ Create Wiki Content to Describe Datasets
|
|
867
|
+
|
|
868
|
+
<div style="page-break-after: always;"></div>
|
|
869
|
+
|