@chrismo/superkit 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE.txt +29 -0
- package/README.md +26 -0
- package/dist/cli/pager.d.ts +6 -0
- package/dist/cli/pager.d.ts.map +1 -0
- package/dist/cli/pager.js +21 -0
- package/dist/cli/pager.js.map +1 -0
- package/dist/cli/skdoc.d.ts +3 -0
- package/dist/cli/skdoc.d.ts.map +1 -0
- package/dist/cli/skdoc.js +42 -0
- package/dist/cli/skdoc.js.map +1 -0
- package/dist/cli/skgrok.d.ts +3 -0
- package/dist/cli/skgrok.d.ts.map +1 -0
- package/dist/cli/skgrok.js +21 -0
- package/dist/cli/skgrok.js.map +1 -0
- package/dist/cli/skops.d.ts +3 -0
- package/dist/cli/skops.d.ts.map +1 -0
- package/dist/cli/skops.js +32 -0
- package/dist/cli/skops.js.map +1 -0
- package/dist/index.d.ts +10 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +11 -0
- package/dist/index.js.map +1 -0
- package/dist/lib/docs.d.ts +11 -0
- package/dist/lib/docs.d.ts.map +1 -0
- package/dist/lib/docs.js +29 -0
- package/dist/lib/docs.js.map +1 -0
- package/dist/lib/expert-sections.d.ts +32 -0
- package/dist/lib/expert-sections.d.ts.map +1 -0
- package/dist/lib/expert-sections.js +130 -0
- package/dist/lib/expert-sections.js.map +1 -0
- package/dist/lib/grok.d.ts +15 -0
- package/dist/lib/grok.d.ts.map +1 -0
- package/dist/lib/grok.js +57 -0
- package/dist/lib/grok.js.map +1 -0
- package/dist/lib/help.d.ts +20 -0
- package/dist/lib/help.d.ts.map +1 -0
- package/dist/lib/help.js +163 -0
- package/dist/lib/help.js.map +1 -0
- package/dist/lib/recipes.d.ts +29 -0
- package/dist/lib/recipes.d.ts.map +1 -0
- package/dist/lib/recipes.js +133 -0
- package/dist/lib/recipes.js.map +1 -0
- package/dist/superkit.tar.gz +0 -0
- package/docs/grok-patterns.sup +89 -0
- package/docs/recipes/array.md +66 -0
- package/docs/recipes/array.spq +31 -0
- package/docs/recipes/character.md +110 -0
- package/docs/recipes/character.spq +57 -0
- package/docs/recipes/escape.md +159 -0
- package/docs/recipes/escape.spq +102 -0
- package/docs/recipes/format.md +51 -0
- package/docs/recipes/format.spq +24 -0
- package/docs/recipes/index.md +23 -0
- package/docs/recipes/integer.md +101 -0
- package/docs/recipes/integer.spq +53 -0
- package/docs/recipes/records.md +84 -0
- package/docs/recipes/records.spq +61 -0
- package/docs/recipes/string.md +177 -0
- package/docs/recipes/string.spq +105 -0
- package/docs/superdb-expert.md +929 -0
- package/docs/tutorials/bash_to_sup.md +123 -0
- package/docs/tutorials/chess-tiebreaks.md +233 -0
- package/docs/tutorials/debug.md +439 -0
- package/docs/tutorials/fork_for_window.md +296 -0
- package/docs/tutorials/grok.md +166 -0
- package/docs/tutorials/index.md +10 -0
- package/docs/tutorials/joins.md +79 -0
- package/docs/tutorials/moar_subqueries.md +35 -0
- package/docs/tutorials/subqueries.md +236 -0
- package/docs/tutorials/sup_to_bash.md +164 -0
- package/docs/tutorials/super_db_update.md +34 -0
- package/docs/tutorials/unnest.md +113 -0
- package/docs/zq-to-super-upgrades.md +549 -0
- package/package.json +46 -0
|
@@ -0,0 +1,296 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Fork as a Window Function Workaround"
|
|
3
|
+
name: fork-for-window
|
|
4
|
+
description: "Using fork as a workaround for window functions to do per-group selection."
|
|
5
|
+
layout: default
|
|
6
|
+
nav_order: 4
|
|
7
|
+
parent: Tutorials
|
|
8
|
+
superdb_version: "0.3.0"
|
|
9
|
+
last_updated: "2026-02-20"
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Fork as a Window Function Workaround
|
|
13
|
+
|
|
14
|
+
Window functions like `ROW_NUMBER() OVER (PARTITION BY ...)` are not yet
|
|
15
|
+
available in SuperDB ([brimdata/super#5921][issue]). This tutorial shows how to
|
|
16
|
+
use `fork` to achieve per-group selection — picking the top N items from each
|
|
17
|
+
group.
|
|
18
|
+
|
|
19
|
+
[issue]: https://github.com/brimdata/super/issues/5921
|
|
20
|
+
|
|
21
|
+
## The Problem
|
|
22
|
+
|
|
23
|
+
You have a pool of available EC2 instances spread across availability zones.
|
|
24
|
+
You need to pick instances while maximizing AZ distribution — taking an equal
|
|
25
|
+
number from each zone rather than filling up from one.
|
|
26
|
+
|
|
27
|
+
```mdtest-input instances.sup
|
|
28
|
+
{id:"i-001",az:"us-east-1a"}
|
|
29
|
+
{id:"i-002",az:"us-east-1a"}
|
|
30
|
+
{id:"i-003",az:"us-east-1a"}
|
|
31
|
+
{id:"i-004",az:"us-east-1b"}
|
|
32
|
+
{id:"i-005",az:"us-east-1c"}
|
|
33
|
+
{id:"i-006",az:"us-east-1c"}
|
|
34
|
+
{id:"i-007",az:"us-east-1c"}
|
|
35
|
+
{id:"i-008",az:"us-east-1c"}
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
Distribution: 3 in `us-east-1a`, 1 in `us-east-1b`, 4 in `us-east-1c`.
|
|
39
|
+
|
|
40
|
+
## What You'd Want (Window Functions)
|
|
41
|
+
|
|
42
|
+
In SQL with window functions, this would be straightforward:
|
|
43
|
+
|
|
44
|
+
```sql
|
|
45
|
+
SELECT * FROM (
|
|
46
|
+
SELECT *,
|
|
47
|
+
ROW_NUMBER() OVER (PARTITION BY az ORDER BY id) as rn
|
|
48
|
+
FROM instances
|
|
49
|
+
) WHERE rn <= 2
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
This assigns a row number within each AZ group, then filters to keep only the
|
|
53
|
+
first 2 per group. But SuperDB doesn't support this yet.
|
|
54
|
+
|
|
55
|
+
## The Fork Approach
|
|
56
|
+
|
|
57
|
+
`fork` splits the input stream into parallel branches. Each branch receives a
|
|
58
|
+
copy of **all** the input records, processes them independently, and the results
|
|
59
|
+
from every branch are merged back together into a single stream.
|
|
60
|
+
|
|
61
|
+
Here's the full query — we'll break it down step by step after:
|
|
62
|
+
|
|
63
|
+
```mdtest-command
|
|
64
|
+
super -s -c "
|
|
65
|
+
from instances.sup
|
|
66
|
+
| fork
|
|
67
|
+
( where az=='us-east-1a' | head 2 )
|
|
68
|
+
( where az=='us-east-1b' | head 2 )
|
|
69
|
+
( where az=='us-east-1c' | head 2 )
|
|
70
|
+
| sort az, id
|
|
71
|
+
"
|
|
72
|
+
```
|
|
73
|
+
```mdtest-output
|
|
74
|
+
{id:"i-001",az:"us-east-1a"}
|
|
75
|
+
{id:"i-002",az:"us-east-1a"}
|
|
76
|
+
{id:"i-004",az:"us-east-1b"}
|
|
77
|
+
{id:"i-005",az:"us-east-1c"}
|
|
78
|
+
{id:"i-006",az:"us-east-1c"}
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
### Step by Step
|
|
82
|
+
|
|
83
|
+
**Step 1: `from instances.sup`** — reads all 8 records into the stream:
|
|
84
|
+
|
|
85
|
+
```mdtest-command
|
|
86
|
+
super -s -c "from instances.sup"
|
|
87
|
+
```
|
|
88
|
+
```mdtest-output
|
|
89
|
+
{id:"i-001",az:"us-east-1a"}
|
|
90
|
+
{id:"i-002",az:"us-east-1a"}
|
|
91
|
+
{id:"i-003",az:"us-east-1a"}
|
|
92
|
+
{id:"i-004",az:"us-east-1b"}
|
|
93
|
+
{id:"i-005",az:"us-east-1c"}
|
|
94
|
+
{id:"i-006",az:"us-east-1c"}
|
|
95
|
+
{id:"i-007",az:"us-east-1c"}
|
|
96
|
+
{id:"i-008",az:"us-east-1c"}
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
**Step 2: `fork`** — sends all 8 records into each of three branches. Each
|
|
100
|
+
branch sees the full input and processes it independently.
|
|
101
|
+
|
|
102
|
+
**Branch 1:** `where az=='us-east-1a'` filters to 3 records, then `head 2`
|
|
103
|
+
keeps the first 2:
|
|
104
|
+
|
|
105
|
+
```mdtest-command
|
|
106
|
+
super -s -c "from instances.sup | where az=='us-east-1a' | head 2"
|
|
107
|
+
```
|
|
108
|
+
```mdtest-output
|
|
109
|
+
{id:"i-001",az:"us-east-1a"}
|
|
110
|
+
{id:"i-002",az:"us-east-1a"}
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
(i-003 was filtered out by `head 2`)
|
|
114
|
+
|
|
115
|
+
**Branch 2:** `where az=='us-east-1b'` filters to 1 record, `head 2` returns
|
|
116
|
+
what's available:
|
|
117
|
+
|
|
118
|
+
```mdtest-command
|
|
119
|
+
super -s -c "from instances.sup | where az=='us-east-1b' | head 2"
|
|
120
|
+
```
|
|
121
|
+
```mdtest-output
|
|
122
|
+
{id:"i-004",az:"us-east-1b"}
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Only 1 instance exists in this AZ. `head 2` doesn't error or pad — it just
|
|
126
|
+
returns what's there.
|
|
127
|
+
|
|
128
|
+
**Branch 3:** `where az=='us-east-1c'` filters to 4 records, `head 2` keeps
|
|
129
|
+
the first 2:
|
|
130
|
+
|
|
131
|
+
```mdtest-command
|
|
132
|
+
super -s -c "from instances.sup | where az=='us-east-1c' | head 2"
|
|
133
|
+
```
|
|
134
|
+
```mdtest-output
|
|
135
|
+
{id:"i-005",az:"us-east-1c"}
|
|
136
|
+
{id:"i-006",az:"us-east-1c"}
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
(i-007 and i-008 were filtered out by `head 2`)
|
|
140
|
+
|
|
141
|
+
**Step 3: implicit combine** — after the fork closes, results from all three
|
|
142
|
+
branches merge back into a single stream of 5 records. Fork branches run in
|
|
143
|
+
parallel and finish in nondeterministic order, so the combined output may be
|
|
144
|
+
interleaved differently on each run. This is why the final `sort` matters.
|
|
145
|
+
|
|
146
|
+
**Step 4: `sort az, id`** — sorts the combined results for clean, predictable
|
|
147
|
+
output:
|
|
148
|
+
|
|
149
|
+
```mdtest-command
|
|
150
|
+
super -s -c "
|
|
151
|
+
from instances.sup
|
|
152
|
+
| fork
|
|
153
|
+
( where az=='us-east-1a' | head 2 )
|
|
154
|
+
( where az=='us-east-1b' | head 2 )
|
|
155
|
+
( where az=='us-east-1c' | head 2 )
|
|
156
|
+
| sort az, id
|
|
157
|
+
"
|
|
158
|
+
```
|
|
159
|
+
```mdtest-output
|
|
160
|
+
{id:"i-001",az:"us-east-1a"}
|
|
161
|
+
{id:"i-002",az:"us-east-1a"}
|
|
162
|
+
{id:"i-004",az:"us-east-1b"}
|
|
163
|
+
{id:"i-005",az:"us-east-1c"}
|
|
164
|
+
{id:"i-006",az:"us-east-1c"}
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
2 from `us-east-1a`, 1 from `us-east-1b` (all it had), 2 from `us-east-1c` —
|
|
168
|
+
as balanced as possible given the available pool.
|
|
169
|
+
|
|
170
|
+
## Why Not Just Sort and Head?
|
|
171
|
+
|
|
172
|
+
Without fork, you might try:
|
|
173
|
+
|
|
174
|
+
```mdtest-command
|
|
175
|
+
super -s -c "from instances.sup | sort az, id | head 5"
|
|
176
|
+
```
|
|
177
|
+
```mdtest-output
|
|
178
|
+
{id:"i-001",az:"us-east-1a"}
|
|
179
|
+
{id:"i-002",az:"us-east-1a"}
|
|
180
|
+
{id:"i-003",az:"us-east-1a"}
|
|
181
|
+
{id:"i-004",az:"us-east-1b"}
|
|
182
|
+
{id:"i-005",az:"us-east-1c"}
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
All 3 from `us-east-1a`, the 1 from `us-east-1b`, and only 1 from `us-east-1c`.
|
|
186
|
+
That's unbalanced — it fills up from the first AZ alphabetically instead of
|
|
187
|
+
distributing evenly.
|
|
188
|
+
|
|
189
|
+
## Verifying the Distribution
|
|
190
|
+
|
|
191
|
+
You can check the balance of your selection by piping through an aggregate:
|
|
192
|
+
|
|
193
|
+
```mdtest-command
|
|
194
|
+
super -s -c "
|
|
195
|
+
from instances.sup
|
|
196
|
+
| fork
|
|
197
|
+
( where az=='us-east-1a' | head 2 )
|
|
198
|
+
( where az=='us-east-1b' | head 2 )
|
|
199
|
+
( where az=='us-east-1c' | head 2 )
|
|
200
|
+
| aggregate count:=count() by az
|
|
201
|
+
| sort az
|
|
202
|
+
"
|
|
203
|
+
```
|
|
204
|
+
```mdtest-output
|
|
205
|
+
{az:"us-east-1a",count:2}
|
|
206
|
+
{az:"us-east-1b",count:1}
|
|
207
|
+
{az:"us-east-1c",count:2}
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
## Alternative: Self-Join for Row Numbering
|
|
211
|
+
|
|
212
|
+
There's a pure SQL approach that doesn't require fork and works dynamically with
|
|
213
|
+
any number of groups. The idea: for each record, count how many records in the
|
|
214
|
+
same group have an id less than or equal to it. This simulates
|
|
215
|
+
`ROW_NUMBER() OVER (PARTITION BY az ORDER BY id)`.
|
|
216
|
+
|
|
217
|
+
```mdtest-command
|
|
218
|
+
super -s -c "
|
|
219
|
+
select a.id, a.az, count(*) as row_num
|
|
220
|
+
from instances.sup a
|
|
221
|
+
join instances.sup b on a.az = b.az and b.id <= a.id
|
|
222
|
+
group by a.id, a.az
|
|
223
|
+
order by a.az, a.id
|
|
224
|
+
"
|
|
225
|
+
```
|
|
226
|
+
```mdtest-output
|
|
227
|
+
{id:"i-001",az:"us-east-1a",row_num:1}
|
|
228
|
+
{id:"i-002",az:"us-east-1a",row_num:2}
|
|
229
|
+
{id:"i-003",az:"us-east-1a",row_num:3}
|
|
230
|
+
{id:"i-004",az:"us-east-1b",row_num:1}
|
|
231
|
+
{id:"i-005",az:"us-east-1c",row_num:1}
|
|
232
|
+
{id:"i-006",az:"us-east-1c",row_num:2}
|
|
233
|
+
{id:"i-007",az:"us-east-1c",row_num:3}
|
|
234
|
+
{id:"i-008",az:"us-east-1c",row_num:4}
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
Step by step, for record `i-006` in `us-east-1c`:
|
|
238
|
+
|
|
239
|
+
1. The self-join matches `i-006` against all `us-east-1c` records with
|
|
240
|
+
`id <= 'i-006'`: that's `i-005` and `i-006` itself.
|
|
241
|
+
2. `count(*)` = 2, so `row_num` = 2.
|
|
242
|
+
|
|
243
|
+
Now filter to keep only the first 2 per group:
|
|
244
|
+
|
|
245
|
+
```mdtest-command
|
|
246
|
+
super -s -c "
|
|
247
|
+
with ranked as (
|
|
248
|
+
select a.id, a.az, count(*) as row_num
|
|
249
|
+
from instances.sup a
|
|
250
|
+
join instances.sup b on a.az = b.az and b.id <= a.id
|
|
251
|
+
group by a.id, a.az
|
|
252
|
+
)
|
|
253
|
+
select id, az from ranked
|
|
254
|
+
where row_num <= 2
|
|
255
|
+
order by az, id
|
|
256
|
+
"
|
|
257
|
+
```
|
|
258
|
+
```mdtest-output
|
|
259
|
+
{id:"i-001",az:"us-east-1a"}
|
|
260
|
+
{id:"i-002",az:"us-east-1a"}
|
|
261
|
+
{id:"i-004",az:"us-east-1b"}
|
|
262
|
+
{id:"i-005",az:"us-east-1c"}
|
|
263
|
+
{id:"i-006",az:"us-east-1c"}
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
Same result as fork, but no hardcoded AZ names — works with any number of
|
|
267
|
+
groups dynamically.
|
|
268
|
+
|
|
269
|
+
## Trade-offs
|
|
270
|
+
|
|
271
|
+
**Fork** is simple and fast (linear scan per branch), but requires hardcoding
|
|
272
|
+
group values. Best when groups are known and stable (like AZs in a region).
|
|
273
|
+
|
|
274
|
+
**Self-join** is dynamic and handles any number of groups automatically, but
|
|
275
|
+
is O(n^2) per group since every record is joined against all peers with a
|
|
276
|
+
smaller key. Fine for small datasets, potentially slow for large ones.
|
|
277
|
+
|
|
278
|
+
**With window functions** ([brimdata/super#5921][issue]), the query would be
|
|
279
|
+
both dynamic and efficient — handling any number of groups with a single linear
|
|
280
|
+
pass and supporting sophisticated ranking (e.g., ordering within groups by
|
|
281
|
+
launch time, instance type preference, etc.).
|
|
282
|
+
|
|
283
|
+
| Approach | Dynamic groups? | Time complexity | Notes |
|
|
284
|
+
|------------------|-----------------|------------------|--------------------------------------------|
|
|
285
|
+
| Fork | No | O(n) per branch | Groups must be hardcoded |
|
|
286
|
+
| Self-join | Yes | O(n^2) per group | Every record joined against its group peers |
|
|
287
|
+
| Window functions | Yes | O(n log n) | Sort + single pass (not yet available) |
|
|
288
|
+
|
|
289
|
+
For a refresher on what those mean in practice
|
|
290
|
+
([Big O notation](https://en.wikipedia.org/wiki/Big_O_notation)):
|
|
291
|
+
|
|
292
|
+
| Notation | Name | 100 records | 10,000 records | Growth |
|
|
293
|
+
|------------|-------------|-------------|----------------|---------------|
|
|
294
|
+
| O(n) | Linear | 100 | 10,000 | Scales nicely |
|
|
295
|
+
| O(n log n) | Linearithmic| ~664 | ~132,877 | Typical sort |
|
|
296
|
+
| O(n^2) | Quadratic | 10,000 | 100,000,000 | Gets slow fast|
|
|
@@ -0,0 +1,166 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "grok"
|
|
3
|
+
name: grok
|
|
4
|
+
description: "Tutorial on using the grok function for text parsing in SuperDB."
|
|
5
|
+
layout: default
|
|
6
|
+
nav_order: 5
|
|
7
|
+
parent: Tutorials
|
|
8
|
+
superdb_version: "0.3.0"
|
|
9
|
+
last_updated: "2026-03-28"
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# grok
|
|
13
|
+
|
|
14
|
+
The grok function is a great choice for parsing text, but due to some gaps in
|
|
15
|
+
its documentation and some vague error messages, it can be difficult to use at
|
|
16
|
+
first.
|
|
17
|
+
|
|
18
|
+
The docs do helpfully encourage building out grok patterns incrementally, but
|
|
19
|
+
without knowing some of grok's gotchyas, this can be discouraging.
|
|
20
|
+
|
|
21
|
+
Let's demonstrate these starting with this example where we want to extract the
|
|
22
|
+
name out of this string:
|
|
23
|
+
|
|
24
|
+
```text
|
|
25
|
+
My name is: Muerte!
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
To start incrementally, I know I want to skip everything up past the colon, and
|
|
29
|
+
then extract the name minus the closing exclamation point. There's probably not
|
|
30
|
+
a predefined pattern that exists for this prefix regex, and/or I'm feeling lazy
|
|
31
|
+
enough right now to not go looking for one, and the regex is pretty simple.
|
|
32
|
+
|
|
33
|
+
So, I'll define my own pattern in the 3rd arg to grok to handle this:
|
|
34
|
+
```mdtest-command
|
|
35
|
+
super -s -c '
|
|
36
|
+
values "My name is: Muerte!"
|
|
37
|
+
| grok("%{NAME_PREFIX}", this, "NAME_PREFIX .*: ")'
|
|
38
|
+
```
|
|
39
|
+
Since there aren't any errors, and no field names assigned, it returns an empty
|
|
40
|
+
record:
|
|
41
|
+
```mdtest-output
|
|
42
|
+
{}
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
It's a simple regex, it seems like it's accurate, it's hard for me to see what's
|
|
46
|
+
wrong?
|
|
47
|
+
|
|
48
|
+
The regex is fine, in fact. The real reason this returns an empty record is that
|
|
49
|
+
the capture pattern is **missing a field name** in which to store the value.
|
|
50
|
+
Without a field name, there's nothing to capture into a record field.
|
|
51
|
+
|
|
52
|
+
We probably made this mistake because we don't really want to capture "My name
|
|
53
|
+
is: " in a field of the record. But, no big deal, we can add one and use the
|
|
54
|
+
cut operator later to remove it.
|
|
55
|
+
|
|
56
|
+
```mdtest-command
|
|
57
|
+
super -s -c '
|
|
58
|
+
values "My name is: Muerte!"
|
|
59
|
+
| grok("%{NAME_PREFIX:prefix}", this, "NAME_PREFIX .*: ")'
|
|
60
|
+
```
|
|
61
|
+
```mdtest-output
|
|
62
|
+
{prefix:"My name is: "}
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
Success!
|
|
66
|
+
|
|
67
|
+
For our next incremental step, let's capture the name. That's all that's left.
|
|
68
|
+
```mdtest-command
|
|
69
|
+
super -s -c '
|
|
70
|
+
values "My name is: Muerte!"
|
|
71
|
+
| grok("%{NAME_PREFIX:prefix}%{WORD:name}", this, "NAME_PREFIX .*: ")'
|
|
72
|
+
```
|
|
73
|
+
```mdtest-output
|
|
74
|
+
{prefix:"My name is: ",name:"Muerte"}
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
Success again! Ok, that wasn't so bad, but - it's a little arduous. It doesn't
|
|
78
|
+
feel like I'm getting to use the power of regex in a straightforward manner.
|
|
79
|
+
|
|
80
|
+
There are two grok undocumented "hacks" that can make a simple job like this
|
|
81
|
+
even simpler.
|
|
82
|
+
|
|
83
|
+
First, (as seen already with the unnamed capture pattern in `super`), not all
|
|
84
|
+
capture patterns need a field name as long as _**one**_ of them has a field
|
|
85
|
+
name. So we can reduce our last example to be this:
|
|
86
|
+
|
|
87
|
+
```mdtest-command
|
|
88
|
+
super -s -c '
|
|
89
|
+
values "My name is: Muerte!"
|
|
90
|
+
| grok("%{NAME_PREFIX}%{WORD:name}", this, "NAME_PREFIX .*: ")'
|
|
91
|
+
```
|
|
92
|
+
```mdtest-output
|
|
93
|
+
{name:"Muerte"}
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
Second, custom regex patterns can be _inlined_ into the pattern string without
|
|
97
|
+
being a custom named pattern in the 3rd argument at all!
|
|
98
|
+
|
|
99
|
+
```mdtest-command
|
|
100
|
+
super -s -c '
|
|
101
|
+
values "My name is: Muerte!"
|
|
102
|
+
| grok(".*: %{WORD:name}", this)'
|
|
103
|
+
```
|
|
104
|
+
```mdtest-output
|
|
105
|
+
{name:"Muerte"}
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
Now that feels clean and simple!
|
|
109
|
+
|
|
110
|
+
## Pairing grok with infer
|
|
111
|
+
|
|
112
|
+
Grok extracts fields as strings — even numbers and IPs. The `infer` operator
|
|
113
|
+
can automatically detect and cast these to native types.
|
|
114
|
+
|
|
115
|
+
Here's a log line parsed with grok — note that everything is a string:
|
|
116
|
+
|
|
117
|
+
```mdtest-command
|
|
118
|
+
super -s -c '
|
|
119
|
+
values
|
|
120
|
+
"192.168.1.1 GET /api/users 200 1234",
|
|
121
|
+
"10.0.0.5 POST /api/data 404 567"
|
|
122
|
+
| grok("%{IP:client} %{WORD:method} %{URIPATH:path} %{INT:status} %{INT:bytes}", this)'
|
|
123
|
+
```
|
|
124
|
+
```mdtest-output
|
|
125
|
+
{client:"192.168.1.1",method:"GET",path:"/api/users",status:"200",bytes:"1234"}
|
|
126
|
+
{client:"10.0.0.5",method:"POST",path:"/api/data",status:"404",bytes:"567"}
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
Add `| infer` and the types get cleaned up automatically:
|
|
130
|
+
|
|
131
|
+
```mdtest-command
|
|
132
|
+
super -s -c '
|
|
133
|
+
values
|
|
134
|
+
"192.168.1.1 GET /api/users 200 1234",
|
|
135
|
+
"10.0.0.5 POST /api/data 404 567"
|
|
136
|
+
| grok("%{IP:client} %{WORD:method} %{URIPATH:path} %{INT:status} %{INT:bytes}", this)
|
|
137
|
+
| infer'
|
|
138
|
+
```
|
|
139
|
+
```mdtest-output
|
|
140
|
+
{client:192.168.1.1,method:"GET",path:"/api/users",status:200,bytes:1234}
|
|
141
|
+
{client:10.0.0.5,method:"POST",path:"/api/data",status:404,bytes:567}
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
`client` became an `ip` type, `status` and `bytes` became `int64`, while
|
|
145
|
+
`method` and `path` correctly stayed as strings. This means you can now do
|
|
146
|
+
things like `where status >= 400` or `where client in 10.0.0.0/8` without
|
|
147
|
+
manual casting.
|
|
148
|
+
|
|
149
|
+
## Unit tests in codebase
|
|
150
|
+
|
|
151
|
+
```mdtest-command
|
|
152
|
+
super -s -c 'values "1", "foo" | grok("%{INT}", this)'
|
|
153
|
+
```
|
|
154
|
+
```mdtest-output
|
|
155
|
+
{}
|
|
156
|
+
error({message:"grok: value does not match pattern",on:"foo"})
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
## as of versions
|
|
160
|
+
|
|
161
|
+
```mdtest-command
|
|
162
|
+
super --version
|
|
163
|
+
```
|
|
164
|
+
```mdtest-output
|
|
165
|
+
Version: v0.3.0
|
|
166
|
+
```
|
|
@@ -0,0 +1,79 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Joins"
|
|
3
|
+
name: joins
|
|
4
|
+
description: "Examples of outer joins, anti joins, and full outer joins in SuperDB."
|
|
5
|
+
layout: default
|
|
6
|
+
nav_order: 6
|
|
7
|
+
parent: Tutorials
|
|
8
|
+
superdb_version: "0.3.0"
|
|
9
|
+
last_updated: "2026-02-15"
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Joins
|
|
13
|
+
|
|
14
|
+
## Outer Joins
|
|
15
|
+
|
|
16
|
+
```mdtest-input za.sup
|
|
17
|
+
{id:1,name:"foo",src:"za"}
|
|
18
|
+
{id:3,name:"qux",src:"za"}
|
|
19
|
+
```
|
|
20
|
+
```mdtest-input zb.sup
|
|
21
|
+
{id:1,name:"foo",src:"zb"}
|
|
22
|
+
{id:2,name:"bar",src:"zb"}
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
Left Join (Left Only + Inner Joins — where clause required to eliminate inner joins)
|
|
26
|
+
|
|
27
|
+
`select *` includes columns from both sides. The right table's columns get a
|
|
28
|
+
`_1` suffix to avoid name collisions, and unmatched values are
|
|
29
|
+
`error("missing")`.
|
|
30
|
+
```mdtest-command
|
|
31
|
+
super -s -c "select * from za.sup as za
|
|
32
|
+
left join zb.sup as zb
|
|
33
|
+
on za.id=zb.id
|
|
34
|
+
where is_error(zb.name)"
|
|
35
|
+
```
|
|
36
|
+
```mdtest-output
|
|
37
|
+
{id:3,name:"qux",src:"za",id_1:error("missing"),name_1:error("missing"),src_1:error("missing")}
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
Right Join (Right Only + Inner Joins — where clause required to eliminate inner joins)
|
|
41
|
+
```mdtest-command
|
|
42
|
+
super -s -c "select * from za.sup as za
|
|
43
|
+
right join zb.sup as zb
|
|
44
|
+
on za.id=zb.id
|
|
45
|
+
where is_error(za.name)"
|
|
46
|
+
```
|
|
47
|
+
```mdtest-output
|
|
48
|
+
{id:error("missing"),name:error("missing"),src:error("missing"),id_1:2,name_1:"bar",src_1:"zb"}
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
Anti Join (Left Join exclusively — no where clause required)
|
|
52
|
+
```mdtest-command
|
|
53
|
+
super -s -c "select * from za.sup as za
|
|
54
|
+
anti join zb.sup as zb
|
|
55
|
+
on za.id=zb.id"
|
|
56
|
+
```
|
|
57
|
+
```mdtest-output
|
|
58
|
+
{id:3,name:"qux",src:"za",id_1:error("missing"),name_1:error("missing"),src_1:error("missing")}
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
Full Outer (Left Only + Right Only + Inner Joins) - _BUG: Still behaves like Left Join — only returns left-side rows_
|
|
62
|
+
```mdtest-command
|
|
63
|
+
super -s -c "select * from za.sup as za
|
|
64
|
+
full outer join zb.sup as zb
|
|
65
|
+
on za.id=zb.id
|
|
66
|
+
where is_error(za.name) or is_error(zb.name)"
|
|
67
|
+
```
|
|
68
|
+
```mdtest-output
|
|
69
|
+
{id:3,name:"qux",src:"za",id_1:error("missing"),name_1:error("missing"),src_1:error("missing")}
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
## as of versions
|
|
73
|
+
|
|
74
|
+
```mdtest-command
|
|
75
|
+
super --version
|
|
76
|
+
```
|
|
77
|
+
```mdtest-output
|
|
78
|
+
Version: v0.2.0
|
|
79
|
+
```
|
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Moar Subqueries"
|
|
3
|
+
name: moar-subqueries
|
|
4
|
+
description: "Additional subquery patterns including fork and full sub-selects."
|
|
5
|
+
layout: default
|
|
6
|
+
nav_order: 10
|
|
7
|
+
parent: Tutorials
|
|
8
|
+
superdb_version: "0.2.0"
|
|
9
|
+
last_updated: "2026-02-15"
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Moar Subqueries
|
|
13
|
+
|
|
14
|
+
## Fork
|
|
15
|
+
|
|
16
|
+
One hassle to this approach is the limit of 2 forks. Nesting forks works, but
|
|
17
|
+
makes constructing this query a bit more difficult.
|
|
18
|
+
|
|
19
|
+
## Full Sub-Selects
|
|
20
|
+
|
|
21
|
+
As of 20250815 build, this is much, much slower. I'm guessing it's doing a full
|
|
22
|
+
reload of the data file each time.
|
|
23
|
+
|
|
24
|
+
```
|
|
25
|
+
select
|
|
26
|
+
(select count(*)
|
|
27
|
+
from './moar_subqueries.sup'
|
|
28
|
+
where win is not null) as total_games,
|
|
29
|
+
...
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## All Other SQL-Syntax Subqueries
|
|
33
|
+
|
|
34
|
+
They all take about the same amount of wall-time, but CPU usage is much higher
|
|
35
|
+
due to re-reading the file each time.
|