@hamelin.sh/documentation 0.2.2-prerelease.20250910T133135

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,3 @@
1
+ # Hamelin Documentation
2
+
3
+ **NOTICE:** This package is not licensed for use. All rights are reserved. No permission is granted to use, copy, modify, merge, publish, distribute, sublicense, or sell copies of the Software.
package/dist/main.d.ts ADDED
@@ -0,0 +1,3 @@
1
+ declare const HAMELIN_DOCUMENTATION: Record<string, string>;
2
+
3
+ export { HAMELIN_DOCUMENTATION };
package/dist/main.js ADDED
@@ -0,0 +1,517 @@
1
+ // src/generated/documentation.ts
2
+ var HAMELIN_DOCUMENTATION = {
3
+ "command-reference/agg.md": "# `AGG`\n\nGroup rows and apply aggregation functions.\n\n## Syntax\n\n```\nAGG [assignmentClause [, assignmentClause]*] [,]?\n [BY groupClause [, groupClause]*] [,]?\n [SORT [BY]? sortExpression [, sortExpression]*] [,]?\n```\n\n## Parameters\n\n- **assignmentClause** - Aggregation function call or expression, optionally with assignment (`identifier = expression`)\n- **groupClause** - Field or expression to group by\n- **sortExpression** - Field or expression to sort results by, with optional `ASC` or `DESC` direction\n\n## Description\n\nThe `AGG` command groups rows based on specified grouping expressions and \napplies aggregation functions to compute summary statistics for each group. \nWhen you don't specify aggregation functions, `AGG` returns distinct values of \nthe grouping fields. When you don't specify grouping fields, `AGG` treats the \nentire dataset as a single group.\n\nAggregation functions generate automatic column names (see [Automatic Field Names](../smart-features/automatic-field-names.md)). You can assign \ncustom names using the assignment syntax.\n\nThe optional `SORT` clause orders the input to the aggregation functions by the \nexpressions you specify. This only matters for aggregation functions like \n`array_agg()` that are sensitive to input order. You can specify multiple sort \nexpressions, with each supporting optional `ASC` or `DESC` direction specifiers.\n\n\n\n## Related Commands\n\n- **[WINDOW](./window.md)** - Apply aggregation or window functions over sliding windows (sliding windows vs discrete grouping)",
4
+ "command-reference/append.md": "# `APPEND`\n\nAppend rows from the current pipeline to another dataset.\n\n## Syntax\n\n```\nAPPEND tableReference [DISTINCT BY selection (, selection)*]\n```\n\n## Parameters\n\n- **tableReference** - The target dataset to append current pipeline rows to\n- **selection** - Field identifiers you use for deduplication when DISTINCT BY is specified\n\n## Description\n\nThe `APPEND` command adds rows from the current pipeline to another dataset.\nWhen you specify DISTINCT BY, only rows that are unique based on the selected\nfields are appended, preventing duplicate entries. The command performs type\nexpansion to accommodate fields that exist in the current pipeline but not in\nthe target dataset.\n",
5
+ "command-reference/drop.md": "# `DROP`\n\nRemove specific columns from the dataset.\n\n## Syntax\n\n```\nDROP column [, column]* [,]?\n```\n\n## Parameters\n\n- **column** - Name of the column to remove from the results\n\n## Description\n\nThe `DROP` command removes specified columns from the dataset while\npreserving all other columns. Unlike `SELECT`, which defines exactly what\ncolumns to keep, `DROP` defines exactly what columns to exclude.\n\nYou can specify multiple columns in a single `DROP` command, separated by\ncommas. Each identifier must refer to a column that exists in the current\npipeline context. `DROP` also works for nested columns using dot notation\n(e.g., `user.email`).\n",
6
+ "command-reference/explode.md": "# `EXPLODE`\n\nExpand array fields into separate rows.\n\n## Syntax\n\n```\nEXPLODE assignmentClause\n```\n\n## Parameters\n\n- **assignmentClause** - Either an assignment (identifier = expression) or an expression that evaluates to an array field\n\n## Description\n\nThe `EXPLODE` command transforms rows containing array fields into multiple \nrows, with each element of the array becoming a separate row. Each array \nelement becomes a new row with all other fields from the original row \npreserved in each generated output row.\n\nWhen you use assignment syntax (`identifier = expression`), the exploded values \nare placed in the specified field name. Without assignment syntax, the exploded \nvalues replace the original array field.\n\n## Related Commands\n\n- **[ROWS](./rows.md)** - Injects rows into a pipeline (similar row generation behavior)",
7
+ "command-reference/from.md": "# `FROM`\n\nAccess rows from datasets.\n\n## Syntax\n\n```\nFROM fromClause [, fromClause]* [,]?\n```\n\n## Parameters\n\n- **fromClause** - Either a dataset identifier or an alias assignment (`alias = dataset`)\n\n## Description\n\nThe `FROM` command pulls rows from datasets. When you specify multiple \ndatasets, Hamelin combines all rows from all datasets.\n\nMultiple datasets use automatic type expansion. Fields with identical names \nare aligned, while unique fields appear with `NULL` values for rows from \ndatasets that don't contain them.\n\nWhen you use aliases with the assignment syntax (`alias = dataset`), the \nalias becomes a struct containing all fields from that dataset. This lets \nyou organize field access and track data lineage.\n\n\n\n",
8
+ "command-reference/join.md": "# `JOIN`\n\nCombine datasets using inner join logic - only matching rows are kept.\n\n## Syntax\n\n```\nJOIN fromClause [ON expression]?\n```\n\n## Parameters\n\n- **fromClause** - Either a dataset identifier or an alias assignment (`alias = dataset`)\n- **expression** - Boolean condition defining how rows should match\n\n## Description\n\nThe `JOIN` command performs an inner join operation, combining the current \ndataset with another dataset based on matching conditions you specify in the \n`ON` clause. Only rows that have matches in both datasets appear in the final \nresults.\n\nThe joined data is automatically nested as a struct to prevent field name \ncollisions. By default, the struct uses the name of the joined dataset, but \nyou can override this using the assignment syntax in the `fromClause`. You \naccess fields from the joined dataset using dot notation on the struct name.\n\nWhen you omit the `ON` clause, the join becomes a cartesian product of all \nrows from both datasets. The join condition expression can reference fields \nin the current dataset directly and fields in the joined dataset by name \nusing dot notation (e.g., `users.email`).\n\n\n\n## Related Commands\n\n- **[LOOKUP](./lookup.md)** - Left outer join that preserves all original rows",
9
+ "command-reference/let.md": "# `LET`\n\nAdd or modify columns without affecting other columns.\n\n## Syntax\n\n```\nLET assignment [, assignment]* [,]?\n```\n\n## Parameters\n\n- **assignment** - Column assignment in the form `identifier = expression`\n\n## Description\n\nThe `LET` command adds new columns or modifies existing ones while \npreserving all other columns in the dataset. Unlike `SELECT`, which replaces \nthe entire column set, `LET` makes incremental changes to the data structure.\n\nYou can specify multiple assignments in a single `LET` command, separated by \ncommas. Each assignment creates or updates the specified field with the \nresult of evaluating the expression. Expressions can reference any field \navailable at the point where the `LET` command appears in the pipeline.\n\nWhen the identifier already exists as a column, `LET` modifies that column's \nvalues. When the identifier does not exist, `LET` creates a new column with \nthat name.\n\n\n\n## Related Commands\n\n- **[SELECT](./select.md)** - Completely redefine output columns",
10
+ "command-reference/limit.md": "# `LIMIT`\n\nRestrict the number of rows returned.\n\n## Syntax\n\n```\nLIMIT expression\n```\n\n## Parameters\n\n- **expression** - Non-negative integer specifying the maximum number of rows to return\n\n## Description\n\nThe `LIMIT` command restricts the output to a specified maximum number of \nrows. The expression must evaluate to a non-negative integer value. When the \ndataset contains fewer rows than the limit, all rows are returned.\n\nYou typically use `LIMIT` in combination with `SORT` to retrieve the top or \nbottom N results from an ordered dataset. When you use it without sorting, \n`LIMIT` returns an arbitrary subset of rows, which may vary between query \nexecutions.\n\nThe limit is applied after all other operations in the pipeline, making it \nuseful for controlling output size while preserving the full computation \ncontext for earlier commands.\n\n## Related Commands\n\n- **[SORT](./sort.md)** - Order rows by expressions (commonly used with LIMIT)\n\n",
11
+ "command-reference/lookup.md": "# `LOOKUP`\n\nCombine datasets using left outer join logic - all original rows are preserved.\n\n## Syntax\n\n```\nLOOKUP fromClause [ON expression]?\n```\n\n## Parameters\n\n- **fromClause** - Either a dataset identifier or an alias assignment (`alias = dataset`)\n- **expression** - Boolean condition defining how rows should match\n\n## Description\n\nThe `LOOKUP` command performs a left outer join operation, combining the \ncurrent dataset with another dataset based on matching conditions you specify \nin the `ON` clause. All rows from the original dataset are preserved in the \nresults, regardless of whether they have matches in the lookup dataset.\n\nFor rows without matches, the looked-up data struct is set to `null`. For \nrows with matches, the looked-up data is nested as a struct to prevent field \nname collisions. By default, the struct uses the name of the lookup dataset, \nbut you can override this using the assignment syntax in the `fromClause`.\n\nWhen you omit the `ON` clause, the lookup becomes a cartesian product of all \nrows from both datasets. The lookup condition expression can reference fields \nin the current dataset directly and fields in the lookup dataset by name \nusing dot notation (e.g., `users.email`).\n\n\n\n## Related Commands\n\n- **[JOIN](./join.md)** - Inner join that only keeps rows with matches",
12
+ "command-reference/match.md": "# `MATCH`\n\nFind ordered sequences of events using pattern matching with quantifiers.\n\n## Syntax\n\n```\nMATCH pattern+ \n [WHEN matchDefine [, matchDefine]*] [,]?\n [BY groupClause [, groupClause]*] [,]?\n [SORT [BY]? sortExpression [, sortExpression]*] [,]?\n```\n\n## Parameters\n\n- **pattern** - Named dataset reference with optional quantifier (`*`, `+`, `?`, `{n}`) specifying sequence requirements\n- **matchDefine** - Named constraint expression for matched sequences (`identifier = expression`)\n- **groupClause** - Field or expression to group pattern matching by\n- **sortExpression** - Field or expression to order results by, with optional `ASC` or `DESC` direction\n\n## Description\n\nThe `MATCH` command finds ordered sequences of events across multiple named \ndatasets using regular expression-style pattern matching. Unlike `WINDOW`, \nwhich performs unordered correlation, `MATCH` requires that events occur in a \nspecific temporal sequence.\n\nYou specify patterns using named datasets (defined with `WITH` clauses) \nfollowed by optional quantifiers. Quantifiers include `*` (zero or more), `+` \n(one or more), `?` (zero or one), `{n}` (exactly n), and `{n,m}` (between n \nand m occurrences).\n\nThe optional `WHEN` clause adds timing constraints and other conditions to \nmatched sequences. The `BY` clause partitions data for independent pattern \nmatching within each group. Results contain matched sequences with access to \nfirst and last events via `first()` and `last()` functions.\n\n\n\n## Related Commands\n\n- **[WINDOW](./window.md)** - Unordered correlation and aggregation",
13
+ "command-reference/nest.md": "# `NEST`\n\nNest all currently defined fields into one sub-struct.\n\n## Syntax\n\n```\nNEST identifier\n```\n\n## Parameters\n\n- **identifier** - The field name to create for the nested structure\n\n## Description\n\nThe `NEST` command takes all currently defined fields and nests them into one \nsub-struct by creating a new field containing a struct. All original fields \nare preserved as properties of the nested struct. This operation is the \ninverse of UNNEST.\n\n## Related Commands\n\n- **[UNNEST](./unnest.md)** - Lifts struct fields into the parent or enclosing result set (inverse operation)",
14
+ "command-reference/parse.md": '# `PARSE`\n\nExtract structured data from string fields using anchor parsing.\n\n## Syntax\n\n```\nPARSE [expression] pattern AS? identifier (, identifier)* [NODROP]\n```\n\n## Parameters\n\n- **expression** - Optional source field to parse (defaults to current row context)\n- **pattern** - Anchor pattern string using star (*) characters to mark extraction points\n- **identifier** - Output field names for extracted values\n- **NODROP** - Optional flag to preserve the original source field\n\n## Description\n\nThe `PARSE` command provides a lightweight pattern matching approach that is a\nsimple alternative to complex regular expressions. It extracts structured data\nfrom string fields using anchor parsing with star (*) characters. The pattern\nstring uses literal text as anchors with star characters marking extraction\npoints (e.g., "prefix-*-suffix" extracts the value between the anchors). You\nmust provide as many output identifiers as there are star (*) characters in\nthe pattern. The command creates new fields containing the extracted values.\nBy default, rows that don\'t match the pattern are filtered out. When you\nspecify NODROP, non-matching rows are preserved with all output fields set to\nnull.\n',
15
+ "command-reference/rows.md": "# `ROWS`\n\nInject rows into a pipeline.\n\n## Syntax\n\n```\nROWS expression\n```\n\n## Parameters\n\n- **expression** - An expression that evaluates to an array-of-struct which defines the rows to inject\n\n## Description\n\nThe `ROWS` command injects rows into a pipeline by taking an array-of-struct\nexpression and creating one row for each struct element. This is mostly used\nfor examples or playgrounds and is rarely useful in real queries over larger\ndatasets. The `ROWS` command is functionally equivalent to `UNNEST` of a literal\narray.\n\n## Related Commands\n\n- **[EXPLODE](./explode.md)** - Expand array fields into separate rows (similar row generation behavior)\n- **[UNNEST](./unnest.md)** - Lift struct or array of struct fields into the parent or enclosing result set (functionally equivalent for literal arrays)\n",
16
+ "command-reference/select.md": "# `SELECT`\n\nDefine output columns by completely replacing the current column set.\n\n## Syntax\n\n```\nSELECT assignmentClause [, assignmentClause]* [,]?\n```\n\n## Parameters\n\n- **assignmentClause** - Either a simple expression or an assignment (`identifier = expression`)\n\n## Description\n\nThe `SELECT` command completely redefines the column structure of the dataset, \nreplacing all existing columns with only those you explicitly specify. Unlike \n`LET`, which adds or modifies columns while preserving others, `SELECT` gives \nyou complete control over the output schema.\n\nEach `assignmentClause` can be either a simple expression (which preserves \nthe original column name when referencing a field) or an assignment that \ncreates a new column with a specified name. When you use assignment syntax, \nthe identifier becomes the column name and the expression provides the value. \nWhen you don't use assignment syntax, Hamelin automatically generates field \nnames based on the expression (see [Automatic Field Names](../smart-features/automatic-field-names.md) \nfor details).\n\nThe `SELECT` command evaluates expressions against the current row context. \nYou can reference any fields available at that point in the pipeline, \nincluding original dataset fields, fields created by previous `LET` commands, \nand fields from joined datasets.\n\n\n\n## Related Commands\n\n- **[LET](./let.md)** - Add or modify fields while preserving others",
17
+ "command-reference/sort.md": "# `SORT`\n\nOrder rows by one or more expressions with optional direction specifiers.\n\n## Syntax\n\n```\nSORT [BY]? sortExpression [, sortExpression]* [,]?\n```\n\n## Parameters\n\n- **sortExpression** - Field or expression to sort by, with optional `ASC` or `DESC` direction (default is ascending)\n\n## Description\n\nThe `SORT` command orders rows in the dataset according to the specified sort expressions. Multiple sort expressions can be provided, with later expressions acting as tiebreakers when earlier expressions have equal values.\n\nEach sort expression can optionally specify a direction using `ASC` (ascending) or `DESC` (descending). When no direction is specified, ascending order is used by default. The optional `BY` keyword can be included for readability but has no functional impact.\n\nSort expressions can reference any field available in the current dataset context, including calculated fields from previous `LET` commands and fields from joined datasets. Expressions can also include function calls and calculations.\n\n## Related Commands\n\n- **[LIMIT](./limit.md)** - Restrict the number of rows returned (commonly used with SORT)\n\n\n",
18
+ "command-reference/union.md": "# `UNION`\n\nCombine rows from multiple datasets.\n\n## Syntax\n\n```\nUNION fromClause (, fromClause)*\n```\n\n## Parameters\n\n- **fromClause** - A dataset reference or table alias assignment\n\n## Description\n\nThe `UNION` command combines rows from multiple datasets into a single output \nstream. The command performs type expansion on input datasets, automatically \ncreating a unified schema that accommodates all fields from all sources. \nMissing fields in any dataset are filled with null values. All rows from each \ninput dataset are preserved without deduplication.\n\n",
19
+ "command-reference/unnest.md": "# `UNNEST`\n\nLift struct or array of struct fields into the parent or enclosing result set.\n\n## Syntax\n\n```\nUNNEST expression\n```\n\n## Parameters\n\n- **expression** - An expression that evaluates to a struct or array of struct\n\n## Description\n\nThe `UNNEST` command lifts struct or array of struct fields into the parent or \nenclosing result set. When given a struct, it lifts struct fields into the \nparent struct without changing row cardinality. When given an array of struct, \nit performs an explode operation followed by unnesting, creating one row per \narray element with the struct fields lifted into the parent struct.\n\n## Related Commands\n\n- **[NEST](./nest.md)** - Nests all pipeline rows into a sub-structure (inverse operation)",
20
+ "command-reference/where.md": "# `WHERE`\n\nFilter rows based on a boolean condition.\n\n## Syntax\n\n```\nWHERE expression\n```\n\n## Parameters\n\n- **expression** - Boolean expression that evaluates to true or false for each row\n\n## Description\n\nThe `WHERE` command filters rows by evaluating a boolean expression against \neach row. Only rows where the expression evaluates to `true` are kept. Rows \nwhere the expression evaluates to `false` or `null` are excluded.\n\nYou can chain multiple `WHERE` commands in a pipeline. Each command acts as \nan additional filter with logical AND behavior. The expression can reference \nany field available at the point where the `WHERE` command appears in the \npipeline.\n\nThe expression can be any valid Hamelin expression, including those that use \ncomparison, logical, membership, and null-checking operators.\n\n\n\n",
21
+ "command-reference/window.md": "# `WINDOW`\n\nApply aggregation or window functions over sliding windows of data.\n\n## Syntax\n\n```\nWINDOW assignmentClause [, assignmentClause]* [,]?\n [BY groupClause [, groupClause]*] [,]?\n [SORT [BY]? sortExpression [, sortExpression]*] [,]?\n [WITHIN withinExpression]?\n```\n\n## Parameters\n\n- **assignmentClause** - Aggregation function call or expression, optionally with assignment (`identifier = expression`)\n- **groupClause** - Field or expression to partition the data by\n- **sortExpression** - Field or expression to order rows within each partition, with optional `ASC` or `DESC` direction\n- **withinExpression** - Time range or row count defining the window size\n\n## Description\n\nThe `WINDOW` command applies aggregation or window functions over sliding windows of data,\ncomputing results for each row based on a specified subset of surrounding rows.\nUnlike `AGG`, which groups data into discrete buckets, `WINDOW` maintains the\noriginal row structure while adding computed window function results.\n\nThe `BY` clause partitions the data into separate windows based on the grouping\nexpressions you specify. The `SORT` clause determines the ordering of rows\nwithin each partition. The `WITHIN` expression defines the window size, which\nyou can specify as a relative interval (e.g., `-5min`, `-1hr`), a relative range\n(`-10min..-5min`), or relative row count (e.g., `-5r`, `10rows`).\n\nAll aggregation functions in a single `WINDOW` command use the same window\ndefinition. Window function results are added as new columns to the existing\ndataset structure.\n\n\n\n## Related Commands\n\n- **[AGG](./agg.md)** - Group rows and apply aggregation functions (discrete grouping vs sliding windows)\n- **[MATCH](./match.md)** - Pattern matching over sequences\n",
22
+ "command-reference/within.md": "# `WITHIN`\n\nFilter rows based on timestamp ranges.\n\n## Syntax\n\n```\nWITHIN expression\n```\n\n## Parameters\n\n- **expression** - An interval, a range of intervals, or a range of timestamps defining the temporal filter bounds\n\n## Description\n\nThe `WITHIN` command filters rows based on timestamp ranges, keeping only rows\nwhose timestamp falls within the specified time bounds. The expression defines\na time range using interval notation, supporting both bounded and unbounded\nranges.\n\nThe expression supports three forms:\n\n- **Interval** (e.g., `-1hr`) - calculates from that time until now\n- **Range of intervals** (e.g., `-2hr..-1hr`) - both intervals are calculated\n from now\n- **Specific timestamps** (e.g., `ts('2023-01-01')..ts('2023-01-02')`) - between those\n exact timestamps\n\nUnbounded ranges are exactly that - evaluated with no bound in one direction.\n\nThe `WITHIN` command uses the default timestamp field from the dataset.\n",
23
+ "examples/aws-security-group-detection.md": "# AWS Security Group Detection\n\nThis example shows how to build a detection for overly permissive AWS EC2 \nsecurity group modifications. Security groups act as virtual firewalls for \nEC2 instances. Misconfigurations that allow unrestricted internet access \ncreate attack vectors that lead to breaches. You can identify dangerous \nmodifications by analyzing CloudTrail events for security group changes that \nexpose sensitive ports to the internet.\n\n## Attack overview\n\nSecurity group misconfigurations lead to cloud breaches through this pattern:\n\n1. **Initial misconfiguration** - Administrators accidentally allow unrestricted access\n2. **Deliberate modification** - Attackers modify security groups after compromise\n3. **Service exposure** - Sensitive services become accessible from the internet\n4. **Lateral movement** - Exposed services provide entry points for further attacks\n\nYou can detect these misconfigurations by monitoring CloudTrail events for \nsecurity group modifications that create overly permissive rules.\n\n## Detection strategy\n\nThe detection focuses on security group modifications that allow traffic from \nany source (`0.0.0.0/0` or `::/0`) to sensitive ports. You write queries that \nparse CloudTrail JSON events to extract security group changes, identify \npermissive rules, and calculate risk scores based on the ports and protocols \nexposed.\n\nHamelin targets CloudTrail events like `AuthorizeSecurityGroupIngress` and \n`ModifySecurityGroupRules` that indicate firewall rule changes, then filters \nfor modifications that create internet-accessible endpoints.\n\n## Complete detection query\n\nThis detection query parses CloudTrail JSON events to identify security group \nmodifications that expose services to unrestricted internet access. Hamelin \nextracts relevant fields from nested JSON, identifies sensitive ports, \ncalculates risk scores based on exposure level, and creates structured output \nfor security teams:\n\n```hamelin\n// Name: AWS EC2 Security Group Permissive Changes\n// Author: Detection Engineer\n//-\n//- Description: This detection identifies potentially dangerous modifications\n//- to EC2 security groups that could expose resources to unauthorized access.\n//- It specifically looks for rules that allow traffic from any source\n//- (0.0.0.0/0 or ::/0) on sensitive ports. Security groups act as virtual\n//- firewalls, and overly permissive rules are a common security misconfiguration\n//- that can lead to breaches.\n//-\n//- Tags: aws, ec2, security-group, network-security, misconfiguration\n//- Mitre-Tactic: TA0005 (Defense Evasion), TA0001 (Initial Access)\n//- Mitre-Technique: T1562.007 (Impair Defenses: Disable or Modify Cloud Firewall)\n//-\n\nFROM simba.cloudtrail_events\n\n// Parse the JSON and extract fields\n| LET evt = parse_json(event.original) AS variant\n| LET eventName = evt.eventName AS string\n| LET source_ip = evt.sourceIPAddress AS string\n| LET user_identity = evt.userIdentity.arn AS string\n| LET account_id = evt.userIdentity.accountId AS string\n| LET aws_region = evt.awsRegion AS string\n| LET error_code = evt.errorCode AS string\n| LET request_params = evt.requestParameters AS variant\n| LET group_id = evt.requestParameters.groupId AS string\n| LET group_name = evt.requestParameters.groupName AS string\n\n// Filter for security group modification events\n| WHERE coalesce(eventName, '') IN [\n 'AuthorizeSecurityGroupIngress',\n 'AuthorizeSecurityGroupEgress',\n 'RevokeSecurityGroupIngress',\n 'RevokeSecurityGroupEgress',\n 'CreateSecurityGroup',\n 'ModifySecurityGroupRules'\n ]\n\n// Only process successful events\n| WHERE coalesce(error_code, '') == ''\n\n// Check for overly permissive rules in the original event\n| LET has_any_source = contains(event.original, '0.0.0.0/0') OR contains(event.original, '::/0')\n\n// Extract port information (simplified approach)\n| LET from_port = evt.requestParameters.ipPermissions[0].fromPort AS int\n| LET to_port = evt.requestParameters.ipPermissions[0].toPort AS int\n| LET ip_protocol = evt.requestParameters.ipPermissions[0].ipProtocol AS string\n\n// Define sensitive ports\n| LET sensitive_ports = [22, 23, 3389, 1433, 3306, 5432, 5984, 6379, 7000, 7001, 8020, 8086, 8888, 9042, 9160, 9200, 9300, 11211, 27017, 27018, 27019, 50070]\n| LET is_sensitive_port = coalesce(from_port, 0) IN sensitive_ports OR \n coalesce(to_port, 0) IN sensitive_ports OR\n (coalesce(from_port, 0) <= 22 AND coalesce(to_port, 0) >= 22) OR\n (coalesce(from_port, 0) <= 3389 AND coalesce(to_port, 0) >= 3389)\n\n// Check for allow all protocols\n| LET is_all_traffic = coalesce(ip_protocol, '') == '-1' OR \n (coalesce(from_port, 0) == 0 AND coalesce(to_port, 0) == 65535)\n\n// Only flag events with permissive rules\n| WHERE has_any_source\n\n// Calculate risk score\n| LET risk_score = if(\n contains(coalesce(eventName, ''), 'Authorize') AND is_all_traffic,\n 100, // Critical - allowing all traffic from anywhere\n if(\n contains(coalesce(eventName, ''), 'Authorize') AND is_sensitive_port,\n 90, // Very high - sensitive ports exposed\n if(\n contains(coalesce(eventName, ''), 'Authorize'),\n 75, // High - any port exposed to internet\n 40 // Low - revoking permissive rules (good action)\n )\n )\n )\n\n// Create human-readable message\n| LET action_type = if(contains(coalesce(eventName, ''), 'Authorize'), 'opened', 'closed')\n| LET from_port_str = coalesce(from_port, 0) AS string\n| LET to_port_str = coalesce(to_port, 0) AS string\n| LET port_info = if(\n is_all_traffic,\n 'all ports',\n 'port(s) ' + from_port_str + '-' + to_port_str\n )\n| LET message = 'Security group ' + action_type + ' to internet: ' + \n port_info + ' on group ' + coalesce(group_id, coalesce(group_name, 'unknown'))\n\n// Create labels map for context\n| LET labels = map(\n 'Severity': if(risk_score >= 90, 'CRITICAL', if(risk_score >= 75, 'HIGH', 'LOW')) AS string,\n 'Account': coalesce(account_id, '') AS string,\n 'Security Group': coalesce(group_id, coalesce(group_name, '')) AS string,\n 'User': coalesce(user_identity, '') AS string,\n 'Source IP': coalesce(source_ip, '') AS string,\n 'Event': coalesce(eventName, '') AS string,\n 'Region': coalesce(aws_region, '') AS string,\n 'Protocol': coalesce(ip_protocol, 'unknown') AS string,\n 'Port Range': port_info AS string,\n 'Action': action_type AS string,\n 'Tactic': 'Defense Evasion' AS string,\n 'Technique': 'T1562.007' AS string\n )\n\n// Format the final output\n| SELECT\n timestamp,\n event.start = timestamp,\n event.end = timestamp,\n message,\n labels,\n host.name = coalesce(account_id, ''),\n user.name = coalesce(user_identity, ''),\n source.ip = coalesce(source_ip, ''),\n cloud.region = coalesce(aws_region, ''),\n cloud.service.name = 'ec2',\n network.protocol = coalesce(ip_protocol, ''),\n rule = {\n name: 'AWS EC2 Security Group Permissive Changes' AS string,\n description: 'Detects modifications to EC2 security groups that allow ' +\n 'unrestricted access from the internet (0.0.0.0/0), especially ' +\n 'on sensitive ports commonly targeted by attackers.' AS string,\n mitre_tactics: ['Defense Evasion', 'Initial Access'] AS array(string),\n mitre_techniques: ['T1562.007'] AS array(string),\n severity: if(risk_score >= 90, 'critical', if(risk_score >= 75, 'high', 'low')) AS string,\n references: [\n 'https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules.html',\n 'https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html'\n ] AS array(string)\n }\n```\n\n## Query breakdown\n\nThis detection shows several techniques that make Hamelin effective for \ncloud security monitoring. Let's examine each component:\n\n### JSON parsing and field extraction\n\nThe detection starts by parsing CloudTrail JSON events and extracting relevant \nfields. CloudTrail events contain nested JSON structures, so you need to \nextract specific fields systematically:\n\n```hamelin\n| LET evt = parse_json(event.original) AS variant\n| LET eventName = evt.eventName AS string\n| LET source_ip = evt.sourceIPAddress AS string\n| LET user_identity = evt.userIdentity.arn AS string\n| LET group_id = evt.requestParameters.groupId AS string\n```\n\nThe `parse_json()` function converts the JSON string into a variant type. \nYou can then extract specific fields using dot notation. The `coalesce()` \nfunction handles missing fields gracefully.\n\n### Event filtering and validation\n\nThe detection filters for specific CloudTrail events that indicate security \ngroup modifications. You process only successful events to avoid false \npositives from failed API calls:\n\n```hamelin\n| WHERE coalesce(eventName, '') IN [\n 'AuthorizeSecurityGroupIngress',\n 'AuthorizeSecurityGroupEgress',\n 'RevokeSecurityGroupIngress',\n 'RevokeSecurityGroupEgress',\n 'CreateSecurityGroup',\n 'ModifySecurityGroupRules'\n ]\n| WHERE coalesce(error_code, '') == ''\n```\n\nThis filtering ensures you analyze actual security group changes, not \nfailed API calls or unrelated CloudTrail events.\n\n### Permissive rule detection\n\nThe detection identifies overly permissive rules by searching for CIDR blocks \nthat allow traffic from anywhere on the internet:\n\n```hamelin\n| LET has_any_source = contains(event.original, '0.0.0.0/0') OR contains(event.original, '::/0')\n| WHERE has_any_source\n```\n\nThe `contains()` function searches the raw JSON for these permissive CIDR \nblocks. This catches rules that expose services to the entire internet.\n\n### Sensitive port identification\n\nThe detection defines a list of commonly targeted ports and checks whether \nthe security group modification affects these sensitive services:\n\n```hamelin\n| LET sensitive_ports = [22, 23, 3389, 1433, 3306, 5432, ...]\n| LET is_sensitive_port = coalesce(from_port, 0) IN sensitive_ports OR \n coalesce(to_port, 0) IN sensitive_ports\n```\n\nThis includes SSH (22), RDP (3389), database ports (1433, 3306, 5432), and \nother services commonly targeted by attackers.\n\n### Risk scoring logic\n\nThe detection calculates risk scores based on the type of modification and \nthe sensitivity of exposed ports:\n\n```hamelin\n| LET risk_score = if(\n contains(coalesce(eventName, ''), 'Authorize') AND is_all_traffic,\n 100, // Critical - allowing all traffic from anywhere\n if(\n contains(coalesce(eventName, ''), 'Authorize') AND is_sensitive_port,\n 90, // Very high - sensitive ports exposed\n if(\n contains(coalesce(eventName, ''), 'Authorize'),\n 75, // High - any port exposed to internet\n 40 // Low - revoking permissive rules (good action)\n )\n )\n )\n```\n\nThis nested `if()` structure prioritizes the most dangerous configurations. \nIt also recognizes when administrators remove permissive rules.\n\n### Structured output generation\n\nThe detection creates both human-readable messages and structured labels for \ndifferent consumers:\n\n```hamelin\n| LET message = 'Security group ' + action_type + ' to internet: ' + \n port_info + ' on group ' + coalesce(group_id, group_name)\n| LET labels = map(\n 'Severity': if(risk_score >= 90, 'CRITICAL', 'HIGH'),\n 'Account': account_id,\n 'Security Group': group_id,\n 'User': user_identity\n )\n```\n\nThis provides context for analysts and maintains machine-readable structure \nfor SIEM integration.\n\n## Advanced techniques demonstrated\n\nThis detection uses several techniques that make Hamelin effective for \ncloud security monitoring:\n\n**JSON parsing and navigation** handles complex nested CloudTrail event \nstructures using `parse_json()` and dot notation for field extraction.\n\n**Flexible field handling** uses `coalesce()` to handle missing or null fields \nin CloudTrail events gracefully.\n\n**String pattern matching** uses `contains()` for efficient searching of \nCIDR blocks within raw JSON content.\n\n**Conditional risk scoring** uses nested `if()` statements to create \nnuanced risk assessments based on multiple factors.\n\n**Cloud-native field mapping** creates ECS-compliant output with cloud-specific \nfields like `cloud.region` and `cloud.service.name`.\n\n**Reference documentation** includes links to official AWS documentation for \nanalyst context and verification.\n\n## Detection tuning\n\nYou can adapt this detection for different cloud environments:\n\n**Expand sensitive ports** by adding environment-specific services to the \nsensitive ports list based on applications in use.\n\n**Adjust risk scores** by modifying the scoring logic based on risk tolerance \nand compliance requirements.\n\n**Add region filtering** by including or excluding specific AWS regions based \non where resources should legitimately exist.\n\n**Customize user filtering** by adding exclusions for automated tools or \nservice accounts that legitimately modify security groups.\n\nThe modular structure with clear variable definitions makes it easy to modify \nindividual components without affecting the overall detection logic.",
24
+ "examples/clickfix-threat-detection.md": '# ClickFix Threat Detection\n\nThis example shows how to build a threat detection query that identifies \n**ClickFix** phishing attacks. ClickFix is a social engineering technique \nwhere attackers convince users to execute malicious commands through the \nWindows Run dialog or browser, often disguised as a way to "fix" a supposed \nerror.\n\n## Attack Overview\n\nClickFix attacks follow a predictable pattern:\n\n1. **Social Engineering** - Users are tricked into thinking they need to "fix" something\n2. **Command Execution** - Users paste malicious commands into Windows Run dialog (Windows + R)\n3. **Payload Download** - PowerShell downloads and executes malicious files\n4. **Command & Control** - Establishes connection to attacker infrastructure\n\nYou can detect these attacks by correlating these behaviors within tight time \nwindows to catch the complete attack chain while minimizing false positives.\n\n## Detection strategy\n\nThis detection identifies three key events within a 5-second window:\n\n- **Registry modifications** to RunMRU (tracks Windows+R usage)\n- **Suspicious process creation** (PowerShell with specific parameters)\n- **Network connections** to external command & control servers\n\nThe tight time correlation and requirement for all three behaviors helps \ndistinguish real attacks from normal administrative activity.\n\n## Complete detection query\n\nHere\'s the full detection query that correlates registry events, process \ncreation, and network connections to identify ClickFix attacks. The query \nuses `WITH` clauses to define each event type, then applies time-based \nwindowing and pattern matching to find the complete attack sequence:\n\n```hamelin\n// Detection Name: Detection - ClickFix Initial Access\n//\n// Description:\n// This detection identifies potential ClickFix phishing activity,\n// a technique where users are socially engineered to execute\n// malicious commands either through the Windows Run dialog\n// (Windows + R) or directly from their browser. The attack\n// typically involves convincing users to paste commands that lead\n// to execution of PowerShell or other scripting tools, often\n// disguised as a way to "fix" a supposed error.\n//\n// Detection Logic\n// The detection correlates three key events within a 5-second window:\n// 1. Registry modifications to RunMRU (tracks Windows+R usage) or\n// process creation from common browsers\n// 2. Creation of suspicious processes (PowerShell, CMD, MSHTA) with\n// suspicious command-line parameters\n// 3. Network connections to external IPs or domains\n//\n// The tight 5-second window and requirement for all three behaviors\n// help reduce false positives while catching the typical ClickFix\n// attack chain.\n\n// Identify Windows Run Dialog useage\nWITH registry_events = FROM simba.sysmon_events\n| WHERE event.code == "13"\n| LET reg_key = coalesce(winlog.parsed_event_data.target_object AS string, \'\')\n| LET reg_value = coalesce(winlog.parsed_event_data.details AS string, \'\')\n| LET image = lower(coalesce(winlog.parsed_event_data.image AS string, \'\'))\n| LET host = host.name\n| WHERE regexp_like(lower(reg_key), \'(?i).*\\\\currentversion\\\\explorer\\\\runmru.*\')\n AND regexp_like(image, \'(?i).*explorer\\.exe.*\')\n\n// Get suspicious process creations\nWITH process_events = FROM simba.sysmon_events\n| WHERE event.code == "1"\n| LET cmd_line = coalesce(winlog.parsed_event_data.command_line AS string, \'\')\n| LET image = coalesce(winlog.parsed_event_data.image AS string, \'\')\n| LET parent_image = coalesce(winlog.parsed_event_data.parent_image AS string, \'\')\n| LET host = host.name\n| LET user_name = coalesce(winlog.user.name AS string, \'\')\n| WHERE regexp_like(lower(cmd_line), \'(?i).*(iwr|invoke-webrequest|wget).*\\$env:temp.*\\.exe.*\')\n AND regexp_like(lower(cmd_line), \'(?i).*-w\\s+hidden.*\')\n AND regexp_like(lower(parent_image), \'(?i).*explorer\\.exe.*\')\n\n// Get suspicious network connections\nWITH network_events = FROM simba.sysmon_events\n| WHERE event.code == "3"\n| LET dest_ip = coalesce(winlog.parsed_event_data.destination_ip AS string, \'\')\n| LET dest_port = coalesce(winlog.parsed_event_data.destination_port AS string, \'\')\n| LET image = lower(coalesce(winlog.parsed_event_data.image AS string, \'\'))\n| LET host = host.name\n| WHERE\n dest_ip == "172.31.7.63" AND\n dest_port == "4444" AND\n regexp_like(image, \'(?i).*\\\\temp\\\\.*\\.exe\')\n\n// Main Detection Logic:\n//\n// Above we detect three distinct events happening in a Clickfix scenario:\n// - A suspicious process is launched\n// - A registry event signaling a Windows Run Dialog usage \n// - Outbound network connections to a known C2 server\n//\n// Then, we detect a process event and a registry event happening together.\n// Technically, the process event precedes the registry event.\n// But in Windows, the timeestamps for the events are not very precise\n// Therefore, we need look for those two events happening NOT one\n// after the other, but in a small time window!\nWITH process_near_registry = FROM registry_event = registry_events,\n process_event = process_events\n| WINDOW process_event = last_value(process_event),\n registry_event = last_value(registry_event)\n BY host WITHIN 1s\n| WHERE process_event IS NOT NULL AND registry_event IS NOT NULL\n\n// Finally, we are looking for a pattern where a combination of the above\n// mentioned process & registry events are on the same host followed by\n// outbound network connections to the C2 server.\nMATCH process_near_registry network_event=network_events+\nWHEN network_event = network_event.timestamp - process_near_registry.timestamp <= 5s\nBY host\n| AGG timestamp = min(timestamp),\n event.start = min(timestamp),\n event.end = max(timestamp),\n process_event = array_agg(matched.process_event)[0],\n registry_event = array_agg(matched.registry_event)[0],\n network_events = filter_null(array_agg(network_event)),\n BY\n match_number, host\n\n// Now we know it\'s Clickfix, so let\'s create a nice Signal for analysts to\n// follow up on.\n//\n// Create simple message and structured labels:\n| LET message = "ClickFix C2 Activity Detection - Initial Access and Command & Control"\n| LET labels = map(\n "Severity": "CRITICAL",\n "Host": host,\n "User": coalesce(process_event.user_name, "unknown"),\n "Run Dialog Activity": registry_event.reg_value,\n "Registry Key": registry_event.reg_key,\n "PowerShell Command": process_event.cmd_line,\n "Process Path": process_event.image,\n "Parent Process": process_event.parent_image,\n "Destination IP": network_events.dest_ip[0],\n "Destination Port": network_events.dest_port[0],\n "Beacon Count": len(array_distinct(network_events.dest_ip)) AS string)\n\n// Detection Output\n| SELECT\n timestamp,\n event.start = event.start,\n event.end = event.end,\n process_event,\n registry_event,\n network_events,\n host.name = host,\n process.name = process_event.image,\n process.command_line = process_event.cmd_line,\n process.parent.name = process_event.parent_image,\n registry.key = registry_event.reg_key,\n registry.value = registry_event.reg_value,\n destination.ip = network_events.dest_ip[0],\n destination.port = network_events.dest_port[0],\n user.name = process_event.user_name,\n beacons = array_distinct(network_events.dest_ip),\n event.risk_score = 100,\n message,\n labels,\n rule = {\n name: "Windows ClickFix Initial Access",\n description: message,\n mitre_tactics: [\n "Command and Control",\n "Execution",\n "Persistence"\n ],\n mitre_techniques: [\n "T1571", // Non-Standard Port\n "T1105", // Ingress Tool Transfer\n "T1572", // Protocol Tunneling\n "T1059.001" // PowerShell\n ],\n severity: "critical"\n }\n```\n\n## Query breakdown\n\nThis detection demonstrates several advanced Hamelin techniques working \ntogether. Let\'s examine each component:\n\n### Event collection phase\n\nThe query starts by defining three separate event types using `WITH` clauses. \nEach clause filters and transforms raw Sysmon events to extract the specific \nindicators needed for correlation:\n\n**Registry events** track Windows Run dialog usage by monitoring registry \nmodifications to the RunMRU key:\n```hamelin\nWITH registry_events = FROM simba.sysmon_events\n| WHERE event.code == "13" // Registry modification events\n| LET reg_key = coalesce(winlog.parsed_event_data.target_object AS string, \'\')\n| WHERE regexp_like(lower(reg_key), \'(?i).*\\\\currentversion\\\\explorer\\\\runmru.*\')\n AND regexp_like(image, \'(?i).*explorer\\.exe.*\')\n```\n\n**Process events** identify suspicious PowerShell execution with specific \ncommand-line patterns that indicate ClickFix behavior:\n```hamelin\nWITH process_events = FROM simba.sysmon_events\n| WHERE event.code == "1" // Process creation events\n| WHERE regexp_like(lower(cmd_line), \'(?i).*(iwr|invoke-webrequest|wget).*\\$env:temp.*\\.exe.*\')\n AND regexp_like(lower(cmd_line), \'(?i).*-w\\s+hidden.*\')\n```\n\n**Network events** detect connections to known command & control \ninfrastructure by filtering for specific IP addresses and ports:\n```hamelin\nWITH network_events = FROM simba.sysmon_events\n| WHERE event.code == "3" // Network connection events\n| WHERE dest_ip == "172.31.7.63" AND dest_port == "4444"\n```\n\n### Time-based correlation\n\nThe detection uses windowing to correlate process and registry events that \noccur close in time. Windows event timestamps aren\'t perfectly precise, so \nyou need to look for events happening within a small time window rather than \nin strict sequence:\n\n```hamelin\nWITH process_near_registry = FROM registry_event = registry_events,\n process_event = process_events\n| WINDOW process_event = last_value(process_event),\n registry_event = last_value(registry_event)\n BY host WITHIN 1s\n```\n\nThis creates pairs of process and registry events that happen within 1 second \nof each other on the same host.\n\n### Pattern matching\n\nThe `MATCH` command identifies the complete attack sequence by finding \ncorrelated events followed by network connections:\n\n```hamelin\nMATCH process_near_registry network_event=network_events+\nWHEN network_event = network_event.timestamp - process_near_registry.timestamp <= 5s\nBY host\n```\n\nThis finds patterns where process and registry events occur together, followed \nby one or more network connections within 5 seconds, all happening on the \nsame host.\n\n### Structured output\n\nThe final section creates analyst-friendly output that includes event metadata \nlike start/end times and risk scores, process details such as command lines \nand parent processes, network indicators including destination IPs and ports, \nMITRE mapping for threat intelligence, and structured labels as key-value \npairs for SIEM integration.\n\n## Advanced techniques demonstrated\n\nThis detection showcases several techniques that make Hamelin effective for \nthreat detection:\n\n**Multi-source correlation** combines registry, process, and network events \nfrom the same data source but different event types.\n\n**Time-based windowing** uses `WINDOW` operations to group events that occur \nclose in time, accounting for timestamp precision limitations.\n\n**Pattern matching** employs `MATCH` to identify ordered sequences of events \nthat follow the attack pattern.\n\n**Flexible field extraction** uses `coalesce()` to handle varying field names \nand missing data gracefully.\n\n**Regular expressions** apply `regexp_like()` for flexible pattern matching \nin command lines and file paths.\n\n**Structured output** creates both human-readable messages and \nmachine-readable labels for different consumers.\n\n## Detection tuning\n\nYou can adapt this detection for different environments:\n\n**Adjust time windows** by modifying the 1-second and 5-second windows based \non event timing characteristics in the environment.\n\n**Update IOCs** by replacing the hardcoded IP address and port with threat \nintelligence feeds.\n\n**Refine regex patterns** by customizing the command line patterns based on \nobserved ClickFix variants.\n\n**Modify thresholds** by adjusting beacon counts or adding minimum connection \nduration requirements.\n\nThe modular structure with `WITH` clauses makes it easy to test and tune \nindividual components without affecting the entire detection logic.',
25
+ "examples/registry-persistence-detection.md": `# Registry Persistence Detection
26
+
27
+ This example shows how to build a detection for Windows registry persistence
28
+ mechanisms. Attackers modify registry keys to maintain persistence on
29
+ compromised systems. You can identify suspicious registry modifications by
30
+ combining baseline anomaly detection with pattern matching for known
31
+ persistence techniques.
32
+
33
+ ## Attack overview
34
+
35
+ Registry persistence is a technique where attackers modify specific registry
36
+ keys to ensure their malware runs automatically:
37
+
38
+ 1. **Registry modification** - Attackers write to persistence locations like Run keys
39
+ 2. **Payload execution** - Modified registry entries execute malicious code at startup
40
+ 3. **Persistence establishment** - Malware survives reboots and user logoffs
41
+ 4. **Defense evasion** - Registry modifications can be subtle and hard to detect
42
+
43
+ You can identify both high-volume registry modifications and modifications to
44
+ known persistence locations with suspicious characteristics.
45
+
46
+ ## Detection strategy
47
+
48
+ The detection uses a two-pronged approach to identify persistence mechanisms
49
+ while minimizing false positives. Hamelin establishes baselines for normal
50
+ registry activity per host, then identifies modifications to known persistence
51
+ locations. The query combines multiple indicators to calculate threat severity
52
+ and excludes common legitimate registry operations.
53
+
54
+ The detection focuses on registry event codes 12, 13, and 14 from Sysmon,
55
+ which capture registry key creation, value modification, and key deletion.
56
+
57
+ ## Complete detection query
58
+
59
+ This detection query combines baseline analysis with pattern matching to
60
+ identify suspicious registry persistence mechanisms. The query establishes
61
+ normal registry operation baselines for each host, calculates statistical
62
+ thresholds, then identifies anomalous activity and suspicious patterns in
63
+ real-time registry modifications:
64
+
65
+ \`\`\`hamelin
66
+ // Detection Name: Windows Registry Persistence (Tuned)
67
+ //
68
+ // Detection Logic:
69
+ // - Uses simba.events source
70
+ // - Extracts and casts event_data fields: CommandLine, Details, Image, TargetObject
71
+ // - Identifies anomalous patterns via regexp_like and filters legitimate cases
72
+ // - Calculates risk score from base and anomaly components
73
+ // - Projects ECS-compliant fields and rule metadata (name, mitre, severity)
74
+ // - TUNED: Added exclusions for common false positives
75
+ //
76
+
77
+ // Baseline for registry operations per host
78
+ WITH baseline_reg = FROM simba.sysmon_events
79
+ | WHERE event.code IN ["12", "13", "14"]
80
+ | LET computer_name = host.name AS string
81
+ | AGG reg_ops = count()
82
+ BY timestamp = timestamp@m, computer_name
83
+ | AGG avg = avg(reg_ops),
84
+ stddev = stddev(reg_ops)
85
+ BY computer_name
86
+
87
+ // Main detection
88
+ FROM simba.sysmon_events
89
+ | WHERE event.code IN ["12", "13", "14"]
90
+ | LET computer_name = host.name AS string
91
+ | LET reg_key = coalesce(winlog.parsed_event_data.target_object AS string, "") AS string
92
+ | LET reg_value = coalesce(winlog.parsed_event_data.details AS string, "") AS string
93
+ | LET process_name = coalesce(winlog.parsed_event_data.image AS string, "") AS string
94
+ | LET process_cmd = coalesce(winlog.parsed_event_data.command_line AS string, "") AS string
95
+
96
+ // Known good exclusions - EXPANDED TO REDUCE FALSE POSITIVES
97
+ | WHERE NOT (
98
+ // Windows Installer
99
+ regexp_like(lower(reg_key),
100
+ '(?i).*\\\\microsoft\\\\windows\\\\currentversion\\\\installer\\\\.*') OR
101
+ // Microsoft Office Updates
102
+ regexp_like(lower(reg_key),
103
+ '(?i).*\\\\microsoft\\\\office\\\\.*updates\\\\.*') OR
104
+ // Windows Defender
105
+ regexp_like(lower(reg_key),
106
+ '(?i).*\\\\microsoft\\\\windows defender\\\\.*') OR
107
+ // DHCP/Network Configuration by svchost
108
+ (regexp_like(lower(reg_key),
109
+ '(?i).*\\\\services\\\\tcpip\\\\parameters\\\\interfaces\\\\.*dhcp.*') AND
110
+ regexp_like(lower(process_name), '(?i).*\\\\svchost\\\\.exe$')) OR
111
+ // Windows Inventory/Telemetry
112
+ regexp_like(lower(reg_key),
113
+ '(?i).*\\\\registry\\\\a\\\\.*\\\\root\\\\inventory.*') OR
114
+ // Windows Compatibility Telemetry
115
+ (regexp_like(lower(reg_key),
116
+ '(?i).*inventory.*writeperms.*') AND
117
+ regexp_like(lower(process_name), '(?i).*\\\\compattelrunner\\\\.exe$')) OR
118
+ // LogonUI Session Data
119
+ (regexp_like(lower(reg_key),
120
+ '(?i).*\\\\authentication\\\\logonui\\\\sessiondata\\\\.*') AND
121
+ regexp_like(lower(process_name), '(?i).*\\\\logonui\\\\.exe$')) OR
122
+ // WMI/ETW AutoLogger Configuration
123
+ (regexp_like(lower(reg_key),
124
+ '(?i).*\\\\control\\\\wmi\\\\autologger\\\\.*') AND
125
+ regexp_like(lower(process_name), '(?i).*\\\\svchost\\\\.exe$')) OR
126
+ // Windows Update
127
+ regexp_like(lower(reg_key),
128
+ '(?i).*\\\\microsoft\\\\windows\\\\currentversion\\\\windowsupdate\\\\.*') OR
129
+ // Group Policy Cache
130
+ regexp_like(lower(reg_key),
131
+ '(?i).*\\\\microsoft\\\\windows\\\\currentversion\\\\group policy\\\\.*cache.*') OR
132
+ // Edge/Chrome Updates
133
+ regexp_like(lower(reg_key),
134
+ '(?i).*\\\\(microsoft|google)\\\\update\\\\.*') OR
135
+ // VSCode Inventory (from earlier analysis)
136
+ regexp_like(lower(reg_key),
137
+ '(?i).*codesetup-stable.*programid.*')
138
+ )
139
+
140
+ // Suspicious registry locations
141
+ | LET is_run_key = regexp_like(lower(reg_key),
142
+ '(?i).*(\\\\\\\\run\\\\\\\\|\\\\\\\\runonce\\\\\\\\|\\\\\\\\runonceex\\\\\\\\|\\\\\\\\explorer\\\\\\\\run).*') AS boolean
143
+ | LET is_startup_key = regexp_like(lower(reg_key),
144
+ '(?i).*(\\\\\\\\windows\\\\\\\\currentversion\\\\\\\\shell\\\\\\\\|\\\\\\\\startup\\\\\\\\|\\\\\\\\load\\\\\\\\|\\\\\\\\userinit\\\\\\\\|\\\\\\\\shell folders\\\\\\\\).*') AS boolean
145
+ | LET is_service_key = regexp_like(lower(reg_key),
146
+ '(?i).*(\\\\\\\\services\\\\\\\\[^\\\\\\\\]+\\\\\\\\(imagepath|servicedll)|\\\\\\\\currentcontrolset\\\\\\\\services\\\\\\\\[^\\\\\\\\]+$)') AS boolean
147
+ | LET is_logon_key = regexp_like(lower(reg_key),
148
+ '(?i).*(\\\\\\\\winlogon\\\\\\\\(shell|userinit)|\\\\\\\\policies\\\\\\\\system\\\\\\\\|\\\\\\\\groupolicy\\\\\\\\scripts\\\\\\\\).*') AS boolean
149
+ | LET is_extension_key = regexp_like(lower(reg_key),
150
+ '(?i).*(\\\\\\\\command\\\\\\\\|\\\\\\\\shell\\\\\\\\open\\\\\\\\command|\\\\\\\\handlers\\\\\\\\|\\\\\\\\fileexts\\\\\\\\.*\\\\\\\\openwithprogids).*') AS boolean
151
+
152
+ // Suspicious values/data
153
+ | LET sus_value_data = regexp_like(lower(reg_value),
154
+ '(?i).*(powershell|cmd|wscript|cscript|rundll32|regsvr32|mshta|certutil|bitsadmin)\\\\.exe.*') AS boolean
155
+ | LET sus_value_location = regexp_like(lower(reg_value),
156
+ '(?i).*(\\\\\\\\temp\\\\\\\\|\\\\\\\\appdata\\\\\\\\|\\\\\\\\programdata\\\\\\\\public\\\\\\\\|\\\\\\\\downloads\\\\\\\\|\\\\\\\\users\\\\\\\\all users\\\\\\\\).*') AS boolean
157
+ | LET sus_encoded_data = regexp_like(lower(reg_value),
158
+ '(?i).*(base64|encoded|encrypt|\\\\-enc\\\\s|\\\\-e\\\\s|iex\\\\s|downloadstring).*') AS boolean
159
+
160
+ // Suspicious process context
161
+ | LET sus_process = regexp_like(lower(process_name),
162
+ '(?i).*(powershell|cmd|wscript|cscript|rundll32|regsvr32|mshta|certutil|bitsadmin)\\\\.exe.*') AS boolean
163
+ | LET sus_cmdline = regexp_like(lower(process_cmd),
164
+ '(?i).*(hidden|verysilent|/q|/quiet|/s|/silent|-w hidden|-windowstyle hidden|-enc|-encode).*') AS boolean
165
+
166
+ // Additional filtering for known legitimate patterns
167
+ | WHERE NOT (
168
+ // Filter out IP addresses and GUIDs as registry values (common for network config)
169
+ (regexp_like(reg_value, '^[0-9]{1,3}\\\\.[0-9]{1,3}\\\\.[0-9]{1,3}\\\\.[0-9]{1,3}$') AND
170
+ NOT sus_value_data) OR
171
+ // Filter out DWORD values that are just numbers
172
+ (regexp_like(reg_value, '^DWORD \\\\(0x[0-9a-fA-F]+\\\\)$') AND
173
+ NOT sus_value_data)
174
+ )
175
+
176
+ // Calculate risk score
177
+ | LET risk_score = (
178
+ if(is_run_key, 30, 0) + // Increased weight for run keys
179
+ if(is_startup_key, 30, 0) + // Increased weight for startup keys
180
+ if(is_service_key, 25, 0) + // Slightly increased for services
181
+ if(is_logon_key, 25, 0) + // Slightly increased for logon
182
+ if(is_extension_key, 15, 0) + // Reduced weight for extensions
183
+ if(sus_value_data, 40, 0) + // Increased weight for suspicious data
184
+ if(sus_value_location, 25, 0) + // Increased weight for suspicious locations
185
+ if(sus_encoded_data, 40, 0) + // Increased weight for encoded data
186
+ if(sus_process, 25, 0) + // Increased weight for suspicious process
187
+ if(sus_cmdline, 25, 0) // Increased weight for suspicious cmdline
188
+ )
189
+ | LET registry = {key: reg_key, value: reg_value}
190
+
191
+ // Aggregate by minute
192
+ | AGG minute_count = count(),
193
+ related.registry = array_agg(registry),
194
+ process_names = array_agg(process_name),
195
+ command_lines = array_agg(process_cmd),
196
+ event_ids = array_agg(event.code),
197
+ users = array_agg(coalesce(winlog.user.name AS string, '') AS string),
198
+ risk_factors.has_run_keys = any(is_run_key),
199
+ risk_factors.has_startup_keys = any(is_startup_key),
200
+ risk_factors.has_service_keys = any(is_service_key),
201
+ risk_factors.has_logon_keys = any(is_logon_key),
202
+ risk_factors.has_extension_keys = any(is_extension_key),
203
+ risk_factors.has_sus_data = any(sus_value_data),
204
+ risk_factors.has_sus_location = any(sus_value_location),
205
+ risk_factors.has_encoded_data = any(sus_encoded_data),
206
+ risk_factors.has_sus_process = any(sus_process),
207
+ risk_factors.has_sus_cmdline = any(sus_cmdline),
208
+ max_risk = max(risk_score)
209
+ BY timestamp = timestamp@m, computer_name
210
+
211
+ // Join with baseline
212
+ | JOIN baseline_reg ON computer_name == baseline_reg.computer_name
213
+ | LET threshold = baseline_reg.avg + 3 * baseline_reg.stddev
214
+
215
+ // Alert criteria - ADJUSTED THRESHOLDS
216
+ | LET suspicious_condition_count = (
217
+ if(risk_factors.has_run_keys, 1, 0) +
218
+ if(risk_factors.has_startup_keys, 1, 0) +
219
+ if(risk_factors.has_service_keys, 1, 0) +
220
+ if(risk_factors.has_logon_keys, 1, 0) +
221
+ if(risk_factors.has_extension_keys, 1, 0) +
222
+ if(risk_factors.has_sus_data, 2, 0) + // Weight suspicious data more
223
+ if(risk_factors.has_sus_location, 1, 0) +
224
+ if(risk_factors.has_encoded_data, 2, 0) + // Weight encoded data more
225
+ if(risk_factors.has_sus_process, 1, 0) +
226
+ if(risk_factors.has_sus_cmdline, 1, 0)
227
+ )
228
+ // Increased thresholds to reduce noise
229
+ | WHERE minute_count > threshold OR suspicious_condition_count >= 3 OR max_risk >= 70
230
+
231
+ // ECS required single-value fields
232
+ | LET primary_process_name = array_distinct(process_names)[0]
233
+ | LET primary_command_line = array_distinct(command_lines)[0]
234
+ | LET primary_event_code = array_distinct(event_ids)[0]
235
+ | LET primary_user = array_distinct(users)[0]
236
+ | LET registry = related.registry[0]
237
+ // Create simple message and structured labels
238
+ | LET message = "Windows Registry Persistence Signal"
239
+
240
+ // Create text values for labels
241
+ | LET run_keys_text = if(risk_factors.has_run_keys, "Yes", "No") AS string
242
+ | LET startup_keys_text = if(risk_factors.has_startup_keys, "Yes", "No") AS string
243
+ | LET service_keys_text = if(risk_factors.has_service_keys, "Yes", "No") AS string
244
+ | LET logon_keys_text = if(risk_factors.has_logon_keys, "Yes", "No") AS string
245
+ | LET extension_keys_text = if(risk_factors.has_extension_keys, "Yes", "No") AS string
246
+ | LET sus_data_text = if(risk_factors.has_sus_data, "Yes", "No") AS string
247
+ | LET sus_location_text = if(risk_factors.has_sus_location, "Yes", "No") AS string
248
+ | LET encoded_data_text = if(risk_factors.has_encoded_data, "Yes", "No") AS string
249
+ | LET sus_process_text = if(risk_factors.has_sus_process, "Yes", "No") AS string
250
+ | LET sus_cmdline_text = if(risk_factors.has_sus_cmdline, "Yes", "No") AS string
251
+ | LET minute_count_text = minute_count AS string
252
+ | LET threshold_text = threshold AS string
253
+ | LET condition_count_text = suspicious_condition_count AS string
254
+
255
+ // Create structured labels map
256
+ | LET labels = map(
257
+ "Severity": if(max_risk >= 80, 'HIGH',
258
+ if(max_risk >= 60, 'MEDIUM', 'LOW')),
259
+ "Host": computer_name,
260
+ "User": primary_user,
261
+ "Registry Key": registry.key,
262
+ "Registry Value": registry.value,
263
+ "Process": primary_process_name,
264
+ "Command Line": primary_command_line,
265
+ "Event Count": minute_count_text,
266
+ "Anomaly Threshold": threshold_text,
267
+ "Run Keys": run_keys_text,
268
+ "Startup Keys": startup_keys_text,
269
+ "Service Keys": service_keys_text,
270
+ "Logon Keys": logon_keys_text,
271
+ "Extension Keys": extension_keys_text,
272
+ "Suspicious Data": sus_data_text,
273
+ "Suspicious Location": sus_location_text,
274
+ "Encoded Data": encoded_data_text,
275
+ "Suspicious Process": sus_process_text,
276
+ "Suspicious Command Line": sus_cmdline_text,
277
+ "Suspicious Condition Count": condition_count_text
278
+ )
279
+
280
+ // Create detailed description for rule object
281
+ | LET detailed_description =
282
+ "Detection of suspicious registry persistence mechanisms" +
283
+ if(computer_name != "", " on host: " + computer_name, "") +
284
+ if(registry.key is not NULL, ". Registry key: " + registry.key, "") +
285
+ if(registry.value is not NULL, ". Registry value: " + registry.value, "")
286
+
287
+ // Final output
288
+ | SELECT
289
+ timestamp,
290
+ message,
291
+ labels,
292
+ host.name = computer_name,
293
+ event.count = minute_count,
294
+ anomaly_threshold = threshold,
295
+ registry,
296
+ process.name = primary_process_name,
297
+ related.process.name = array_distinct(process_names),
298
+ process.command_line = primary_command_line,
299
+ related.process.command_line = array_distinct(command_lines),
300
+ event.code = primary_event_code,
301
+ related.event.code = array_distinct(event_ids),
302
+ event.start = event.start,
303
+ event.end = event.end,
304
+ user.name = primary_user,
305
+ related.user.name = array_distinct(users),
306
+ condition_count = suspicious_condition_count,
307
+ risk_factors,
308
+ event.risk_score = max_risk,
309
+ rule = {
310
+ name: 'Windows Registry Persistence',
311
+ description: detailed_description,
312
+ mitre_tactics: [
313
+ 'Persistence',
314
+ 'Privilege Escalation',
315
+ 'Defense Evasion'
316
+ ],
317
+ mitre_techniques: [
318
+ 'T1547.001',
319
+ 'T1546.001',
320
+ 'T1112',
321
+ 'T1037'
322
+ ],
323
+ severity: if(max_risk >= 80, 'high',
324
+ if(max_risk >= 60, 'medium', 'low')),
325
+ risk_factors: [ // Keep descriptions in the rule object
326
+ if(risk_factors.has_run_keys, "Contains Run keys for persistence", null),
327
+ if(risk_factors.has_startup_keys, "Contains Startup keys for persistence", null),
328
+ if(risk_factors.has_service_keys, "Contains Service keys for persistence", null),
329
+ if(risk_factors.has_logon_keys, "Contains Logon keys for persistence", null),
330
+ if(risk_factors.has_extension_keys, "Contains Extension keys for persistence", null),
331
+ if(risk_factors.has_sus_data, "Contains suspicious data in registry values", null),
332
+ if(risk_factors.has_sus_location, "Contains suspicious file locations in registry values", null),
333
+ if(risk_factors.has_encoded_data, "Contains encoded or obfuscated data in registry values", null),
334
+ if(risk_factors.has_sus_process, "Registry modifications by suspicious process", null),
335
+ if(risk_factors.has_sus_cmdline, "Registry modifications with suspicious command line", null),
336
+ if(minute_count > threshold, "Anomalous number of registry operations detected", null)
337
+ ]
338
+ }
339
+ | SORT BY timestamp DESC
340
+ \`\`\`
341
+
342
+ ## Query breakdown
343
+
344
+ This detection demonstrates several advanced Hamelin techniques working
345
+ together to identify registry persistence while minimizing false positives.
346
+ Let's examine each component:
347
+
348
+ ### Baseline establishment
349
+
350
+ The detection starts by calculating normal registry activity baselines for
351
+ each host. This lets you identify unusual spikes in registry operations that
352
+ might indicate malicious activity:
353
+
354
+ \`\`\`hamelin
355
+ WITH baseline_reg = FROM simba.sysmon_events
356
+ | WHERE event.code IN ["12", "13", "14"]
357
+ | AGG reg_ops = count()
358
+ BY timestamp = timestamp@m, computer_name
359
+ | AGG avg = avg(reg_ops),
360
+ stddev = stddev(reg_ops)
361
+ BY computer_name
362
+ \`\`\`
363
+
364
+ This calculates the average and standard deviation of registry operations per
365
+ minute for each host. Hamelin uses these statistics to create baselines for
366
+ anomaly detection.
367
+
368
+ ### False positive filtering
369
+
370
+ The detection includes extensive filtering to exclude common legitimate
371
+ registry operations. Without this filtering, Windows Update, Office updates,
372
+ and other normal system processes would generate false positives:
373
+
374
+ \`\`\`hamelin
375
+ | WHERE NOT (
376
+ // Windows Installer
377
+ regexp_like(lower(reg_key),
378
+ '(?i).*\\\\microsoft\\\\windows\\\\currentversion\\\\installer\\\\.*') OR
379
+ // Microsoft Office Updates
380
+ regexp_like(lower(reg_key),
381
+ '(?i).*\\\\microsoft\\\\office\\\\.*updates\\\\.*') OR
382
+ // Windows Defender
383
+ regexp_like(lower(reg_key),
384
+ '(?i).*\\\\microsoft\\\\windows defender\\\\.*')
385
+ // ... additional exclusions
386
+ )
387
+ \`\`\`
388
+
389
+ These patterns filter out registry modifications by Windows Update, Office
390
+ updates, Windows Defender, and other legitimate system processes.
391
+
392
+ ### Persistence location detection
393
+
394
+ The detection identifies modifications to known registry persistence locations
395
+ using boolean flags. Each flag captures a different persistence mechanism
396
+ that attackers commonly exploit:
397
+
398
+ \`\`\`hamelin
399
+ | LET is_run_key = regexp_like(lower(reg_key),
400
+ '(?i).*(\\\\\\\\run\\\\\\\\|\\\\\\\\runonce\\\\\\\\|\\\\\\\\runonceex\\\\\\\\|\\\\\\\\explorer\\\\\\\\run).*')
401
+ | LET is_startup_key = regexp_like(lower(reg_key),
402
+ '(?i).*(\\\\\\\\windows\\\\\\\\currentversion\\\\\\\\shell\\\\\\\\|\\\\\\\\startup\\\\\\\\|\\\\\\\\load\\\\\\\\).*')
403
+ | LET is_service_key = regexp_like(lower(reg_key),
404
+ '(?i).*(\\\\\\\\services\\\\\\\\[^\\\\\\\\]+\\\\\\\\(imagepath|servicedll)).*')
405
+ \`\`\`
406
+
407
+ These patterns identify the most common registry-based persistence mechanisms
408
+ that attackers use to maintain access to compromised systems.
409
+
410
+ ### Risk scoring system
411
+
412
+ The detection calculates a risk score based on multiple factors. Hamelin
413
+ assigns higher weights to more suspicious indicators like encoded data and
414
+ suspicious executables:
415
+
416
+ \`\`\`hamelin
417
+ | LET risk_score = (
418
+ if(is_run_key, 30, 0) + // High weight for run keys
419
+ if(sus_value_data, 40, 0) + // Very high weight for suspicious executables
420
+ if(sus_encoded_data, 40, 0) + // Very high weight for encoded data
421
+ if(sus_process, 25, 0) // Moderate weight for suspicious processes
422
+ )
423
+ \`\`\`
424
+
425
+ This scoring system lets you prioritize the most dangerous types of registry
426
+ modifications and focus analyst attention on the highest-risk events.
427
+
428
+ ### Anomaly detection integration
429
+
430
+ The detection combines pattern matching with statistical anomaly detection.
431
+ Hamelin establishes thresholds based on historical behavior for each host:
432
+
433
+ \`\`\`hamelin
434
+ | LET threshold = baseline_reg.avg + 3 * baseline_reg.stddev
435
+ | WHERE minute_count > threshold OR suspicious_condition_count >= 3 OR max_risk >= 70
436
+ \`\`\`
437
+
438
+ This triggers alerts when registry activity exceeds normal baselines or when
439
+ multiple suspicious conditions are present, ensuring you catch both volume-based
440
+ and pattern-based threats.
441
+
442
+ ## Advanced techniques demonstrated
443
+
444
+ This detection showcases several techniques that make Hamelin effective for
445
+ complex threat detection:
446
+
447
+ **Baseline anomaly detection** establishes statistical baselines for normal
448
+ behavior and identifies deviations that might indicate malicious activity.
449
+
450
+ **Multi-factor risk scoring** combines multiple indicators with weighted
451
+ scoring to prioritize the most dangerous threats.
452
+
453
+ **Comprehensive filtering** removes known false positives while preserving
454
+ detection of actual threats through extensive exclusion patterns.
455
+
456
+ **Boolean logic aggregation** uses array functions like \`any()\` to aggregate
457
+ boolean flags across multiple events in time windows.
458
+
459
+ **Dynamic thresholding** adjusts alert criteria based on both statistical
460
+ baselines and fixed thresholds for suspicious conditions.
461
+
462
+ **Structured output** creates both human-readable labels and machine-readable
463
+ risk factors for different analysis workflows.
464
+
465
+ ## Detection tuning
466
+
467
+ You can adapt this detection for different environments:
468
+
469
+ **Adjust baseline periods** by modifying the time aggregation intervals used
470
+ for calculating statistical baselines.
471
+
472
+ **Update exclusion patterns** by adding environment-specific legitimate
473
+ registry operations to the filtering logic.
474
+
475
+ **Modify risk weights** by adjusting the scoring values based on observed
476
+ attack patterns in your environment.
477
+
478
+ **Tune alert thresholds** by changing the suspicious condition count and risk
479
+ score thresholds based on acceptable false positive rates.
480
+
481
+ The modular structure makes it easy to test individual components and adjust
482
+ scoring weights without affecting the overall detection logic.`,
483
+ "function-reference/aggregation-functions.md": "# Aggregation Functions\n\nFunctions that operate on groups of rows to produce summary values when used with `AGG` or `WINDOW` commands.\n\n## `count([expression])`\n\nReturns the number of rows in each group during aggregation.\n\n### Parameters\n\n- **expression** (optional) - When provided, only counts rows where this expression evaluates to a non-null value\n\n### Description\n\nThe `count()` function returns the number of rows in each group when you use it\nwith `AGG` or `WINDOW` commands. Unlike SQL, Hamelin uses `count()` rather than\n`count(*)`. When no grouping is specified, it counts all rows in the result.\n\nWhen you provide an expression parameter, `count()` only counts rows where that\nexpression evaluates to a non-null value. This lets you count specific \nconditions without using separate filtering.\n\nWhen you omit an explicit column name, Hamelin automatically generates the \ncolumn name `count()` or `count(expression)` that you can reference using \nbackticks in subsequent commands.\n\n## `count_if(condition)`\n\nCounts the number of rows where a condition is true in each group.\n\n### Parameters\n\n- **condition** - Boolean expression that must evaluate to true for the row to be counted\n\n### Description\n\nThe `count_if()` function counts only rows where the specified condition \nevaluates to true. This provides a more concise alternative to using `WHERE` \nclauses or `case()` expressions for conditional counting.\n\nWhen you omit an explicit column name, Hamelin automatically generates the \ncolumn name `count_if(condition)` that you can reference using backticks in \nsubsequent commands.\n\n## `sum(x)`\n\nReturns the sum of all values in each group.\n\n### Parameters\n\n- **x** - Numeric expression to sum\n\n### Description\n\nThe `sum()` function adds up all non-null values of the specified expression \nwithin each group. If all values are null, it returns null. The result type \nmatches the input type for exact numeric types.\n\n## `avg(x)`\n\nReturns the average (arithmetic mean) of all values in each group.\n\n### Parameters\n\n- **x** - Numeric expression to average\n\n### Description\n\nThe `avg()` function calculates the arithmetic mean of all non-null values \nwithin each group. If all values are null, it returns null. The result is \nalways a floating-point type regardless of input type.\n\n## `stddev(x)`\n\nReturns the standard deviation of all values in each group.\n\n### Parameters\n\n- **x** - Numeric expression to calculate standard deviation for\n\n### Description\n\nThe `stddev()` function calculates the sample standard deviation of all \nnon-null values within each group. If there are fewer than two non-null \nvalues, it returns null. The result is always a floating-point type.\n\n## `approx_percentile(x, percentile)`\n\nReturns an approximate percentile value for the specified expression.\n\n### Parameters\n\n- **x** - Numeric expression to calculate percentile for\n- **percentile** - Numeric value between 0.0 and 1.0 representing the desired percentile\n\n### Description\n\nThe `approx_percentile()` function calculates an approximate percentile using \nefficient algorithms suitable for large datasets. The percentile parameter \nshould be between 0.0 (minimum) and 1.0 (maximum). For example, 0.5 returns \nthe median, 0.95 returns the 95th percentile.\n\n## `min(x)`\n\nReturns the minimum value in each group.\n\n### Parameters\n\n- **x** - Expression of numeric, string, or timestamp type\n\n### Description\n\nThe `min()` function finds the smallest value within each group. It works with \nnumeric types (returning the numerically smallest), strings (lexicographic \nordering), and timestamps (chronologically earliest). If all values are null, \nit returns null.\n\n## `max(x)`\n\nReturns the maximum value in each group.\n\n### Parameters\n\n- **x** - Expression of numeric, string, or timestamp type\n\n### Description\n\nThe `max()` function finds the largest value within each group. It works with \nnumeric types (returning the numerically largest), strings (lexicographic \nordering), and timestamps (chronologically latest). If all values are null, \nit returns null.\n\n## `array_agg(x)`\n\nCollects all values in each group into an array.\n\n### Parameters\n\n- **x** - Expression of any type to collect into an array\n\n### Description\n\nThe `array_agg()` function creates an array containing all non-null values \nfrom the specified expression within each group. The order of elements in the \nresulting array follows the order specified by any `SORT` clause in the \naggregation command. If there are no non-null values, it returns an empty array.\n\n## `map_agg(key, value)`\n\nCollects key-value pairs in each group into a map.\n\n### Parameters\n\n- **key** - Expression of any type to use as map keys\n- **value** - Expression of any type to use as map values\n\n### Description\n\nThe `map_agg()` function creates a map from key-value pairs within each group. \nIf the same key appears multiple times, only the last value is retained. The \norder of processing follows any `SORT` clause in the aggregation command.\n\n## `multimap_agg(key, value)`\n\nCollects key-value pairs in each group into a map where each key maps to an array of values.\n\n### Parameters\n\n- **key** - Expression of any type to use as map keys\n- **value** - Expression of any type to collect into arrays\n\n### Description\n\nThe `multimap_agg()` function creates a map where each unique key maps to an \narray of all values associated with that key within each group. This preserves \nall values for duplicate keys, unlike `map_agg()` which keeps only the last value.\n\n## `any(x)`\n\nReturns true if any value in the group is true (logical OR aggregation).\n\n### Parameters\n\n- **x** - Boolean expression to test\n\n### Description\n\nThe `any()` function performs logical OR aggregation on boolean values within \neach group. It returns true if at least one value is true, false if all values \nare false, and null if all values are null.\n\n## `all(x)`\n\nReturns true if all values in the group are true (logical AND aggregation).\n\n### Parameters\n\n- **x** - Boolean expression to test\n\n### Description\n\nThe `all()` function performs logical AND aggregation on boolean values within \neach group. It returns true if all values are true, false if at least one \nvalue is false, and null if all values are null.",
484
+ "function-reference/array-functions.md": "# Array Functions\n\nScalar functions for array processing and manipulation that can be used in any expression context.\n\n## `array_distinct(x)`\n\nRemoves duplicate elements from an array.\n\n### Parameters\n\n- **x** - Array expression\n\n### Description\n\nThe `array_distinct()` function returns a new array containing only the unique\nelements from the input array. The order of elements in the result is not\nguaranteed. If the input array is null, the function returns null.\n\n## `any(x)`\n\nTests whether any element in a boolean array is true.\n\n### Parameters\n\n- **x** - Array of boolean expressions\n\n### Description\n\nThe `any()` function returns true if at least one element in the boolean array\nis true, false if all elements are false. If the array is empty, it returns\nfalse. If the array contains only null values, it returns null. This function\nperforms logical OR aggregation across array elements.\n\n## `all(x)`\n\nTests whether all elements in a boolean array are true.\n\n### Parameters\n\n- **x** - Array of boolean expressions\n\n### Description\n\nThe `all()` function returns true if all elements in the boolean array are true,\nfalse if at least one element is false. If the array is empty, it returns true.\nIf the array contains only null values, it returns null. This function performs\nlogical AND aggregation across array elements.\n\n## `max(x)`\n\nReturns the maximum element from an array.\n\n### Parameters\n\n- **x** - Array of numeric, string, or timestamp expressions\n\n### Description\n\nThe `max()` function finds and returns the largest element in the array. For\nnumeric arrays, it returns the numerically largest value. For string arrays,\nit uses lexicographic ordering. For timestamp arrays, it returns the\nchronologically latest value. If the array is empty or contains only null\nvalues, it returns null.\n\n## `min(x)`\n\nReturns the minimum element from an array.\n\n### Parameters\n\n- **x** - Array of numeric, string, or timestamp expressions\n\n### Description\n\nThe `min()` function finds and returns the smallest element in the array. For\nnumeric arrays, it returns the numerically smallest value. For string arrays,\nit uses lexicographic ordering. For timestamp arrays, it returns the\nchronologically earliest value. If the array is empty or contains only null\nvalues, it returns null.\n\n## `sum(x)`\n\nReturns the sum of all numeric elements in an array.\n\n### Parameters\n\n- **x** - Array of numeric expressions\n\n### Description\n\nThe `sum()` function calculates the sum of all numeric elements in the array.\nNull values are ignored in the calculation. If the array is empty or contains\nonly null values, it returns null. The result type matches the element type\nfor exact numeric types.\n\n## `len(x)`\n\nReturns the number of elements in an array.\n\n### Parameters\n\n- **x** - Array expression of any element type\n\n### Description\n\nThe `len()` function returns the number of elements in the array as an integer.\nThis includes null elements in the count. If the array itself is null, the\nfunction returns null. An empty array returns 0.\n\n## `filter_null(x)`\n\nRemoves null elements from an array.\n\n### Parameters\n\n- **x** - Array expression of any element type\n\n### Description\n\nThe `filter_null()` function returns a new array containing only the non-null\nelements from the input array. The order of remaining elements is preserved.\nIf all elements are null, it returns an empty array. If the input array is\nnull, the function returns null.",
485
+ "function-reference/conditional-functions.md": "# Conditional Functions\n\nScalar functions for conditional logic and branching that can be used in any expression context.\n\n## `if(condition, then)` / `if(condition, then, else)`\n\nReturns different values based on a boolean condition.\n\n### Parameters\n\n- **condition** - Boolean expression to evaluate\n- **then** - Expression to return when condition is true\n- **else** (optional) - Expression to return when condition is false\n\n### Description\n\nThe `if()` function evaluates the condition and returns the `then` expression\nif the condition is true. When used with two parameters, it returns null if\nthe condition is false. When used with three parameters, it returns the `else`\nexpression if the condition is false.\n\nBoth the `then` and `else` expressions must be of the same type when the\nthree-parameter form is used. The function provides a concise way to implement\nconditional logic within expressions.\n\n## `case(when: then, when: then, ...)`\n\nReturns values based on multiple conditions evaluated in order.\n\n### Parameters\n\n- **when: then** - Variable number of condition-value pairs\n\n### Description\n\nThe `case()` function evaluates multiple condition-value pairs in order and\nreturns the value associated with the first condition that evaluates to true.\nUnlike SQL's CASE WHEN syntax, Hamelin uses function syntax with colon-separated\npairs.\n\nEach condition must be a boolean expression, and all values must be of the same\ntype. If no condition matches, the function returns null. The conditions are\nevaluated in the order they appear, so earlier conditions take precedence.\n\n## `coalesce(...)`\n\nReturns the first non-null value from a list of expressions.\n\n### Parameters\n\n- **...** - Variable number of expressions of the same type\n\n### Description\n\nThe `coalesce()` function evaluates expressions from left to right and returns\nthe first expression that is not null. If all expressions are null, it returns\nnull. All expressions must be of the same type.\n\nThis function is commonly used for providing default values or handling null\nvalues in expressions. It's particularly useful when you want to fall back\nthrough a series of potentially null values to find the first valid one.",
486
+ "function-reference/data-structure-functions.md": "# Data Structure Functions\n\nScalar functions for data structure operations and type information that can be used in any expression context.\n\n## `typeof(x)`\n\nReturns type information for any expression.\n\n### Parameters\n\n- **x** - Expression of any type\n\n### Description\n\nThe `typeof()` function returns a struct containing detailed type information\nabout the input expression. The result includes both the Hamelin type name\nand the corresponding SQL type name. This function is useful for debugging,\ntype introspection, and understanding how Hamelin maps types to the underlying\nSQL engine.\n\n## `map(keys, values)`\n\nCreates a map from separate key and value arrays.\n\n### Parameters\n\n- **keys** - Array expression containing map keys\n- **values** - Array expression containing map values\n\n### Description\n\nThe `map()` function creates a map by pairing elements from the keys array\nwith elements from the values array. Both arrays must have the same length.\nThe nth element from the keys array is paired with the nth element from the\nvalues array. If the arrays have different lengths, an error is raised.\n\n## `map(elements)`\n\nCreates a map from an array of key-value tuples.\n\n### Parameters\n\n- **elements** - Array of tuples where each tuple contains a key and value\n\n### Description\n\nThe `map()` function creates a map from an array of key-value pairs represented\nas tuples. Each tuple in the array must contain exactly two elements: the first\nelement becomes the key, and the second element becomes the value. This format\nis useful when you have structured key-value data.\n\n## `map()`\n\nCreates an empty map.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `map()` function creates an empty map with unknown key and value types.\nThis is useful for initializing map variables or as a starting point for\nmap operations. The specific key and value types are inferred from subsequent\nusage context.\n\n## `map(key: value, ...)`\n\nCreates a map from literal key-value pairs.\n\n### Parameters\n\n- **key: value** - Variable number of key-value pairs using colon syntax\n\n### Description\n\nThe `map()` function creates a map from explicitly specified key-value pairs\nusing Hamelin's colon syntax. Each key must be unique within the map. All keys\nmust be of the same type, and all values must be of the same type. This provides\na concise way to create maps with known literal values.\n\n## `map_keys(map)`\n\nExtracts all keys from a map as an array.\n\n### Parameters\n\n- **map** - Map expression\n\n### Description\n\nThe `map_keys()` function returns an array containing all keys from the input\nmap. The order of keys in the resulting array is not guaranteed. If the map\nis empty, it returns an empty array. If the map is null, the function returns null.\n\n## `parse_json(json)`\n\nParses a JSON string into a variant type.\n\n### Parameters\n\n- **json** - String expression containing valid JSON\n\n### Description\n\nThe `parse_json()` function parses a JSON string and returns the result as\na variant type that can represent any JSON structure including objects, arrays,\nstrings, numbers, booleans, and null values. If the input string is not valid\nJSON, an error is raised. The variant type preserves the original JSON structure\nand allows dynamic access to nested elements.\n\n## `parse_json(variant)`\n\nReturns a variant value unchanged (identity function for variants).\n\n### Parameters\n\n- **variant** - Variant expression\n\n### Description\n\nWhen `parse_json()` is called with a variant input, it simply returns the\nvariant unchanged. This overload allows `parse_json()` to be safely used\non values that might already be variants without causing errors or unnecessary\nconversions.\n\n## `len(collection)`\n\nReturns the number of elements in a collection.\n\n### Parameters\n\n- **collection** - Array or map expression\n\n### Description\n\nThe `len()` function returns the number of elements in arrays or maps as an\ninteger. For arrays, it counts all elements including null values. For maps,\nit counts the number of key-value pairs. If the collection is null, the\nfunction returns null. Empty collections return 0.\n\n## `filter_null(array)`\n\nRemoves null elements from an array.\n\n### Parameters\n\n- **array** - Array expression of any element type\n\n### Description\n\nThe `filter_null()` function returns a new array containing only the non-null\nelements from the input array. The order of remaining elements is preserved.\nIf all elements are null, it returns an empty array. If the input array is\nnull, the function returns null. This function is essential for cleaning\ndata before further processing.",
487
+ "function-reference/match-group-functions.md": "# Match Group Functions\n\nFunctions for accessing events within pattern matching groups that must be used with the `MATCH` command.\n\n## `first(expression)` / `first(expression, offset)`\n\nReturns the value of an expression from the first event in a match group.\n\n### Parameters\n\n- **expression** - Expression to evaluate from the first event\n- **offset** (optional) - Integer specifying which occurrence to access (default: 0)\n\n### Description\n\nThe `first()` function retrieves the value of the specified expression from\nthe first event in the current match group. When used with the offset parameter,\nit returns the value from the first + offset event. This function is commonly\nused to access timestamps, field values, or calculated expressions from the\nbeginning of a matched event sequence.\n\n## `last(expression)` / `last(expression, offset)`\n\nReturns the value of an expression from the last event in a match group.\n\n### Parameters\n\n- **expression** - Expression to evaluate from the last event\n- **offset** (optional) - Integer specifying which occurrence to access (default: 0)\n\n### Description\n\nThe `last()` function retrieves the value of the specified expression from\nthe last event in the current match group. When used with the offset parameter,\nit returns the value from the last - offset event. This function is commonly\nused to measure durations, access final states, or extract values from the\nend of a matched event sequence.\n\n## `prev(expression)`\n\nReturns the value of an expression from the previous event in the sequence.\n\n### Parameters\n\n- **expression** - Expression to evaluate from the previous event\n\n### Description\n\nThe `prev()` function retrieves the value of the specified expression from\nthe event immediately preceding the current event in the match sequence.\nThis function provides access to the previous event's state, enabling\ncomparisons and calculations that depend on sequential relationships\nbetween events.\n\n## `next(expression)`\n\nReturns the value of an expression from the next event in the sequence.\n\n### Parameters\n\n- **expression** - Expression to evaluate from the next event\n\n### Description\n\nThe `next()` function retrieves the value of the specified expression from\nthe event immediately following the current event in the match sequence.\nThis function enables forward-looking analysis and calculations that depend\non subsequent events in the pattern.",
488
+ "function-reference/mathematical-functions.md": "# Mathematical Functions\n\nScalar functions for mathematical operations and calculations that can be used in any expression context.\n\n## `abs(x)`\n\nReturns the absolute value of a number.\n\n### Parameters\n\n- **x** - Numeric expression\n\n### Description\n\nThe `abs()` function returns the absolute value (magnitude) of the input number,\nremoving any negative sign. For positive numbers and zero, it returns the value\nunchanged. For negative numbers, it returns the positive equivalent.\n\n## `cbrt(x)`\n\nReturns the cube root of a number.\n\n### Parameters\n\n- **x** - Numeric expression\n\n### Description\n\nThe `cbrt()` function calculates the cube root of the input value. The result\nis always returned as a double-precision floating-point number. Unlike square\nroot, cube root is defined for negative numbers.\n\n## `ceil(x)` / `ceiling(x)`\n\nRounds a number up to the nearest integer.\n\n### Parameters\n\n- **x** - Numeric expression\n\n### Description\n\nThe `ceil()` and `ceiling()` functions round the input value up to the nearest\ninteger. For positive numbers, this means rounding away from zero. For negative\nnumbers, this means rounding toward zero. Both function names are equivalent.\n\n## `degrees(x)`\n\nConverts radians to degrees.\n\n### Parameters\n\n- **x** - Numeric expression representing an angle in radians\n\n### Description\n\nThe `degrees()` function converts an angle from radians to degrees. The result\nis always returned as a double-precision floating-point number. The conversion\nuses the formula: degrees = radians \xD7 (180/\u03C0).\n\n## `e()`\n\nReturns Euler's number (mathematical constant e).\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `e()` function returns the mathematical constant e (approximately 2.71828),\nwhich is the base of natural logarithms. The result is returned as a\ndouble-precision floating-point number.\n\n## `exp(x)`\n\nReturns e raised to the power of x.\n\n### Parameters\n\n- **x** - Numeric expression representing the exponent\n\n### Description\n\nThe `exp()` function calculates e^x, where e is Euler's number. This is the\nexponential function, which is the inverse of the natural logarithm. The result\nis always returned as a double-precision floating-point number.\n\n## `floor(x)`\n\nRounds a number down to the nearest integer.\n\n### Parameters\n\n- **x** - Numeric expression\n\n### Description\n\nThe `floor()` function rounds the input value down to the nearest integer. For\npositive numbers, this means rounding toward zero. For negative numbers, this\nmeans rounding away from zero.\n\n## `ln(x)`\n\nReturns the natural logarithm of a number.\n\n### Parameters\n\n- **x** - Numeric expression (must be positive)\n\n### Description\n\nThe `ln()` function calculates the natural logarithm (base e) of the input\nvalue. The input must be positive; negative values or zero will result in an\nerror. The result is always returned as a double-precision floating-point number.\n\n## `log(b, x)`\n\nReturns the logarithm of x with the specified base.\n\n### Parameters\n\n- **b** - Numeric expression representing the logarithm base\n- **x** - Numeric expression (must be positive)\n\n### Description\n\nThe `log()` function calculates the logarithm of x using the specified base b.\nBoth the base and the value must be positive. The result is always returned as\na double-precision floating-point number.\n\n## `log10(x)`\n\nReturns the base-10 logarithm of a number.\n\n### Parameters\n\n- **x** - Numeric expression (must be positive)\n\n### Description\n\nThe `log10()` function calculates the common logarithm (base 10) of the input\nvalue. The input must be positive; negative values or zero will result in an\nerror. The result is always returned as a double-precision floating-point number.\n\n## `log2(x)`\n\nReturns the base-2 logarithm of a number.\n\n### Parameters\n\n- **x** - Numeric expression (must be positive)\n\n### Description\n\nThe `log2()` function calculates the binary logarithm (base 2) of the input\nvalue. The input must be positive; negative values or zero will result in an\nerror. The result is always returned as a double-precision floating-point number.\n\n## `pi()`\n\nReturns the mathematical constant \u03C0 (pi).\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `pi()` function returns the mathematical constant \u03C0 (approximately 3.14159),\nwhich represents the ratio of a circle's circumference to its diameter. The\nresult is returned as a double-precision floating-point number.\n\n## `pow(x, p)` / `power(x, p)`\n\nRaises a number to the specified power.\n\n### Parameters\n\n- **x** - Numeric expression representing the base\n- **p** - Numeric expression representing the exponent\n\n### Description\n\nThe `pow()` and `power()` functions calculate x raised to the power of p (x^p).\nBoth function names are equivalent. The result is always returned as a\ndouble-precision floating-point number.\n\n## `radians(x)`\n\nConverts degrees to radians.\n\n### Parameters\n\n- **x** - Numeric expression representing an angle in degrees\n\n### Description\n\nThe `radians()` function converts an angle from degrees to radians. The result\nis always returned as a double-precision floating-point number. The conversion\nuses the formula: radians = degrees \xD7 (\u03C0/180).\n\n## `round(x)` / `round(x, d)`\n\nRounds a number to the nearest integer or specified decimal places.\n\n### Parameters\n\n- **x** - Numeric expression to round\n- **d** (optional) - Integer specifying the number of decimal places\n\n### Description\n\nThe `round()` function rounds the input value to the nearest integer when used\nwith one parameter, or to the specified number of decimal places when used with\ntwo parameters. The rounding follows standard mathematical rules (0.5 rounds up).\n\n## `sign(x)`\n\nReturns the sign of a number.\n\n### Parameters\n\n- **x** - Numeric expression\n\n### Description\n\nThe `sign()` function returns -1 for negative numbers, 0 for zero, and 1 for\npositive numbers. This function helps determine the sign of a value without\nregard to its magnitude.\n\n## `sqrt(x)`\n\nReturns the square root of a number.\n\n### Parameters\n\n- **x** - Numeric expression (must be non-negative)\n\n### Description\n\nThe `sqrt()` function calculates the square root of the input value. The input\nmust be non-negative; negative values will result in an error. The result is\nalways returned as a double-precision floating-point number.\n\n## `truncate(x)`\n\nRemoves the fractional part of a number.\n\n### Parameters\n\n- **x** - Numeric expression\n\n### Description\n\nThe `truncate()` function removes the fractional part of a number, effectively\nrounding toward zero. For positive numbers, this is equivalent to `floor()`.\nFor negative numbers, this is equivalent to `ceil()`.\n\n## `width_bucket(x, bound1, bound2, n)`\n\nReturns the bucket number for a value in a histogram with equal-width buckets.\n\n### Parameters\n\n- **x** - Numeric expression representing the value to bucket\n- **bound1** - Numeric expression representing the lower bound\n- **bound2** - Numeric expression representing the upper bound \n- **n** - Integer expression representing the number of buckets\n\n### Description\n\nThe `width_bucket()` function determines which bucket a value falls into when\ndividing the range between bound1 and bound2 into n equal-width buckets. Values\noutside the bounds return 0 (below bound1) or n+1 (above bound2).\n\n## `width_bucket(x, bins)`\n\nReturns the bucket number for a value using explicitly defined bucket boundaries.\n\n### Parameters\n\n- **x** - Numeric expression representing the value to bucket\n- **bins** - Array of numeric values representing bucket boundaries\n\n### Description\n\nThe `width_bucket()` function determines which bucket a value falls into using\nan array of explicitly defined bucket boundaries. The function returns the\nindex of the bucket where the value belongs, with 0 for values below the\nlowest boundary and array length + 1 for values above the highest boundary.",
489
+ "function-reference/regular-expression-functions.md": "# Regular Expression Functions\n\nScalar functions for pattern matching and advanced text processing using regular expressions.\n\n## `regexp_count(string, pattern)`\n\nCounts the number of times a regular expression pattern matches in a string.\n\n### Parameters\n\n- **string** - String expression to search within\n- **pattern** - String expression representing the regular expression pattern\n\n### Description\n\nThe `regexp_count()` function returns the number of non-overlapping matches of\nthe specified regular expression pattern within the input string. If no matches\nare found, it returns 0. The pattern uses standard regular expression syntax.\n\n## `regexp_extract_all(string, pattern)` / `regexp_extract_all(string, pattern, group)`\n\nExtracts all matches of a regular expression pattern from a string.\n\n### Parameters\n\n- **string** - String expression to search within\n- **pattern** - String expression representing the regular expression pattern\n- **group** (optional) - Integer specifying which capture group to extract\n\n### Description\n\nThe `regexp_extract_all()` function returns an array containing all matches of\nthe specified pattern. When used with two parameters, it returns the entire\nmatch. When used with three parameters, it returns the specified capture group\nfrom each match.\n\nIf no matches are found, it returns an empty array. Capture groups are numbered\nstarting from 1, with 0 representing the entire match.\n\n## `regexp_extract(string, pattern)` / `regexp_extract(string, pattern, group)`\n\nExtracts the first match of a regular expression pattern from a string.\n\n### Parameters\n\n- **string** - String expression to search within\n- **pattern** - String expression representing the regular expression pattern\n- **group** (optional) - Integer specifying which capture group to extract\n\n### Description\n\nThe `regexp_extract()` function returns the first match of the specified pattern.\nWhen used with two parameters, it returns the entire match. When used with three\nparameters, it returns the specified capture group from the first match.\n\nIf no match is found, it returns null. Capture groups are numbered starting\nfrom 1, with 0 representing the entire match.\n\n## `regexp_like(string, pattern)`\n\nTests whether a string matches a regular expression pattern.\n\n### Parameters\n\n- **string** - String expression to test\n- **pattern** - String expression representing the regular expression pattern\n\n### Description\n\nThe `regexp_like()` function returns true if the input string contains a match\nfor the specified regular expression pattern, false otherwise. This function\ntests for the presence of a match anywhere within the string, not just at the\nbeginning or end.\n\n## `regexp_position(string, pattern)` / `regexp_position(string, pattern, start)` / `regexp_position(string, pattern, start, occurrence)`\n\nReturns the position of a regular expression match within a string.\n\n### Parameters\n\n- **string** - String expression to search within\n- **pattern** - String expression representing the regular expression pattern\n- **start** (optional) - Integer specifying the starting position for the search\n- **occurrence** (optional) - Integer specifying which occurrence to find\n\n### Description\n\nThe `regexp_position()` function returns the 1-based position of the first\ncharacter of a pattern match within the string. When used with the `start`\nparameter, it begins searching from that position. When used with the\n`occurrence` parameter, it finds the nth occurrence of the pattern.\n\nIf no match is found, it returns 0. The start position is 1-based, and the\noccurrence count begins at 1 for the first match.\n\n## `regexp_replace(string, pattern)` / `regexp_replace(string, pattern, replacement)`\n\nReplaces matches of a regular expression pattern in a string.\n\n### Parameters\n\n- **string** - String expression to search within\n- **pattern** - String expression representing the regular expression pattern\n- **replacement** (optional) - String expression to replace matches with\n\n### Description\n\nThe `regexp_replace()` function replaces all matches of the specified pattern.\nWhen used with two parameters, it removes all matches (replaces with empty\nstring). When used with three parameters, it replaces matches with the\nspecified replacement string.\n\nThe replacement string can include capture group references using standard\nregular expression syntax. If no matches are found, the original string is\nreturned unchanged.\n\n## `regexp_split(string, pattern)`\n\nSplits a string using a regular expression pattern as the delimiter.\n\n### Parameters\n\n- **string** - String expression to split\n- **pattern** - String expression representing the regular expression pattern to use as delimiter\n\n### Description\n\nThe `regexp_split()` function splits the input string at each occurrence of\nthe specified pattern and returns an array of the resulting substrings. The\npattern matches are not included in the result array.\n\nIf the pattern is not found, the function returns an array containing the\noriginal string as a single element. If the pattern matches at the beginning\nor end of the string, empty strings may be included in the result array.",
490
+ "function-reference/string-functions.md": "# String Functions\n\nScalar functions for string processing and manipulation that can be used in any expression context.\n\n## `replace(string, pattern)`\n\nReplaces all occurrences of a pattern in a string.\n\n### Parameters\n\n- **string** - String expression to search within\n- **pattern** - String expression representing the text to replace\n\n### Description\n\nThe `replace()` function removes all occurrences of the specified pattern from\nthe input string. This function performs literal string replacement, not\npattern matching. If the pattern is not found, the original string is returned\nunchanged.\n\n## `starts_with(string, prefix)`\n\nTests whether a string starts with a specified prefix.\n\n### Parameters\n\n- **string** - String expression to test\n- **prefix** - String expression representing the prefix to check for\n\n### Description\n\nThe `starts_with()` function returns true if the input string begins with the\nspecified prefix, false otherwise. The comparison is case-sensitive. An empty\nprefix will always return true for any string.\n\n## `ends_with(string, suffix)`\n\nTests whether a string ends with a specified suffix.\n\n### Parameters\n\n- **string** - String expression to test\n- **suffix** - String expression representing the suffix to check for\n\n### Description\n\nThe `ends_with()` function returns true if the input string ends with the\nspecified suffix, false otherwise. The comparison is case-sensitive. An empty\nsuffix will always return true for any string.\n\n## `contains(string, substring)`\n\nTests whether a string contains a specified substring.\n\n### Parameters\n\n- **string** - String expression to search within\n- **substring** - String expression representing the text to search for\n\n### Description\n\nThe `contains()` function returns true if the input string contains the\nspecified substring anywhere within it, false otherwise. The comparison is\ncase-sensitive. An empty substring will always return true for any string.\n\n## `lower(string)`\n\nConverts a string to lowercase.\n\n### Parameters\n\n- **string** - String expression to convert\n\n### Description\n\nThe `lower()` function converts all uppercase characters in the input string\nto their lowercase equivalents. Characters that are already lowercase or\nnon-alphabetic characters remain unchanged.\n\n## `upper(string)`\n\nConverts a string to uppercase.\n\n### Parameters\n\n- **string** - String expression to convert\n\n### Description\n\nThe `upper()` function converts all lowercase characters in the input string\nto their uppercase equivalents. Characters that are already uppercase or\nnon-alphabetic characters remain unchanged.\n\n## `len(string)`\n\nReturns the length of a string in characters.\n\n### Parameters\n\n- **string** - String expression to measure\n\n### Description\n\nThe `len()` function returns the number of characters in the input string.\nThis counts Unicode characters, not bytes, so multi-byte characters are\ncounted as single characters. An empty string returns 0.",
491
+ "function-reference/time-date-functions.md": '# Time & Date Functions\n\nScalar functions for temporal data processing and manipulation that can be used in any expression context.\n\n## `now()`\n\nReturns the current timestamp.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `now()` function returns the current date and time as a timestamp. The\nexact timestamp represents the moment when the function is evaluated during\nquery execution. All calls to `now()` within the same query execution return\nthe same timestamp value.\n\n## `today()`\n\nReturns today\'s date at midnight.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `today()` function returns the current date with the time portion set to\nmidnight (00:00:00). This is equivalent to truncating `now()` to the day\nboundary. The result represents the start of the current day.\n\n## `yesterday()`\n\nReturns yesterday\'s date at midnight.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `yesterday()` function returns yesterday\'s date with the time portion set\nto midnight (00:00:00). This is equivalent to subtracting one day from `today()`.\nThe result represents the start of the previous day.\n\n## `tomorrow()`\n\nReturns tomorrow\'s date at midnight.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `tomorrow()` function returns tomorrow\'s date with the time portion set to\nmidnight (00:00:00). This is equivalent to adding one day to `today()`. The\nresult represents the start of the next day.\n\n## `ts(timestamp)`\n\nConverts a string to a timestamp.\n\n### Parameters\n\n- **timestamp** - String expression representing a timestamp\n\n### Description\n\nThe `ts()` function parses a string representation of a timestamp and converts\nit to a timestamp type. The function accepts various timestamp formats including\nISO 8601 format. If the string cannot be parsed as a valid timestamp, an error\nis raised.\n\n## `year(timestamp)`\n\nExtracts the year from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `year()` function extracts the year component from a timestamp and returns\nit as an integer. For example, a timestamp of "2023-07-15 14:30:00" would\nreturn 2023.\n\n## `month(timestamp)`\n\nExtracts the month from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `month()` function extracts the month component from a timestamp and returns\nit as an integer from 1 to 12, where 1 represents January and 12 represents\nDecember. For example, a timestamp of "2023-07-15 14:30:00" would return 7.\n\n## `day(timestamp)`\n\nExtracts the day of the month from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `day()` function extracts the day component from a timestamp and returns\nit as an integer from 1 to 31, depending on the month. For example, a timestamp\nof "2023-07-15 14:30:00" would return 15.\n\n## `hour(timestamp)`\n\nExtracts the hour from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `hour()` function extracts the hour component from a timestamp and returns\nit as an integer from 0 to 23, using 24-hour format. For example, a timestamp\nof "2023-07-15 14:30:00" would return 14.\n\n## `minute(timestamp)`\n\nExtracts the minute from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `minute()` function extracts the minute component from a timestamp and\nreturns it as an integer from 0 to 59. For example, a timestamp of\n"2023-07-15 14:30:00" would return 30.\n\n## `second(timestamp)`\n\nExtracts the second from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `second()` function extracts the second component from a timestamp and\nreturns it as an integer from 0 to 59. For example, a timestamp of\n"2023-07-15 14:30:45" would return 45.\n\n## `at_timezone(timestamp, timezone)`\n\nConverts a timestamp to a different timezone.\n\n### Parameters\n\n- **timestamp** - Timestamp expression to convert\n- **timezone** - String expression representing the target timezone\n\n### Description\n\nThe `at_timezone()` function converts a timestamp from its current timezone\nto the specified target timezone. The timezone parameter should be a valid\ntimezone identifier such as "UTC", "America/New_York", or "Europe/London".\nThe function returns a new timestamp representing the same moment in time\nbut expressed in the target timezone.\n\n## `to_millis(interval)`\n\nConverts an interval to milliseconds.\n\n### Parameters\n\n- **interval** - Interval expression to convert\n\n### Description\n\nThe `to_millis()` function converts an interval (duration) to its equivalent\nvalue in milliseconds as an integer. This is useful for calculations that\nrequire numeric representations of time durations. For example, an interval\nof "5 minutes" would return 300000 milliseconds.',
492
+ "function-reference/window-functions.md": "# Window Functions\n\nFunctions for analytical operations over data windows that must be used with the `WINDOW` command.\n\n## `row_number()`\n\nReturns a sequential row number for each row within a window partition.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `row_number()` function assigns a unique sequential integer to each row\nwithin its window partition, starting from 1. The ordering is determined by\nthe `SORT` clause in the `WINDOW` command. Rows with identical sort values\nreceive different row numbers in an arbitrary but consistent order.\n\n## `rank()`\n\nReturns the rank of each row within a window partition with gaps.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `rank()` function assigns a rank to each row within its window partition\nbased on the `SORT` clause ordering. Rows with identical sort values receive\nthe same rank, and subsequent ranks are skipped. For example, if two rows tie\nfor rank 2, the next row receives rank 4 (not rank 3).\n\n## `dense_rank()`\n\nReturns the rank of each row within a window partition without gaps.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `dense_rank()` function assigns a rank to each row within its window\npartition based on the `SORT` clause ordering. Rows with identical sort values\nreceive the same rank, but subsequent ranks are not skipped. For example, if\ntwo rows tie for rank 2, the next row receives rank 3.\n\n## `lag(expression, offset, ignore_nulls)`\n\nReturns the value of an expression from a previous row within the window.\n\n### Parameters\n\n- **expression** - Expression to evaluate from the previous row\n- **offset** - Integer specifying how many rows back to look\n- **ignore_nulls** - Boolean indicating whether to skip null values (default: true)\n\n### Description\n\nThe `lag()` function retrieves the value of the specified expression from a\nrow that is `offset` positions before the current row within the window\npartition. When `ignore_nulls` is true, null values are skipped when counting\nthe offset. If there is no row at the specified offset, the function returns null.\n\n## `lead(expression, offset, ignore_nulls)`\n\nReturns the value of an expression from a subsequent row within the window.\n\n### Parameters\n\n- **expression** - Expression to evaluate from the subsequent row\n- **offset** - Integer specifying how many rows ahead to look\n- **ignore_nulls** - Boolean indicating whether to skip null values (default: true)\n\n### Description\n\nThe `lead()` function retrieves the value of the specified expression from a\nrow that is `offset` positions after the current row within the window\npartition. When `ignore_nulls` is true, null values are skipped when counting\nthe offset. If there is no row at the specified offset, the function returns null.\n\n## `first_value(expression, ignore_nulls)`\n\nReturns the first value of an expression within the window frame.\n\n### Parameters\n\n- **expression** - Expression to evaluate\n- **ignore_nulls** - Boolean indicating whether to skip null values (default: true)\n\n### Description\n\nThe `first_value()` function returns the value of the specified expression from\nthe first row in the current window frame. When `ignore_nulls` is true, it\nreturns the first non-null value. The window frame is determined by the\n`WITHIN` clause in the `WINDOW` command.\n\n## `last_value(expression, ignore_nulls)`\n\nReturns the last value of an expression within the window frame.\n\n### Parameters\n\n- **expression** - Expression to evaluate\n- **ignore_nulls** - Boolean indicating whether to skip null values (default: true)\n\n### Description\n\nThe `last_value()` function returns the value of the specified expression from\nthe last row in the current window frame. When `ignore_nulls` is true, it\nreturns the last non-null value. The window frame is determined by the\n`WITHIN` clause in the `WINDOW` command.\n\n## `nth_value(expression, n, ignore_nulls)`\n\nReturns the nth value of an expression within the window frame.\n\n### Parameters\n\n- **expression** - Expression to evaluate\n- **n** - Integer specifying which value to return (1-based)\n- **ignore_nulls** - Boolean indicating whether to skip null values (default: true)\n\n### Description\n\nThe `nth_value()` function returns the value of the specified expression from\nthe nth row in the current window frame. When `ignore_nulls` is true, null\nvalues are not counted in the position. If there is no nth row, the function\nreturns null. The position is 1-based, where 1 represents the first row.\n\n## `cume_dist()`\n\nReturns the cumulative distribution of each row within the window partition.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `cume_dist()` function calculates the cumulative distribution of each row\nwithin its window partition. The result is the number of rows with values less\nthan or equal to the current row's value, divided by the total number of rows\nin the partition. Values range from 0 to 1.\n\n## `percent_rank()`\n\nReturns the percentile rank of each row within the window partition.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `percent_rank()` function calculates the percentile rank of each row within\nits window partition. The result is calculated as (rank - 1) / (total rows - 1),\nwhere rank is determined by the `SORT` clause ordering. Values range from 0 to 1,\nwith 0 representing the lowest value and 1 representing the highest.",
493
+ "introduction.md": "# Introducing Hamelin\n\nHamelin is a **pipe-based query language** for **event analytics** which targets\nthe specific challenges detection engineers face when analyzing security events.\nThe language makes event correlation straightforward, letting you define\npatterns, correlate them across time windows, and match ordered sequences of\nevents.\n\n## Key Features\n\n### \u{1F504} Pipe-Based\n\nYou write queries that read naturally from top to bottom. Each operation\nconnects to the next using the pipe operator `|`. Pipe-based languages let you\nbuild queries incrementally, making them easier to read, write, and test than\napproaches that rely heavily on nested subqueries.\n\n```hamelin\nFROM events\n| WHERE event.action == 'login'\n| WITHIN -1hr\n| SELECT user.email, timestamp\n```\n\n### \u{1F550} Event-Native\n\nHamelin offers shorthand for working with timestamped events. Time intervals are\nwritten as simple expressions that match how you think about time. You can\nreference relative timestamps and truncate them to specific boundaries.\n\n```hamelin\n// Reference relative time\n| WITHIN -15m // events within the last 15 minutes\n| WITHIN -1h // events within the last hour\n| WITHIN -7d // events within the last 7 days\n\n// Truncate timestamps to boundaries\n| SELECT timestamp@h // truncate to hour boundary\n| SELECT timestamp@d // truncate to day boundary\n```\n\n### \u{1FA9F} Sliding Windows\n\nSliding windows move continuously with each event, giving you insights without\ngaps or duplicates. You can aggregate data over these moving time windows to\ndetect patterns as they happen.\n\n```hamelin\nFROM events\n| WHERE event.action == 'login'\n| WINDOW count()\n BY user.id\n WITHIN -15m\n```\n\n### \u{1F3AF} Correlation of Named Subqueries\n\nNamed subqueries let you define specific event patterns and correlate them\nwithin sliding windows. You can drop these patterns into sliding windows and\nwrite correlations around them. Hamelin makes it straightforward to aggregate\nover specific patterns while also aggregating over the entire group of events.\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nFROM failed = failed_logins, success = successful_logins\n| WINDOW failures = count(failed),\n successes = count(success),\n total = count(),\n BY user.id\n WITHIN -5m\n| WHERE successes >= 1 && failures / total > 0.2\n```\n\nThis query demonstrates correlating failed and successful login events to detect\nbrute force attacks. Named subqueries define distinct event patterns:\n`failed_logins` filters to login failure events while `successful_logins`\nfilters to login success events. The sliding window aggregates these patterns by\nuser over 5-minute periods, counting failures, successes, and total events. The\nfinal filter identifies users who had at least one successful login where failed\nattempts represent more than 20% of their total login activity within that\nwindow.\n\n### \u{1F50D} Ordered Matching of Named Subqueries\n\nYou can ask Hamelin to match ordered patterns across events. Aggregations over sliding windows work well for many use cases, but others require that you search for specific events followed by other specific events. You can do that in Hamelin using regular expression quantifiers applied to named subqueries.\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nMATCH failed_logins{10,} successful_logins+\nWHEN last(successful_logins.timestamp) - first(successful_logins.timestamp) < 10m\n```\n\nThis searches for 10 failed logins followed by at least one successful login in\na ten minute period. The sliding window approach might miss attack patterns\nwhere timing and sequence matter, but ordered matching can detect the exact\nprogression of a brute force attack.\n\n### \u{1F517} Event Type Expansion\n\nYou can query across different event types without worrying about schema\ndifferences. Hamelin automatically sets missing fields to `null` when they don't\nexist in a particular event type.\n\n```hamelin\nFROM login_events, logout_events, error_events\n// Filters by user.email when if this field exists in a row.\n// Drops rows where this field does not exist\n// (because NULL does not equal any string).\n| WHERE user.email == 'john@example.com'\n```\n\n### \u{1F5C2}\uFE0F Structured Types\n\nHamelin supports structured types like structs, arrays, and maps to represent\ncomplex data. These types make data modeling more familiar, and reduce the need\nto rely too much on joins in analytic queries.\n\n```hamelin\n// Create struct literals with nested data\nLET login_metadata = {\n ip_address: '192.168.1.100',\n user_agent: 'Mozilla/5.0',\n location: 'San Francisco'\n}\n\n// Access nested fields using dot notation\n| WHERE login_metadata.ip_address != '192.168.1.100'\n\n// Use arrays to store multiple related values\n| LET failed_attempts = [\n {timestamp: '2024-01-15T14:25:00Z', reason: 'invalid_password'},\n {timestamp: '2024-01-15T14:27:00Z', reason: 'account_locked'}\n ]\n\n// Use maps when key data is high cardinality\n// Using structs for this use case creates too many columns.\n| LET host_metrics = map(\n 'web-server-01': {cpu: 85.2, memory: 72.1},\n 'web-server-02': {cpu: 91.7, memory: 68.9},\n 'db-primary-01': {cpu: 67.3, memory: 89.4}\n )\n\n// Look up map values using index notation\n| WHERE host_metrics['web-server-01'].cpu > 80\n```\n\n### \u{1F4E1} Array Broadcasting\n\nHamelin makes working with arrays simpler by offering broadcasting, which helps\nyou distribute operations over each member of an array. It does this when you\napply an operation to an array that makes more sense to be applied to each of\nits members. Broadcasting lets you work with arrays using simple, familiar\nsyntax without asking you to resort to functional programming or inefficient\nunnesting.\n\n```hamelin\n| WHERE any(failed_attempts.reason == 'invalid_password')\n```\n\nThis example demonstrates how the equality operator `==` broadcasts across the\n`reason` field of each element in the `failed_attempts` array. This example\ndemonstrates *two* broadcasts:\n\n * first, the lookup of the `reason` field changes an array-of-struct into an\n array-of-string\n * second, applying equality to the resulting array applies it to each member\n\nHamelin can do this automatically because it is type-aware. It knows that\ncomparing equality between `array(string)` and `string` makes more sense to\nbroadcast: an array can never be equal to a string, but a member of an\n`array(string)` might be.\n\n### \u{1F500} Semi-Structured Types\n\nHamelin lets you parse json into instances of the `variant` type. This helps you\nhandle semi-structured data that doesn't fit nicely into fixed schemas. You can\nparse JSON strings, access their fields, and convert them to more structured\ntypes. This makes working with JSON feel fairly native.\n\n```hamelin\n// Parse JSON strings into the variant type\nFROM logs\n| LET event_data = parse_json(raw_json)\n\n// Access nested fields using dot notation\n| WHERE event_data.level AS string == 'ERROR'\n\n// Access json array elements with index notation\n| LET first_tag = event_data.tags[0]\n\n// Cast variant data to structured types when you need type safety.\n// Values that do not match will be null.\n| LET user_info = event_data.user AS {id: int, name: string}\n```\n\n### \u{1F6A8} Excellent Error Messages\n\nHamelin provides clear, helpful error messages. Error messages\npoint directly to the problematic Hamelin code and explain exactly what went\nwrong, rather than showing cryptic messages about generated SQL.\n\nThis matters especially when AI assistants write queries. AI tools need precise\ndescriptions of errors to fix queries and complete tasks. Clear error messages\nlet AI assistants debug queries effectively by giving the context needed to\ncorrect mistakes.\n\n```hamelin\nFROM simba.sysmon_events\n| AGG count() BY host.hostname\n| LET hostname = lower(host.hostname)\n```\n\ngenerates the error\n\n```\nError: problem doing translation\n \u256D\u2500[ :3:24 ]\n \u2502\n 3 \u2502 | LET hostname = lower(host.hostname)\n \u2502 \u2500\u2500\u252C\u2500\n \u2502 \u2570\u2500\u2500\u2500 error while translating\n \u2502\n \u2502 Note: unbound column reference: host\n \u2502\n \u2502 the following entries in the environment are close:\n \u2502 - `host.hostname` (you must actually wrap with ``)\n\u2500\u2500\u2500\u256F\n```\n\nHere, the user has forgotten to escape an identifier that contains a dot character.\n\n```hamelin\nFROM simba.sysmon_events\n| WINDOW count(),\n all(winlog.event_data.events)\n BY host.hostname\n```\n\ngenerates the error\n\n```\nError: problem doing translation\n \u256D\u2500[ :3:10 ]\n \u2502\n 3 \u2502 all(winlog.event_data.events)\n \u2502 \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252C\u2500\u252C\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n \u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 could not find a matching function definition\n \u2502 \u2502\n \u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 variant\n \u2502\n \u2502 Note: Attempted all(x=boolean)\n \u2502 - Type mismatch for x: expected boolean, got variant\n \u2502\n \u2502 Attempted all(x=array(boolean))\n \u2502 - Type mismatch for x: expected array(boolean), got variant\n \u2502\n\u2500\u2500\u2500\u256F\n```\n\nHere, the user has forgotten to cast variant to a primitive type so that it can\nbe matched against the function call. (A future version of Hamelin will probably\ncoerce this automatically!)\n",
494
+ "language-basics/aggregation.md": "# AGG: performing ordinary aggregation\n\nThe `AGG` command groups and aggregates datasets to create summary statistics\nand analytical insights. You can analyze user behavior patterns, system\nperformance metrics, or security events by grouping related records together and\napplying mathematical functions to each group.\n\n## AGG syntax\n\nThe `AGG` command follows a simple pattern that groups data and applies aggregation functions to each group:\n\n```hamelin\nAGG result = function(expression), ... BY grouping_expression, ...\n```\n\nWhen you omit the `BY` clause, Hamelin aggregates all records into a single group. This calculates overall dataset statistics and global metrics that span all records, counting all events across the entire dataset without any grouping or partitioning:\n\n```hamelin\nFROM events\n| AGG total_events = count()\n```\n\nWhen you omit explicit column names, Hamelin generates them automatically from\nthe expressions you provide. Learn more about this feature in [Automatic Field\nNames](../smart-features/automatic-field-names.md). This creates columns named\n`count()` and `avg(response_time)` that you can reference using backticks in\nsubsequent commands:\n\n```hamelin\nFROM requests\n| AGG count(), avg(response_time) BY service_name\n```\n\nWhen you omit aggregation functions entirely, you get distinct groups without any calculations. This returns the unique combinations of event_type and user_id without performing any mathematical operations:\n\n```hamelin\nFROM events\n| AGG BY event_type, user_id\n```\n\nYou can also rename columns in the BY clause and use any expression for grouping. This example groups by renamed event_type, truncated timestamp, and extracted email domain, creating clear column names for downstream analysis:\n\n```hamelin\nFROM events\n| AGG\n total_events = count(),\n avg_duration = avg(duration)\n BY event_category = event_type,\n hour_bucket = timestamp@hr,\n user_domain = split(email, '@')[1]\n```\n\n## Simple aggregation examples\n\n### Basic counting\n\nEvent counting groups events by their characteristics and calculates how many events fall into each category. Notice that Hamelin uses `count()` with no arguments, not `count(*)` like SQL. The empty parentheses count all rows in each group, providing a clean syntax for the most common aggregation operation:\n\n```hamelin\nFROM events\n| AGG event_count = count() BY event_type\n```\n\n### Multiple aggregations\n\nCalculating several metrics at once in a single `AGG` command ensures all metrics use consistent grouping logic:\n\n```hamelin\nFROM requests\n| AGG\n total_requests = count(),\n avg_response_time = avg(response_time),\n max_response_time = max(response_time),\n error_count = count_if(status_code >= 400)\n BY service_name\n```\n\n### Conditional aggregation\n\nConditional aggregation functions like `count_if()` let you count only rows that meet specific conditions without pre-filtering the dataset. Conditional aggregation maintains the full context of each group while applying different filters to different calculations:\n\n```hamelin\nFROM auth_logs\n| AGG\n failures = count_if(outcome == 'FAILURE'),\n successes = count_if(outcome == 'SUCCESS')\n BY user_name\n```\n\n## Time series aggregations\n\nTime series aggregations combine time truncation with grouping to create time-based buckets for temporal analysis. Time-based grouping creates time-bucketed summaries for monitoring system performance, tracking business metrics, and understanding user behavior patterns across different time scales.\n\n### Hourly summaries\n\nHourly aggregations provide detailed views of system activity and user behavior throughout the day:\n\n```hamelin\nFROM logs\n| AGG\n hourly_events = count(),\n avg_response = avg(response_time),\n error_rate = count_if(status >= 400) / count()\n BY timestamp@hr\n| SORT timestamp@hr\n```\n\n### Daily trends\n\nDaily aggregations reveal longer-term trends and enable comparison across different time periods:\n\n```hamelin\nFROM events\n| WITHIN -30d..now()\n| AGG\n daily_events = count(),\n unique_users = count_distinct(user_name),\n high_severity = count_if(severity = 'HIGH')\n BY timestamp@d\n| SORT timestamp@d DESC\n```\n",
495
+ "language-basics/five-core-commands.md": "# Five core commands\n\nHamelin uses five core commands to handle basic data operations: `FROM`,\n`WHERE`, `LET`, `SELECT`, and `DROP`. Each command operates narrowly and serves\na specific purpose. You combine these commands using the pipe character `|`.\nThese core commands serve the same purpose as SQL clauses, but you can use them\nin any order, with each command feeding its output to the next.\n\n## Command reference\n\n### `FROM` - Access the rows of a dataset\n\nThe `FROM` command pulls rows from one or more datasets. You start most queries with this command to specify the data sources.\n\nPull rows from the events dataset:\n\n```hamelin\nFROM events\n```\n\nPull rows from both the users and orders datasets:\n\n```hamelin\nFROM users, orders\n```\n\nThis operation translates to a SQL `UNION ALL`, not a join. It pulls all rows from both sources without performing any filtering or row alignment.\n\nWhen you query multiple sources, Hamelin automatically expands types to accommodate all fields from both datasets. Fields with the same name get aligned, while unique fields are added with `NULL` values for rows that don't contain them. Learn more about how this works in [Type Expansion](../smart-features/type-expansion.md).\n\nYou can organize results from multiple datasets by grouping fields into separate\nsections. This lets you keep track of which data comes from which source:\n\n```hamelin\nFROM\n allows = events.access_allowed,\n denys = events.access_denied\n```\n\nThe `allows` field becomes a struct containing all fields from\n`events.access_allowed` (set to null for rows from `events.access_denied`). The\n`denys` field works the same way for `events.access_denied`. All other\nfields are aligned normally. This makes it easy to both reference a field's\nspecific lineage and to ignore lineage when you don't need it.\n\n\n### `WHERE` - Filter rows\n\nThe `WHERE` command filters rows based on conditions you specify. You can stack\nmultiple `WHERE` commands to apply multiple filters.\n\nOnly query rows whose action is 'login':\n\n```hamelin\nFROM events\n| WHERE event.action == 'login'\n```\n\nThis has the same effect as using `AND` between each condition:\n\n```hamelin\nFROM users\n| WHERE user.role == 'admin'\n| WHERE status == 'active'\n```\n\n### `LET` - Add or modify columns\n\nThe `LET` command adds new columns or modifies existing ones without affecting\nother columns. This lets you create calculated fields and enrich datasets as\nyou build queries.\n\nCreate a full name by concatenating first and last names:\n\n```hamelin\nFROM users\n| LET full_name = user.first_name + ' ' + user.last_name\n```\n\nCalculate how many days ago an event occurred:\n\n```hamelin\nFROM events\n| LET days_ago = (now() - timestamp) / 1d\n```\n\nAdd a severity level based on the event action:\n\n```hamelin\nFROM events\n| LET severity = if(event.action == 'login_failed', 'high', 'low')\n```\n\nYou can set nested fields within existing structs to modify specific parts of records:\n\n```hamelin\nFROM events\n| LET user.display_name = user.first_name + ' ' + user.last_name\n```\n\nYou can also create entirely new nested structs by setting multiple nested fields:\n\n```hamelin\nFROM events\n| LET metadata.processed_at = now()\n| LET metadata.version = '2.1'\n```\n\nWhen creating new structs, using a struct literal is often more readable:\n\n```hamelin\nFROM events\n| LET metadata = {\n processed_at: now(),\n version: '2.1'\n }\n```\n\n### `SELECT` - Define output columns\n\nThe `SELECT` command completely redefines what columns appear in the results.\nThis replaces all existing columns with only the ones you specify.\n\nSelect only the user's email, timestamp, and event action from the events dataset:\n\n```hamelin\nFROM events\n| SELECT user.email, timestamp, event.action\n```\n\nSelect existing columns and add new computed columns with custom names:\n\n```hamelin\nFROM metrics\n| SELECT count, avg_time, category = 'security'\n```\n\nCreate new columns using expressions and conditional logic:\n\n```hamelin\nFROM events\n| SELECT user_id, severity = if(error_count > 10, 'high', 'low')\n```\n\nTransform existing columns while selecting them:\n\n```hamelin\nFROM logs\n| SELECT timestamp, message, log_level = upper(level)\n```\n\nWhen you don't provide explicit field names in SELECT, Hamelin automatically\ngenerates sensible names based on your expressions. This reduces the need to\nthink of names for simple calculations. Learn more about this in [Automatic Field Names](../smart-features/automatic-field-names.md).\n\n### `DROP` - Remove columns\n\nThe `DROP` command removes specific columns from the results. This is useful\nwhen you want to exclude sensitive data or reduce the size of the output.\n\nRemove unwanted columns from the dataset:\n\n```hamelin\nFROM events\n| DROP internal_id, debug_info\n```\n\n## Basic example\n\nThis example shows how you can combine the core commands to build a common query.\n\n```hamelin\nFROM events\n| WHERE event.action != null\n| LET days_ago = (now() - timestamp) / 1d\n| DROP debug_data, internal_flags\n```\n\nThis example demonstrates how the core commands work together in a typical\nworkflow. You start by pulling rows from the events dataset, filter out records\nwith missing action data, add a calculated field for how many days ago each\nevent occurred, and remove unwanted debug column\n\n```hamelin\nFROM events\n| WHERE event.action != null\n| LET days_ago = (now() - timestamp) / 1d\n| SELECT event.action, timestamp, days_ago\n```\n\nThis example shows a similar pattern. Rather than dropping specific columns, you\nselect only the ones you want to keep in the final output.\n",
496
+ "language-basics/join-combining-datasets.md": '# `JOIN` and `LOOKUP`: Combining datasets\n\nIn event analytics, event records are often narrow and require enrichment with\nadditional context. You use the `JOIN` and `LOOKUP` commands to do this\nenrichment. Hamelin gives you most of the power of SQL joins - you get inner\njoin behavior (with `JOIN`) and left outer join behavior (with `LOOKUP`).\n\n## Basic syntax\n\nJoin your main dataset with additional data by specifying a matching condition. The `ON` clause defines how records from both datasets should be linked together:\n\n```hamelin\n| JOIN other_dataset ON field_name == other_dataset.field_name\n```\n\nUse `LOOKUP` when you want to keep all your original records, even if some don\'t have matches in the second dataset:\n\n```hamelin\n| LOOKUP other_dataset ON field_name == other_dataset.field_name\n```\n\n## Nested results\n\nWhen you join datasets, Hamelin nests the joined data as a struct to prevent field name collisions. This structure keeps your original fields separate from the joined fields, making it clear which data came from which dataset.\n\nCombine user records with order data:\n\n```hamelin\nFROM users\n| WHERE timestamp > "2024-01-01"\n| JOIN orders ON user_id == orders.user_id\n```\n\nThis creates records where each user has an `orders` struct containing all the matched order information. Your original user fields remain at the top level, while order details are nested inside the `orders` structure.\n\n## Custom naming\n\nYou can control the name of the nested struct using assignment syntax. This makes your queries more readable when the default dataset name isn\'t descriptive:\n\n```hamelin\nFROM users\n| WHERE timestamp > "2024-01-01"\n| JOIN purchase_history = orders ON user_id == orders.user_id\n```\n\nNow the joined data appears under the more descriptive name `purchase_history` instead of the generic `orders` name.\n\n## Accessing joined fields\n\nYou access fields from the joined struct using dot notation. This lets you filter, select, or manipulate the joined data just like any other nested structure:\n\n```hamelin\nFROM users\n| WHERE timestamp > "2024-01-01"\n| JOIN orders ON user_id == orders.user_id\n| WHERE orders.total > 100\n```\n\nThis query finds users who have at least one order with a total over 100, demonstrating how you can filter on joined data.\n\n## Multiple joins\n\nYou can chain multiple `JOIN` operations to combine data from several datasets. Each join creates its own nested struct, letting you pull related information from multiple sources:\n\n```hamelin\nFROM transactions\n| WHERE amount > 1000\n| JOIN user_details = users ON user_id == users.id\n| JOIN account_info = accounts ON account_id == accounts.id\n| WHERE user_details.risk_score > 0.8\n```\n\nThis creates records where each transaction has both `user_details` and `account_info` structs, giving you access to related data from multiple datasets in a single query.\n\n## `JOIN` vs `LOOKUP`: Required vs optional matches\n\nThe key difference between `JOIN` and `LOOKUP` is how they handle missing matches. This choice determines whether you get only complete records or keep all your original data with optional enrichment.\n\n### `JOIN` requires matches\n\nWhen you use `JOIN`, only rows that have a match in both datasets appear in your results. Rows without matches get filtered out completely, giving you a dataset that only contains records with complete information.\n\nGet users who have placed orders:\n\n```hamelin\nFROM users\n| JOIN orders ON user_id == orders.user_id\n```\n\nThis returns only users who have actually placed orders. Users without any orders are excluded from the results entirely.\n\n### `LOOKUP` keeps all rows\n\nWhen you use `LOOKUP`, all rows from your main dataset stay in the results, regardless of whether they have matches. For rows without matches, the nested struct gets set to `null`, preserving your complete dataset while adding optional enrichment.\n\nGet all users and include their order information when available:\n\n```hamelin\nFROM users\n| LOOKUP orders ON user_id == orders.user_id\n```\n\nThis returns every user from your dataset. Users with orders get an `orders` struct containing their order data, while users without orders still appear with their `orders` field set to null.\n\n### When to use each\n\nUse `JOIN` when you only want records that have complete information from both datasets. Use `LOOKUP` when you want to preserve your entire main dataset and optionally enrich it with additional data that might not exist for every record.\n',
497
+ "language-basics/match-ordered-pattern-matching.md": "# MATCH: ordered pattern matching\n\nThe `MATCH` command finds specific sequences of events in your data. Pattern matching detects sequences like \"10 failed logins followed by a successful login\" or \"error events followed by restart events within 5 minutes.\" Unlike windowed aggregations, pattern matching requires that event patterns happen in a specific sequence.\n\n## Basic pattern matching\n\nThe `MATCH` command searches for ordered sequences using named subqueries and regular expression-style quantifiers. You define what events you're looking for, then specify the pattern and timing constraints. Create named subqueries for the events you want to match, then specify the sequence pattern. This example finds sequences where 5 or more failed logins are followed by at least one successful login:\n\n```hamelin\nWITH failed_logins =\n FROM security_logs\n | WHERE action == 'login_failed'\n\nWITH successful_logins =\n FROM security_logs\n | WHERE action == 'login_success'\n\nMATCH failed_logins{5,} successful_logins+\n```\n\n\n\n## Pattern quantifiers\n\nQuantifiers control how many of each event type to match. These work like regular expression quantifiers but apply to your named event datasets.\n\n### Exact counts\n\nSpecify exact numbers of events to match. This example finds exactly 3 error events followed by exactly 1 restart event:\n\n```hamelin\nWITH errors =\n FROM system_logs\n | WHERE level == 'ERROR'\n\nWITH restarts =\n FROM system_logs\n | WHERE action == 'service_restart'\n\nMATCH errors{3} restarts{1}\n```\n\n\n\n### Range quantifiers\n\nUse ranges to specify minimum and maximum counts. This example finds between 2 and 5 high-severity alerts followed by 1 or 2 acknowledgment events:\n\n```hamelin\nWITH alerts =\n FROM monitoring\n | WHERE severity == 'HIGH'\n\nWITH acknowledgments =\n FROM monitoring\n | WHERE action == 'acknowledge'\n\nMATCH alerts{2,5} acknowledgments{1,2}\n```\n\n\n\n### Open-ended quantifiers\n\nUse `+` for \"one or more\" and `*` for \"zero or more\". This example finds one or more failed requests followed by one or more successful requests:\n\n```hamelin\nWITH failed_requests =\n FROM api_logs\n | WHERE status_code >= 500\n\nWITH success_requests =\n FROM api_logs\n | WHERE status_code < 400\n\nMATCH failed_requests+ success_requests+\n```\n\n\n\n## Time constraints with WHEN\n\nThe `WHEN` clause adds timing constraints to patterns. Time constraints ensure that matched sequences happen within specific time windows or meet other conditions.\n\n### Time window constraints\n\nRequire that patterns complete within a specific time period. This example finds 10 or more failed logins followed by successful logins, but only when the entire sequence happens within 10 minutes:\n\n```hamelin\nWITH failed_logins =\n FROM security_logs\n | WHERE action == 'login_failed'\n\nWITH successful_logins =\n FROM security_logs\n | WHERE action == 'login_success'\n\nMATCH failed_logins{10,} successful_logins+\nWHEN last(successful_logins.timestamp) - first(failed_logins.timestamp) < 10min\n```\n\n\n\n### Using first() and last() functions\n\nThe `first()` and `last()` functions access the earliest and latest events in each matched group. This example finds CPU spikes followed by memory alerts, requiring that the CPU spikes started within the last hour and the entire pattern completes within 15 minutes:\n\n```hamelin\nWITH cpu_spikes =\n FROM metrics\n | WHERE cpu_usage > 90\n\nWITH memory_alerts =\n FROM metrics\n | WHERE memory_usage > 85\n\nMATCH cpu_spikes{3,} memory_alerts+\nWHEN first(cpu_spikes.timestamp) > now() - 1hr\n AND last(memory_alerts.timestamp) - first(cpu_spikes.timestamp) < 15min\n```\n\n\n\n## Complex pattern examples\n\n### Security incident detection\n\nLook for suspicious login patterns that might indicate a brute force attack. This example detects external brute force attempts followed by successful logins and optional privilege escalations, all within 30 minutes:\n\n```hamelin\nWITH failed_logins =\n FROM auth_logs\n | WHERE outcome == 'FAILURE'\n | WHERE source_ip NOT IN ('10.0.0.0/8', '192.168.0.0/16')\n\nWITH successful_logins =\n FROM auth_logs\n | WHERE outcome == 'SUCCESS'\n\nWITH privilege_escalations =\n FROM audit_logs\n | WHERE action == 'privilege_escalation'\n\nMATCH failed_logins{5,} successful_logins{1,3} privilege_escalations*\nWHEN last(privilege_escalations.timestamp) - first(failed_logins.timestamp) < 30min\n```\n\n\n\n## When to use MATCH vs WINDOW\n\nThe key difference is that `WINDOW` performs unordered correlation while `MATCH` performs ordered correlation.\n\nWhen you pull multiple event patterns into a sliding window, you can aggregate each individual pattern or aggregate across all the patterns together. However, you cannot require that certain subpatterns happen before others\u2014the window treats all events within the time frame as unordered.\n\n`MATCH` specifies that certain events must happen before others in a specific sequence. Ordered correlation matters when the timing and sequence of events affects your analysis.\n\n**Use MATCH when order matters:**\n- Security attack sequences (failed logins \u2192 successful login \u2192 privilege escalation)\n- System failure cascades (errors \u2192 timeouts \u2192 circuit breaker trips)\n- User workflow analysis (page view \u2192 form submission \u2192 purchase)\n- Compliance violations (access \u2192 modification \u2192 deletion)\n",
498
+ "language-basics/sort-limit-top-n.md": "# `SORT` and `LIMIT`: Doing top-n\n\nYou use the `SORT` command to order your data, and the `LIMIT` command to take only the first n rows from your results. Together, these commands let you find top performers, recent events, highest values, or any other ranking-based analysis.\n\nEach command is also useful on its own. `SORT` helps you understand data patterns by revealing ordering and outliers. You might sort transaction amounts to see the distribution of values, or sort timestamps to understand event sequences. `LIMIT` is valuable for exploring large datasets by giving you manageable samples. You can take the first 100 rows to understand data structure before writing more complex queries, or limit results to avoid overwhelming outputs during development.\n\n## Basic syntax\n\nSort your data by specifying the field you want to order by. Add `DESC` for descending order (highest to lowest) or leave it blank for ascending order (lowest to highest):\n\n```hamelin\n| SORT field_name DESC\n```\n\nLimit your results to a specific number of rows using the `LIMIT` command:\n\n```hamelin\n| LIMIT 10\n```\n\n## Simple sorting\n\nOrder your data by a single field to see patterns and outliers. This is useful for finding the most recent events, highest values, or alphabetical arrangements.\n\nSort login events by timestamp to see them in chronological order:\n\n```hamelin\nFROM security_logs\n| WHERE action == 'login'\n| SORT timestamp DESC\n```\n\nThis query gets login events and sorts them by timestamp in descending order (newest first), letting you see the full sequence of login activity.\n\n## Multiple sort fields\n\nYou can sort by multiple fields to create more sophisticated ordering. List the fields in order of priority, with the most important sort field first:\n\n```hamelin\nFROM transactions\n| SORT amount DESC, timestamp DESC\n```\n\nThis sorts transactions first by amount (highest first), then by timestamp (newest first) for transactions with the same amount. This ordering reveals value patterns across all transactions, with ties broken by recency.\n\n## Top-n analysis\n\nThe combination of `SORT` and `LIMIT` creates powerful top-n analysis patterns. This lets you answer questions like \"who are my top customers\" or \"what are the most common errors\" with simple, readable queries.\n\nFind the top 5 users by transaction volume:\n\n```hamelin\nFROM transactions\n| AGG total_amount = sum(amount) BY user_id\n| SORT total_amount DESC\n| LIMIT 5\n```\n\nThis aggregates transaction amounts by user, sorts by the total in descending order, and takes the top 5 results. The pattern works for any ranking scenario where you need to identify leaders or outliers.\n\nNote: This example uses the `AGG` command which we haven't covered yet. You can learn more about aggregation in [Aggregation](aggregation.md).\n\n## Sorting with expressions\n\nYou can sort by calculated values without adding them as permanent fields. This is useful when you want to order by a computation but don't need that computation in your final results:\n\n```hamelin\nFROM events\n| SORT (now() - timestamp) / 1hr\n| LIMIT 20\n```\n\nThis sorts events by how many hours ago they occurred, giving you the most recent events first. The calculation happens during sorting but doesn't create a new field in your results.\n\n## Complex sorting scenarios\n\nFor more advanced sorting, you can combine multiple fields, expressions, and directions to create exactly the ordering you need.\n\nFind the most problematic users by recent failed login attempts:\n\n```hamelin\nFROM security_logs\n| WHERE action == 'login_failed' AND timestamp > now() - 24hr\n| AGG failure_count = count(), latest_failure = max(timestamp) BY user_id\n| SORT failure_count DESC, latest_failure DESC\n| LIMIT 10\n```\n\nThis query identifies users with the most failed login attempts in the last 24 hours, sorted first by failure count (most failures first), then by recency of their latest failure. This creates a prioritized list for security investigation.\n\n## Performance considerations\n\nSorting large datasets can be expensive, especially when sorting by multiple fields or complex expressions. When possible, apply filters with `WHERE` before sorting to reduce the amount of data that needs to be ordered:\n\n```hamelin\nFROM events\n| WHERE timestamp > now() - 7d -- Filter first\n| SORT severity DESC, timestamp DESC\n| LIMIT 50\n```\n\nThis pattern filters to recent events before sorting, which is more efficient than sorting all events and then filtering.\n",
499
+ "language-basics/time.md": "# Time\n\nTime is central to event analytics. In Hamelin, you write time the way you think\nabout it\u2014`1hr`, `30min`, or `yesterday`. The language supports several ways to work\nwith time: intervals for durations, absolute timestamps for specific moments,\ntime truncation for grouping, and ranges for time spans.\n\n## Time intervals\n\nYou use time intervals to express durations\u2014how long something takes or how far back to look in your data.\n\n### Basic interval syntax\n\nCreate time intervals by writing a number directly followed by a time unit, with no spaces. Use these anywhere you need to specify how long something takes or how far back to look:\n\n```hamelin\n# Time intervals - number + unit\n1sec # 1 second\n30sec # 30 seconds\n1min # 1 minute\n15min # 15 minutes\n1hr # 1 hour\n2hr # 2 hours\n1d # 1 day\n7d # 7 days\n1w # 1 week\n2w # 2 weeks\n1mon # 1 month\n6mon # 6 months\n1yr # 1 year\n```\n\n### Supported time units\n\n| Unit | Abbreviations | Examples |\n|------|---------------|----------|\n| **Seconds** | `s`, `sec`, `secs`, `second`, `seconds` | `30s`, `45sec` |\n| **Minutes** | `m`, `min`, `mins`, `minute`, `minutes` | `5m`, `15min` |\n| **Hours** | `h`, `hr`, `hrs`, `hour`, `hours` | `1h`, `2hr` |\n| **Days** | `d`, `day`, `days` | `1d`, `30days` |\n| **Weeks** | `w`, `week`, `weeks` | `1w`, `2weeks` |\n| **Months** | `mon`, `month`, `months` | `1mon`, `3months` |\n| **Years** | `y`, `yr`, `yrs`, `year`, `years` | `1y`, `2yrs` |\n\n### Using intervals in variables\n\nStore commonly used time intervals in variables to make your queries more readable and maintainable:\n\n```hamelin\n# Store intervals in variables for reuse\nWITH time_constants =\n LET short_window = 5min,\n daily_retention = 30d,\n investigation_period = 2hr,\n alert_threshold = 500ms\n```\n\n## Absolute timestamps\n\nYou can reference specific moments in time using absolute timestamps. This is useful when you know the exact time of an incident or need to analyze data from a specific date.\n\n### ISO 8601 format\n\nCreate absolute timestamps using the `ts()` function with ISO 8601 formatted strings. This format works with or without timezone information:\n\n```hamelin\n# Absolute timestamps using ISO 8601 format\nWITH timestamps =\n LET incident_start = ts('2024-01-15T14:30:00'),\n maintenance_window = ts('2024-01-15T02:00:00Z'),\n deployment_time = ts('2024-01-15T09:15:30.123Z')\n```\n\n### Current time\n\nGet the current timestamp using the `now()` function. This captures the exact moment when your query starts running:\n\n```hamelin\n# Get the current timestamp\nWITH current_times =\n LET right_now = now(),\n query_start_time = now()\n```\n\n## Time truncation with `@`\n\nThe `@` operator snaps timestamps to time boundaries. You can truncate any timestamp to the start of its hour, day, week, or other time period. This makes it straightforward to group events into time buckets for analysis.\n\n### Truncation syntax\n\nApply the `@` operator to any timestamp to round it down to the nearest time boundary:\n\n```hamelin\n# Truncate current time to various boundaries\nnow()@d # Today at midnight (00:00:00)\nnow()@hr # This hour at :00 minutes\nnow()@min # This minute at :00 seconds\nnow()@w # This week's Monday at midnight\nnow()@mon # First day of this month at midnight\n```\n\n### Available truncation units\n\n| Unit | Truncates To | Example Result |\n|------|--------------|----------------|\n| `@s` | Start of second | `2024-01-15T14:30:25.000` |\n| `@min` | Start of minute | `2024-01-15T14:30:00.000` |\n| `@hr` | Start of hour | `2024-01-15T14:00:00.000` |\n| `@d` | Start of day (midnight) | `2024-01-15T00:00:00.000` |\n| `@w` | Start of week (Monday) | `2024-01-15T00:00:00.000` |\n| `@mon` | Start of month | `2024-01-01T00:00:00.000` |\n| `@yr` | Start of year | `2024-01-01T00:00:00.000` |\n\n### Truncation with any timestamp\n\nYou can truncate any timestamp, not just `now()`. Create time buckets from your event data by truncating timestamp fields:\n\n```hamelin\n# Truncate any timestamp, not just now()\nWITH event_data =\n LET event_time = ts('2024-01-15T14:37:22')\n\nFROM event_data\n| LET hour_bucket = event_time@hr // 2024-01-15T14:00:00\n| LET day_bucket = event_time@d // 2024-01-15T00:00:00\n```\n\nYou can also truncate timestamp columns directly from your event datasets to group events by time periods:\n\n```hamelin\nFROM events\n| LET event_day = timestamp@d // Group events by day\n| LET event_hour = timestamp@hr // Group events by hour\n| SELECT user_id, event_day, event_hour, action\n```\n\n## Time ranges\n\nYou combine time values into ranges using the `..` operator. Time ranges let you express time spans like \"between 2 and 4 hours ago\" or \"from this morning onward.\" This makes it natural to filter events within specific time windows.\n\n### The range operator `..`\n\nThe `..` operator creates a span between two time points. You place time values on either side to define the start and end of your range.\n\nCreate a range between 2 hours ago and 1 hour ago:\n\n```hamelin\n-2hr..-1hr\n```\n\nCreate a range from a specific time until now:\n\n```hamelin\nts('2024-01-15T10:00:00')..now()\n```\n\nCreate a range from midnight today until midnight tomorrow:\n\n```hamelin\nnow()@d..(now()@d + 1d)\n```\n\n### Relative time ranges\n\nUse negative numbers to go back in time from \"now\". This pattern covers most security and operational analytics scenarios where you're investigating recent events.\n\nGet events from the last hour:\n\n```hamelin\n-1hr..now()\n```\n\nGet events between 2 and 4 hours ago:\n\n```hamelin\n-4hr..-2hr\n```\n\nGet events from this week so far:\n\n```hamelin\nnow()@w..now()\n```\n\nYou can combine truncation with ranges to create precise time windows aligned to calendar boundaries:\n\n```hamelin\n// From start of today until now\nnow()@d..now()\n\n// Yesterday (full day)\n(now()@d - 1d)..(now()@d)\n\n// Last full hour\n(now()@hr - 1hr)..(now()@hr)\n```\n\n### Unbounded ranges\n\nYou can leave either side of the range empty to create spans that extend infinitely in one direction. This is useful for ongoing monitoring or historical analysis without a specific end point.\n\nGet everything before 1 hour ago:\n\n```hamelin\n..-1hr\n```\n\nGet everything from a specific time onward:\n\n```hamelin\nts('2024-01-15T10:00:00')..\n```\n\nGet everything from 2 hours ago onward (includes future events):\n\n```hamelin\n-2hr..\n```\n\n### Bounded vs unbounded ranges\n\nThe choice between bounded and unbounded ranges determines how your queries behave, especially for ongoing monitoring versus historical analysis.\n\n```hamelin\n// Bounded: Only gets events that happened in the past hour\nFROM alerts | WHERE timestamp IN -1hr..now()\n\n// Unbounded: Gets past events AND future events as they arrive\nFROM alerts | WHERE timestamp IN -1hr..\n```\n\nUse **bounded ranges** when analyzing completed time periods. Use **unbounded ranges** when monitoring ongoing events as they happen.\n",
500
+ "language-basics/window-aggregating-over-sliding-windows.md": "# WINDOW: aggregating over sliding windows\n\nThe `WINDOW` command creates aggregations that slide across windows of data.\nThis lets you calculate running totals, moving averages, and time-based metrics\nwithout losing the detail of individual events. Each row gets its own\ncalculation based on a sliding window of related rows around it.\n\n## Window calculations\n\nThe `WINDOW` command supports two main types of calculations that operate on the\nsliding window of data. Aggregation functions like `count()`, `sum()`, `avg()`,\n`min()`, and `max()` calculate summary statistics across all rows in the current\nwindow frame.\n\n```hamelin\nFROM sales\n| WINDOW\n total_sales = sum(amount),\n avg_sale = avg(amount),\n sale_count = count()\n BY region\n WITHIN -7d\n```\n\nThis creates a 7-day rolling summary for each region, showing total sales, average sale amount, and number of sales within the sliding window.\n\nWindow-specific functions like `row_number()`, `rank()`, `dense_rank()`, and `lag()` analyze the position and relationships between rows within the window without aggregating the data.\n\n```hamelin\nFROM events\n| WINDOW\n event_number = row_number(),\n event_rank = rank(),\n previous_value = lag(score, 1)\n BY user_id\n SORT timestamp\n```\n\nThis assigns sequence numbers, ranks events by timestamp order, and shows the previous score value for each user's events.\n\nYou can combine multiple calculations in a single `WINDOW` command, and each calculation receives the same set of rows determined by the window frame, but produces different analytical results based on its specific function behavior:\n\n```hamelin\nFROM metrics\n| WINDOW\n recent_count = count(),\n running_total = sum(value),\n current_rank = row_number(),\n percentile_rank = percent_rank()\n BY service\n SORT timestamp\n WITHIN -1hr\n```\n\nThis example mixes aggregation functions (`count()`, `sum()`) with window-specific functions (`row_number()`, `percent_rank()`) to create comprehensive analytics for each service within a 1-hour sliding window.\n\n\nWhen explicit names aren't provided for window calculations, Hamelin automatically generates field names from expressions. Learn more about this in [Automatic Field Names](../smart-features/automatic-field-names.md).\n\n## WINDOW command parts\n\nThe `WINDOW` command has three optional clauses that control how the sliding\nwindow behaves. Each clause serves a specific purpose in defining which data\ngets included in each calculation.\n\n```hamelin\nWINDOW calculations\n BY grouping_fields // optional: partitions data\n SORT ordering_fields // optional: defines row order\n WITHIN frame_range // optional: defines window size\n```\n\n### BY clause: partitioning data\n\nThe `BY` clause divides data into separate groups, with each group getting its own independent sliding window. This lets you create per-user, per-host, or per-category calculations without mixing data across different entities.\n\n**With BY fields specified:** Hamelin creates separate windows for each unique combination of those fields. This partitioning ensures that calculations for different users, devices, or categories remain completely independent. Each partition maintains its own window state, preventing data from different entities from interfering with each other. Here's how to create separate counting windows for each user:\n\n```hamelin\nFROM events\n| WINDOW count()\n BY user_id\n```\n\n**Without BY fields:** Hamelin treats all data as one big group. This creates a single window that processes all events together, regardless of their source or category. The calculation accumulates across every row in the dataset, which proves useful for global metrics or overall trend analysis. This example creates one counting window that includes all events:\n\n```hamelin\nFROM events\n| WINDOW count()\n```\n\n### SORT clause: ordering rows\n\nThe `SORT` clause controls the order of rows within each window partition. This ordering determines which rows come \"before\" and \"after\" each row, affecting functions like `row_number()` and defining the direction of the sliding window.\n\n**With SORT specified:** Hamelin uses the explicit ordering. The sort order determines which rows come before and after each current row in the window calculation. When you want to analyze transactions by value rather than time, you can sort by amount to create value-based rankings and running totals:\n\n```hamelin\nFROM transactions\n| WINDOW\n running_total = sum(amount),\n transaction_rank = rank()\n BY account_id\n SORT amount DESC\n```\n\n**Without SORT specified:** Hamelin automatically orders by event timestamp. This chronological ordering makes sense for most time-series analysis where you want to track how metrics evolve over time. The automatic timestamp ordering eliminates the need to explicitly specify time-based sorting in typical analytical scenarios. This example creates a chronological sequence count for each user:\n\n```hamelin\nFROM events\n| WINDOW event_sequence = count()\n BY user_id\n```\n\n### WITHIN clause: defining the window frame\n\nThe `WITHIN` clause controls how much data gets included in the window around each row.\n\n**With WITHIN specified:** Hamelin uses the explicit frame size. This sliding frame moves with each row, always maintaining the specified time period or row count. When you need to count events within a specific time window, you can specify the exact duration:\n\n```hamelin\nFROM events\n| WINDOW count()\n BY user_id\n WITHIN -1hr\n```\n\nFor each event, this counts all events for that user in the hour leading up to that event's timestamp.\n\n**Without WITHIN specified:** Hamelin uses `..0r` (from the beginning of the partition up to the current row). This default behavior creates cumulative calculations that include all rows from the start of each partition up to the current row. The cumulative approach works well for running totals, progressive counts, and other metrics that should include all historical data. This example creates a running count for each user from their first event:\n\n```hamelin\nFROM events\n| WINDOW cumulative_count = count()\n BY user_id\n```\n\n## Window frames\n\nThe `WITHIN` clause accepts different types of frame specifications that control how much data gets included around each row. Frame specifications determine whether the window slides based on time intervals, specific row counts, or bounded ranges between two points. Understanding these frame types lets you create exactly the sliding behavior you need for different analytical scenarios.\n\n### Value-based frames\n\nIntervals like `-5min` or `-1hr` create sliding windows based on the values in the sorted column. Because the most common sort order is by timestamp, these frames typically create time-based windows that slide through data chronologically. The window maintains a consistent value range (usually time duration) regardless of how many events occur within that period. Value-based frames work particularly well for temporal metrics like monitoring system performance or analyzing user activity patterns over fixed time periods.\n\n```hamelin\nFROM metrics\n| WINDOW avg_cpu = avg(cpu_usage)\n BY hostname\n WITHIN -5min\n```\n\nFor each metric record, this calculates the average CPU usage for that host over the 5 minutes leading up to that metric's timestamp.\n\nYou can also use value-based frames with non-timestamp columns when the data is sorted by those values. This example calculates running statistics for orders based on order amounts, looking at orders within a $5 range of each current order:\n\n```hamelin\nFROM orders\n| WINDOW\n nearby_orders = count(),\n avg_nearby_amount = avg(amount)\n BY customer_id\n SORT amount\n WITHIN -5\n```\n\n### Row-based frames\n\nRow counts create windows based on a specific number of surrounding rows rather than time periods. This approach proves valuable when you need consistent sample sizes for statistical calculations or when events occur at irregular intervals. Row-based frames ensure that each calculation includes exactly the specified number of data points, making comparisons more reliable across different time periods. This example counts events and assigns sequence numbers using a 4-row window (current row plus 3 preceding rows):\n\n```hamelin\nFROM events\n| WINDOW\n recent_events = count(),\n event_sequence = row_number()\n BY user_id\n SORT timestamp\n WITHIN -3r\n```\n\n### Range frames\n\nRange frames like `-2hr..-1hr` create windows between two specific offsets from the current row. This capability lets you analyze data from specific time periods without including the current time period in the calculation. Range frames prove particularly useful for lag analysis, where you want to compare current metrics against historical periods, or when you need to exclude recent data that might be incomplete. This example counts events from the hour that ended one hour before each current event:\n\n```hamelin\nFROM events\n| WINDOW previous_hour_count = count()\n BY user_id\n WITHIN -2hr..-1hr\n```\n\nRange frames can be unbounded by omitting one end of the range. An unbounded range like `-2hr..` creates a window that extends infinitely in one direction from a starting point. This technique proves useful when you want all data from a specific threshold forward, such as calculating cumulative metrics that start counting only after a certain time delay. Unbounded ranges help create progressive totals that begin from meaningful starting points rather than the very beginning of the dataset. This example counts all events for each user starting from 2 hours before each current event:\n\n```hamelin\nFROM events\n| WINDOW cumulative_count = count()\n BY user_id\n WITHIN -2hr..\n```\n\n**Warning:** Interval frames like `-1hr` create sliding windows, while unbounded ranges like `-1hr..` include all future rows in the dataset. The unbounded version creates a massive window instead of the sliding window you typically want for temporal analysis.\n",
501
+ "language-basics/with-using-named-subqueries.md": "# WITH: using named subqueries\n\nYou use the `WITH` keyword to create named datasets that you can reference later in your query. This lets you break complex analysis into manageable steps, reuse filtered data multiple times, and make your queries more readable and maintainable.\n\n## Creating named datasets\n\nThe `WITH` command assigns a name to any dataset, letting you reference it throughout your query. This is essential when you need to combine data from multiple sources or apply different filters to the same base data:\n\n```hamelin\nWITH error_events =\n FROM events\n | WHERE status = 'ERROR'\n\nWITH service_counts =\n FROM logs\n | AGG event_count = count()\n BY service\n\nFROM error_events\n| JOIN service_counts ON service == service_counts.service\n```\n\n## Common patterns\n\nHere are three scenarios where named subqueries make complex analysis straightforward and readable.\n\n### Combining filtered event types\n\nYou can combine different types of events after applying specific filters to each type. This lets you analyze related events together while maintaining distinct filtering criteria:\n\n```hamelin\nWITH login_events =\n FROM security_logs\n | WHERE event_type = 'user_login'\n | WITHIN -1hr\n\nWITH logout_events =\n FROM security_logs\n | WHERE event_type = 'user_logout'\n | WITHIN -1hr\n\nFROM login_events, logout_events\n| AGG total_auth_events = count()\n```\n\nThis creates separate filtered datasets for logins and logouts, then combines\nthem to get a total count of authentication events from the past hour.\n\n### Security event correlation\n\nYou can analyze relationships between different security events by creating named datasets for each event type and then using window functions to find patterns:\n\n```hamelin\nWITH login_events =\n FROM security_logs\n | WHERE event_type = 'user_login'\n | WITHIN -1hr\n\nWITH priv_escalation =\n FROM security_logs\n | WHERE event_type = 'privilege_escalation'\n | WITHIN -1hr\n\nFROM login = login_events, priv = priv_escalation\n| WINDOW count(login), count(priv)\n WITHIN 5min\n BY user_id\n\n// Focus on windows where new events occur to avoid duplicate alerts\n| WHERE `count(login)` == 1 AND `count(priv)` >= 1 OR\n `count(login)` >= 1 AND `count(priv)` == 1\n```\n\nThis creates separate datasets for login and privilege escalation events, then uses a sliding window to find 5-minute periods where both types occur for the same user.\n\n### Enrichment with filtered lookup data\n\nYou can enrich your main dataset with information from lookup tables that have been pre-filtered to include only relevant records:\n\n```hamelin\nWITH active_departments =\n FROM departments\n | WHERE status = 'active' AND budget > 100000\n\nWITH user_activity =\n FROM user_events\n | WITHIN -24hr\n | AGG session_count = count() BY user_id\n\nFROM user_activity\n| JOIN users ON user_id == users.user_id\n| JOIN active_departments ON department_id == active_departments.department_id\n| WHERE session_count > 10\n```\n\nThis pattern filters the department lookup table first to include only active, well-funded departments, then enriches user activity data with this filtered information. This approach is more efficient than filtering departments after the join.\n",
502
+ "language-basics/within-filtering-by-timestamp.md": "# WITHIN: filtering by timestamp\n\nYou use the `WITHIN` command to filter events by their timestamp. This command automatically finds the primary timestamp field in your dataset and filters events to match your specified time criteria, making time-based filtering straightforward and readable.\n\n## Basic syntax\n\nThe `WITHIN` command filters your dataset to include only events that occurred within your specified time window. Hamelin automatically identifies the timestamp field in your data, so you don't need to specify column names.\n\nPull events that happened in the last hour:\n\n```hamelin\nFROM events | WITHIN -1hr\n```\n\nGet events from a specific time window in the past:\n\n```hamelin\nFROM security_events | WITHIN -4hr..-2hr\n```\n\nAnalyze only today's events using time truncation:\n\n```hamelin\nFROM system_logs | WITHIN now()@d..now()\n```\n\nLook at yesterday's complete activity:\n\n```hamelin\nFROM audit_logs | WITHIN (now()@d - 1d)..(now()@d)\n```\n\n## Time intervals vs time ranges\n\nYou can use `WITHIN` with either time intervals or time ranges.\n\n**Time intervals** get automatically converted to ranges that start or end at \"now\". This makes them perfect for recent event analysis:\n\n```hamelin\n# Negative intervals go backward from now\nFROM alerts | WITHIN -1hr # Last hour: -1hr..now()\n\n# Positive intervals go forward from now\nFROM alerts | WITHIN 2hr # Next 2 hours: now()..2hr\n```\n\n**Time ranges** work exactly as you define them, giving you precise control over both start and end points:\n\n```hamelin\n# Bounded range: specific start and end\nFROM events | WITHIN -4hr..-2hr\n\n# Unbounded range: from start onward\nFROM alerts | WITHIN -1hr..\n\n# Range ending at now\nFROM events | WITHIN -2hr..now()\n```\n\n### Backward vs forward time intervals\n\n**Negative intervals** look backward from now to capture recent events:\n\n```hamelin\nFROM events | WITHIN -30min # Events from 30 minutes ago until now\nFROM events | WITHIN -1d # Events from 1 day ago until now\n```\n\n**Positive intervals** look forward from now, which is useful for scheduled events or planned activities:\n\n```hamelin\nFROM scheduled_tasks | WITHIN 2hr # Tasks scheduled for next 2 hours\nFROM alerts | WITHIN 15min # Alerts expected in next 15 minutes\n```\n",
503
+ "smart-features/array-broadcasting.md": "# Array Broadcasting\n\nHamelin automatically broadcasts operations across arrays, letting you use\nfamiliar syntax without explicit iteration. When you write `array.field`,\nHamelin extracts that field from every element in the array. When you write\n`func(array)`, the function applies to each array element individually.\n\n## How broadcasting works\n\nBroadcasting works for any operation or function call, for example arithmetic,\ncomparisons, function calls, and field access (dereference). Instead of needing\nspecial array functions, you use familiar syntax and Hamelin handles the\niteration.\n\n## Examples\n\nLet's work through some examples assuming that `failed_attempts` has the following data:\n\n```hamelin\nLET failed_attempts = [\n {timestamp: '2024-01-15T14:25:00Z', reason: 'invalid_password', ip: '192.168.1.100'},\n {timestamp: '2024-01-15T14:26:15Z', reason: 'account_locked', ip: '192.168.1.101'},\n {timestamp: '2024-01-15T14:27:30Z', reason: 'invalid_password', ip: '192.168.1.102'}\n]\n```\n\n### Field access broadcasting\n\nAccess fields across all array elements using dot notation:\n\n```hamelin\n// Extract all unique reasons from the failed attempts array\n| LET all_reasons = array_distinct(failed_attempts.reason)\n// Result: ['invalid_password', 'account_locked']\n\n// Extract all IP addresses\n| LET all_ips = failed_attempts.ip\n// Result: ['192.168.1.100', '192.168.1.101', '192.168.1.102']\n```\n\nThe operation applies to each element in the array, extracting the specified field from every object.\n\n### Comparison broadcasting\n\nApply comparison operations across array elements:\n\n```hamelin\n// Check which attempts were due to invalid passwords\n| LET password_failures = failed_attempts.reason == 'invalid_password'\n// Result: [true, false, true]\n\n// Find attempts from suspicious IP ranges\n| LET suspicious_ips = failed_attempts.ip > '192.168.1.100'\n// Result: [false, true, true]\n```\n\nEach comparison operation returns an array of boolean values, maintaining the same structure as the original array.\n\n### Function call broadcasting\n\nNamed functions automatically broadcast over arrays:\n\n```hamelin\n// Convert all reasons to uppercase\n| LET uppercase_reasons = upper(failed_attempts.reason)\n// Result: ['INVALID_PASSWORD', 'ACCOUNT_LOCKED', 'INVALID_PASSWORD']\n\n// Extract hour from all timestamps\n| LET attempt_hours = hour(failed_attempts.timestamp)\n// Result: [14, 14, 14]\n```\n\nFunctions apply to each element in the array automatically.\n\n### Arithmetic broadcasting\n\nMathematical operations work across array elements:\n\n```hamelin\n// Extract numeric seconds from timestamp strings for calculations\n| LET timestamp_seconds = second(failed_attempts.timestamp)\n// Result: [0, 15, 30]\n\n// Add 10 seconds to each timestamp second value\n| LET adjusted_seconds = timestamp_seconds + 10\n// Result: [10, 25, 40]\n\n// Calculate minutes elapsed since the first attempt\n| LET minutes_elapsed = (timestamp_seconds - timestamp_seconds[0]) / 60\n// Result: [0, 0.25, 0.5]\n```\n\n### Combining with array functions\n\nBroadcasting pairs naturally with functions that take arrays as arguments like `any()` and `all()`:\n\n```hamelin\n// Check if any login attempt was due to invalid password\n| WHERE any(failed_attempts.reason == 'invalid_password')\n\n// Verify all attempts came from internal network\n| WHERE all(failed_attempts.ip < '192.168.2.0')\n\n// Count how many attempts were password-related\n| AGG password_attempt_count = count_if(any(failed_attempts.reason == 'invalid_password'))\n```\n\nThe broadcasting happens first (creating boolean arrays), then the array\nfunctions operate on those arrays.\n\n### Nested structure broadcasting\n\nBroadcasting works with the nested structure in our example data:\n\n```hamelin\n// Extract just the hour from each timestamp\n| LET attempt_hours = hour(failed_attempts.timestamp)\n// Result: [14, 14, 14]\n\n// Check if any IP is in a specific subnet\n| LET internal_network = failed_attempts.ip > '192.168.1.100'\n// Result: [false, true, true]\n\n// Create readable timestamp strings\n| LET readable_times = format_timestamp(failed_attempts.timestamp, 'HH:mm:ss')\n// Result: ['14:25:00', '14:26:15', '14:27:30']\n```\n\nBroadcasting navigates through multiple levels of nesting.\n\n### Broadcasting vs. traditional approaches\n\nWithout broadcasting, you would need either functional programming with map operations or explode/re-aggregate patterns:\n\n```hamelin\n-- Hamelin with broadcasting\nLET tag_names = upper(tags.name)\n\n-- Alternative approaches without broadcasting:\n-- 1. Functional map operation\nLET tag_names = array_map(tags, tag -> upper(tag.name))\n\n-- 2. Explode and re-aggregate pattern\nFROM events\n| EXPLODE tags AS tag\n| SELECT upper(tag.name) AS upper_name\n| AGG array_agg(upper_name) BY event_id\n```\n\nBroadcasting eliminates the need for explicit mapping or explode/aggregate operations.\n",
504
+ "smart-features/automatic-field-names.md": '# Automatic field names\n\nWhen you don\'t specify output field names in your queries, Hamelin generates \nsensible names automatically. This reduces typing and makes exploratory queries \nfaster to write. You can always override the automatic names by providing your \nown.\n\n## How automatic naming works\n\nHamelin uses the expression itself as the output field name when you don\'t \nprovide one. This creates predictable, readable column names that reflect what \nthe field contains. As an example, for simple column references, Hamelin uses \nthe column name as the field name:\n\n```hamelin\nFROM events | SELECT timestamp, user_name\n// Result columns: "timestamp", "user_name"\n```\n\nFor any expression more complex than a simple column reference, Hamelin uses \nthe full expression text as the field name:\n\n```hamelin\nFROM events | SELECT user_name, timestamp + 1hr, count()\n// Result columns: "user_name", "timestamp + 1hr", "count()"\n\nFROM logs | AGG count(error_code), avg(response_time) BY service\n// Result columns: "count(error_code)", "avg(response_time)"\n```\n\n## Referencing automatic field names\n\nUse backticks to reference automatically generated field names from complex \nexpressions:\n\n```hamelin\nFROM events\n| AGG count(), avg(response_time) BY service_name\n| WHERE `count()` > 100\n| SORT `avg(response_time)` DESC\n```\n\nThe backticks tell Hamelin you\'re referring to a field name that contains \nspecial characters or spaces.\n\n## Commands that use automatic naming\n\n### SELECT command\n\nSELECT generates field names from the expressions you provide:\n\n```hamelin\nFROM users | SELECT first_name + " " + last_name, age * 12\n// Result columns: "first_name + " " + last_name", "age * 12"\n```\n\n### AGG command\n\nAGG creates field names from both aggregation functions and grouping fields:\n\n```hamelin\nFROM requests | AGG count(), max(response_time) BY endpoint\n// Result columns: "endpoint", "count()", "max(response_time)"\n\nFROM events | AGG count() BY user_name, timestamp@hr\n// Result columns: "user_name", "timestamp@hr", "count()"\n```\n\nYou can override the automatic names for both aggregation functions and \ngrouping fields:\n\n```hamelin\nFROM requests | AGG\n request_count = count(),\n max_response = max(response_time)\n BY service_endpoint = endpoint\n// Result columns: "service_endpoint", "request_count", "max_response"\n```\n\n### WINDOW command\n\nWINDOW functions generate names from the function calls:\n\n```hamelin\nFROM metrics\n| WINDOW count(), avg(cpu_usage)\n BY host\n SORT timestamp\n WITHIN -5min\n// Result columns: "count()", "avg(cpu_usage)"\n```\n\n## Overriding automatic names\n\nProvide explicit names when you want cleaner field names or need to reference \nthem easily:\n\n```hamelin\n// Automatic names (harder to reference)\nFROM events | AGG count(), avg(response_time) BY service\n\n// Explicit names (easier to reference)\nFROM events | AGG\n total_events = count(),\n avg_response = avg(response_time)\n BY service\n```\n\nExplicit names make your queries more readable and easier to chain with \nadditional operations.\n\n## When automatic names are most useful\n\nAutomatic naming speeds up data exploration when you\'re trying to understand \nyour data. As an example, you can see results quickly without thinking about \nfield names:\n\n```hamelin\nFROM logs | AGG count(), count_distinct(user_name), max(timestamp) BY service\n```\n\nFor straightforward field selections, automatic names keep queries concise:\n\n```hamelin\nFROM events | SELECT user_name, timestamp, action_type\n```\n\nYou can also start with automatic names, then add explicit names as your query \nbecomes more complex:\n\n```hamelin\n// Start simple\nFROM events | AGG count() BY event_type\n\n// Add explicit names as needed\nFROM events | AGG event_count = count() BY event_type\n```\n\nAutomatic field names reduce friction in query writing while maintaining \nclarity about what each field contains.\n',
505
+ "smart-features/lineage-sub-struct.md": "# Lineage sub-struct\n\nHamelin tracks where data comes from when you assign results to struct fields\nin `FROM` or `MATCH` clauses. This lineage tracking lets you correlate events\nfrom different sources while maintaining visibility into which upstream source\ncontributed each piece of data. You can reference this lineage information to\nbuild complex pattern detection queries.\n\n## How lineage sub-struct works\n\nHamelin creates a composite record that preserves the source of each piece of\ndata when you assign query results to struct fields. This happens automatically\nwhen you use assignment syntax in `FROM` or `MATCH` clauses. As an example,\nconsider tracking both failed and successful login events:\n\n```hamelin\nFROM failed = failed_logins, success = successful_logins\n| WINDOW failures = count(failed),\n successes = count(success),\n total = count()\n BY user.id\n WITHIN -5m\n```\n\nThe `failed = failed_logins` assignment creates a struct field that gets\npopulated for events from the failed logins source, while `success =\nsuccessful_logins` creates another struct field that gets populated for events\nfrom the successful logins source. Events from `failed_logins` will have the\n`failed` field populated and `success` as NULL. Events from `successful_logins`\nwill have the `success` field populated and `failed` as NULL. Hamelin maintains\nthis lineage information throughout the query pipeline.\n\n## Accessing lineage data\n\nYou can reference the assigned struct fields directly in queries. The field\nnames become available for filtering, aggregation, and selection:\n\n```hamelin\nFROM failed = security_alerts, success = login_events\n| WHERE failed.severity > 'medium' OR success.user_id IS NOT NULL\n| SELECT failed.alert_type, success.login_time, failed.source_ip\n```\n\nEach event gets lineage tags that indicate which source it came from. Events\nfrom `security_alerts` will have the `failed` field populated with their data\nand `success` as NULL. Events from `login_events` will have the `success` field\npopulated with their data and `failed` as NULL. This lets you access any field\nfrom the original data while knowing exactly which source contributed each\nevent.\n\n## Pattern correlation with lineage\n\nLineage tracking enables sophisticated event correlation patterns. As an\nexample, consider detecting brute force attacks by correlating failed attempts\nwith eventual successes:\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nFROM failed = failed_logins, success = successful_logins\n| WINDOW failures = count(failed),\n successes = count(success),\n total = count()\n BY user.id\n WITHIN -5m\n| WHERE successes >= 1 && failures / total > 0.2\n| SELECT user.id,\n failed_count = failures,\n success_count = successes,\n failure_rate = failures / total,\n```\n\nThis query correlates two distinct event patterns within sliding windows. The\nlineage tracking lets you distinguish events by source - events from\n`failed_logins` have the `failed` struct populated, while events from\n`successful_logins` have the `success` struct populated. You can then access\nsource-specific fields and aggregate based on event lineage.\n\n## MATCH clause lineage\n\nThe `MATCH` command also supports lineage tracking when you assign pattern\nresults to struct fields. As an example, consider detecting brute force\npatterns that span multiple login attempts:\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nMATCH failed_logins = failed_logins{10,}, successful_logins = successful_logins+\nWHEN max(successful_logins.timestamp) - min(successful_logins.timestamp) < 10m\n| AGG failed_count = count(failed_logins),\n success_count = count(successful_logins),\n first_failed_ip = min(failed_logins.source_ip),\n success_duration = max(successful_logins.timestamp) - min(successful_logins.timestamp)\n BY user_id\n```\n\nThis pattern detects sequences where at least 10 failed login attempts are\nfollowed by one or more successful logins, all occurring within a 10-minute\nwindow. The assignments (`failed_logins =` and `successful_logins =`) create\nlineage tags that identify which pattern each event matched. Events matching the\nfailed login pattern have the `failed_logins` struct populated, while events\nmatching the successful login pattern have the `successful_logins` struct\npopulated. The `AGG` command then operates on these lineage-tagged events to\ncalculate metrics specific to each pattern type. The `count(failed_logins)`\naggregation counts only events that matched the failed login pattern, while\n`count(successful_logins)` counts only events that matched the successful login\npattern. Similarly, `min(failed_logins.source_ip)` accesses the `source_ip`\nfield specifically from events in the failed login pattern, and the timestamp\ncalculations work with the `timestamp` field from events in the successful login\npattern.\n\n## Benefits of lineage tracking\n\nLineage sub-struct provides several key advantages for complex data analysis.\nYou can correlate events from multiple sources while maintaining clear\nattribution of where each piece of data originated. This eliminates confusion\nin queries where data might come from multiple upstream sources with similar\nfield names.\n\nThe feature also enables pattern detection across different event types. You\ncan write queries that aggregate and filter across multiple event patterns\nwhile accessing specific fields from each pattern type. This supports use cases\nlike security monitoring, user behavior analysis, and system performance\ncorrelation.\n",
506
+ "smart-features/type-expansion.md": "# Type expansion\n\nHamelin expands types when you query multiple datasets. If you write `FROM\nevents, logs`, Hamelin creates an *expanded type* that includes all fields from\nboth sources. This lets you write queries that work across datasets without\nworrying about schema differences.\n\n## How type expansion works\n\nHamelin constructs expanded types by combining field names from all source\ntypes. Fields with the same name get aligned into a single field in the\nexpanded type. This lets you write queries that work across datasets with\nconsistent field naming. As an example, consider searching across different\nauthentication logs with varying schemas:\n\n```hamelin\nFROM security_logs, audit_logs, access_logs\n| WHERE action = 'login' OR event_type = 'authentication'\n| SELECT timestamp, user_id, source_ip, action, event_type, session_id\n```\n\nEach logging system has its own schema:\n\nSecurity logs track authentication attempts with IP addresses:\n\n```hamelin\n{timestamp: timestamp, user_id: string, source_ip: string, action: string}\n```\n\nAudit logs capture detailed session information:\n\n```hamelin\n{timestamp: timestamp, user_id: string, event_type: string, session_id: string}\n```\n\nAccess logs record basic user activity:\n\n```hamelin\n{timestamp: timestamp, user_id: string, source_ip: string}\n```\n\nThe expanded result type becomes:\n\n```hamelin\n{\n timestamp: timestamp,\n user_id: string,\n source_ip: string,\n action: string,\n event_type: string,\n session_id: string\n}\n```\n\nRows from `security_logs` will have `NULL` for the `event_type` and\n`session_id` fields. Rows from `audit_logs` will have `NULL` for the\n`source_ip` and `action` fields. Rows from `access_logs` will have `NULL` for\nthe `action`, `event_type`, and `session_id` fields.\n\nThe result contains rows like:\n\n| timestamp | user_id | source_ip | action | event_type | session_id |\n|-----------|---------|-----------|---------|------------|------------|\n| 2024-01-15 10:30:00 | alice.smith | 192.168.1.100 | login | NULL | NULL |\n| 2024-01-15 10:31:00 | alice.smith | NULL | NULL | authentication | sess_abc123 |\n| 2024-01-15 10:32:00 | alice.smith | 192.168.1.100 | NULL | NULL | NULL |\n\nAll three datasets contribute to the same `timestamp` and `user_id` fields\nbecause they use identical field names. You can filter and select on shared\nfields without knowing which source contributed each row. This lets you write a\nsingle query to search for authentication events across all systems, even\nthough each system logs different fields. The expanded type accommodates all\npossible fields, and you can filter on any field that exists in any source.\n\n\n\n## Nested type expansion\n\nType expansion works with nested structures. Hamelin expands the type hierarchy\nto accommodate nested fields from different sources. As an example, consider\ncombining user data from different systems:\n\n```hamelin\nFROM user_profiles, account_settings\n| SELECT user.name, user.email, user.preferences\n```\n\nEach system has its own nested user structure:\n\nUser profiles contain basic identity information:\n\n```hamelin\n{user: {name: string, email: string}}\n```\n\nAccount settings store user preferences:\n\n```hamelin\n{user: {preferences: string}}\n```\n\nThe expanded result type becomes:\n\n```hamelin\n{user: {name: string, email: string, preferences: string}}\n```\n\nThis means you can access `user.name` from profile data and `user.preferences`\nfrom settings data in the same query, even though the original sources have\ndifferent nested structures.\n\n :::note\n\n Hamelin maintains stable field ordering when merging nested structures. Fields\n from the first source appear first, then fields from the second source are\n added in their original order. This consistent ordering means you can rely on\n the structure of expanded types being predictable across queries.\n\n :::\n\n## Array literal expansion\n\nType expansion also happens when you create array literals containing struct\ntypes with different schemas. Just like `FROM` clauses, Hamelin creates an\nexpanded type that accommodates all fields from every struct in the array. As\nan example, consider creating an array mixing user records with different\navailable fields:\n\n```hamelin\nLET mixed_users = [\n {name: 'Alice', age: 30, department: 'Engineering'},\n {name: 'Bob', email: 'bob@company.com', age: 25},\n {name: 'Carol', email: 'carol@company.com', department: 'Sales'}\n]\n```\n\nEach struct has its own schema:\n\nThe first user record has name, age, and department:\n\n```hamelin\n{name: string, age: number, department: string}\n```\n\nThe second user record has name, email, and age:\n\n```hamelin\n{name: string, email: string, age: number}\n```\n\nThe third user record has name, email, and department:\n\n```hamelin\n{name: string, email: string, department: string}\n```\n\nThe expanded array type becomes:\n\n```hamelin\n[{name: string, age: number, department: string, email: string}]\n```\n\nEach element gets `NULL` values for missing fields. The first element has\n`NULL` for `email`. The second element has `NULL` for `department`. The third\nelement has `NULL` for `age`. This lets you create arrays from structs with\ndifferent schemas while maintaining type consistency across all elements.\n\nYou can then query the expanded array just like any other dataset:\n\n```hamelin\nLET mixed_users = [\n {name: 'Alice', age: 30, department: 'Engineering'},\n {name: 'Bob', email: 'bob@company.com', age: 25},\n {name: 'Carol', email: 'carol@company.com', department: 'Sales'}\n]\n| UNNEST mixed_users\n```\n\nThe query works across all elements regardless of which fields were originally\npresent in each struct. Missing fields appear as `NULL` in the results, just\nlike with `FROM` clause expansion.\n\nThe results would be:\n\n| name | age | department | email |\n|------|-----|------------|-------|\n| Alice | 30 | Engineering | NULL |\n| Bob | 25 | NULL | bob@company.com |\n| Carol | NULL | Sales | carol@company.com |\n",
507
+ "types/array.md": '# Array\n\nArrays let you work with collections of values. Hamelin arrays work much like SQL arrays, but they integrate seamlessly with structs to handle complex nested data.\n\n## Creating arrays\n\nYou create arrays using square brackets with elements separated by commas.\n\n```hamelin\nLET event_types = ["login", "logout", "purchase"]\nLET user_ids = [1001, 1002, 1003, 1004]\nLET timestamps = [ts(\'2024-01-15T10:00:00\'), ts(\'2024-01-15T10:15:00\'), ts(\'2024-01-15T10:30:00\')]\n```\n\nArrays can contain any type of value, including numbers, strings, timestamps, and even structs.\n\n## Arrays of structs\n\nArrays become especially useful when they contain structs, as Hamelin automatically handles differences between struct fields.\n\n```hamelin\nLET user_events = [\n {user_id: 1001, event: "login", timestamp: ts(\'2024-01-15T09:00:00\')},\n {user_id: 1001, event: "purchase", timestamp: ts(\'2024-01-15T09:15:00\'), amount: 49.99},\n {user_id: 1001, event: "logout", timestamp: ts(\'2024-01-15T09:30:00\')}\n]\n```\n\nNotice how the second struct has an `amount` field that the others don\'t have. Hamelin automatically creates a combined type that includes all fields, setting missing fields to `NULL` where needed.\n\n## Accessing array elements\n\nUse square brackets with zero-based indexing to access individual elements.\n\n```hamelin\nFROM events\n| SELECT \n first_tag = tags[0],\n second_tag = tags[1],\n last_tag = tags[-1]\n```\n\nNegative indices count from the end of the array, so `[-1]` gives you the last element.\n\n## Array operations with mixed structs\n\nWhen you combine arrays containing structs with different fields, Hamelin merges the struct types intelligently.\n\n```hamelin\nLET login_events = [\n {event_type: "login", user_id: 100, timestamp: ts(\'2024-01-15T09:00:00\')},\n {event_type: "login", user_id: 101, timestamp: ts(\'2024-01-15T09:05:00\')}\n]\n\nLET purchase_events = [\n {event_type: "purchase", user_id: 100, amount: 25.99, timestamp: ts(\'2024-01-15T09:10:00\')}\n]\n\n// Combining these creates an array with all fields: event_type, user_id, timestamp, amount\n```\n\nThe resulting combined array contains structs where each element has all the fields that appear in any struct, with `NULL` values where fields are missing.\n\n## Field ordering in combined structs\n\nWhen Hamelin combines structs with different fields, it maintains the field order from the first struct encountered, then appends any new fields in the order they first appear.\n\n```hamelin\nLET events = [\n {id: 1, type: "login", user_id: 100}, // Order: id, type, user_id\n {status: "success", id: 2, type: "logout"} // New field \'status\' gets appended\n]\n\n// Result order: id, type, user_id, status\n```\n\n## Type compatibility\n\nArrays can only contain elements that can be coerced to a common type. Hamelin will combine compatible types automatically, but incompatible types will cause an error.\n\n```hamelin\n// This works - numbers can be in the same array\nLET mixed_numbers = [1, 2.5, 3]\n\n// This works - structs with compatible fields\nLET compatible_structs = [\n {name: "Alice", age: 30},\n {name: "Bob", age: 25, city: "Seattle"}\n]\n\n// This would fail - structs with same field name but different types\n// {name: "Alice", count: 5} and {name: "Bob", count: "many"} \n```\n\n## Practical examples\n\nArrays work well for collecting related values and organizing repeated data.\n\n```hamelin\n// Collecting user actions over time\nFROM user_logs\n| SELECT \n user_id,\n user_session = {\n daily_events: [\n {action: "login", time: login_time},\n {action: "view_page", time: page_view_time, page: page_name},\n {action: "logout", time: logout_time}\n ],\n session_duration: logout_time - login_time\n }\n```\n\nThis creates structured output where each user\'s session contains an array of different event types, each with their own specific fields.\n\n## Working with nested arrays\n\nArrays can contain structs that themselves contain arrays, creating complex nested structures.\n\n```hamelin\nFROM dns_logs\n| SELECT dns_response = {\n query: query_name,\n answers: [\n {name: answer1_name, type: answer1_type, ttl: answer1_ttl},\n {name: answer2_name, type: answer2_type, ttl: answer2_ttl}\n ],\n response_time: query_duration\n }\n```\n\nThis organizes DNS response data where each query can have multiple answers, and each answer has its own set of fields.',
508
+ "types/casting.md": "# Casting\n\nTo cast, use the infix operator `AS`. Hamelin uses the `AS` operator for explicit type casting. You write the value, then `AS`, then the type you want.\n\nYou basically don't ever have to cast, except from variant to explicit types, where casting is a very important part of interpreting JSON.\n\n## Basic syntax\n\nCast a value by putting `AS` between the value and the target type:\n\n```hamelin\n| LET x = 5 AS double\n```\n\nThis creates a double-precision value instead of an integer.\n\n## Why `AS` for casting?\n\nYou'll use explicit casting often, especially when declaring literals to influence type inference. We wanted something terse. Using `AS` for assignment confuses people (the order seems backwards). This frees up `AS` for casting, which reads cleanly: *treat this one thing as another type*.\n\n## How it works\n\nThe `AS` operator translates explicit cast expressions into the generated code. We often actually translate to `try_cast()` in order to make sure the query doesn't crash.\n\nHamelin delegates **implicit casting to the underlying engine** \u2014 if you assign a value to a typed column or pass it to a function that expects a different type, the engine decides whether and how to cast the value.\n\n## Common casting examples\n\n### String conversions\nConvert values to strings for display or storage:\n\n```hamelin\nFROM events\n| SELECT\n user_id_str = user_id AS string,\n timestamp_str = timestamp AS string,\n status_display = status_code AS string\n```\n\n### Numeric conversions\nConvert between different numeric types or from strings to numbers:\n\n```hamelin\nFROM logs\n| SELECT\n status_code = response_code AS integer,\n response_time = response_time_str AS double,\n user_count = total_users AS integer\n```\n\n### Boolean conversions\nConvert various values to boolean types:\n\n```hamelin\nFROM user_data\n| SELECT\n user_id,\n is_active = status_flag AS boolean,\n has_permissions = permission_level AS boolean\n```\n\n## Type inference with casting\n\nYou can influence type inference in variable declarations by casting literals:\n\n```hamelin\nFROM events\n| LET\n threshold = 100 AS double,\n max_retries = 5 AS integer,\n default_timeout = 30.0 AS double\n| WHERE response_time > threshold\n```\n\n## Complex type casting\n\n### Array casting\nCast arrays to specific element types:\n\n```hamelin\nFROM json_data\n| SELECT\n tags = tag_list AS array(string),\n scores = score_array AS array(double)\n```\n\n### Struct casting\nCast structured data to specific field types:\n\n```hamelin\nFROM structured_data\n| SELECT\n user_info = user_data AS {name: string, email: string},\n coordinates = location AS {x: double, y: double}\n```\n",
509
+ "types/map.md": "# Map\n\nYou use maps to store key-value pairs where you have too many different keys to create separate columns. Hamelin's **map** type matches SQL's design - it's a homogeneous structure where all keys have the same type and all values have the same type.\n\n## When to use maps\n\nYou should use maps rarely. Before using a map, consider whether you could:\n- Factor subset key spaces into separate tables\n- Leave the data as a JSON string and parse only commonly needed values into columns\n\nYou use maps only when you have a high cardinality key space that can't be handled with these alternatives.\n\n## Creating maps\n\nYou construct maps using the `map()` function, which has two overloaded forms.\n\n### Map literals\n\nYou create a map by listing key-value pairs directly:\n\n```hamelin\nLET config = map(\n 'timeout': 30,\n 'retries': 3,\n 'debug': false\n)\n```\n\n### From key and value arrays\n\nYou build a map from separate arrays of keys and values:\n\n```hamelin\nLET field_names = ['user_id', 'email', 'created_at']\nLET field_values = [12345, 'user@example.com', '2024-01-15']\nLET user_data = map(field_names, field_values)\n```\n\n\n\n## Empty maps\n\nYou create an empty map using the function without arguments:\n\n```hamelin\nLET empty_config = map()\n```\n\n## Type homogeneity\n\nMaps must be homogeneous - all values must have the same type. This example will generate an error:\n\n```hamelin\n// ERROR: mixing integer and string values\nLET broken_map = map(\n 'count': 42,\n 'name': 'example'\n)\n```\n\n## Accessing map values\n\nYou use bracket notation to retrieve values by key:\n\n```hamelin\nFROM events\n| LET metadata = map('source': 'api', 'version': 2)\n| SELECT event_source = metadata['source']\n```\n\n## Map storage\n\nThe underlying engine stores each map as a pair of related columns - one for keys and one for values. Row positions in each column relate the keys to their values.\n\n## Performance considerations\n\nUnlike structs, which add overhead by creating a column per field, maps have minimal impact on table width (only two columns). However, you pay the cost of key values being actual data rather than column names, even though they're dictionary encoded.\n",
510
+ "types/philosophy.md": "# Type Philosophy\n\nHamelin is a typed language, which prevents query mistakes, provides better\nerror messages, and simplifies translation definitions.\n\n## Design Philosophy\n\nYou mostly won't think about types when writing queries. You'll learn about\nthem when something doesn't type check and Hamelin gives you a clear error\nmessage. Hamelin catches type errors before they reach the SQL engine, so you\nget helpful feedback about your Hamelin code instead of confusing messages about\ngenerated SQL.\n\n### Umbrella Types\n\nHamelin groups related types under umbrella categories instead of exposing every\nSQL type variation. As an example, all integer types from `tinyint` to `bigint` become the\nsingle `integer` type for type checking. You can reason about your code more\neasily while still getting precise error messages.\n\n### Error Prevention\n\nTypes catch mistakes early and give you clear feedback. When something doesn't\ntype check, errors point to your Hamelin code, not to generated SQL that would\nconfuse you.\n\n### Transparent Mapping\n\nHamelin types map cleanly to SQL types without requiring you to think about\nstorage details. The system handles these details automatically while preserving\nthe semantic meaning of your data.\n\n### Function Overloading\n\nHamelin's type system allows it to use the same function name for operations on\ndifferent types. You write `sum()` whether you're aggregating a column or adding\nup an array - no need for separate `sum_numbers()` and `array_sum()` functions.\nThis makes it easier on the author. It also makes it possible for Hamelin's\nimplementation to define dialect-specific translations.\n\n## Type Inference\n\nHamelin figures out types from your expressions and data automatically. You\ndon't need to declare types. Hamelin determines them based on how you use\nvalues in operations and functions.\n\nFor example:\n- `42 + 3.14e0` results in a `double` (floating-point arithmetic)\n- `'hello' + 'world'` results in a `string` (string concatenation)\n- `timestamp > '2024-01-01'` results in a `boolean` (comparison operation)\n- `sum(revenue)` works as an aggregation function in `AGG` commands\n- `sum([1, 2, 3])` works on arrays and returns `6`\n\nType inference also powers Hamelin's function translation, ensuring that\noperations translate to the right SQL functions based on the inferred types of\ntheir arguments. This lets you focus on expressing your logic clearly while the\ntype system works behind the scenes to ensure correctness and give you helpful\nfeedback when things go wrong.\n",
511
+ "types/primitive-types.md": "# Primitive Types\n\nHamelin provides several primitive types that serve as building blocks for more complex data structures. These types map cleanly to SQL types while providing a simplified interface for type checking and operations.\n\n## Boolean\n\nThe `boolean` type translates directly to SQL's `boolean` type.\n\nExamples: `true`, `false`\n\n## Integer\n\nThe `integer` type is an umbrella for all integer types, from `tinyint` (8 bits)\nto `bigint` (64 bits). All integers are treated the same for type checking\npurposes.\n\nExamples: `1`, `42`, `-17`, `1000000`\n\n## Double (Floating Point)\n\nThe `double` type represents floating point numbers with variable precision. Use\ndoubles for calculations where approximate values are acceptable. Double\nliterals can be written in scientific notation (e.g., `1.5e0`). The `e0` means\n\"times 10 to the power of 0\", so `1.5e0` equals `1.5` but forces the result to\nbe a double type.\n\nExamples: `3.14159e0`, `2.5e0`, `1.23e-4`, `-9.87e2`\n\n## Decimal (Fixed Point)\n\nThe `decimal` type represents exact numeric values with specified precision and\nscale, written as `decimal(precision, scale)`. Decimal literals in Hamelin\ndefault to fixed point because most business calculations require exact\narithmetic rather than floating point approximations.\n\nExamples: `100.50`, `0.075`, `999.99`\n\n## String\n\nThe `string` type covers `char`, `varchar` (of any length). String concatenation\nuses the `+` operator.\n\nExamples: `'hello world'`, `\"error message\"`, `'user@example.com'`\n\n## Binary\n\nThe `binary` type translates to `varbinary` in SQL for handling binary data.\n\nExamples: Binary data representations for file contents, encrypted data, or raw bytes.\n\n## Timestamp\n\nThe `timestamp` type is an umbrella for `date`, `timestamp`, and all their\nvariants (precision and timezone). Hamelin follows SQL's assumption that\ntimestamps with and without zones can be compared.\n\nExamples: `ts('2024-01-15')`, `ts('2024-01-15 14:30:00')`, `ts('2024-01-15T14:30:00Z')`\n\n## Interval\n\nThe `interval` type covers `interval day to second` for time duration\ncalculations with fixed durations. These intervals represent exact amounts of\ntime that can be directly compared and calculated.\n\nExamples: `2h`, `30min`, `5d`\n\n## Calendar Interval\n\nThe `calendar interval` type covers `interval year to month` for calendar-based\ndurations. Calendar intervals like years and months don't represent a fixed\nnumber of days because months have different lengths and years can be leap\nyears. These intervals cannot be directly compared to day-based intervals.\n\nExamples: `3mon`, `2y`, `1q`\n\n## Range\n\nThe `range` type represents spans between two values of any type. Ranges\nare created by the `..` operator. Ranges can be bounded (with both start and end\npoints) or unbounded (extending infinitely in one direction). You see them most\ncommonly in Hamelin as `range(timestamp)` or `range(interval)`, and every query\ngenerally operates under the constraints of a `range(timestamp)`.\n\nExamples: `-2hr..-1hr`, `ts('2024-01-15')..now()`, `-1hr..`, `..now()`, `1..10`, `'a'..'z'`\n\n## Rows\n\nThe `rows` type represents a number of rows and is only useful when declaring\nwindow frames as a certain number of rows rather than a time-based frame.\n\nExamples: `5r`, `10r`, `0r`\n",
512
+ "types/struct.md": '# Struct\n\nStructs let you group related fields together into a single data structure. Unlike SQL\'s `ROW` type, Hamelin structs use field names rather than position to identify each field, making them safer and easier to work with.\n\n## Creating structs\n\nYou create structs using curly braces with field names and values.\n\n```hamelin\nLET user_data = {\n user_id: 12345,\n name: "Alice Johnson",\n email: "alice@example.com"\n}\n```\n\nThis creates a struct with three fields: `user_id`, `name`, and `email`. Each field has a name and a value.\n\n## Accessing struct fields\n\nUse dot notation to access individual fields within a struct.\n\n```hamelin\nFROM user_events\n| WHERE user_info.user_id == 12345\n| SELECT\n user_name = user_info.name,\n user_email = user_info.email\n```\n\nYou can access any field by name, regardless of the order they were defined in the struct.\n\n## Field order and naming\n\nStructs maintain the order of fields as you define them, but field identity\nduring type expansion comes from the name, not the position. You can read more\nabout that in [type expansion](../smart-features/type-expansion.md).\n\n```hamelin\n// These two structs declare different types, but they can be aligned during expansion\nLET\n profile1 = {\n id: 1001,\n status: "active",\n created: ts(\'2024-01-15T00:00:00\')\n },\n profile2 = {\n status: "inactive",\n id: 1002,\n created: ts(\'2024-01-16T00:00:00\')\n }\n```\n\n## Nested structs\n\nStructs can contain other structs for organizing complex data.\n\n```hamelin\nFROM events\n| SELECT structured_event = {\n user: {\n id: user_id,\n profile: {\n name: user_name,\n email: email_address\n }\n },\n event: {\n type: event_type,\n timestamp: event_time,\n source: data_source\n }\n }\n```\n\nAccess nested fields by chaining field names with dots:\n\n```hamelin\n| SELECT\n user_name = structured_event.user.profile.name,\n event_time = structured_event.event.timestamp\n```\n\n## Practical examples\n\nStructs work well for organizing related information that belongs together:\n\n```hamelin\n// HTTP request logging\nFROM access_logs\n| SELECT request_data = {\n request: {\n method: http_method,\n path: url_path,\n status: response_code\n },\n timing: {\n start_time: request_start,\n end_time: request_end,\n duration_ms: response_time\n },\n client: {\n ip: client_ip,\n user_agent: user_agent\n }\n }\n```\n\nThis creates clean, organized output where related fields are grouped logically rather than scattered across many columns.\n',
513
+ "types/variant.md": "# Variant\n\nYou use Hamelin's **variant** type to work with JSON and other semi-structured data. Hamelin adopts the Variant trend for representing the JSON object model, making JSON feel native and easy to work with.\n\n## Parsing JSON into variant\n\nYou parse a JSON string into a variant using the `parse_json()` function:\n\n```hamelin\nFROM api_logs\n| LET event_data = parse_json(json_payload)\n| SELECT event_data\n```\n\n## Navigating variant data\n\nYou navigate variant substructure safely and ergonomically using dots and square brackets, just like with structs and arrays:\n\n```hamelin\nFROM car_sales\n| LET json = parse_json(src)\n| SELECT \n sale_date = json.date,\n salesperson_name = json.salesperson.name,\n customer_name = json.customer[0].name\n```\n\n## Accessing nested fields\n\nYou can access deeply nested fields using the same dot and bracket notation:\n\n```hamelin\nFROM events\n| LET data = parse_json(event_json)\n| SELECT \n user_email = data.user.profile.email,\n first_item_price = data.transaction.items[0].price\n```\n\n## Safe type conversion\n\nVariants cast safely to other Hamelin types. Individual conversion failures become `NULL` instead of crashing your query:\n\n```hamelin\nFROM logs\n| LET parsed = parse_json(log_data)\n| SELECT \n log_time = parsed.timestamp AS timestamp,\n user_id = parsed.user_id AS string,\n event_count = parsed.count AS integer\n```\n\n## Casting to structured types\n\nYou can cast variants to maps, arrays, and structs. These casts are safe and null on failure:\n\n```hamelin\nFROM json_data\n| LET parsed = parse_json(raw_json)\n| SELECT \n user_info = parsed AS {name: string, age: integer},\n tag_list = parsed.tags AS [string]\n```\n\n## Creating variant objects\n\nYou create variant objects by casting structs to variant:\n\n```hamelin\nLET user_struct = {name: 'Alice', age: 30}\n| LET user_variant = user_struct AS variant\n| SELECT user_variant\n```\n\n## Creating variant arrays\n\nYou create variant arrays by casting arrays to variant:\n\n```hamelin\nLET numbers = [1, 2, 3, 4, 5]\n| LET variant_list = numbers AS variant\n| SELECT variant_list\n```\n\n## Mixed type handling\n\nVariant handles mixed types within the same structure:\n\n```hamelin\nFROM api_responses\n| LET response = parse_json(response_body)\n| SELECT \n record_id = response.data.id AS string,\n is_active = response.data.active AS boolean,\n user_score = response.data.score AS double\n```\n\n## Database system compatibility\n\nHamelin adapts to your database system's JSON capabilities:\n\n- **Full VARIANT support** (Snowflake, Databricks): Hamelin uses native variant storage and operations\n- **JSON support** (BigQuery, Postgres): Hamelin treats JSON as the variant format \n- **ANSI JSON only**: Hamelin provides parsing and access functions but no efficient storage\n\n## Working with arrays\n\nYou access array elements using zero-based indexing:\n\n```hamelin\nFROM events\n| LET data = parse_json(event_data)\n| SELECT \n first_item = data.items[0].name,\n last_item = data.items[-1].name\n```\n\n## Handling missing fields\n\nVariant navigation is safe - accessing missing fields returns `NULL`:\n\n```hamelin\nFROM logs\n| LET parsed = parse_json(log_entry)\n| SELECT \n always_present = parsed.required_field,\n might_be_null = parsed.optional_field\n```\n\n---\n\n*Variant types make JSON feel native in Hamelin, providing safe navigation and conversion without requiring upfront schema knowledge.*"
514
+ };
515
+ export {
516
+ HAMELIN_DOCUMENTATION
517
+ };
package/package.json ADDED
@@ -0,0 +1,24 @@
1
+ {
2
+ "name": "@hamelin.sh/documentation",
3
+ "version": "0.2.2-prerelease.20250910T133135",
4
+ "sideEffects": false,
5
+ "license": "UNLICENSED",
6
+ "type": "module",
7
+ "files": [
8
+ "dist",
9
+ "README.md"
10
+ ],
11
+ "exports": "./dist/main.js",
12
+ "types": "./dist/main.d.ts",
13
+ "devDependencies": {
14
+ "@total-typescript/ts-reset": "^0.6.1",
15
+ "@total-typescript/tsconfig": "^1.0.4",
16
+ "@types/node": "^22.13.10",
17
+ "tsup": "^8.3.5",
18
+ "tsx": "^4.16.0",
19
+ "typescript": "^5.5.0"
20
+ },
21
+ "scripts": {
22
+ "build": "tsx scripts/build-documentation.ts && tsup"
23
+ }
24
+ }