@hamelin.sh/documentation 0.2.14 → 0.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/main.js +6 -5
- package/package.json +1 -1
package/dist/main.js
CHANGED
|
@@ -9,7 +9,7 @@ var HAMELIN_DOCUMENTATION = {
|
|
|
9
9
|
"command-reference/let.md": "# `LET`\n\nAdd or modify columns without affecting other columns.\n\n## Syntax\n\n```\nLET assignment [, assignment]* [,]?\n```\n\n## Parameters\n\n- **assignment** - Column assignment in the form `identifier = expression`\n\n## Description\n\nThe `LET` command adds new columns or modifies existing ones while \npreserving all other columns in the dataset. Unlike `SELECT`, which replaces \nthe entire column set, `LET` makes incremental changes to the data structure.\n\nYou can specify multiple assignments in a single `LET` command, separated by \ncommas. Each assignment creates or updates the specified field with the \nresult of evaluating the expression. Expressions can reference any field \navailable at the point where the `LET` command appears in the pipeline.\n\nWhen the identifier already exists as a column, `LET` modifies that column's \nvalues. When the identifier does not exist, `LET` creates a new column with \nthat name.\n\n\n\n## Related Commands\n\n- **[SELECT](./select.md)** - Completely redefine output columns",
|
|
10
10
|
"command-reference/limit.md": "# `LIMIT`\n\nRestrict the number of rows returned.\n\n## Syntax\n\n```\nLIMIT expression\n```\n\n## Parameters\n\n- **expression** - Non-negative integer specifying the maximum number of rows to return\n\n## Description\n\nThe `LIMIT` command restricts the output to a specified maximum number of \nrows. The expression must evaluate to a non-negative integer value. When the \ndataset contains fewer rows than the limit, all rows are returned.\n\nYou typically use `LIMIT` in combination with `SORT` to retrieve the top or \nbottom N results from an ordered dataset. When you use it without sorting, \n`LIMIT` returns an arbitrary subset of rows, which may vary between query \nexecutions.\n\nThe limit is applied after all other operations in the pipeline, making it \nuseful for controlling output size while preserving the full computation \ncontext for earlier commands.\n\n## Related Commands\n\n- **[SORT](./sort.md)** - Order rows by expressions (commonly used with LIMIT)\n\n",
|
|
11
11
|
"command-reference/lookup.md": "# `LOOKUP`\n\nCombine datasets using left outer join logic - all original rows are preserved.\n\n## Syntax\n\n```\nLOOKUP fromClause [ON expression]?\n```\n\n## Parameters\n\n- **fromClause** - Either a dataset identifier or an alias assignment (`alias = dataset`)\n- **expression** - Boolean condition defining how rows should match\n\n## Description\n\nThe `LOOKUP` command performs a left outer join operation, combining the \ncurrent dataset with another dataset based on matching conditions you specify \nin the `ON` clause. All rows from the original dataset are preserved in the \nresults, regardless of whether they have matches in the lookup dataset.\n\nFor rows without matches, the looked-up data struct is set to `null`. For \nrows with matches, the looked-up data is nested as a struct to prevent field \nname collisions. By default, the struct uses the name of the lookup dataset, \nbut you can override this using the assignment syntax in the `fromClause`.\n\nWhen you omit the `ON` clause, the lookup becomes a cartesian product of all \nrows from both datasets. The lookup condition expression can reference fields \nin the current dataset directly and fields in the lookup dataset by name \nusing dot notation (e.g., `users.email`).\n\n\n\n## Related Commands\n\n- **[JOIN](./join.md)** - Inner join that only keeps rows with matches",
|
|
12
|
-
"command-reference/match.md":
|
|
12
|
+
"command-reference/match.md": '# `MATCH`\n\nFind ordered sequences of events using pattern matching with quantifiers.\n\n## Syntax\n\n```\nMATCH pattern+ \n [BY groupClause [, groupClause]*] [,]?\n [SORT [BY]? sortExpression [, sortExpression]*] [,]?\n [WITHIN interval] [,]?\n```\n\n## Parameters\n\n- **pattern** - Named dataset reference with optional quantifier (`*`, `+`, `?`, `{n}`) specifying sequence requirements\n- **interval** - Time interval specifying the maximum duration for the entire pattern sequence (e.g., `5m`, `1h`)\n- **groupClause** - Field or expression to group pattern matching by\n- **sortExpression** - Field or expression to order results by, with optional `ASC` or `DESC` direction\n\n## Description\n\nThe `MATCH` command finds ordered sequences of events across multiple named \ndatasets using regular expression-style pattern matching. Unlike `WINDOW`, \nwhich performs unordered correlation, `MATCH` requires that events occur in a \nspecific temporal sequence.\n\nYou specify patterns using named datasets (defined with `WITH` clauses) \nfollowed by optional quantifiers. Quantifiers include `*` (zero or more), `+` \n(one or more), `?` (zero or one), `{n}` (exactly n), and `{n,m}` (between n \nand m occurrences).\n\n### Implicit Timestamp Ordering\n\n`MATCH` commands automatically sort by the `timestamp` column unless you explicitly specify a different `SORT` clause. This default ordering ensures that pattern matching operates on temporally ordered event sequences.\n\nIf you explicitly provide a `SORT` clause, the automatic timestamp ordering is disabled, and your custom ordering takes effect.\n\n### Time Constraints with WITHIN\n\nThe `WITHIN` clause constrains the total duration from the start of the first pattern to the end of the last pattern. For example, `WITHIN 5m` ensures that the entire pattern sequence completes within 5 minutes. The constraint must be a positive value and cannot be negative.\n\nWhen using `WITHIN`:\n- You must specify exactly one `SORT` expression (or use the implicit `timestamp` ordering)\n- The `WITHIN` constraint measures on the actual `SORT` column, not hardcoded to timestamp\n- The `SORT` column type must be compatible with the `WITHIN` type:\n - `TIMESTAMP` sort columns work with `INTERVAL` (e.g., `5m`) or `CALENDAR_INTERVAL` (e.g., `1y`, `3mon`)\n - Numeric sort columns require matching numeric `WITHIN` types (e.g., `INT` sort with `INT` within)\n- If you don\'t specify a `SORT` clause, the implicit `timestamp` ordering is used automatically\n\nThe `BY` clause partitions data for independent pattern matching within each group. Results contain matched sequences with access to \nfirst and last events via `first()` and `last()` functions.\n\n## Examples\n\n### Basic Pattern Matching with WITHIN\n\n```hamelin\nWITH login AS (FROM events WHERE event_type = "login"),\n suspicious AS (FROM events WHERE event_type = "suspicious_activity")\n\nFROM events \n| MATCH login suspicious WITHIN 5m\n```\n\nFinds sequences where a login event is followed by suspicious activity within 5 minutes.\n\n### Multiple Patterns with Time Constraint\n\n```hamelin\nWITH file_access AS (FROM events WHERE event_type = "file_access"),\n data_exfil AS (FROM events WHERE event_type = "data_transfer")\n\nFROM events \n| MATCH file_access+ data_exfil BY user_id WITHIN 10m\n```\n\nFinds sequences where one or more file access events are followed by data transfer, all occurring within 10 minutes, grouped by user.\n\n### Numeric Ordering with WITHIN\n\n```hamelin\nWITH login AS (FROM events WHERE event_type = "login"),\n action AS (FROM events WHERE event_type = "action")\n\nFROM events \n| MATCH login action SORT BY row_number WITHIN 10\n```\n\nFinds sequences where a login is followed by an action within 10 row numbers. The `WITHIN` constraint measures on the `SORT` column (`row_number`), ensuring the distance from first to last event is at most 10.\n\n## Related Commands\n\n- **[WINDOW](./window.md)** - Unordered correlation and aggregation',
|
|
13
13
|
"command-reference/nest.md": "# `NEST`\n\nNest all currently defined fields into one sub-struct.\n\n## Syntax\n\n```\nNEST identifier\n```\n\n## Parameters\n\n- **identifier** - The field name to create for the nested structure\n\n## Description\n\nThe `NEST` command takes all currently defined fields and nests them into one \nsub-struct by creating a new field containing a struct. All original fields \nare preserved as properties of the nested struct. This operation is the \ninverse of UNNEST.\n\n## Related Commands\n\n- **[UNNEST](./unnest.md)** - Lifts struct fields into the parent or enclosing result set (inverse operation)",
|
|
14
14
|
"command-reference/parse.md": '# `PARSE`\n\nExtract structured data from string fields using anchor parsing.\n\n## Syntax\n\n```\nPARSE [expression] pattern AS? identifier (, identifier)* [NODROP]\n```\n\n## Parameters\n\n- **expression** - Optional source field to parse (defaults to current row context)\n- **pattern** - Anchor pattern string using star (*) characters to mark extraction points\n- **identifier** - Output field names for extracted values\n- **NODROP** - Optional flag to preserve the original source field\n\n## Description\n\nThe `PARSE` command provides a lightweight pattern matching approach that is a\nsimple alternative to complex regular expressions. It extracts structured data\nfrom string fields using anchor parsing with star (*) characters. The pattern\nstring uses literal text as anchors with star characters marking extraction\npoints (e.g., "prefix-*-suffix" extracts the value between the anchors). You\nmust provide as many output identifiers as there are star (*) characters in\nthe pattern. The command creates new fields containing the extracted values.\nBy default, rows that don\'t match the pattern are filtered out. When you\nspecify NODROP, non-matching rows are preserved with all output fields set to\nnull.\n',
|
|
15
15
|
"command-reference/rows.md": "# `ROWS`\n\nInject rows into a pipeline.\n\n## Syntax\n\n```\nROWS expression\n```\n\n## Parameters\n\n- **expression** - An expression that evaluates to an array-of-struct which defines the rows to inject\n\n## Description\n\nThe `ROWS` command injects rows into a pipeline by taking an array-of-struct\nexpression and creating one row for each struct element. This is mostly used\nfor examples or playgrounds and is rarely useful in real queries over larger\ndatasets. The `ROWS` command is functionally equivalent to `UNNEST` of a literal\narray.\n\n## Related Commands\n\n- **[EXPLODE](./explode.md)** - Expand array fields into separate rows (similar row generation behavior)\n- **[UNNEST](./unnest.md)** - Lift struct or array of struct fields into the parent or enclosing result set (functionally equivalent for literal arrays)\n",
|
|
@@ -21,7 +21,7 @@ var HAMELIN_DOCUMENTATION = {
|
|
|
21
21
|
"command-reference/window.md": "# `WINDOW`\n\nApply aggregation or window functions over sliding windows of data.\n\n## Syntax\n\n```\nWINDOW assignmentClause [, assignmentClause]* [,]?\n [BY groupClause [, groupClause]*] [,]?\n [SORT [BY]? sortExpression [, sortExpression]*] [,]?\n [WITHIN withinExpression]?\n```\n\n## Parameters\n\n- **assignmentClause** - Aggregation function call or expression, optionally with assignment (`identifier = expression`)\n- **groupClause** - Field or expression to partition the data by\n- **sortExpression** - Field or expression to order rows within each partition, with optional `ASC` or `DESC` direction\n- **withinExpression** - Time range or row count defining the window size\n\n## Description\n\nThe `WINDOW` command applies aggregation or window functions over sliding windows of data,\ncomputing results for each row based on a specified subset of surrounding rows.\nUnlike `AGG`, which groups data into discrete buckets, `WINDOW` maintains the\noriginal row structure while adding computed window function results.\n\nThe `BY` clause partitions the data into separate windows based on the grouping\nexpressions you specify. The `SORT` clause determines the ordering of rows\nwithin each partition. The `WITHIN` expression defines the window size, which\nyou can specify as a relative interval (e.g., `-5min`, `-1hr`), a relative range\n(`-10min..-5min`), or relative row count (e.g., `-5r`, `10rows`).\n\nAll aggregation functions in a single `WINDOW` command use the same window\ndefinition. Window function results are added as new columns to the existing\ndataset structure.\n\n\n\n## Related Commands\n\n- **[AGG](./agg.md)** - Group rows and apply aggregation functions (discrete grouping vs sliding windows)\n- **[MATCH](./match.md)** - Pattern matching over sequences\n",
|
|
22
22
|
"command-reference/within.md": "# `WITHIN`\n\nFilter rows based on timestamp ranges.\n\n## Syntax\n\n```\nWITHIN expression\n```\n\n## Parameters\n\n- **expression** - An interval, a range of intervals, or a range of timestamps defining the temporal filter bounds\n\n## Description\n\nThe `WITHIN` command filters rows based on timestamp ranges, keeping only rows\nwhose timestamp falls within the specified time bounds. The expression defines\na time range using interval notation, supporting both bounded and unbounded\nranges.\n\nThe expression supports three forms:\n\n- **Interval** (e.g., `-1hr`) - calculates from that time until now\n- **Range of intervals** (e.g., `-2hr..-1hr`) - both intervals are calculated\n from now\n- **Specific timestamps** (e.g., `ts('2023-01-01')..ts('2023-01-02')`) - between those\n exact timestamps\n\nUnbounded ranges are exactly that - evaluated with no bound in one direction.\n\nThe `WITHIN` command uses the default timestamp field from the dataset.\n",
|
|
23
23
|
"examples/aws-security-group-detection.md": "# AWS Security Group Detection\n\nThis example shows how to build a detection for overly permissive AWS EC2 \nsecurity group modifications. Security groups act as virtual firewalls for \nEC2 instances. Misconfigurations that allow unrestricted internet access \ncreate attack vectors that lead to breaches. You can identify dangerous \nmodifications by analyzing CloudTrail events for security group changes that \nexpose sensitive ports to the internet.\n\n## Attack overview\n\nSecurity group misconfigurations lead to cloud breaches through this pattern:\n\n1. **Initial misconfiguration** - Administrators accidentally allow unrestricted access\n2. **Deliberate modification** - Attackers modify security groups after compromise\n3. **Service exposure** - Sensitive services become accessible from the internet\n4. **Lateral movement** - Exposed services provide entry points for further attacks\n\nYou can detect these misconfigurations by monitoring CloudTrail events for \nsecurity group modifications that create overly permissive rules.\n\n## Detection strategy\n\nThe detection focuses on security group modifications that allow traffic from \nany source (`0.0.0.0/0` or `::/0`) to sensitive ports. You write queries that \nparse CloudTrail JSON events to extract security group changes, identify \npermissive rules, and calculate risk scores based on the ports and protocols \nexposed.\n\nHamelin targets CloudTrail events like `AuthorizeSecurityGroupIngress` and \n`ModifySecurityGroupRules` that indicate firewall rule changes, then filters \nfor modifications that create internet-accessible endpoints.\n\n## Complete detection query\n\nThis detection query parses CloudTrail JSON events to identify security group \nmodifications that expose services to unrestricted internet access. Hamelin \nextracts relevant fields from nested JSON, identifies sensitive ports, \ncalculates risk scores based on exposure level, and creates structured output \nfor security teams:\n\n```hamelin\n// Name: AWS EC2 Security Group Permissive Changes\n// Author: Detection Engineer\n//-\n//- Description: This detection identifies potentially dangerous modifications\n//- to EC2 security groups that could expose resources to unauthorized access.\n//- It specifically looks for rules that allow traffic from any source\n//- (0.0.0.0/0 or ::/0) on sensitive ports. Security groups act as virtual\n//- firewalls, and overly permissive rules are a common security misconfiguration\n//- that can lead to breaches.\n//-\n//- Tags: aws, ec2, security-group, network-security, misconfiguration\n//- Mitre-Tactic: TA0005 (Defense Evasion), TA0001 (Initial Access)\n//- Mitre-Technique: T1562.007 (Impair Defenses: Disable or Modify Cloud Firewall)\n//-\n\nFROM simba.cloudtrail_events\n\n// Parse the JSON and extract fields\n| LET evt = parse_json(event.original) AS variant\n| LET eventName = evt.eventName AS string\n| LET source_ip = evt.sourceIPAddress AS string\n| LET user_identity = evt.userIdentity.arn AS string\n| LET account_id = evt.userIdentity.accountId AS string\n| LET aws_region = evt.awsRegion AS string\n| LET error_code = evt.errorCode AS string\n| LET request_params = evt.requestParameters AS variant\n| LET group_id = evt.requestParameters.groupId AS string\n| LET group_name = evt.requestParameters.groupName AS string\n\n// Filter for security group modification events\n| WHERE coalesce(eventName, '') IN [\n 'AuthorizeSecurityGroupIngress',\n 'AuthorizeSecurityGroupEgress',\n 'RevokeSecurityGroupIngress',\n 'RevokeSecurityGroupEgress',\n 'CreateSecurityGroup',\n 'ModifySecurityGroupRules'\n ]\n\n// Only process successful events\n| WHERE coalesce(error_code, '') == ''\n\n// Check for overly permissive rules in the original event\n| LET has_any_source = contains(event.original, '0.0.0.0/0') OR contains(event.original, '::/0')\n\n// Extract port information (simplified approach)\n| LET from_port = evt.requestParameters.ipPermissions[0].fromPort AS int\n| LET to_port = evt.requestParameters.ipPermissions[0].toPort AS int\n| LET ip_protocol = evt.requestParameters.ipPermissions[0].ipProtocol AS string\n\n// Define sensitive ports\n| LET sensitive_ports = [22, 23, 3389, 1433, 3306, 5432, 5984, 6379, 7000, 7001, 8020, 8086, 8888, 9042, 9160, 9200, 9300, 11211, 27017, 27018, 27019, 50070]\n| LET is_sensitive_port = coalesce(from_port, 0) IN sensitive_ports OR \n coalesce(to_port, 0) IN sensitive_ports OR\n (coalesce(from_port, 0) <= 22 AND coalesce(to_port, 0) >= 22) OR\n (coalesce(from_port, 0) <= 3389 AND coalesce(to_port, 0) >= 3389)\n\n// Check for allow all protocols\n| LET is_all_traffic = coalesce(ip_protocol, '') == '-1' OR \n (coalesce(from_port, 0) == 0 AND coalesce(to_port, 0) == 65535)\n\n// Only flag events with permissive rules\n| WHERE has_any_source\n\n// Calculate risk score\n| LET risk_score = if(\n contains(coalesce(eventName, ''), 'Authorize') AND is_all_traffic,\n 100, // Critical - allowing all traffic from anywhere\n if(\n contains(coalesce(eventName, ''), 'Authorize') AND is_sensitive_port,\n 90, // Very high - sensitive ports exposed\n if(\n contains(coalesce(eventName, ''), 'Authorize'),\n 75, // High - any port exposed to internet\n 40 // Low - revoking permissive rules (good action)\n )\n )\n )\n\n// Create human-readable message\n| LET action_type = if(contains(coalesce(eventName, ''), 'Authorize'), 'opened', 'closed')\n| LET from_port_str = coalesce(from_port, 0) AS string\n| LET to_port_str = coalesce(to_port, 0) AS string\n| LET port_info = if(\n is_all_traffic,\n 'all ports',\n 'port(s) ' + from_port_str + '-' + to_port_str\n )\n| LET message = 'Security group ' + action_type + ' to internet: ' + \n port_info + ' on group ' + coalesce(group_id, coalesce(group_name, 'unknown'))\n\n// Create labels map for context\n| LET labels = map(\n 'Severity': if(risk_score >= 90, 'CRITICAL', if(risk_score >= 75, 'HIGH', 'LOW')) AS string,\n 'Account': coalesce(account_id, '') AS string,\n 'Security Group': coalesce(group_id, coalesce(group_name, '')) AS string,\n 'User': coalesce(user_identity, '') AS string,\n 'Source IP': coalesce(source_ip, '') AS string,\n 'Event': coalesce(eventName, '') AS string,\n 'Region': coalesce(aws_region, '') AS string,\n 'Protocol': coalesce(ip_protocol, 'unknown') AS string,\n 'Port Range': port_info AS string,\n 'Action': action_type AS string,\n 'Tactic': 'Defense Evasion' AS string,\n 'Technique': 'T1562.007' AS string\n )\n\n// Format the final output\n| SELECT\n timestamp,\n event.start = timestamp,\n event.end = timestamp,\n message,\n labels,\n host.name = coalesce(account_id, ''),\n user.name = coalesce(user_identity, ''),\n source.ip = coalesce(source_ip, ''),\n cloud.region = coalesce(aws_region, ''),\n cloud.service.name = 'ec2',\n network.protocol = coalesce(ip_protocol, ''),\n rule = {\n name: 'AWS EC2 Security Group Permissive Changes' AS string,\n description: 'Detects modifications to EC2 security groups that allow ' +\n 'unrestricted access from the internet (0.0.0.0/0), especially ' +\n 'on sensitive ports commonly targeted by attackers.' AS string,\n mitre_tactics: ['Defense Evasion', 'Initial Access'] AS array(string),\n mitre_techniques: ['T1562.007'] AS array(string),\n severity: if(risk_score >= 90, 'critical', if(risk_score >= 75, 'high', 'low')) AS string,\n references: [\n 'https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules.html',\n 'https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html'\n ] AS array(string)\n }\n```\n\n## Query breakdown\n\nThis detection shows several techniques that make Hamelin effective for \ncloud security monitoring. Let's examine each component:\n\n### JSON parsing and field extraction\n\nThe detection starts by parsing CloudTrail JSON events and extracting relevant \nfields. CloudTrail events contain nested JSON structures, so you need to \nextract specific fields systematically:\n\n```hamelin\n| LET evt = parse_json(event.original) AS variant\n| LET eventName = evt.eventName AS string\n| LET source_ip = evt.sourceIPAddress AS string\n| LET user_identity = evt.userIdentity.arn AS string\n| LET group_id = evt.requestParameters.groupId AS string\n```\n\nThe `parse_json()` function converts the JSON string into a variant type. \nYou can then extract specific fields using dot notation. The `coalesce()` \nfunction handles missing fields gracefully.\n\n### Event filtering and validation\n\nThe detection filters for specific CloudTrail events that indicate security \ngroup modifications. You process only successful events to avoid false \npositives from failed API calls:\n\n```hamelin\n| WHERE coalesce(eventName, '') IN [\n 'AuthorizeSecurityGroupIngress',\n 'AuthorizeSecurityGroupEgress',\n 'RevokeSecurityGroupIngress',\n 'RevokeSecurityGroupEgress',\n 'CreateSecurityGroup',\n 'ModifySecurityGroupRules'\n ]\n| WHERE coalesce(error_code, '') == ''\n```\n\nThis filtering ensures you analyze actual security group changes, not \nfailed API calls or unrelated CloudTrail events.\n\n### Permissive rule detection\n\nThe detection identifies overly permissive rules by searching for CIDR blocks \nthat allow traffic from anywhere on the internet:\n\n```hamelin\n| LET has_any_source = contains(event.original, '0.0.0.0/0') OR contains(event.original, '::/0')\n| WHERE has_any_source\n```\n\nThe `contains()` function searches the raw JSON for these permissive CIDR \nblocks. This catches rules that expose services to the entire internet.\n\n### Sensitive port identification\n\nThe detection defines a list of commonly targeted ports and checks whether \nthe security group modification affects these sensitive services:\n\n```hamelin\n| LET sensitive_ports = [22, 23, 3389, 1433, 3306, 5432, ...]\n| LET is_sensitive_port = coalesce(from_port, 0) IN sensitive_ports OR \n coalesce(to_port, 0) IN sensitive_ports\n```\n\nThis includes SSH (22), RDP (3389), database ports (1433, 3306, 5432), and \nother services commonly targeted by attackers.\n\n### Risk scoring logic\n\nThe detection calculates risk scores based on the type of modification and \nthe sensitivity of exposed ports:\n\n```hamelin\n| LET risk_score = if(\n contains(coalesce(eventName, ''), 'Authorize') AND is_all_traffic,\n 100, // Critical - allowing all traffic from anywhere\n if(\n contains(coalesce(eventName, ''), 'Authorize') AND is_sensitive_port,\n 90, // Very high - sensitive ports exposed\n if(\n contains(coalesce(eventName, ''), 'Authorize'),\n 75, // High - any port exposed to internet\n 40 // Low - revoking permissive rules (good action)\n )\n )\n )\n```\n\nThis nested `if()` structure prioritizes the most dangerous configurations. \nIt also recognizes when administrators remove permissive rules.\n\n### Structured output generation\n\nThe detection creates both human-readable messages and structured labels for \ndifferent consumers:\n\n```hamelin\n| LET message = 'Security group ' + action_type + ' to internet: ' + \n port_info + ' on group ' + coalesce(group_id, group_name)\n| LET labels = map(\n 'Severity': if(risk_score >= 90, 'CRITICAL', 'HIGH'),\n 'Account': account_id,\n 'Security Group': group_id,\n 'User': user_identity\n )\n```\n\nThis provides context for analysts and maintains machine-readable structure \nfor SIEM integration.\n\n## Advanced techniques demonstrated\n\nThis detection uses several techniques that make Hamelin effective for \ncloud security monitoring:\n\n**JSON parsing and navigation** handles complex nested CloudTrail event \nstructures using `parse_json()` and dot notation for field extraction.\n\n**Flexible field handling** uses `coalesce()` to handle missing or null fields \nin CloudTrail events gracefully.\n\n**String pattern matching** uses `contains()` for efficient searching of \nCIDR blocks within raw JSON content.\n\n**Conditional risk scoring** uses nested `if()` statements to create \nnuanced risk assessments based on multiple factors.\n\n**Cloud-native field mapping** creates ECS-compliant output with cloud-specific \nfields like `cloud.region` and `cloud.service.name`.\n\n**Reference documentation** includes links to official AWS documentation for \nanalyst context and verification.\n\n## Detection tuning\n\nYou can adapt this detection for different cloud environments:\n\n**Expand sensitive ports** by adding environment-specific services to the \nsensitive ports list based on applications in use.\n\n**Adjust risk scores** by modifying the scoring logic based on risk tolerance \nand compliance requirements.\n\n**Add region filtering** by including or excluding specific AWS regions based \non where resources should legitimately exist.\n\n**Customize user filtering** by adding exclusions for automated tools or \nservice accounts that legitimately modify security groups.\n\nThe modular structure with clear variable definitions makes it easy to modify \nindividual components without affecting the overall detection logic.",
|
|
24
|
-
"examples/clickfix-threat-detection.md": '# ClickFix Threat Detection\n\nThis example shows how to build a threat detection query that identifies \n**ClickFix** phishing attacks. ClickFix is a social engineering technique \nwhere attackers convince users to execute malicious commands through the \nWindows Run dialog or browser, often disguised as a way to "fix" a supposed \nerror.\n\n## Attack Overview\n\nClickFix attacks follow a predictable pattern:\n\n1. **Social Engineering** - Users are tricked into thinking they need to "fix" something\n2. **Command Execution** - Users paste malicious commands into Windows Run dialog (Windows + R)\n3. **Payload Download** - PowerShell downloads and executes malicious files\n4. **Command & Control** - Establishes connection to attacker infrastructure\n\nYou can detect these attacks by correlating these behaviors within tight time \nwindows to catch the complete attack chain while minimizing false positives.\n\n## Detection strategy\n\nThis detection identifies three key events within a 5-second window:\n\n- **Registry modifications** to RunMRU (tracks Windows+R usage)\n- **Suspicious process creation** (PowerShell with specific parameters)\n- **Network connections** to external command & control servers\n\nThe tight time correlation and requirement for all three behaviors helps \ndistinguish real attacks from normal administrative activity.\n\n## Complete detection query\n\nHere\'s the full detection query that correlates registry events, process \ncreation, and network connections to identify ClickFix attacks. The query \nuses `WITH` clauses to define each event type, then applies time-based \nwindowing and pattern matching to find the complete attack sequence:\n\n```hamelin\n// Detection Name: Detection - ClickFix Initial Access\n//\n// Description:\n// This detection identifies potential ClickFix phishing activity,\n// a technique where users are socially engineered to execute\n// malicious commands either through the Windows Run dialog\n// (Windows + R) or directly from their browser. The attack\n// typically involves convincing users to paste commands that lead\n// to execution of PowerShell or other scripting tools, often\n// disguised as a way to "fix" a supposed error.\n//\n// Detection Logic\n// The detection correlates three key events within a 5-second window:\n// 1. Registry modifications to RunMRU (tracks Windows+R usage) or\n// process creation from common browsers\n// 2. Creation of suspicious processes (PowerShell, CMD, MSHTA) with\n// suspicious command-line parameters\n// 3. Network connections to external IPs or domains\n//\n// The tight 5-second window and requirement for all three behaviors\n// help reduce false positives while catching the typical ClickFix\n// attack chain.\n\n// Identify Windows Run Dialog useage\nWITH registry_events = FROM simba.sysmon_events\n| WHERE event.code == "13"\n| LET reg_key = coalesce(winlog.parsed_event_data.target_object AS string, \'\')\n| LET reg_value = coalesce(winlog.parsed_event_data.details AS string, \'\')\n| LET image = lower(coalesce(winlog.parsed_event_data.image AS string, \'\'))\n| LET host = host.name\n| WHERE regexp_like(lower(reg_key), \'(?i).*\\\\currentversion\\\\explorer\\\\runmru.*\')\n AND regexp_like(image, \'(?i).*explorer\\.exe.*\')\n\n// Get suspicious process creations\nWITH process_events = FROM simba.sysmon_events\n| WHERE event.code == "1"\n| LET cmd_line = coalesce(winlog.parsed_event_data.command_line AS string, \'\')\n| LET image = coalesce(winlog.parsed_event_data.image AS string, \'\')\n| LET parent_image = coalesce(winlog.parsed_event_data.parent_image AS string, \'\')\n| LET host = host.name\n| LET user_name = coalesce(winlog.user.name AS string, \'\')\n| WHERE regexp_like(lower(cmd_line), \'(?i).*(iwr|invoke-webrequest|wget).*\\$env:temp.*\\.exe.*\')\n AND regexp_like(lower(cmd_line), \'(?i).*-w\\s+hidden.*\')\n AND regexp_like(lower(parent_image), \'(?i).*explorer\\.exe.*\')\n\n// Get suspicious network connections\nWITH network_events = FROM simba.sysmon_events\n| WHERE event.code == "3"\n| LET dest_ip = coalesce(winlog.parsed_event_data.destination_ip AS string, \'\')\n| LET dest_port = coalesce(winlog.parsed_event_data.destination_port AS string, \'\')\n| LET image = lower(coalesce(winlog.parsed_event_data.image AS string, \'\'))\n| LET host = host.name\n| WHERE\n dest_ip == "172.31.7.63" AND\n dest_port == "4444" AND\n regexp_like(image, \'(?i).*\\\\temp\\\\.*\\.exe\')\n\n// Main Detection Logic:\n//\n// Above we detect three distinct events happening in a Clickfix scenario:\n// - A suspicious process is launched\n// - A registry event signaling a Windows Run Dialog usage \n// - Outbound network connections to a known C2 server\n//\n// Then, we detect a process event and a registry event happening together.\n// Technically, the process event precedes the registry event.\n// But in Windows, the timeestamps for the events are not very precise\n// Therefore, we need look for those two events happening NOT one\n// after the other, but in a small time window!\nWITH process_near_registry = FROM registry_event = registry_events,\n process_event = process_events\n| WINDOW process_event = last_value(process_event),\n registry_event = last_value(registry_event)\n BY host WITHIN 1s\n| WHERE process_event IS NOT NULL AND registry_event IS NOT NULL\n\n// Finally, we are looking for a pattern where a combination of the above\n// mentioned process & registry events are on the same host followed by\n// outbound network connections to the C2 server.\nMATCH process_near_registry network_event=network_events+\nWHEN network_event = network_event.timestamp - process_near_registry.timestamp <= 5s\nBY host\n| AGG timestamp = min(timestamp),\n event.start = min(timestamp),\n event.end = max(timestamp),\n process_event = array_agg(matched.process_event)[0],\n registry_event = array_agg(matched.registry_event)[0],\n network_events = filter_null(array_agg(network_event)),\n BY\n match_number, host\n\n// Now we know it\'s Clickfix, so let\'s create a nice Signal for analysts to\n// follow up on.\n//\n// Create simple message and structured labels:\n| LET message = "ClickFix C2 Activity Detection - Initial Access and Command & Control"\n| LET labels = map(\n "Severity": "CRITICAL",\n "Host": host,\n "User": coalesce(process_event.user_name, "unknown"),\n "Run Dialog Activity": registry_event.reg_value,\n "Registry Key": registry_event.reg_key,\n "PowerShell Command": process_event.cmd_line,\n "Process Path": process_event.image,\n "Parent Process": process_event.parent_image,\n "Destination IP": network_events.dest_ip[0],\n "Destination Port": network_events.dest_port[0],\n "Beacon Count": len(array_distinct(network_events.dest_ip)) AS string)\n\n// Detection Output\n| SELECT\n timestamp,\n event.start = event.start,\n event.end = event.end,\n process_event,\n registry_event,\n network_events,\n host.name = host,\n process.name = process_event.image,\n process.command_line = process_event.cmd_line,\n process.parent.name = process_event.parent_image,\n registry.key = registry_event.reg_key,\n registry.value = registry_event.reg_value,\n destination.ip = network_events.dest_ip[0],\n destination.port = network_events.dest_port[0],\n user.name = process_event.user_name,\n beacons = array_distinct(network_events.dest_ip),\n event.risk_score = 100,\n message,\n labels,\n rule = {\n name: "Windows ClickFix Initial Access",\n description: message,\n mitre_tactics: [\n "Command and Control",\n "Execution",\n "Persistence"\n ],\n mitre_techniques: [\n "T1571", // Non-Standard Port\n "T1105", // Ingress Tool Transfer\n "T1572", // Protocol Tunneling\n "T1059.001" // PowerShell\n ],\n severity: "critical"\n }\n```\n\n## Query breakdown\n\nThis detection demonstrates several advanced Hamelin techniques working \ntogether. Let\'s examine each component:\n\n### Event collection phase\n\nThe query starts by defining three separate event types using `WITH` clauses. \nEach clause filters and transforms raw Sysmon events to extract the specific \nindicators needed for correlation:\n\n**Registry events** track Windows Run dialog usage by monitoring registry \nmodifications to the RunMRU key:\n```hamelin\nWITH registry_events = FROM simba.sysmon_events\n| WHERE event.code == "13" // Registry modification events\n| LET reg_key = coalesce(winlog.parsed_event_data.target_object AS string, \'\')\n| WHERE regexp_like(lower(reg_key), \'(?i).*\\\\currentversion\\\\explorer\\\\runmru.*\')\n AND regexp_like(image, \'(?i).*explorer\\.exe.*\')\n```\n\n**Process events** identify suspicious PowerShell execution with specific \ncommand-line patterns that indicate ClickFix behavior:\n```hamelin\nWITH process_events = FROM simba.sysmon_events\n| WHERE event.code == "1" // Process creation events\n| WHERE regexp_like(lower(cmd_line), \'(?i).*(iwr|invoke-webrequest|wget).*\\$env:temp.*\\.exe.*\')\n AND regexp_like(lower(cmd_line), \'(?i).*-w\\s+hidden.*\')\n```\n\n**Network events** detect connections to known command & control \ninfrastructure by filtering for specific IP addresses and ports:\n```hamelin\nWITH network_events = FROM simba.sysmon_events\n| WHERE event.code == "3" // Network connection events\n| WHERE dest_ip == "172.31.7.63" AND dest_port == "4444"\n```\n\n### Time-based correlation\n\nThe detection uses windowing to correlate process and registry events that \noccur close in time. Windows event timestamps aren\'t perfectly precise, so \nyou need to look for events happening within a small time window rather than \nin strict sequence:\n\n```hamelin\nWITH process_near_registry = FROM registry_event = registry_events,\n process_event = process_events\n| WINDOW process_event = last_value(process_event),\n registry_event = last_value(registry_event)\n BY host WITHIN 1s\n```\n\nThis creates pairs of process and registry events that happen within 1 second \nof each other on the same host.\n\n### Pattern matching\n\nThe `MATCH` command identifies the complete attack sequence by finding \ncorrelated events followed by network connections:\n\n```hamelin\nMATCH process_near_registry network_event=network_events+\nWHEN network_event = network_event.timestamp - process_near_registry.timestamp <= 5s\nBY host\n```\n\nThis finds patterns where process and registry events occur together, followed \nby one or more network connections within 5 seconds, all happening on the \nsame host.\n\n### Structured output\n\nThe final section creates analyst-friendly output that includes event metadata \nlike start/end times and risk scores, process details such as command lines \nand parent processes, network indicators including destination IPs and ports, \nMITRE mapping for threat intelligence, and structured labels as key-value \npairs for SIEM integration.\n\n## Advanced techniques demonstrated\n\nThis detection showcases several techniques that make Hamelin effective for \nthreat detection:\n\n**Multi-source correlation** combines registry, process, and network events \nfrom the same data source but different event types.\n\n**Time-based windowing** uses `WINDOW` operations to group events that occur \nclose in time, accounting for timestamp precision limitations.\n\n**Pattern matching** employs `MATCH` to identify ordered sequences of events \nthat follow the attack pattern.\n\n**Flexible field extraction** uses `coalesce()` to handle varying field names \nand missing data gracefully.\n\n**Regular expressions** apply `regexp_like()` for flexible pattern matching \nin command lines and file paths.\n\n**Structured output** creates both human-readable messages and \nmachine-readable labels for different consumers.\n\n## Detection tuning\n\nYou can adapt this detection for different environments:\n\n**Adjust time windows** by modifying the 1-second and 5-second windows based \non event timing characteristics in the environment.\n\n**Update IOCs** by replacing the hardcoded IP address and port with threat \nintelligence feeds.\n\n**Refine regex patterns** by customizing the command line patterns based on \nobserved ClickFix variants.\n\n**Modify thresholds** by adjusting beacon counts or adding minimum connection \nduration requirements.\n\nThe modular structure with `WITH` clauses makes it easy to test and tune \nindividual components without affecting the entire detection logic.',
|
|
24
|
+
"examples/clickfix-threat-detection.md": '# ClickFix Threat Detection\n\nThis example shows how to build a threat detection query that identifies \n**ClickFix** phishing attacks. ClickFix is a social engineering technique \nwhere attackers convince users to execute malicious commands through the \nWindows Run dialog or browser, often disguised as a way to "fix" a supposed \nerror.\n\n## Attack Overview\n\nClickFix attacks follow a predictable pattern:\n\n1. **Social Engineering** - Users are tricked into thinking they need to "fix" something\n2. **Command Execution** - Users paste malicious commands into Windows Run dialog (Windows + R)\n3. **Payload Download** - PowerShell downloads and executes malicious files\n4. **Command & Control** - Establishes connection to attacker infrastructure\n\nYou can detect these attacks by correlating these behaviors within tight time \nwindows to catch the complete attack chain while minimizing false positives.\n\n## Detection strategy\n\nThis detection identifies three key events within a 5-second window:\n\n- **Registry modifications** to RunMRU (tracks Windows+R usage)\n- **Suspicious process creation** (PowerShell with specific parameters)\n- **Network connections** to external command & control servers\n\nThe tight time correlation and requirement for all three behaviors helps \ndistinguish real attacks from normal administrative activity.\n\n## Complete detection query\n\nHere\'s the full detection query that correlates registry events, process \ncreation, and network connections to identify ClickFix attacks. The query \nuses `WITH` clauses to define each event type, then applies time-based \nwindowing and pattern matching to find the complete attack sequence:\n\n```hamelin\n// Detection Name: Detection - ClickFix Initial Access\n//\n// Description:\n// This detection identifies potential ClickFix phishing activity,\n// a technique where users are socially engineered to execute\n// malicious commands either through the Windows Run dialog\n// (Windows + R) or directly from their browser. The attack\n// typically involves convincing users to paste commands that lead\n// to execution of PowerShell or other scripting tools, often\n// disguised as a way to "fix" a supposed error.\n//\n// Detection Logic\n// The detection correlates three key events within a 5-second window:\n// 1. Registry modifications to RunMRU (tracks Windows+R usage) or\n// process creation from common browsers\n// 2. Creation of suspicious processes (PowerShell, CMD, MSHTA) with\n// suspicious command-line parameters\n// 3. Network connections to external IPs or domains\n//\n// The tight 5-second window and requirement for all three behaviors\n// help reduce false positives while catching the typical ClickFix\n// attack chain.\n\n// Identify Windows Run Dialog useage\nWITH registry_events = FROM simba.sysmon_events\n| WHERE event.code == "13"\n| LET reg_key = coalesce(winlog.parsed_event_data.target_object AS string, \'\')\n| LET reg_value = coalesce(winlog.parsed_event_data.details AS string, \'\')\n| LET image = lower(coalesce(winlog.parsed_event_data.image AS string, \'\'))\n| LET host = host.name\n| WHERE regexp_like(lower(reg_key), \'(?i).*\\\\currentversion\\\\explorer\\\\runmru.*\')\n AND regexp_like(image, \'(?i).*explorer\\.exe.*\')\n\n// Get suspicious process creations\nWITH process_events = FROM simba.sysmon_events\n| WHERE event.code == "1"\n| LET cmd_line = coalesce(winlog.parsed_event_data.command_line AS string, \'\')\n| LET image = coalesce(winlog.parsed_event_data.image AS string, \'\')\n| LET parent_image = coalesce(winlog.parsed_event_data.parent_image AS string, \'\')\n| LET host = host.name\n| LET user_name = coalesce(winlog.user.name AS string, \'\')\n| WHERE regexp_like(lower(cmd_line), \'(?i).*(iwr|invoke-webrequest|wget).*\\$env:temp.*\\.exe.*\')\n AND regexp_like(lower(cmd_line), \'(?i).*-w\\s+hidden.*\')\n AND regexp_like(lower(parent_image), \'(?i).*explorer\\.exe.*\')\n\n// Get suspicious network connections\nWITH network_events = FROM simba.sysmon_events\n| WHERE event.code == "3"\n| LET dest_ip = coalesce(winlog.parsed_event_data.destination_ip AS string, \'\')\n| LET dest_port = coalesce(winlog.parsed_event_data.destination_port AS string, \'\')\n| LET image = lower(coalesce(winlog.parsed_event_data.image AS string, \'\'))\n| LET host = host.name\n| WHERE\n dest_ip == "172.31.7.63" AND\n dest_port == "4444" AND\n regexp_like(image, \'(?i).*\\\\temp\\\\.*\\.exe\')\n\n// Main Detection Logic:\n//\n// Above we detect three distinct events happening in a Clickfix scenario:\n// - A suspicious process is launched\n// - A registry event signaling a Windows Run Dialog usage \n// - Outbound network connections to a known C2 server\n//\n// Then, we detect a process event and a registry event happening together.\n// Technically, the process event precedes the registry event.\n// But in Windows, the timeestamps for the events are not very precise\n// Therefore, we need look for those two events happening NOT one\n// after the other, but in a small time window!\nWITH process_near_registry = FROM registry_event = registry_events,\n process_event = process_events\n| WINDOW process_event = last_value(process_event),\n registry_event = last_value(registry_event)\n BY host WITHIN 1s\n| WHERE process_event IS NOT NULL AND registry_event IS NOT NULL\n\n// Finally, we are looking for a pattern where a combination of the above\n// mentioned process & registry events are on the same host followed by\n// outbound network connections to the C2 server.\nMATCH process_near_registry network_event=network_events+ BY host WITHIN 5s\n| AGG timestamp = min(timestamp),\n event.start = min(timestamp),\n event.end = max(timestamp),\n process_event = array_agg(matched.process_event)[0],\n registry_event = array_agg(matched.registry_event)[0],\n network_events = filter_null(array_agg(network_event)),\n BY\n match_number, host\n\n// Now we know it\'s Clickfix, so let\'s create a nice Signal for analysts to\n// follow up on.\n//\n// Create simple message and structured labels:\n| LET message = "ClickFix C2 Activity Detection - Initial Access and Command & Control"\n| LET labels = map(\n "Severity": "CRITICAL",\n "Host": host,\n "User": coalesce(process_event.user_name, "unknown"),\n "Run Dialog Activity": registry_event.reg_value,\n "Registry Key": registry_event.reg_key,\n "PowerShell Command": process_event.cmd_line,\n "Process Path": process_event.image,\n "Parent Process": process_event.parent_image,\n "Destination IP": network_events.dest_ip[0],\n "Destination Port": network_events.dest_port[0],\n "Beacon Count": len(array_distinct(network_events.dest_ip)) AS string)\n\n// Detection Output\n| SELECT\n timestamp,\n event.start = event.start,\n event.end = event.end,\n process_event,\n registry_event,\n network_events,\n host.name = host,\n process.name = process_event.image,\n process.command_line = process_event.cmd_line,\n process.parent.name = process_event.parent_image,\n registry.key = registry_event.reg_key,\n registry.value = registry_event.reg_value,\n destination.ip = network_events.dest_ip[0],\n destination.port = network_events.dest_port[0],\n user.name = process_event.user_name,\n beacons = array_distinct(network_events.dest_ip),\n event.risk_score = 100,\n message,\n labels,\n rule = {\n name: "Windows ClickFix Initial Access",\n description: message,\n mitre_tactics: [\n "Command and Control",\n "Execution",\n "Persistence"\n ],\n mitre_techniques: [\n "T1571", // Non-Standard Port\n "T1105", // Ingress Tool Transfer\n "T1572", // Protocol Tunneling\n "T1059.001" // PowerShell\n ],\n severity: "critical"\n }\n```\n\n## Query breakdown\n\nThis detection demonstrates several advanced Hamelin techniques working \ntogether. Let\'s examine each component:\n\n### Event collection phase\n\nThe query starts by defining three separate event types using `WITH` clauses. \nEach clause filters and transforms raw Sysmon events to extract the specific \nindicators needed for correlation:\n\n**Registry events** track Windows Run dialog usage by monitoring registry \nmodifications to the RunMRU key:\n```hamelin\nWITH registry_events = FROM simba.sysmon_events\n| WHERE event.code == "13" // Registry modification events\n| LET reg_key = coalesce(winlog.parsed_event_data.target_object AS string, \'\')\n| WHERE regexp_like(lower(reg_key), \'(?i).*\\\\currentversion\\\\explorer\\\\runmru.*\')\n AND regexp_like(image, \'(?i).*explorer\\.exe.*\')\n```\n\n**Process events** identify suspicious PowerShell execution with specific \ncommand-line patterns that indicate ClickFix behavior:\n```hamelin\nWITH process_events = FROM simba.sysmon_events\n| WHERE event.code == "1" // Process creation events\n| WHERE regexp_like(lower(cmd_line), \'(?i).*(iwr|invoke-webrequest|wget).*\\$env:temp.*\\.exe.*\')\n AND regexp_like(lower(cmd_line), \'(?i).*-w\\s+hidden.*\')\n```\n\n**Network events** detect connections to known command & control \ninfrastructure by filtering for specific IP addresses and ports:\n```hamelin\nWITH network_events = FROM simba.sysmon_events\n| WHERE event.code == "3" // Network connection events\n| WHERE dest_ip == "172.31.7.63" AND dest_port == "4444"\n```\n\n### Time-based correlation\n\nThe detection uses windowing to correlate process and registry events that \noccur close in time. Windows event timestamps aren\'t perfectly precise, so \nyou need to look for events happening within a small time window rather than \nin strict sequence:\n\n```hamelin\nWITH process_near_registry = FROM registry_event = registry_events,\n process_event = process_events\n| WINDOW process_event = last_value(process_event),\n registry_event = last_value(registry_event)\n BY host WITHIN 1s\n```\n\nThis creates pairs of process and registry events that happen within 1 second \nof each other on the same host.\n\n### Pattern matching\n\nThe `MATCH` command identifies the complete attack sequence by finding \ncorrelated events followed by network connections:\n\n```hamelin\nMATCH process_near_registry network_event=network_events+ BY host WITHIN 5s\n```\n\nThis finds patterns where process and registry events occur together, followed \nby one or more network connections, with the entire pattern completing within \n5 seconds, all happening on the same host.\n\n### Structured output\n\nThe final section creates analyst-friendly output that includes event metadata \nlike start/end times and risk scores, process details such as command lines \nand parent processes, network indicators including destination IPs and ports, \nMITRE mapping for threat intelligence, and structured labels as key-value \npairs for SIEM integration.\n\n## Advanced techniques demonstrated\n\nThis detection showcases several techniques that make Hamelin effective for \nthreat detection:\n\n**Multi-source correlation** combines registry, process, and network events \nfrom the same data source but different event types.\n\n**Time-based windowing** uses `WINDOW` operations to group events that occur \nclose in time, accounting for timestamp precision limitations.\n\n**Pattern matching** employs `MATCH` to identify ordered sequences of events \nthat follow the attack pattern.\n\n**Flexible field extraction** uses `coalesce()` to handle varying field names \nand missing data gracefully.\n\n**Regular expressions** apply `regexp_like()` for flexible pattern matching \nin command lines and file paths.\n\n**Structured output** creates both human-readable messages and \nmachine-readable labels for different consumers.\n\n## Detection tuning\n\nYou can adapt this detection for different environments:\n\n**Adjust time windows** by modifying the 1-second and 5-second windows based \non event timing characteristics in the environment.\n\n**Update IOCs** by replacing the hardcoded IP address and port with threat \nintelligence feeds.\n\n**Refine regex patterns** by customizing the command line patterns based on \nobserved ClickFix variants.\n\n**Modify thresholds** by adjusting beacon counts or adding minimum connection \nduration requirements.\n\nThe modular structure with `WITH` clauses makes it easy to test and tune \nindividual components without affecting the entire detection logic.',
|
|
25
25
|
"examples/registry-persistence-detection.md": `# Registry Persistence Detection
|
|
26
26
|
|
|
27
27
|
This example shows how to build a detection for Windows registry persistence
|
|
@@ -491,11 +491,11 @@ scoring weights without affecting the overall detection logic.`,
|
|
|
491
491
|
"function-reference/string-functions.md": "# String Functions\n\nScalar functions for string processing and manipulation that can be used in any expression context.\n\n## `replace(string, pattern)`\n\nReplaces all occurrences of a pattern in a string.\n\n### Parameters\n\n- **string** - String expression to search within\n- **pattern** - String expression representing the text to replace\n\n### Description\n\nThe `replace()` function removes all occurrences of the specified pattern from\nthe input string. This function performs literal string replacement, not\npattern matching. If the pattern is not found, the original string is returned\nunchanged.\n\n## `starts_with(string, prefix)`\n\nTests whether a string starts with a specified prefix.\n\n### Parameters\n\n- **string** - String expression to test\n- **prefix** - String expression representing the prefix to check for\n\n### Description\n\nThe `starts_with()` function returns true if the input string begins with the\nspecified prefix, false otherwise. The comparison is case-sensitive. An empty\nprefix will always return true for any string.\n\n## `ends_with(string, suffix)`\n\nTests whether a string ends with a specified suffix.\n\n### Parameters\n\n- **string** - String expression to test\n- **suffix** - String expression representing the suffix to check for\n\n### Description\n\nThe `ends_with()` function returns true if the input string ends with the\nspecified suffix, false otherwise. The comparison is case-sensitive. An empty\nsuffix will always return true for any string.\n\n## `contains(string, substring)`\n\nTests whether a string contains a specified substring.\n\n### Parameters\n\n- **string** - String expression to search within\n- **substring** - String expression representing the text to search for\n\n### Description\n\nThe `contains()` function returns true if the input string contains the\nspecified substring anywhere within it, false otherwise. The comparison is\ncase-sensitive. An empty substring will always return true for any string.\n\n## `lower(string)`\n\nConverts a string to lowercase.\n\n### Parameters\n\n- **string** - String expression to convert\n\n### Description\n\nThe `lower()` function converts all uppercase characters in the input string\nto their lowercase equivalents. Characters that are already lowercase or\nnon-alphabetic characters remain unchanged.\n\n## `upper(string)`\n\nConverts a string to uppercase.\n\n### Parameters\n\n- **string** - String expression to convert\n\n### Description\n\nThe `upper()` function converts all lowercase characters in the input string\nto their uppercase equivalents. Characters that are already uppercase or\nnon-alphabetic characters remain unchanged.\n\n## `len(string)`\n\nReturns the length of a string in characters.\n\n### Parameters\n\n- **string** - String expression to measure\n\n### Description\n\nThe `len()` function returns the number of characters in the input string.\nThis counts Unicode characters, not bytes, so multi-byte characters are\ncounted as single characters. An empty string returns 0.",
|
|
492
492
|
"function-reference/time-date-functions.md": '# Time & Date Functions\n\nScalar functions for temporal data processing and manipulation that can be used in any expression context.\n\n## `now()`\n\nReturns the current timestamp.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `now()` function returns the current date and time as a timestamp. The\nexact timestamp represents the moment when the function is evaluated during\nquery execution. All calls to `now()` within the same query execution return\nthe same timestamp value.\n\n## `today()`\n\nReturns today\'s date at midnight.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `today()` function returns the current date with the time portion set to\nmidnight (00:00:00). This is equivalent to truncating `now()` to the day\nboundary. The result represents the start of the current day.\n\n## `yesterday()`\n\nReturns yesterday\'s date at midnight.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `yesterday()` function returns yesterday\'s date with the time portion set\nto midnight (00:00:00). This is equivalent to subtracting one day from `today()`.\nThe result represents the start of the previous day.\n\n## `tomorrow()`\n\nReturns tomorrow\'s date at midnight.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `tomorrow()` function returns tomorrow\'s date with the time portion set to\nmidnight (00:00:00). This is equivalent to adding one day to `today()`. The\nresult represents the start of the next day.\n\n## `ts(timestamp)`\n\nConverts a string to a timestamp.\n\n### Parameters\n\n- **timestamp** - String expression representing a timestamp\n\n### Description\n\nThe `ts()` function parses a string representation of a timestamp and converts\nit to a timestamp type. The function accepts various timestamp formats including\nISO 8601 format. If the string cannot be parsed as a valid timestamp, an error\nis raised.\n\n## `year(timestamp)`\n\nExtracts the year from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `year()` function extracts the year component from a timestamp and returns\nit as an integer. For example, a timestamp of "2023-07-15 14:30:00" would\nreturn 2023.\n\n## `month(timestamp)`\n\nExtracts the month from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `month()` function extracts the month component from a timestamp and returns\nit as an integer from 1 to 12, where 1 represents January and 12 represents\nDecember. For example, a timestamp of "2023-07-15 14:30:00" would return 7.\n\n## `day(timestamp)`\n\nExtracts the day of the month from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `day()` function extracts the day component from a timestamp and returns\nit as an integer from 1 to 31, depending on the month. For example, a timestamp\nof "2023-07-15 14:30:00" would return 15.\n\n## `day_of_week(timestamp)`\n\nExtracts the day of the week from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `day_of_week()` function extracts the ISO day of the week from a timestamp \nand returns it as an integer from 1 (Monday) to 7 (Sunday).\n\n## `hour(timestamp)`\n\nExtracts the hour from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `hour()` function extracts the hour component from a timestamp and returns\nit as an integer from 0 to 23, using 24-hour format. For example, a timestamp\nof "2023-07-15 14:30:00" would return 14.\n\n## `minute(timestamp)`\n\nExtracts the minute from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `minute()` function extracts the minute component from a timestamp and\nreturns it as an integer from 0 to 59. For example, a timestamp of\n"2023-07-15 14:30:00" would return 30.\n\n## `second(timestamp)`\n\nExtracts the second from a timestamp.\n\n### Parameters\n\n- **timestamp** - Timestamp expression\n\n### Description\n\nThe `second()` function extracts the second component from a timestamp and\nreturns it as an integer from 0 to 59. For example, a timestamp of\n"2023-07-15 14:30:45" would return 45.\n\n## `at_timezone(timestamp, timezone)`\n\nConverts a timestamp to a different timezone.\n\n### Parameters\n\n- **timestamp** - Timestamp expression to convert\n- **timezone** - String expression representing the target timezone\n\n### Description\n\nThe `at_timezone()` function converts a timestamp from its current timezone\nto the specified target timezone. The timezone parameter should be a valid\ntimezone identifier such as "UTC", "America/New_York", or "Europe/London".\nThe function returns a new timestamp representing the same moment in time\nbut expressed in the target timezone.\n\n## `to_millis(interval)`\n\nConverts an interval to milliseconds.\n\n### Parameters\n\n- **interval** - Interval expression to convert\n\n### Description\n\nThe `to_millis()` function converts an interval (duration) to its equivalent\nvalue in milliseconds as an integer. This is useful for calculations that\nrequire numeric representations of time durations. For example, an interval\nof "5 minutes" would return 300000 milliseconds.\n\n## `to_nanos(interval)`\n\nConverts an interval to nanoseconds.\n\n### Parameters\n\n- **interval** - Interval expression to convert\n\n### Description\n\nThe `to_nanos()` function converts an interval (duration) to its equivalent\nvalue in nanoseconds as an integer. This provides the highest precision for\ntime duration calculations. The function multiplies the millisecond value\nby 1,000,000 to get nanoseconds. For example, an interval of "1 second"\nwould return 1,000,000,000 nanoseconds.\n\n## `from_millis(millis)`\n\nCreates an interval from milliseconds.\n\n### Parameters\n\n- **millis** - Integer expression representing milliseconds\n\n### Description\n\nThe `from_millis()` function creates an interval from a millisecond value.\nThis is the inverse of `to_millis()`, allowing you to convert numeric\nmillisecond values back into interval types that can be used with timestamp\narithmetic. For example, `from_millis(5000)` creates an interval of 5 seconds.\n\n## `from_nanos(nanos)`\n\nCreates an interval from nanoseconds.\n\n### Parameters\n\n- **nanos** - Integer expression representing nanoseconds\n\n### Description\n\nThe `from_nanos()` function creates an interval from a nanosecond value.\nThis is the inverse of `to_nanos()`, converting numeric nanosecond values\ninto interval types. The function divides the nanosecond value by 1,000,000,000\nto convert to seconds. For example, `from_nanos(1500000000)` creates an\ninterval of 1.5 seconds.\n\n## `from_unixtime_seconds(seconds)`\n\nCreates a timestamp from Unix seconds.\n\n### Parameters\n\n- **seconds** - Integer expression representing seconds since Unix epoch\n\n### Description\n\nThe `from_unixtime_seconds()` function converts a Unix timestamp (seconds\nsince January 1, 1970 UTC) into a timestamp type. This is commonly used\nwhen working with systems that store time as Unix timestamps. For example,\n`from_unixtime_seconds(1625097600)` returns the timestamp "2021-07-01 00:00:00".\n\n## `from_unixtime_millis(millis)`\n\nCreates a timestamp from Unix milliseconds.\n\n### Parameters\n\n- **millis** - Integer expression representing milliseconds since Unix epoch\n\n### Description\n\nThe `from_unixtime_millis()` function converts Unix time in milliseconds\nto a timestamp. Many systems and APIs return timestamps as milliseconds\nsince the Unix epoch. This function handles the conversion by multiplying\nthe input by 1,000,000 to convert to nanoseconds internally. For example,\n`from_unixtime_millis(1625097600000)` returns "2021-07-01 00:00:00".\n\n## `from_unixtime_micros(micros)`\n\nCreates a timestamp from Unix microseconds.\n\n### Parameters\n\n- **micros** - Integer expression representing microseconds since Unix epoch\n\n### Description\n\nThe `from_unixtime_micros()` function converts Unix time in microseconds\nto a timestamp. This provides microsecond precision for systems that require\nit. The function multiplies the input by 1,000 to convert to nanoseconds\ninternally. For example, `from_unixtime_micros(1625097600000000)` returns\n"2021-07-01 00:00:00".\n\n## `from_unixtime_nanos(nanos)`\n\nCreates a timestamp from Unix nanoseconds.\n\n### Parameters\n\n- **nanos** - Integer expression representing nanoseconds since Unix epoch\n\n### Description\n\nThe `from_unixtime_nanos()` function converts Unix time in nanoseconds\ndirectly to a timestamp. This provides the highest precision for timestamp\nconversion and is useful when working with high-frequency data or systems\nthat track time at nanosecond granularity. For example,\n`from_unixtime_nanos(1625097600000000000)` returns "2021-07-01 00:00:00".\n\n## `to_unixtime(timestamp)`\n\nConverts a timestamp to Unix seconds.\n\n### Parameters\n\n- **timestamp** - Timestamp expression to convert\n\n### Description\n\nThe `to_unixtime()` function converts a timestamp to Unix time, returning\nthe number of seconds since January 1, 1970 UTC as a double-precision\nfloating-point number. The fractional part represents sub-second precision.\nThis is useful for interoperability with systems that expect Unix timestamps.\nFor example, the timestamp "2021-07-01 00:00:00" returns 1625097600.0.',
|
|
493
493
|
"function-reference/window-functions.md": "# Window Functions\n\nFunctions for analytical operations over data windows that must be used with the `WINDOW` command.\n\n## `row_number()`\n\nReturns a sequential row number for each row within a window partition.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `row_number()` function assigns a unique sequential integer to each row\nwithin its window partition, starting from 1. The ordering is determined by\nthe `SORT` clause in the `WINDOW` command. Rows with identical sort values\nreceive different row numbers in an arbitrary but consistent order.\n\n## `rank()`\n\nReturns the rank of each row within a window partition with gaps.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `rank()` function assigns a rank to each row within its window partition\nbased on the `SORT` clause ordering. Rows with identical sort values receive\nthe same rank, and subsequent ranks are skipped. For example, if two rows tie\nfor rank 2, the next row receives rank 4 (not rank 3).\n\n## `dense_rank()`\n\nReturns the rank of each row within a window partition without gaps.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `dense_rank()` function assigns a rank to each row within its window\npartition based on the `SORT` clause ordering. Rows with identical sort values\nreceive the same rank, but subsequent ranks are not skipped. For example, if\ntwo rows tie for rank 2, the next row receives rank 3.\n\n## `lag(expression, offset, ignore_nulls)`\n\nReturns the value of an expression from a previous row within the window.\n\n### Parameters\n\n- **expression** - Expression to evaluate from the previous row\n- **offset** - Integer specifying how many rows back to look\n- **ignore_nulls** - Boolean indicating whether to skip null values (default: true)\n\n### Description\n\nThe `lag()` function retrieves the value of the specified expression from a\nrow that is `offset` positions before the current row within the window\npartition. When `ignore_nulls` is true, null values are skipped when counting\nthe offset. If there is no row at the specified offset, the function returns null.\n\n## `lead(expression, offset, ignore_nulls)`\n\nReturns the value of an expression from a subsequent row within the window.\n\n### Parameters\n\n- **expression** - Expression to evaluate from the subsequent row\n- **offset** - Integer specifying how many rows ahead to look\n- **ignore_nulls** - Boolean indicating whether to skip null values (default: true)\n\n### Description\n\nThe `lead()` function retrieves the value of the specified expression from a\nrow that is `offset` positions after the current row within the window\npartition. When `ignore_nulls` is true, null values are skipped when counting\nthe offset. If there is no row at the specified offset, the function returns null.\n\n## `first_value(expression, ignore_nulls)`\n\nReturns the first value of an expression within the window frame.\n\n### Parameters\n\n- **expression** - Expression to evaluate\n- **ignore_nulls** - Boolean indicating whether to skip null values (default: true)\n\n### Description\n\nThe `first_value()` function returns the value of the specified expression from\nthe first row in the current window frame. When `ignore_nulls` is true, it\nreturns the first non-null value. The window frame is determined by the\n`WITHIN` clause in the `WINDOW` command.\n\n## `last_value(expression, ignore_nulls)`\n\nReturns the last value of an expression within the window frame.\n\n### Parameters\n\n- **expression** - Expression to evaluate\n- **ignore_nulls** - Boolean indicating whether to skip null values (default: true)\n\n### Description\n\nThe `last_value()` function returns the value of the specified expression from\nthe last row in the current window frame. When `ignore_nulls` is true, it\nreturns the last non-null value. The window frame is determined by the\n`WITHIN` clause in the `WINDOW` command.\n\n## `nth_value(expression, n, ignore_nulls)`\n\nReturns the nth value of an expression within the window frame.\n\n### Parameters\n\n- **expression** - Expression to evaluate\n- **n** - Integer specifying which value to return (1-based)\n- **ignore_nulls** - Boolean indicating whether to skip null values (default: true)\n\n### Description\n\nThe `nth_value()` function returns the value of the specified expression from\nthe nth row in the current window frame. When `ignore_nulls` is true, null\nvalues are not counted in the position. If there is no nth row, the function\nreturns null. The position is 1-based, where 1 represents the first row.\n\n## `cume_dist()`\n\nReturns the cumulative distribution of each row within the window partition.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `cume_dist()` function calculates the cumulative distribution of each row\nwithin its window partition. The result is the number of rows with values less\nthan or equal to the current row's value, divided by the total number of rows\nin the partition. Values range from 0 to 1.\n\n## `percent_rank()`\n\nReturns the percentile rank of each row within the window partition.\n\n### Parameters\n\nThis function takes no parameters.\n\n### Description\n\nThe `percent_rank()` function calculates the percentile rank of each row within\nits window partition. The result is calculated as (rank - 1) / (total rows - 1),\nwhere rank is determined by the `SORT` clause ordering. Values range from 0 to 1,\nwith 0 representing the lowest value and 1 representing the highest.",
|
|
494
|
-
"introduction.md": "# Introducing Hamelin\n\nHamelin is a **pipe-based query language** for **event analytics** which targets\nthe specific challenges detection engineers face when analyzing security events.\nThe language makes event correlation straightforward, letting you define\npatterns, correlate them across time windows, and match ordered sequences of\nevents.\n\n## Key Features\n\n### \u{1F504} Pipe-Based\n\nYou write queries that read naturally from top to bottom. Each operation\nconnects to the next using the pipe operator `|`. Pipe-based languages let you\nbuild queries incrementally, making them easier to read, write, and test than\napproaches that rely heavily on nested subqueries.\n\n```hamelin\nFROM events\n| WHERE event.action == 'login'\n| WITHIN -1hr\n| SELECT user.email, timestamp\n```\n\n### \u{1F550} Event-Native\n\nHamelin offers shorthand for working with timestamped events. Time intervals are\nwritten as simple expressions that match how you think about time. You can\nreference relative timestamps and truncate them to specific boundaries.\n\n```hamelin\n// Reference relative time\n| WITHIN -15m // events within the last 15 minutes\n| WITHIN -1h // events within the last hour\n| WITHIN -7d // events within the last 7 days\n\n// Truncate timestamps to boundaries\n| SELECT timestamp@h // truncate to hour boundary\n| SELECT timestamp@d // truncate to day boundary\n```\n\n### \u{1FA9F} Sliding Windows\n\nSliding windows move continuously with each event, giving you insights without\ngaps or duplicates. You can aggregate data over these moving time windows to\ndetect patterns as they happen.\n\n```hamelin\nFROM events\n| WHERE event.action == 'login'\n| WINDOW count()\n BY user.id\n WITHIN -15m\n```\n\n### \u{1F3AF} Correlation of Named Subqueries\n\nNamed subqueries let you define specific event patterns and correlate them\nwithin sliding windows. You can drop these patterns into sliding windows and\nwrite correlations around them. Hamelin makes it straightforward to aggregate\nover specific patterns while also aggregating over the entire group of events.\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nFROM failed = failed_logins, success = successful_logins\n| WINDOW failures = count(failed),\n successes = count(success),\n total = count(),\n BY user.id\n WITHIN -5m\n| WHERE successes >= 1 && failures / total > 0.2\n```\n\nThis query demonstrates correlating failed and successful login events to detect\nbrute force attacks. Named subqueries define distinct event patterns:\n`failed_logins` filters to login failure events while `successful_logins`\nfilters to login success events. The sliding window aggregates these patterns by\nuser over 5-minute periods, counting failures, successes, and total events. The\nfinal filter identifies users who had at least one successful login where failed\nattempts represent more than 20% of their total login activity within that\nwindow.\n\n### \u{1F50D} Ordered Matching of Named Subqueries\n\nYou can ask Hamelin to match ordered patterns across events. Aggregations over sliding windows work well for many use cases, but others require that you search for specific events followed by other specific events. You can do that in Hamelin using regular expression quantifiers applied to named subqueries.\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nMATCH failed_logins{10,} successful_logins+\nWHEN last(successful_logins.timestamp) - first(successful_logins.timestamp) < 10m\n```\n\nThis searches for 10 failed logins followed by at least one successful login in\na ten minute period. The sliding window approach might miss attack patterns\nwhere timing and sequence matter, but ordered matching can detect the exact\nprogression of a brute force attack.\n\n### \u{1F517} Event Type Expansion\n\nYou can query across different event types without worrying about schema\ndifferences. Hamelin automatically sets missing fields to `null` when they don't\nexist in a particular event type.\n\n```hamelin\nFROM login_events, logout_events, error_events\n// Filters by user.email when if this field exists in a row.\n// Drops rows where this field does not exist\n// (because NULL does not equal any string).\n| WHERE user.email == 'john@example.com'\n```\n\n### \u{1F5C2}\uFE0F Structured Types\n\nHamelin supports structured types like structs, arrays, and maps to represent\ncomplex data. These types make data modeling more familiar, and reduce the need\nto rely too much on joins in analytic queries.\n\n```hamelin\n// Create struct literals with nested data\nLET login_metadata = {\n ip_address: '192.168.1.100',\n user_agent: 'Mozilla/5.0',\n location: 'San Francisco'\n}\n\n// Access nested fields using dot notation\n| WHERE login_metadata.ip_address != '192.168.1.100'\n\n// Use arrays to store multiple related values\n| LET failed_attempts = [\n {timestamp: '2024-01-15T14:25:00Z', reason: 'invalid_password'},\n {timestamp: '2024-01-15T14:27:00Z', reason: 'account_locked'}\n ]\n\n// Use maps when key data is high cardinality\n// Using structs for this use case creates too many columns.\n| LET host_metrics = map(\n 'web-server-01': {cpu: 85.2, memory: 72.1},\n 'web-server-02': {cpu: 91.7, memory: 68.9},\n 'db-primary-01': {cpu: 67.3, memory: 89.4}\n )\n\n// Look up map values using index notation\n| WHERE host_metrics['web-server-01'].cpu > 80\n```\n\n### \u{1F4E1} Array Broadcasting\n\nHamelin makes working with arrays simpler by offering broadcasting, which helps\nyou distribute operations over each member of an array. It does this when you\napply an operation to an array that makes more sense to be applied to each of\nits members. Broadcasting lets you work with arrays using simple, familiar\nsyntax without asking you to resort to functional programming or inefficient\nunnesting.\n\n```hamelin\n| WHERE any(failed_attempts.reason == 'invalid_password')\n```\n\nThis example demonstrates how the equality operator `==` broadcasts across the\n`reason` field of each element in the `failed_attempts` array. This example\ndemonstrates *two* broadcasts:\n\n * first, the lookup of the `reason` field changes an array-of-struct into an\n array-of-string\n * second, applying equality to the resulting array applies it to each member\n\nHamelin can do this automatically because it is type-aware. It knows that\ncomparing equality between `array(string)` and `string` makes more sense to\nbroadcast: an array can never be equal to a string, but a member of an\n`array(string)` might be.\n\n### \u{1F500} Semi-Structured Types\n\nHamelin lets you parse json into instances of the `variant` type. This helps you\nhandle semi-structured data that doesn't fit nicely into fixed schemas. You can\nparse JSON strings, access their fields, and convert them to more structured\ntypes. This makes working with JSON feel fairly native.\n\n```hamelin\n// Parse JSON strings into the variant type\nFROM logs\n| LET event_data = parse_json(raw_json)\n\n// Access nested fields using dot notation\n| WHERE event_data.level AS string == 'ERROR'\n\n// Access json array elements with index notation\n| LET first_tag = event_data.tags[0]\n\n// Cast variant data to structured types when you need type safety.\n// Values that do not match will be null.\n| LET user_info = event_data.user AS {id: int, name: string}\n```\n\n### \u{1F6A8} Excellent Error Messages\n\nHamelin provides clear, helpful error messages. Error messages\npoint directly to the problematic Hamelin code and explain exactly what went\nwrong, rather than showing cryptic messages about generated SQL.\n\nThis matters especially when AI assistants write queries. AI tools need precise\ndescriptions of errors to fix queries and complete tasks. Clear error messages\nlet AI assistants debug queries effectively by giving the context needed to\ncorrect mistakes.\n\n```hamelin\nFROM simba.sysmon_events\n| AGG count() BY host.hostname\n| LET hostname = lower(host.hostname)\n```\n\ngenerates the error\n\n```\nError: problem doing translation\n \u256D\u2500[ :3:24 ]\n \u2502\n 3 \u2502 | LET hostname = lower(host.hostname)\n \u2502 \u2500\u2500\u252C\u2500\n \u2502 \u2570\u2500\u2500\u2500 error while translating\n \u2502\n \u2502 Note: unbound column reference: host\n \u2502\n \u2502 the following entries in the environment are close:\n \u2502 - `host.hostname` (you must actually wrap with ``)\n\u2500\u2500\u2500\u256F\n```\n\nHere, the user has forgotten to escape an identifier that contains a dot character.\n\n```hamelin\nFROM simba.sysmon_events\n| WINDOW count(),\n all(winlog.event_data.events)\n BY host.hostname\n```\n\ngenerates the error\n\n```\nError: problem doing translation\n \u256D\u2500[ :3:10 ]\n \u2502\n 3 \u2502 all(winlog.event_data.events)\n \u2502 \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252C\u2500\u252C\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n \u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 could not find a matching function definition\n \u2502 \u2502\n \u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 variant\n \u2502\n \u2502 Note: Attempted all(x=boolean)\n \u2502 - Type mismatch for x: expected boolean, got variant\n \u2502\n \u2502 Attempted all(x=array(boolean))\n \u2502 - Type mismatch for x: expected array(boolean), got variant\n \u2502\n\u2500\u2500\u2500\u256F\n```\n\nHere, the user has forgotten to cast variant to a primitive type so that it can\nbe matched against the function call. (A future version of Hamelin will probably\ncoerce this automatically!)\n",
|
|
494
|
+
"introduction.md": "# Introducing Hamelin\n\nHamelin is a **pipe-based query language** for **event analytics** which targets\nthe specific challenges detection engineers face when analyzing security events.\nThe language makes event correlation straightforward, letting you define\npatterns, correlate them across time windows, and match ordered sequences of\nevents.\n\n## Key Features\n\n### \u{1F504} Pipe-Based\n\nYou write queries that read naturally from top to bottom. Each operation\nconnects to the next using the pipe operator `|`. Pipe-based languages let you\nbuild queries incrementally, making them easier to read, write, and test than\napproaches that rely heavily on nested subqueries.\n\n```hamelin\nFROM events\n| WHERE event.action == 'login'\n| WITHIN -1hr\n| SELECT user.email, timestamp\n```\n\n### \u{1F550} Event-Native\n\nHamelin offers shorthand for working with timestamped events. Time intervals are\nwritten as simple expressions that match how you think about time. You can\nreference relative timestamps and truncate them to specific boundaries.\n\n```hamelin\n// Reference relative time\n| WITHIN -15m // events within the last 15 minutes\n| WITHIN -1h // events within the last hour\n| WITHIN -7d // events within the last 7 days\n\n// Truncate timestamps to boundaries\n| SELECT timestamp@h // truncate to hour boundary\n| SELECT timestamp@d // truncate to day boundary\n```\n\n### \u{1FA9F} Sliding Windows\n\nSliding windows move continuously with each event, giving you insights without\ngaps or duplicates. You can aggregate data over these moving time windows to\ndetect patterns as they happen.\n\n```hamelin\nFROM events\n| WHERE event.action == 'login'\n| WINDOW count()\n BY user.id\n WITHIN -15m\n```\n\n### \u{1F3AF} Correlation of Named Subqueries\n\nNamed subqueries let you define specific event patterns and correlate them\nwithin sliding windows. You can drop these patterns into sliding windows and\nwrite correlations around them. Hamelin makes it straightforward to aggregate\nover specific patterns while also aggregating over the entire group of events.\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nFROM failed = failed_logins, success = successful_logins\n| WINDOW failures = count(failed),\n successes = count(success),\n total = count(),\n BY user.id\n WITHIN -5m\n| WHERE successes >= 1 && failures / total > 0.2\n```\n\nThis query demonstrates correlating failed and successful login events to detect\nbrute force attacks. Named subqueries define distinct event patterns:\n`failed_logins` filters to login failure events while `successful_logins`\nfilters to login success events. The sliding window aggregates these patterns by\nuser over 5-minute periods, counting failures, successes, and total events. The\nfinal filter identifies users who had at least one successful login where failed\nattempts represent more than 20% of their total login activity within that\nwindow.\n\n### \u{1F50D} Ordered Matching of Named Subqueries\n\nYou can ask Hamelin to match ordered patterns across events. Aggregations over sliding windows work well for many use cases, but others require that you search for specific events followed by other specific events. You can do that in Hamelin using regular expression quantifiers applied to named subqueries.\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nMATCH failed_logins{10,} successful_logins+ WITHIN 10m\n```\n\nThis searches for 10 failed logins followed by at least one successful login, \nwith the entire sequence completing within a 10 minute period. The sliding \nwindow approach might miss attack patterns where timing and sequence matter, \nbut ordered matching can detect the exact progression of a brute force attack.\n\n### \u{1F517} Event Type Expansion\n\nYou can query across different event types without worrying about schema\ndifferences. Hamelin automatically sets missing fields to `null` when they don't\nexist in a particular event type.\n\n```hamelin\nFROM login_events, logout_events, error_events\n// Filters by user.email when if this field exists in a row.\n// Drops rows where this field does not exist\n// (because NULL does not equal any string).\n| WHERE user.email == 'john@example.com'\n```\n\n### \u{1F5C2}\uFE0F Structured Types\n\nHamelin supports structured types like structs, arrays, and maps to represent\ncomplex data. These types make data modeling more familiar, and reduce the need\nto rely too much on joins in analytic queries.\n\n```hamelin\n// Create struct literals with nested data\nLET login_metadata = {\n ip_address: '192.168.1.100',\n user_agent: 'Mozilla/5.0',\n location: 'San Francisco'\n}\n\n// Access nested fields using dot notation\n| WHERE login_metadata.ip_address != '192.168.1.100'\n\n// Use arrays to store multiple related values\n| LET failed_attempts = [\n {timestamp: '2024-01-15T14:25:00Z', reason: 'invalid_password'},\n {timestamp: '2024-01-15T14:27:00Z', reason: 'account_locked'}\n ]\n\n// Use maps when key data is high cardinality\n// Using structs for this use case creates too many columns.\n| LET host_metrics = map(\n 'web-server-01': {cpu: 85.2, memory: 72.1},\n 'web-server-02': {cpu: 91.7, memory: 68.9},\n 'db-primary-01': {cpu: 67.3, memory: 89.4}\n )\n\n// Look up map values using index notation\n| WHERE host_metrics['web-server-01'].cpu > 80\n```\n\n### \u{1F4E1} Array Broadcasting\n\nHamelin makes working with arrays simpler by offering broadcasting, which helps\nyou distribute operations over each member of an array. It does this when you\napply an operation to an array that makes more sense to be applied to each of\nits members. Broadcasting lets you work with arrays using simple, familiar\nsyntax without asking you to resort to functional programming or inefficient\nunnesting.\n\n```hamelin\n| WHERE any(failed_attempts.reason == 'invalid_password')\n```\n\nThis example demonstrates how the equality operator `==` broadcasts across the\n`reason` field of each element in the `failed_attempts` array. This example\ndemonstrates *two* broadcasts:\n\n * first, the lookup of the `reason` field changes an array-of-struct into an\n array-of-string\n * second, applying equality to the resulting array applies it to each member\n\nHamelin can do this automatically because it is type-aware. It knows that\ncomparing equality between `array(string)` and `string` makes more sense to\nbroadcast: an array can never be equal to a string, but a member of an\n`array(string)` might be.\n\n### \u{1F500} Semi-Structured Types\n\nHamelin lets you parse json into instances of the `variant` type. This helps you\nhandle semi-structured data that doesn't fit nicely into fixed schemas. You can\nparse JSON strings, access their fields, and convert them to more structured\ntypes. This makes working with JSON feel fairly native.\n\n```hamelin\n// Parse JSON strings into the variant type\nFROM logs\n| LET event_data = parse_json(raw_json)\n\n// Access nested fields using dot notation\n| WHERE event_data.level AS string == 'ERROR'\n\n// Access json array elements with index notation\n| LET first_tag = event_data.tags[0]\n\n// Cast variant data to structured types when you need type safety.\n// Values that do not match will be null.\n| LET user_info = event_data.user AS {id: int, name: string}\n```\n\n### \u{1F6A8} Excellent Error Messages\n\nHamelin provides clear, helpful error messages. Error messages\npoint directly to the problematic Hamelin code and explain exactly what went\nwrong, rather than showing cryptic messages about generated SQL.\n\nThis matters especially when AI assistants write queries. AI tools need precise\ndescriptions of errors to fix queries and complete tasks. Clear error messages\nlet AI assistants debug queries effectively by giving the context needed to\ncorrect mistakes.\n\n```hamelin\nFROM simba.sysmon_events\n| AGG count() BY host.hostname\n| LET hostname = lower(host.hostname)\n```\n\ngenerates the error\n\n```\nError: problem doing translation\n \u256D\u2500[ :3:24 ]\n \u2502\n 3 \u2502 | LET hostname = lower(host.hostname)\n \u2502 \u2500\u2500\u252C\u2500\n \u2502 \u2570\u2500\u2500\u2500 error while translating\n \u2502\n \u2502 Note: unbound column reference: host\n \u2502\n \u2502 the following entries in the environment are close:\n \u2502 - `host.hostname` (you must actually wrap with ``)\n\u2500\u2500\u2500\u256F\n```\n\nHere, the user has forgotten to escape an identifier that contains a dot character.\n\n```hamelin\nFROM simba.sysmon_events\n| WINDOW count(),\n all(winlog.event_data.events)\n BY host.hostname\n```\n\ngenerates the error\n\n```\nError: problem doing translation\n \u256D\u2500[ :3:10 ]\n \u2502\n 3 \u2502 all(winlog.event_data.events)\n \u2502 \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252C\u2500\u252C\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n \u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 could not find a matching function definition\n \u2502 \u2502\n \u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 variant\n \u2502\n \u2502 Note: Attempted all(x=boolean)\n \u2502 - Type mismatch for x: expected boolean, got variant\n \u2502\n \u2502 Attempted all(x=array(boolean))\n \u2502 - Type mismatch for x: expected array(boolean), got variant\n \u2502\n\u2500\u2500\u2500\u256F\n```\n\nHere, the user has forgotten to cast variant to a primitive type so that it can\nbe matched against the function call. (A future version of Hamelin will probably\ncoerce this automatically!)\n",
|
|
495
495
|
"language-basics/aggregation.md": "# AGG: performing ordinary aggregation\n\nThe `AGG` command groups and aggregates datasets to create summary statistics\nand analytical insights. You can analyze user behavior patterns, system\nperformance metrics, or security events by grouping related records together and\napplying mathematical functions to each group.\n\n## AGG syntax\n\nThe `AGG` command follows a simple pattern that groups data and applies aggregation functions to each group:\n\n```hamelin\nAGG result = function(expression), ... BY grouping_expression, ...\n```\n\nWhen you omit the `BY` clause, Hamelin aggregates all records into a single group. This calculates overall dataset statistics and global metrics that span all records, counting all events across the entire dataset without any grouping or partitioning:\n\n```hamelin\nFROM events\n| AGG total_events = count()\n```\n\nWhen you omit explicit column names, Hamelin generates them automatically from\nthe expressions you provide. Learn more about this feature in [Automatic Field\nNames](../smart-features/automatic-field-names.md). This creates columns named\n`count()` and `avg(response_time)` that you can reference using backticks in\nsubsequent commands:\n\n```hamelin\nFROM requests\n| AGG count(), avg(response_time) BY service_name\n```\n\nWhen you omit aggregation functions entirely, you get distinct groups without any calculations. This returns the unique combinations of event_type and user_id without performing any mathematical operations:\n\n```hamelin\nFROM events\n| AGG BY event_type, user_id\n```\n\nYou can also rename columns in the BY clause and use any expression for grouping. This example groups by renamed event_type, truncated timestamp, and extracted email domain, creating clear column names for downstream analysis:\n\n```hamelin\nFROM events\n| AGG\n total_events = count(),\n avg_duration = avg(duration)\n BY event_category = event_type,\n hour_bucket = timestamp@hr,\n user_domain = split(email, '@')[1]\n```\n\n## Simple aggregation examples\n\n### Basic counting\n\nEvent counting groups events by their characteristics and calculates how many events fall into each category. Notice that Hamelin uses `count()` with no arguments, not `count(*)` like SQL. The empty parentheses count all rows in each group, providing a clean syntax for the most common aggregation operation:\n\n```hamelin\nFROM events\n| AGG event_count = count() BY event_type\n```\n\n### Multiple aggregations\n\nCalculating several metrics at once in a single `AGG` command ensures all metrics use consistent grouping logic:\n\n```hamelin\nFROM requests\n| AGG\n total_requests = count(),\n avg_response_time = avg(response_time),\n max_response_time = max(response_time),\n error_count = count_if(status_code >= 400)\n BY service_name\n```\n\n### Conditional aggregation\n\nConditional aggregation functions like `count_if()` let you count only rows that meet specific conditions without pre-filtering the dataset. Conditional aggregation maintains the full context of each group while applying different filters to different calculations:\n\n```hamelin\nFROM auth_logs\n| AGG\n failures = count_if(outcome == 'FAILURE'),\n successes = count_if(outcome == 'SUCCESS')\n BY user_name\n```\n\n## Time series aggregations\n\nTime series aggregations combine time truncation with grouping to create time-based buckets for temporal analysis. Time-based grouping creates time-bucketed summaries for monitoring system performance, tracking business metrics, and understanding user behavior patterns across different time scales.\n\n### Hourly summaries\n\nHourly aggregations provide detailed views of system activity and user behavior throughout the day:\n\n```hamelin\nFROM logs\n| AGG\n hourly_events = count(),\n avg_response = avg(response_time),\n error_rate = count_if(status >= 400) / count()\n BY timestamp@hr\n| SORT timestamp@hr\n```\n\n### Daily trends\n\nDaily aggregations reveal longer-term trends and enable comparison across different time periods:\n\n```hamelin\nFROM events\n| WITHIN -30d..now()\n| AGG\n daily_events = count(),\n unique_users = count_distinct(user_name),\n high_severity = count_if(severity = 'HIGH')\n BY timestamp@d\n| SORT timestamp@d DESC\n```\n",
|
|
496
496
|
"language-basics/five-core-commands.md": "# Five core commands\n\nHamelin uses five core commands to handle basic data operations: `FROM`,\n`WHERE`, `LET`, `SELECT`, and `DROP`. Each command operates narrowly and serves\na specific purpose. You combine these commands using the pipe character `|`.\nThese core commands serve the same purpose as SQL clauses, but you can use them\nin any order, with each command feeding its output to the next.\n\n## Command reference\n\n### `FROM` - Access the rows of a dataset\n\nThe `FROM` command pulls rows from one or more datasets. You start most queries with this command to specify the data sources.\n\nPull rows from the events dataset:\n\n```hamelin\nFROM events\n```\n\nPull rows from both the users and orders datasets:\n\n```hamelin\nFROM users, orders\n```\n\nThis operation translates to a SQL `UNION ALL`, not a join. It pulls all rows from both sources without performing any filtering or row alignment.\n\nWhen you query multiple sources, Hamelin automatically expands types to accommodate all fields from both datasets. Fields with the same name get aligned, while unique fields are added with `NULL` values for rows that don't contain them. Learn more about how this works in [Type Expansion](../smart-features/type-expansion.md).\n\nYou can organize results from multiple datasets by grouping fields into separate\nsections. This lets you keep track of which data comes from which source:\n\n```hamelin\nFROM\n allows = events.access_allowed,\n denys = events.access_denied\n```\n\nThe `allows` field becomes a struct containing all fields from\n`events.access_allowed` (set to null for rows from `events.access_denied`). The\n`denys` field works the same way for `events.access_denied`. All other\nfields are aligned normally. This makes it easy to both reference a field's\nspecific lineage and to ignore lineage when you don't need it.\n\n\n### `WHERE` - Filter rows\n\nThe `WHERE` command filters rows based on conditions you specify. You can stack\nmultiple `WHERE` commands to apply multiple filters.\n\nOnly query rows whose action is 'login':\n\n```hamelin\nFROM events\n| WHERE event.action == 'login'\n```\n\nThis has the same effect as using `AND` between each condition:\n\n```hamelin\nFROM users\n| WHERE user.role == 'admin'\n| WHERE status == 'active'\n```\n\n### `LET` - Add or modify columns\n\nThe `LET` command adds new columns or modifies existing ones without affecting\nother columns. This lets you create calculated fields and enrich datasets as\nyou build queries.\n\nCreate a full name by concatenating first and last names:\n\n```hamelin\nFROM users\n| LET full_name = user.first_name + ' ' + user.last_name\n```\n\nCalculate how many days ago an event occurred:\n\n```hamelin\nFROM events\n| LET days_ago = (now() - timestamp) / 1d\n```\n\nAdd a severity level based on the event action:\n\n```hamelin\nFROM events\n| LET severity = if(event.action == 'login_failed', 'high', 'low')\n```\n\nYou can set nested fields within existing structs to modify specific parts of records:\n\n```hamelin\nFROM events\n| LET user.display_name = user.first_name + ' ' + user.last_name\n```\n\nYou can also create entirely new nested structs by setting multiple nested fields:\n\n```hamelin\nFROM events\n| LET metadata.processed_at = now()\n| LET metadata.version = '2.1'\n```\n\nWhen creating new structs, using a struct literal is often more readable:\n\n```hamelin\nFROM events\n| LET metadata = {\n processed_at: now(),\n version: '2.1'\n }\n```\n\n### `SELECT` - Define output columns\n\nThe `SELECT` command completely redefines what columns appear in the results.\nThis replaces all existing columns with only the ones you specify.\n\nSelect only the user's email, timestamp, and event action from the events dataset:\n\n```hamelin\nFROM events\n| SELECT user.email, timestamp, event.action\n```\n\nSelect existing columns and add new computed columns with custom names:\n\n```hamelin\nFROM metrics\n| SELECT count, avg_time, category = 'security'\n```\n\nCreate new columns using expressions and conditional logic:\n\n```hamelin\nFROM events\n| SELECT user_id, severity = if(error_count > 10, 'high', 'low')\n```\n\nTransform existing columns while selecting them:\n\n```hamelin\nFROM logs\n| SELECT timestamp, message, log_level = upper(level)\n```\n\nWhen you don't provide explicit field names in SELECT, Hamelin automatically\ngenerates sensible names based on your expressions. This reduces the need to\nthink of names for simple calculations. Learn more about this in [Automatic Field Names](../smart-features/automatic-field-names.md).\n\n### `DROP` - Remove columns\n\nThe `DROP` command removes specific columns from the results. This is useful\nwhen you want to exclude sensitive data or reduce the size of the output.\n\nRemove unwanted columns from the dataset:\n\n```hamelin\nFROM events\n| DROP internal_id, debug_info\n```\n\n## Basic example\n\nThis example shows how you can combine the core commands to build a common query.\n\n```hamelin\nFROM events\n| WHERE event.action != null\n| LET days_ago = (now() - timestamp) / 1d\n| DROP debug_data, internal_flags\n```\n\nThis example demonstrates how the core commands work together in a typical\nworkflow. You start by pulling rows from the events dataset, filter out records\nwith missing action data, add a calculated field for how many days ago each\nevent occurred, and remove unwanted debug column\n\n```hamelin\nFROM events\n| WHERE event.action != null\n| LET days_ago = (now() - timestamp) / 1d\n| SELECT event.action, timestamp, days_ago\n```\n\nThis example shows a similar pattern. Rather than dropping specific columns, you\nselect only the ones you want to keep in the final output.\n",
|
|
497
497
|
"language-basics/join-combining-datasets.md": '# `JOIN` and `LOOKUP`: Combining datasets\n\nIn event analytics, event records are often narrow and require enrichment with\nadditional context. You use the `JOIN` and `LOOKUP` commands to do this\nenrichment. Hamelin gives you most of the power of SQL joins - you get inner\njoin behavior (with `JOIN`) and left outer join behavior (with `LOOKUP`).\n\n## Basic syntax\n\nJoin your main dataset with additional data by specifying a matching condition. The `ON` clause defines how records from both datasets should be linked together:\n\n```hamelin\n| JOIN other_dataset ON field_name == other_dataset.field_name\n```\n\nUse `LOOKUP` when you want to keep all your original records, even if some don\'t have matches in the second dataset:\n\n```hamelin\n| LOOKUP other_dataset ON field_name == other_dataset.field_name\n```\n\n## Nested results\n\nWhen you join datasets, Hamelin nests the joined data as a struct to prevent field name collisions. This structure keeps your original fields separate from the joined fields, making it clear which data came from which dataset.\n\nCombine user records with order data:\n\n```hamelin\nFROM users\n| WHERE timestamp > "2024-01-01"\n| JOIN orders ON user_id == orders.user_id\n```\n\nThis creates records where each user has an `orders` struct containing all the matched order information. Your original user fields remain at the top level, while order details are nested inside the `orders` structure.\n\n## Custom naming\n\nYou can control the name of the nested struct using assignment syntax. This makes your queries more readable when the default dataset name isn\'t descriptive:\n\n```hamelin\nFROM users\n| WHERE timestamp > "2024-01-01"\n| JOIN purchase_history = orders ON user_id == orders.user_id\n```\n\nNow the joined data appears under the more descriptive name `purchase_history` instead of the generic `orders` name.\n\n## Accessing joined fields\n\nYou access fields from the joined struct using dot notation. This lets you filter, select, or manipulate the joined data just like any other nested structure:\n\n```hamelin\nFROM users\n| WHERE timestamp > "2024-01-01"\n| JOIN orders ON user_id == orders.user_id\n| WHERE orders.total > 100\n```\n\nThis query finds users who have at least one order with a total over 100, demonstrating how you can filter on joined data.\n\n## Multiple joins\n\nYou can chain multiple `JOIN` operations to combine data from several datasets. Each join creates its own nested struct, letting you pull related information from multiple sources:\n\n```hamelin\nFROM transactions\n| WHERE amount > 1000\n| JOIN user_details = users ON user_id == users.id\n| JOIN account_info = accounts ON account_id == accounts.id\n| WHERE user_details.risk_score > 0.8\n```\n\nThis creates records where each transaction has both `user_details` and `account_info` structs, giving you access to related data from multiple datasets in a single query.\n\n## `JOIN` vs `LOOKUP`: Required vs optional matches\n\nThe key difference between `JOIN` and `LOOKUP` is how they handle missing matches. This choice determines whether you get only complete records or keep all your original data with optional enrichment.\n\n### `JOIN` requires matches\n\nWhen you use `JOIN`, only rows that have a match in both datasets appear in your results. Rows without matches get filtered out completely, giving you a dataset that only contains records with complete information.\n\nGet users who have placed orders:\n\n```hamelin\nFROM users\n| JOIN orders ON user_id == orders.user_id\n```\n\nThis returns only users who have actually placed orders. Users without any orders are excluded from the results entirely.\n\n### `LOOKUP` keeps all rows\n\nWhen you use `LOOKUP`, all rows from your main dataset stay in the results, regardless of whether they have matches. For rows without matches, the nested struct gets set to `null`, preserving your complete dataset while adding optional enrichment.\n\nGet all users and include their order information when available:\n\n```hamelin\nFROM users\n| LOOKUP orders ON user_id == orders.user_id\n```\n\nThis returns every user from your dataset. Users with orders get an `orders` struct containing their order data, while users without orders still appear with their `orders` field set to null.\n\n### When to use each\n\nUse `JOIN` when you only want records that have complete information from both datasets. Use `LOOKUP` when you want to preserve your entire main dataset and optionally enrich it with additional data that might not exist for every record.\n',
|
|
498
|
-
"language-basics/match-ordered-pattern-matching.md": "# MATCH: ordered pattern matching\n\nThe `MATCH` command finds specific sequences of events in your data. Pattern matching detects sequences like \"10 failed logins followed by a successful login\" or \"error events followed by restart events within 5 minutes.\" Unlike windowed aggregations, pattern matching requires that event patterns happen in a specific sequence.\n\n## Basic pattern matching\n\nThe `MATCH` command searches for ordered sequences using named subqueries and regular expression-style quantifiers. You define what events you're looking for, then specify the pattern and timing constraints. Create named subqueries for the events you want to match, then specify the sequence pattern. This example finds sequences where 5 or more failed logins are followed by at least one successful login:\n\n```hamelin\nWITH failed_logins =\n FROM security_logs\n | WHERE action == 'login_failed'\n\nWITH successful_logins =\n FROM security_logs\n | WHERE action == 'login_success'\n\nMATCH failed_logins{5,} successful_logins+\n```\n\n\n\n## Pattern quantifiers\n\nQuantifiers control how many of each event type to match. These work like regular expression quantifiers but apply to your named event datasets.\n\n### Exact counts\n\nSpecify exact numbers of events to match. This example finds exactly 3 error events followed by exactly 1 restart event:\n\n```hamelin\nWITH errors =\n FROM system_logs\n | WHERE level == 'ERROR'\n\nWITH restarts =\n FROM system_logs\n | WHERE action == 'service_restart'\n\nMATCH errors{3} restarts{1}\n```\n\n\n\n### Range quantifiers\n\nUse ranges to specify minimum and maximum counts. This example finds between 2 and 5 high-severity alerts followed by 1 or 2 acknowledgment events:\n\n```hamelin\nWITH alerts =\n FROM monitoring\n | WHERE severity == 'HIGH'\n\nWITH acknowledgments =\n FROM monitoring\n | WHERE action == 'acknowledge'\n\nMATCH alerts{2,5} acknowledgments{1,2}\n```\n\n\n\n### Open-ended quantifiers\n\nUse `+` for \"one or more\" and `*` for \"zero or more\". This example finds one or more failed requests followed by one or more successful requests:\n\n```hamelin\nWITH failed_requests =\n FROM api_logs\n | WHERE status_code >= 500\n\nWITH success_requests =\n FROM api_logs\n | WHERE status_code < 400\n\nMATCH failed_requests+ success_requests+\n```\n\n\n\n## Time constraints with
|
|
498
|
+
"language-basics/match-ordered-pattern-matching.md": "# MATCH: ordered pattern matching\n\nThe `MATCH` command finds specific sequences of events in your data. Pattern matching detects sequences like \"10 failed logins followed by a successful login\" or \"error events followed by restart events within 5 minutes.\" Unlike windowed aggregations, pattern matching requires that event patterns happen in a specific sequence.\n\n## Basic pattern matching\n\nThe `MATCH` command searches for ordered sequences using named subqueries and regular expression-style quantifiers. You define what events you're looking for, then specify the pattern and timing constraints. Create named subqueries for the events you want to match, then specify the sequence pattern. This example finds sequences where 5 or more failed logins are followed by at least one successful login:\n\n```hamelin\nWITH failed_logins =\n FROM security_logs\n | WHERE action == 'login_failed'\n\nWITH successful_logins =\n FROM security_logs\n | WHERE action == 'login_success'\n\nMATCH failed_logins{5,} successful_logins+\n```\n\n\n\n## Pattern quantifiers\n\nQuantifiers control how many of each event type to match. These work like regular expression quantifiers but apply to your named event datasets.\n\n### Exact counts\n\nSpecify exact numbers of events to match. This example finds exactly 3 error events followed by exactly 1 restart event:\n\n```hamelin\nWITH errors =\n FROM system_logs\n | WHERE level == 'ERROR'\n\nWITH restarts =\n FROM system_logs\n | WHERE action == 'service_restart'\n\nMATCH errors{3} restarts{1}\n```\n\n\n\n### Range quantifiers\n\nUse ranges to specify minimum and maximum counts. This example finds between 2 and 5 high-severity alerts followed by 1 or 2 acknowledgment events:\n\n```hamelin\nWITH alerts =\n FROM monitoring\n | WHERE severity == 'HIGH'\n\nWITH acknowledgments =\n FROM monitoring\n | WHERE action == 'acknowledge'\n\nMATCH alerts{2,5} acknowledgments{1,2}\n```\n\n\n\n### Open-ended quantifiers\n\nUse `+` for \"one or more\" and `*` for \"zero or more\". This example finds one or more failed requests followed by one or more successful requests:\n\n```hamelin\nWITH failed_requests =\n FROM api_logs\n | WHERE status_code >= 500\n\nWITH success_requests =\n FROM api_logs\n | WHERE status_code < 400\n\nMATCH failed_requests+ success_requests+\n```\n\n\n\n## Time constraints with WITHIN\n\nThe `WITHIN` clause adds constraints to patterns, measuring the distance from the first event to the last event in the matched sequence. For timestamp-based ordering, this represents a time window. For numeric ordering, this represents a numeric distance.\n\n### Time window constraints\n\nRequire that patterns complete within a specific time period. This example finds 10 or more failed logins followed by successful logins, but only when the entire sequence happens within 10 minutes:\n\n```hamelin\nWITH failed_logins =\n FROM security_logs\n | WHERE action == 'login_failed'\n\nWITH successful_logins =\n FROM security_logs\n | WHERE action == 'login_success'\n\nMATCH failed_logins{10,} successful_logins+ WITHIN 10m\n```\n\nThe `WITHIN` clause measures on the actual `SORT` column. When you don't specify a `SORT` clause, timestamp ordering is applied automatically. When using `WITHIN`, you must have exactly one `SORT` expression, and the `SORT` column type must be compatible with the `WITHIN` type:\n- `TIMESTAMP` columns work with `INTERVAL` (e.g., `5m`) or `CALENDAR_INTERVAL` (e.g., `1y`, `3mon`)\n- Numeric columns require matching numeric types (e.g., `INT` sort with `INT` within)\n\n### Numeric ordering with WITHIN\n\nYou can use `WITHIN` with numeric columns to constrain sequences by numeric distance rather than time:\n\n```hamelin\nWITH step_a =\n FROM process_log\n | WHERE step == 'A'\n\nWITH step_b =\n FROM process_log\n | WHERE step == 'B'\n\nMATCH step_a step_b SORT BY sequence_number WITHIN 100\n```\n\nThis finds sequences where step A is followed by step B, and the sequence numbers differ by at most 100. The `WITHIN` constraint measures `last(sequence_number) - first(sequence_number) <= 100`.\n\n### Using first() and last() functions\n\nThe `first()` and `last()` functions access the earliest and latest events in each matched group. This example finds CPU spikes followed by memory alerts within 15 minutes:\n\n```hamelin\nWITH cpu_spikes =\n FROM metrics\n | WHERE cpu_usage > 90\n\nWITH memory_alerts =\n FROM metrics\n | WHERE memory_usage > 85\n\nMATCH cpu_spikes{3,} memory_alerts+ WITHIN 15m\n```\n\n\n\n## Complex pattern examples\n\n### Security incident detection\n\nLook for suspicious login patterns that might indicate a brute force attack. This example detects external brute force attempts followed by successful logins and optional privilege escalations, all within 30 minutes:\n\n```hamelin\nWITH failed_logins =\n FROM auth_logs\n | WHERE outcome == 'FAILURE'\n | WHERE source_ip NOT IN ('10.0.0.0/8', '192.168.0.0/16')\n\nWITH successful_logins =\n FROM auth_logs\n | WHERE outcome == 'SUCCESS'\n\nWITH privilege_escalations =\n FROM audit_logs\n | WHERE action == 'privilege_escalation'\n\nMATCH failed_logins{5,} successful_logins{1,3} privilege_escalations* WITHIN 30m\n```\n\n\n\n## When to use MATCH vs WINDOW\n\nThe key difference is that `WINDOW` performs unordered correlation while `MATCH` performs ordered correlation.\n\nWhen you pull multiple event patterns into a sliding window, you can aggregate each individual pattern or aggregate across all the patterns together. However, you cannot require that certain subpatterns happen before others\u2014the window treats all events within the time frame as unordered.\n\n`MATCH` specifies that certain events must happen before others in a specific sequence. Ordered correlation matters when the timing and sequence of events affects your analysis.\n\n**Use MATCH when order matters:**\n- Security attack sequences (failed logins \u2192 successful login \u2192 privilege escalation)\n- System failure cascades (errors \u2192 timeouts \u2192 circuit breaker trips)\n- User workflow analysis (page view \u2192 form submission \u2192 purchase)\n- Compliance violations (access \u2192 modification \u2192 deletion)\n",
|
|
499
499
|
"language-basics/sort-limit-top-n.md": "# `SORT` and `LIMIT`: Doing top-n\n\nYou use the `SORT` command to order your data, and the `LIMIT` command to take only the first n rows from your results. Together, these commands let you find top performers, recent events, highest values, or any other ranking-based analysis.\n\nEach command is also useful on its own. `SORT` helps you understand data patterns by revealing ordering and outliers. You might sort transaction amounts to see the distribution of values, or sort timestamps to understand event sequences. `LIMIT` is valuable for exploring large datasets by giving you manageable samples. You can take the first 100 rows to understand data structure before writing more complex queries, or limit results to avoid overwhelming outputs during development.\n\n## Basic syntax\n\nSort your data by specifying the field you want to order by. Add `DESC` for descending order (highest to lowest) or leave it blank for ascending order (lowest to highest):\n\n```hamelin\n| SORT field_name DESC\n```\n\nLimit your results to a specific number of rows using the `LIMIT` command:\n\n```hamelin\n| LIMIT 10\n```\n\n## Simple sorting\n\nOrder your data by a single field to see patterns and outliers. This is useful for finding the most recent events, highest values, or alphabetical arrangements.\n\nSort login events by timestamp to see them in chronological order:\n\n```hamelin\nFROM security_logs\n| WHERE action == 'login'\n| SORT timestamp DESC\n```\n\nThis query gets login events and sorts them by timestamp in descending order (newest first), letting you see the full sequence of login activity.\n\n## Multiple sort fields\n\nYou can sort by multiple fields to create more sophisticated ordering. List the fields in order of priority, with the most important sort field first:\n\n```hamelin\nFROM transactions\n| SORT amount DESC, timestamp DESC\n```\n\nThis sorts transactions first by amount (highest first), then by timestamp (newest first) for transactions with the same amount. This ordering reveals value patterns across all transactions, with ties broken by recency.\n\n## Top-n analysis\n\nThe combination of `SORT` and `LIMIT` creates powerful top-n analysis patterns. This lets you answer questions like \"who are my top customers\" or \"what are the most common errors\" with simple, readable queries.\n\nFind the top 5 users by transaction volume:\n\n```hamelin\nFROM transactions\n| AGG total_amount = sum(amount) BY user_id\n| SORT total_amount DESC\n| LIMIT 5\n```\n\nThis aggregates transaction amounts by user, sorts by the total in descending order, and takes the top 5 results. The pattern works for any ranking scenario where you need to identify leaders or outliers.\n\nNote: This example uses the `AGG` command which we haven't covered yet. You can learn more about aggregation in [Aggregation](aggregation.md).\n\n## Sorting with expressions\n\nYou can sort by calculated values without adding them as permanent fields. This is useful when you want to order by a computation but don't need that computation in your final results:\n\n```hamelin\nFROM events\n| SORT (now() - timestamp) / 1hr\n| LIMIT 20\n```\n\nThis sorts events by how many hours ago they occurred, giving you the most recent events first. The calculation happens during sorting but doesn't create a new field in your results.\n\n## Complex sorting scenarios\n\nFor more advanced sorting, you can combine multiple fields, expressions, and directions to create exactly the ordering you need.\n\nFind the most problematic users by recent failed login attempts:\n\n```hamelin\nFROM security_logs\n| WHERE action == 'login_failed' AND timestamp > now() - 24hr\n| AGG failure_count = count(), latest_failure = max(timestamp) BY user_id\n| SORT failure_count DESC, latest_failure DESC\n| LIMIT 10\n```\n\nThis query identifies users with the most failed login attempts in the last 24 hours, sorted first by failure count (most failures first), then by recency of their latest failure. This creates a prioritized list for security investigation.\n\n## Performance considerations\n\nSorting large datasets can be expensive, especially when sorting by multiple fields or complex expressions. When possible, apply filters with `WHERE` before sorting to reduce the amount of data that needs to be ordered:\n\n```hamelin\nFROM events\n| WHERE timestamp > now() - 7d -- Filter first\n| SORT severity DESC, timestamp DESC\n| LIMIT 50\n```\n\nThis pattern filters to recent events before sorting, which is more efficient than sorting all events and then filtering.\n",
|
|
500
500
|
"language-basics/time.md": "# Time\n\nTime is central to event analytics. In Hamelin, you write time the way you think\nabout it\u2014`1hr`, `30min`, or `yesterday`. The language supports several ways to work\nwith time: intervals for durations, absolute timestamps for specific moments,\ntime truncation for grouping, and ranges for time spans.\n\n## Time intervals\n\nYou use time intervals to express durations\u2014how long something takes or how far back to look in your data.\n\n### Basic interval syntax\n\nCreate time intervals by writing a number directly followed by a time unit, with no spaces. Use these anywhere you need to specify how long something takes or how far back to look:\n\n```hamelin\n# Time intervals - number + unit\n1sec # 1 second\n30sec # 30 seconds\n1min # 1 minute\n15min # 15 minutes\n1hr # 1 hour\n2hr # 2 hours\n1d # 1 day\n7d # 7 days\n1w # 1 week\n2w # 2 weeks\n1mon # 1 month\n6mon # 6 months\n1yr # 1 year\n```\n\n### Supported time units\n\n| Unit | Abbreviations | Examples |\n|------|---------------|----------|\n| **Seconds** | `s`, `sec`, `secs`, `second`, `seconds` | `30s`, `45sec` |\n| **Minutes** | `m`, `min`, `mins`, `minute`, `minutes` | `5m`, `15min` |\n| **Hours** | `h`, `hr`, `hrs`, `hour`, `hours` | `1h`, `2hr` |\n| **Days** | `d`, `day`, `days` | `1d`, `30days` |\n| **Weeks** | `w`, `week`, `weeks` | `1w`, `2weeks` |\n| **Months** | `mon`, `month`, `months` | `1mon`, `3months` |\n| **Years** | `y`, `yr`, `yrs`, `year`, `years` | `1y`, `2yrs` |\n\n### Using intervals in variables\n\nStore commonly used time intervals in variables to make your queries more readable and maintainable:\n\n```hamelin\n# Store intervals in variables for reuse\nWITH time_constants =\n LET short_window = 5min,\n daily_retention = 30d,\n investigation_period = 2hr,\n alert_threshold = 500ms\n```\n\n## Absolute timestamps\n\nYou can reference specific moments in time using absolute timestamps. This is useful when you know the exact time of an incident or need to analyze data from a specific date.\n\n### ISO 8601 format\n\nCreate absolute timestamps using the `ts()` function with ISO 8601 formatted strings. This format works with or without timezone information:\n\n```hamelin\n# Absolute timestamps using ISO 8601 format\nWITH timestamps =\n LET incident_start = ts('2024-01-15T14:30:00'),\n maintenance_window = ts('2024-01-15T02:00:00Z'),\n deployment_time = ts('2024-01-15T09:15:30.123Z')\n```\n\n### Current time\n\nGet the current timestamp using the `now()` function. This captures the exact moment when your query starts running:\n\n```hamelin\n# Get the current timestamp\nWITH current_times =\n LET right_now = now(),\n query_start_time = now()\n```\n\n## Time truncation with `@`\n\nThe `@` operator snaps timestamps to time boundaries. You can truncate any timestamp to the start of its hour, day, week, or other time period. This makes it straightforward to group events into time buckets for analysis.\n\n### Truncation syntax\n\nApply the `@` operator to any timestamp to round it down to the nearest time boundary:\n\n```hamelin\n# Truncate current time to various boundaries\nnow()@d # Today at midnight (00:00:00)\nnow()@hr # This hour at :00 minutes\nnow()@min # This minute at :00 seconds\nnow()@w # This week's Monday at midnight\nnow()@mon # First day of this month at midnight\n```\n\n### Available truncation units\n\n| Unit | Truncates To | Example Result |\n|------|--------------|----------------|\n| `@s` | Start of second | `2024-01-15T14:30:25.000` |\n| `@min` | Start of minute | `2024-01-15T14:30:00.000` |\n| `@hr` | Start of hour | `2024-01-15T14:00:00.000` |\n| `@d` | Start of day (midnight) | `2024-01-15T00:00:00.000` |\n| `@w` | Start of week (Monday) | `2024-01-15T00:00:00.000` |\n| `@mon` | Start of month | `2024-01-01T00:00:00.000` |\n| `@yr` | Start of year | `2024-01-01T00:00:00.000` |\n\n### Truncation with any timestamp\n\nYou can truncate any timestamp, not just `now()`. Create time buckets from your event data by truncating timestamp fields:\n\n```hamelin\n# Truncate any timestamp, not just now()\nWITH event_data =\n LET event_time = ts('2024-01-15T14:37:22')\n\nFROM event_data\n| LET hour_bucket = event_time@hr // 2024-01-15T14:00:00\n| LET day_bucket = event_time@d // 2024-01-15T00:00:00\n```\n\nYou can also truncate timestamp columns directly from your event datasets to group events by time periods:\n\n```hamelin\nFROM events\n| LET event_day = timestamp@d // Group events by day\n| LET event_hour = timestamp@hr // Group events by hour\n| SELECT user_id, event_day, event_hour, action\n```\n\n## Time ranges\n\nYou combine time values into ranges using the `..` operator. Time ranges let you express time spans like \"between 2 and 4 hours ago\" or \"from this morning onward.\" This makes it natural to filter events within specific time windows.\n\n### The range operator `..`\n\nThe `..` operator creates a span between two time points. You place time values on either side to define the start and end of your range.\n\nCreate a range between 2 hours ago and 1 hour ago:\n\n```hamelin\n-2hr..-1hr\n```\n\nCreate a range from a specific time until now:\n\n```hamelin\nts('2024-01-15T10:00:00')..now()\n```\n\nCreate a range from midnight today until midnight tomorrow:\n\n```hamelin\nnow()@d..(now()@d + 1d)\n```\n\n### Relative time ranges\n\nUse negative numbers to go back in time from \"now\". This pattern covers most security and operational analytics scenarios where you're investigating recent events.\n\nGet events from the last hour:\n\n```hamelin\n-1hr..now()\n```\n\nGet events between 2 and 4 hours ago:\n\n```hamelin\n-4hr..-2hr\n```\n\nGet events from this week so far:\n\n```hamelin\nnow()@w..now()\n```\n\nYou can combine truncation with ranges to create precise time windows aligned to calendar boundaries:\n\n```hamelin\n// From start of today until now\nnow()@d..now()\n\n// Yesterday (full day)\n(now()@d - 1d)..(now()@d)\n\n// Last full hour\n(now()@hr - 1hr)..(now()@hr)\n```\n\n### Unbounded ranges\n\nYou can leave either side of the range empty to create spans that extend infinitely in one direction. This is useful for ongoing monitoring or historical analysis without a specific end point.\n\nGet everything before 1 hour ago:\n\n```hamelin\n..-1hr\n```\n\nGet everything from a specific time onward:\n\n```hamelin\nts('2024-01-15T10:00:00')..\n```\n\nGet everything from 2 hours ago onward (includes future events):\n\n```hamelin\n-2hr..\n```\n\n### Bounded vs unbounded ranges\n\nThe choice between bounded and unbounded ranges determines how your queries behave, especially for ongoing monitoring versus historical analysis.\n\n```hamelin\n// Bounded: Only gets events that happened in the past hour\nFROM alerts | WHERE timestamp IN -1hr..now()\n\n// Unbounded: Gets past events AND future events as they arrive\nFROM alerts | WHERE timestamp IN -1hr..\n```\n\nUse **bounded ranges** when analyzing completed time periods. Use **unbounded ranges** when monitoring ongoing events as they happen.\n",
|
|
501
501
|
"language-basics/window-aggregating-over-sliding-windows.md": "# WINDOW: aggregating over sliding windows\n\nThe `WINDOW` command creates aggregations that slide across windows of data.\nThis lets you calculate running totals, moving averages, and time-based metrics\nwithout losing the detail of individual events. Each row gets its own\ncalculation based on a sliding window of related rows around it.\n\n## Window calculations\n\nThe `WINDOW` command supports two main types of calculations that operate on the\nsliding window of data. Aggregation functions like `count()`, `sum()`, `avg()`,\n`min()`, and `max()` calculate summary statistics across all rows in the current\nwindow frame.\n\n```hamelin\nFROM sales\n| WINDOW\n total_sales = sum(amount),\n avg_sale = avg(amount),\n sale_count = count()\n BY region\n WITHIN -7d\n```\n\nThis creates a 7-day rolling summary for each region, showing total sales, average sale amount, and number of sales within the sliding window.\n\nWindow-specific functions like `row_number()`, `rank()`, `dense_rank()`, and `lag()` analyze the position and relationships between rows within the window without aggregating the data.\n\n```hamelin\nFROM events\n| WINDOW\n event_number = row_number(),\n event_rank = rank(),\n previous_value = lag(score, 1)\n BY user_id\n SORT timestamp\n```\n\nThis assigns sequence numbers, ranks events by timestamp order, and shows the previous score value for each user's events.\n\nYou can combine multiple calculations in a single `WINDOW` command, and each calculation receives the same set of rows determined by the window frame, but produces different analytical results based on its specific function behavior:\n\n```hamelin\nFROM metrics\n| WINDOW\n recent_count = count(),\n running_total = sum(value),\n current_rank = row_number(),\n percentile_rank = percent_rank()\n BY service\n SORT timestamp\n WITHIN -1hr\n```\n\nThis example mixes aggregation functions (`count()`, `sum()`) with window-specific functions (`row_number()`, `percent_rank()`) to create comprehensive analytics for each service within a 1-hour sliding window.\n\n\nWhen explicit names aren't provided for window calculations, Hamelin automatically generates field names from expressions. Learn more about this in [Automatic Field Names](../smart-features/automatic-field-names.md).\n\n## WINDOW command parts\n\nThe `WINDOW` command has three optional clauses that control how the sliding\nwindow behaves. Each clause serves a specific purpose in defining which data\ngets included in each calculation.\n\n```hamelin\nWINDOW calculations\n BY grouping_fields // optional: partitions data\n SORT ordering_fields // optional: defines row order\n WITHIN frame_range // optional: defines window size\n```\n\n### BY clause: partitioning data\n\nThe `BY` clause divides data into separate groups, with each group getting its own independent sliding window. This lets you create per-user, per-host, or per-category calculations without mixing data across different entities.\n\n**With BY fields specified:** Hamelin creates separate windows for each unique combination of those fields. This partitioning ensures that calculations for different users, devices, or categories remain completely independent. Each partition maintains its own window state, preventing data from different entities from interfering with each other. Here's how to create separate counting windows for each user:\n\n```hamelin\nFROM events\n| WINDOW count()\n BY user_id\n```\n\n**Without BY fields:** Hamelin treats all data as one big group. This creates a single window that processes all events together, regardless of their source or category. The calculation accumulates across every row in the dataset, which proves useful for global metrics or overall trend analysis. This example creates one counting window that includes all events:\n\n```hamelin\nFROM events\n| WINDOW count()\n```\n\n### SORT clause: ordering rows\n\nThe `SORT` clause controls the order of rows within each window partition. This ordering determines which rows come \"before\" and \"after\" each row, affecting functions like `row_number()` and defining the direction of the sliding window.\n\n**With SORT specified:** Hamelin uses the explicit ordering. The sort order determines which rows come before and after each current row in the window calculation. When you want to analyze transactions by value rather than time, you can sort by amount to create value-based rankings and running totals:\n\n```hamelin\nFROM transactions\n| WINDOW\n running_total = sum(amount),\n transaction_rank = rank()\n BY account_id\n SORT amount DESC\n```\n\n**Without SORT specified:** Hamelin automatically orders by event timestamp. This chronological ordering makes sense for most time-series analysis where you want to track how metrics evolve over time. The automatic timestamp ordering eliminates the need to explicitly specify time-based sorting in typical analytical scenarios. This example creates a chronological sequence count for each user:\n\n```hamelin\nFROM events\n| WINDOW event_sequence = count()\n BY user_id\n```\n\n### WITHIN clause: defining the window frame\n\nThe `WITHIN` clause controls how much data gets included in the window around each row.\n\n**With WITHIN specified:** Hamelin uses the explicit frame size. This sliding frame moves with each row, always maintaining the specified time period or row count. When you need to count events within a specific time window, you can specify the exact duration:\n\n```hamelin\nFROM events\n| WINDOW count()\n BY user_id\n WITHIN -1hr\n```\n\nFor each event, this counts all events for that user in the hour leading up to that event's timestamp.\n\n**Without WITHIN specified:** Hamelin uses `..0r` (from the beginning of the partition up to the current row). This default behavior creates cumulative calculations that include all rows from the start of each partition up to the current row. The cumulative approach works well for running totals, progressive counts, and other metrics that should include all historical data. This example creates a running count for each user from their first event:\n\n```hamelin\nFROM events\n| WINDOW cumulative_count = count()\n BY user_id\n```\n\n## Window frames\n\nThe `WITHIN` clause accepts different types of frame specifications that control how much data gets included around each row. Frame specifications determine whether the window slides based on time intervals, specific row counts, or bounded ranges between two points. Understanding these frame types lets you create exactly the sliding behavior you need for different analytical scenarios.\n\n### Value-based frames\n\nIntervals like `-5min` or `-1hr` create sliding windows based on the values in the sorted column. Because the most common sort order is by timestamp, these frames typically create time-based windows that slide through data chronologically. The window maintains a consistent value range (usually time duration) regardless of how many events occur within that period. Value-based frames work particularly well for temporal metrics like monitoring system performance or analyzing user activity patterns over fixed time periods.\n\n```hamelin\nFROM metrics\n| WINDOW avg_cpu = avg(cpu_usage)\n BY hostname\n WITHIN -5min\n```\n\nFor each metric record, this calculates the average CPU usage for that host over the 5 minutes leading up to that metric's timestamp.\n\nYou can also use value-based frames with non-timestamp columns when the data is sorted by those values. This example calculates running statistics for orders based on order amounts, looking at orders within a $5 range of each current order:\n\n```hamelin\nFROM orders\n| WINDOW\n nearby_orders = count(),\n avg_nearby_amount = avg(amount)\n BY customer_id\n SORT amount\n WITHIN -5\n```\n\n### Row-based frames\n\nRow counts create windows based on a specific number of surrounding rows rather than time periods. This approach proves valuable when you need consistent sample sizes for statistical calculations or when events occur at irregular intervals. Row-based frames ensure that each calculation includes exactly the specified number of data points, making comparisons more reliable across different time periods. This example counts events and assigns sequence numbers using a 4-row window (current row plus 3 preceding rows):\n\n```hamelin\nFROM events\n| WINDOW\n recent_events = count(),\n event_sequence = row_number()\n BY user_id\n SORT timestamp\n WITHIN -3r\n```\n\n### Range frames\n\nRange frames like `-2hr..-1hr` create windows between two specific offsets from the current row. This capability lets you analyze data from specific time periods without including the current time period in the calculation. Range frames prove particularly useful for lag analysis, where you want to compare current metrics against historical periods, or when you need to exclude recent data that might be incomplete. This example counts events from the hour that ended one hour before each current event:\n\n```hamelin\nFROM events\n| WINDOW previous_hour_count = count()\n BY user_id\n WITHIN -2hr..-1hr\n```\n\nRange frames can be unbounded by omitting one end of the range. An unbounded range like `-2hr..` creates a window that extends infinitely in one direction from a starting point. This technique proves useful when you want all data from a specific threshold forward, such as calculating cumulative metrics that start counting only after a certain time delay. Unbounded ranges help create progressive totals that begin from meaningful starting points rather than the very beginning of the dataset. This example counts all events for each user starting from 2 hours before each current event:\n\n```hamelin\nFROM events\n| WINDOW cumulative_count = count()\n BY user_id\n WITHIN -2hr..\n```\n\n**Warning:** Interval frames like `-1hr` create sliding windows, while unbounded ranges like `-1hr..` include all future rows in the dataset. The unbounded version creates a massive window instead of the sliding window you typically want for temporal analysis.\n",
|
|
@@ -503,7 +503,8 @@ scoring weights without affecting the overall detection logic.`,
|
|
|
503
503
|
"language-basics/within-filtering-by-timestamp.md": "# WITHIN: filtering by timestamp\n\nYou use the `WITHIN` command to filter events by their timestamp. This command automatically finds the primary timestamp field in your dataset and filters events to match your specified time criteria, making time-based filtering straightforward and readable.\n\n## Basic syntax\n\nThe `WITHIN` command filters your dataset to include only events that occurred within your specified time window. Hamelin automatically identifies the timestamp field in your data, so you don't need to specify column names.\n\nPull events that happened in the last hour:\n\n```hamelin\nFROM events | WITHIN -1hr\n```\n\nGet events from a specific time window in the past:\n\n```hamelin\nFROM security_events | WITHIN -4hr..-2hr\n```\n\nAnalyze only today's events using time truncation:\n\n```hamelin\nFROM system_logs | WITHIN now()@d..now()\n```\n\nLook at yesterday's complete activity:\n\n```hamelin\nFROM audit_logs | WITHIN (now()@d - 1d)..(now()@d)\n```\n\n## Time intervals vs time ranges\n\nYou can use `WITHIN` with either time intervals or time ranges.\n\n**Time intervals** get automatically converted to ranges that start or end at \"now\". This makes them perfect for recent event analysis:\n\n```hamelin\n# Negative intervals go backward from now\nFROM alerts | WITHIN -1hr # Last hour: -1hr..now()\n\n# Positive intervals go forward from now\nFROM alerts | WITHIN 2hr # Next 2 hours: now()..2hr\n```\n\n**Time ranges** work exactly as you define them, giving you precise control over both start and end points:\n\n```hamelin\n# Bounded range: specific start and end\nFROM events | WITHIN -4hr..-2hr\n\n# Unbounded range: from start onward\nFROM alerts | WITHIN -1hr..\n\n# Range ending at now\nFROM events | WITHIN -2hr..now()\n```\n\n### Backward vs forward time intervals\n\n**Negative intervals** look backward from now to capture recent events:\n\n```hamelin\nFROM events | WITHIN -30min # Events from 30 minutes ago until now\nFROM events | WITHIN -1d # Events from 1 day ago until now\n```\n\n**Positive intervals** look forward from now, which is useful for scheduled events or planned activities:\n\n```hamelin\nFROM scheduled_tasks | WITHIN 2hr # Tasks scheduled for next 2 hours\nFROM alerts | WITHIN 15min # Alerts expected in next 15 minutes\n```\n",
|
|
504
504
|
"smart-features/array-broadcasting.md": "# Array Broadcasting\n\nHamelin automatically broadcasts operations across arrays, letting you use\nfamiliar syntax without explicit iteration. When you write `array.field`,\nHamelin extracts that field from every element in the array. When you write\n`func(array)`, the function applies to each array element individually.\n\n## How broadcasting works\n\nBroadcasting works for any operation or function call, for example arithmetic,\ncomparisons, function calls, and field access (dereference). Instead of needing\nspecial array functions, you use familiar syntax and Hamelin handles the\niteration.\n\n## Examples\n\nLet's work through some examples assuming that `failed_attempts` has the following data:\n\n```hamelin\nLET failed_attempts = [\n {timestamp: '2024-01-15T14:25:00Z', reason: 'invalid_password', ip: '192.168.1.100'},\n {timestamp: '2024-01-15T14:26:15Z', reason: 'account_locked', ip: '192.168.1.101'},\n {timestamp: '2024-01-15T14:27:30Z', reason: 'invalid_password', ip: '192.168.1.102'}\n]\n```\n\n### Field access broadcasting\n\nAccess fields across all array elements using dot notation:\n\n```hamelin\n// Extract all unique reasons from the failed attempts array\n| LET all_reasons = array_distinct(failed_attempts.reason)\n// Result: ['invalid_password', 'account_locked']\n\n// Extract all IP addresses\n| LET all_ips = failed_attempts.ip\n// Result: ['192.168.1.100', '192.168.1.101', '192.168.1.102']\n```\n\nThe operation applies to each element in the array, extracting the specified field from every object.\n\n### Comparison broadcasting\n\nApply comparison operations across array elements:\n\n```hamelin\n// Check which attempts were due to invalid passwords\n| LET password_failures = failed_attempts.reason == 'invalid_password'\n// Result: [true, false, true]\n\n// Find attempts from suspicious IP ranges\n| LET suspicious_ips = failed_attempts.ip > '192.168.1.100'\n// Result: [false, true, true]\n```\n\nEach comparison operation returns an array of boolean values, maintaining the same structure as the original array.\n\n### Function call broadcasting\n\nNamed functions automatically broadcast over arrays:\n\n```hamelin\n// Convert all reasons to uppercase\n| LET uppercase_reasons = upper(failed_attempts.reason)\n// Result: ['INVALID_PASSWORD', 'ACCOUNT_LOCKED', 'INVALID_PASSWORD']\n\n// Extract hour from all timestamps\n| LET attempt_hours = hour(failed_attempts.timestamp)\n// Result: [14, 14, 14]\n```\n\nFunctions apply to each element in the array automatically.\n\n### Arithmetic broadcasting\n\nMathematical operations work across array elements:\n\n```hamelin\n// Extract numeric seconds from timestamp strings for calculations\n| LET timestamp_seconds = second(failed_attempts.timestamp)\n// Result: [0, 15, 30]\n\n// Add 10 seconds to each timestamp second value\n| LET adjusted_seconds = timestamp_seconds + 10\n// Result: [10, 25, 40]\n\n// Calculate minutes elapsed since the first attempt\n| LET minutes_elapsed = (timestamp_seconds - timestamp_seconds[0]) / 60\n// Result: [0, 0.25, 0.5]\n```\n\n### Combining with array functions\n\nBroadcasting pairs naturally with functions that take arrays as arguments like `any()` and `all()`:\n\n```hamelin\n// Check if any login attempt was due to invalid password\n| WHERE any(failed_attempts.reason == 'invalid_password')\n\n// Verify all attempts came from internal network\n| WHERE all(failed_attempts.ip < '192.168.2.0')\n\n// Count how many attempts were password-related\n| AGG password_attempt_count = count_if(any(failed_attempts.reason == 'invalid_password'))\n```\n\nThe broadcasting happens first (creating boolean arrays), then the array\nfunctions operate on those arrays.\n\n### Nested structure broadcasting\n\nBroadcasting works with the nested structure in our example data:\n\n```hamelin\n// Extract just the hour from each timestamp\n| LET attempt_hours = hour(failed_attempts.timestamp)\n// Result: [14, 14, 14]\n\n// Check if any IP is in a specific subnet\n| LET internal_network = failed_attempts.ip > '192.168.1.100'\n// Result: [false, true, true]\n\n// Create readable timestamp strings\n| LET readable_times = format_timestamp(failed_attempts.timestamp, 'HH:mm:ss')\n// Result: ['14:25:00', '14:26:15', '14:27:30']\n```\n\nBroadcasting navigates through multiple levels of nesting.\n\n### Broadcasting vs. traditional approaches\n\nWithout broadcasting, you would need either functional programming with map operations or explode/re-aggregate patterns:\n\n```hamelin\n-- Hamelin with broadcasting\nLET tag_names = upper(tags.name)\n\n-- Alternative approaches without broadcasting:\n-- 1. Functional map operation\nLET tag_names = array_map(tags, tag -> upper(tag.name))\n\n-- 2. Explode and re-aggregate pattern\nFROM events\n| EXPLODE tags AS tag\n| SELECT upper(tag.name) AS upper_name\n| AGG array_agg(upper_name) BY event_id\n```\n\nBroadcasting eliminates the need for explicit mapping or explode/aggregate operations.\n",
|
|
505
505
|
"smart-features/automatic-field-names.md": '# Automatic field names\n\nWhen you don\'t specify output field names in your queries, Hamelin generates \nsensible names automatically. This reduces typing and makes exploratory queries \nfaster to write. You can always override the automatic names by providing your \nown.\n\n## How automatic naming works\n\nHamelin uses the expression itself as the output field name when you don\'t \nprovide one. This creates predictable, readable column names that reflect what \nthe field contains. As an example, for simple column references, Hamelin uses \nthe column name as the field name:\n\n```hamelin\nFROM events | SELECT timestamp, user_name\n// Result columns: "timestamp", "user_name"\n```\n\nFor any expression more complex than a simple column reference, Hamelin uses \nthe full expression text as the field name:\n\n```hamelin\nFROM events | SELECT user_name, timestamp + 1hr, count()\n// Result columns: "user_name", "timestamp + 1hr", "count()"\n\nFROM logs | AGG count(error_code), avg(response_time) BY service\n// Result columns: "count(error_code)", "avg(response_time)"\n```\n\n## Referencing automatic field names\n\nUse backticks to reference automatically generated field names from complex \nexpressions:\n\n```hamelin\nFROM events\n| AGG count(), avg(response_time) BY service_name\n| WHERE `count()` > 100\n| SORT `avg(response_time)` DESC\n```\n\nThe backticks tell Hamelin you\'re referring to a field name that contains \nspecial characters or spaces.\n\n## Commands that use automatic naming\n\n### SELECT command\n\nSELECT generates field names from the expressions you provide:\n\n```hamelin\nFROM users | SELECT first_name + " " + last_name, age * 12\n// Result columns: "first_name + " " + last_name", "age * 12"\n```\n\n### AGG command\n\nAGG creates field names from both aggregation functions and grouping fields:\n\n```hamelin\nFROM requests | AGG count(), max(response_time) BY endpoint\n// Result columns: "endpoint", "count()", "max(response_time)"\n\nFROM events | AGG count() BY user_name, timestamp@hr\n// Result columns: "user_name", "timestamp@hr", "count()"\n```\n\nYou can override the automatic names for both aggregation functions and \ngrouping fields:\n\n```hamelin\nFROM requests | AGG\n request_count = count(),\n max_response = max(response_time)\n BY service_endpoint = endpoint\n// Result columns: "service_endpoint", "request_count", "max_response"\n```\n\n### WINDOW command\n\nWINDOW functions generate names from the function calls:\n\n```hamelin\nFROM metrics\n| WINDOW count(), avg(cpu_usage)\n BY host\n SORT timestamp\n WITHIN -5min\n// Result columns: "count()", "avg(cpu_usage)"\n```\n\n## Overriding automatic names\n\nProvide explicit names when you want cleaner field names or need to reference \nthem easily:\n\n```hamelin\n// Automatic names (harder to reference)\nFROM events | AGG count(), avg(response_time) BY service\n\n// Explicit names (easier to reference)\nFROM events | AGG\n total_events = count(),\n avg_response = avg(response_time)\n BY service\n```\n\nExplicit names make your queries more readable and easier to chain with \nadditional operations.\n\n## When automatic names are most useful\n\nAutomatic naming speeds up data exploration when you\'re trying to understand \nyour data. As an example, you can see results quickly without thinking about \nfield names:\n\n```hamelin\nFROM logs | AGG count(), count_distinct(user_name), max(timestamp) BY service\n```\n\nFor straightforward field selections, automatic names keep queries concise:\n\n```hamelin\nFROM events | SELECT user_name, timestamp, action_type\n```\n\nYou can also start with automatic names, then add explicit names as your query \nbecomes more complex:\n\n```hamelin\n// Start simple\nFROM events | AGG count() BY event_type\n\n// Add explicit names as needed\nFROM events | AGG event_count = count() BY event_type\n```\n\nAutomatic field names reduce friction in query writing while maintaining \nclarity about what each field contains.\n',
|
|
506
|
-
"smart-features/lineage-sub-struct.md": "# Lineage sub-struct\n\nHamelin tracks where data comes from when you assign results to struct fields\nin `FROM` or `MATCH` clauses. This lineage tracking lets you correlate events\nfrom different sources while maintaining visibility into which upstream source\ncontributed each piece of data. You can reference this lineage information to\nbuild complex pattern detection queries.\n\n## How lineage sub-struct works\n\nHamelin creates a composite record that preserves the source of each piece of\ndata when you assign query results to struct fields. This happens automatically\nwhen you use assignment syntax in `FROM` or `MATCH` clauses. As an example,\nconsider tracking both failed and successful login events:\n\n```hamelin\nFROM failed = failed_logins, success = successful_logins\n| WINDOW failures = count(failed),\n successes = count(success),\n total = count()\n BY user.id\n WITHIN -5m\n```\n\nThe `failed = failed_logins` assignment creates a struct field that gets\npopulated for events from the failed logins source, while `success =\nsuccessful_logins` creates another struct field that gets populated for events\nfrom the successful logins source. Events from `failed_logins` will have the\n`failed` field populated and `success` as NULL. Events from `successful_logins`\nwill have the `success` field populated and `failed` as NULL. Hamelin maintains\nthis lineage information throughout the query pipeline.\n\n## Accessing lineage data\n\nYou can reference the assigned struct fields directly in queries. The field\nnames become available for filtering, aggregation, and selection:\n\n```hamelin\nFROM failed = security_alerts, success = login_events\n| WHERE failed.severity > 'medium' OR success.user_id IS NOT NULL\n| SELECT failed.alert_type, success.login_time, failed.source_ip\n```\n\nEach event gets lineage tags that indicate which source it came from. Events\nfrom `security_alerts` will have the `failed` field populated with their data\nand `success` as NULL. Events from `login_events` will have the `success` field\npopulated with their data and `failed` as NULL. This lets you access any field\nfrom the original data while knowing exactly which source contributed each\nevent.\n\n## Pattern correlation with lineage\n\nLineage tracking enables sophisticated event correlation patterns. As an\nexample, consider detecting brute force attacks by correlating failed attempts\nwith eventual successes:\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nFROM failed = failed_logins, success = successful_logins\n| WINDOW failures = count(failed),\n successes = count(success),\n total = count()\n BY user.id\n WITHIN -5m\n| WHERE successes >= 1 && failures / total > 0.2\n| SELECT user.id,\n failed_count = failures,\n success_count = successes,\n failure_rate = failures / total,\n```\n\nThis query correlates two distinct event patterns within sliding windows. The\nlineage tracking lets you distinguish events by source - events from\n`failed_logins` have the `failed` struct populated, while events from\n`successful_logins` have the `success` struct populated. You can then access\nsource-specific fields and aggregate based on event lineage.\n\n## MATCH clause lineage\n\nThe `MATCH` command also supports lineage tracking when you assign pattern\nresults to struct fields. As an example, consider detecting brute force\npatterns that span multiple login attempts:\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nMATCH failed_logins = failed_logins{10,}, successful_logins = successful_logins
|
|
506
|
+
"smart-features/lineage-sub-struct.md": "# Lineage sub-struct\n\nHamelin tracks where data comes from when you assign results to struct fields\nin `FROM` or `MATCH` clauses. This lineage tracking lets you correlate events\nfrom different sources while maintaining visibility into which upstream source\ncontributed each piece of data. You can reference this lineage information to\nbuild complex pattern detection queries.\n\n## How lineage sub-struct works\n\nHamelin creates a composite record that preserves the source of each piece of\ndata when you assign query results to struct fields. This happens automatically\nwhen you use assignment syntax in `FROM` or `MATCH` clauses. As an example,\nconsider tracking both failed and successful login events:\n\n```hamelin\nFROM failed = failed_logins, success = successful_logins\n| WINDOW failures = count(failed),\n successes = count(success),\n total = count()\n BY user.id\n WITHIN -5m\n```\n\nThe `failed = failed_logins` assignment creates a struct field that gets\npopulated for events from the failed logins source, while `success =\nsuccessful_logins` creates another struct field that gets populated for events\nfrom the successful logins source. Events from `failed_logins` will have the\n`failed` field populated and `success` as NULL. Events from `successful_logins`\nwill have the `success` field populated and `failed` as NULL. Hamelin maintains\nthis lineage information throughout the query pipeline.\n\n## Accessing lineage data\n\nYou can reference the assigned struct fields directly in queries. The field\nnames become available for filtering, aggregation, and selection:\n\n```hamelin\nFROM failed = security_alerts, success = login_events\n| WHERE failed.severity > 'medium' OR success.user_id IS NOT NULL\n| SELECT failed.alert_type, success.login_time, failed.source_ip\n```\n\nEach event gets lineage tags that indicate which source it came from. Events\nfrom `security_alerts` will have the `failed` field populated with their data\nand `success` as NULL. Events from `login_events` will have the `success` field\npopulated with their data and `failed` as NULL. This lets you access any field\nfrom the original data while knowing exactly which source contributed each\nevent.\n\n## Pattern correlation with lineage\n\nLineage tracking enables sophisticated event correlation patterns. As an\nexample, consider detecting brute force attacks by correlating failed attempts\nwith eventual successes:\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nFROM failed = failed_logins, success = successful_logins\n| WINDOW failures = count(failed),\n successes = count(success),\n total = count()\n BY user.id\n WITHIN -5m\n| WHERE successes >= 1 && failures / total > 0.2\n| SELECT user.id,\n failed_count = failures,\n success_count = successes,\n failure_rate = failures / total,\n```\n\nThis query correlates two distinct event patterns within sliding windows. The\nlineage tracking lets you distinguish events by source - events from\n`failed_logins` have the `failed` struct populated, while events from\n`successful_logins` have the `success` struct populated. You can then access\nsource-specific fields and aggregate based on event lineage.\n\n## MATCH clause lineage\n\nThe `MATCH` command also supports lineage tracking when you assign pattern\nresults to struct fields. As an example, consider detecting brute force\npatterns that span multiple login attempts:\n\n```hamelin\nWITH failed_logins = FROM events\n| WHERE event.action == 'login_failed'\n\nWITH successful_logins = FROM events\n| WHERE event.action == 'login_success'\n\nMATCH failed_logins = failed_logins{10,}, successful_logins = successful_logins+ WITHIN 10m\n| AGG failed_count = count(failed_logins),\n success_count = count(successful_logins),\n first_failed_ip = min(failed_logins.source_ip),\n success_duration = max(successful_logins.timestamp) - min(successful_logins.timestamp)\n BY user_id\n```\n\nThis pattern detects sequences where at least 10 failed login attempts are\nfollowed by one or more successful logins, with the entire pattern completing \nwithin a 10-minute window. The assignments (`failed_logins =` and `successful_logins =`) create\nlineage tags that identify which pattern each event matched. Events matching the\nfailed login pattern have the `failed_logins` struct populated, while events\nmatching the successful login pattern have the `successful_logins` struct\npopulated. The `AGG` command then operates on these lineage-tagged events to\ncalculate metrics specific to each pattern type. The `count(failed_logins)`\naggregation counts only events that matched the failed login pattern, while\n`count(successful_logins)` counts only events that matched the successful login\npattern. Similarly, `min(failed_logins.source_ip)` accesses the `source_ip`\nfield specifically from events in the failed login pattern, and the timestamp\ncalculations work with the `timestamp` field from events in the successful login\npattern.\n\n## Benefits of lineage tracking\n\nLineage sub-struct provides several key advantages for complex data analysis.\nYou can correlate events from multiple sources while maintaining clear\nattribution of where each piece of data originated. This eliminates confusion\nin queries where data might come from multiple upstream sources with similar\nfield names.\n\nThe feature also enables pattern detection across different event types. You\ncan write queries that aggregate and filter across multiple event patterns\nwhile accessing specific fields from each pattern type. This supports use cases\nlike security monitoring, user behavior analysis, and system performance\ncorrelation.\n",
|
|
507
|
+
"smart-features/struct-field-assignment-desugaring.md": '# Struct field assignment desugaring\n\nHamelin automatically expands struct field assignments into complete struct\nreconstructions, a process called struct field assignment desugaring. When you\nwrite `event.status = "active"` where `event` already contains a struct value,\nHamelin appears to allow you to modify the struct in place. It does this by\nexpanding the entire struct into individual field bindings, and replacing only\nthe target field with your new value. This desugaring happens whenever you\nassign to a field path where a parent identifier already holds a struct value,\nwhether that struct came from an array access, column reference, function call,\nor even a struct literal. The process ensures you can modify deeply nested data\nwithout manually specifying every field that needs to be preserved.\n\n## How struct literals become narrow bindings\n\nBefore understanding field assignment desugaring, you need to see how Hamelin\nhandles struct literals in the first place. When you assign a struct literal\nto an identifier, Hamelin immediately "cracks" it open into individual field\nbindings rather than storing it as a single struct value. Each field in the\nliteral becomes its own binding in the environment, with the field path as its\nidentifier. No parent binding is created at all - the struct literal simply\nexpands directly into its component field bindings. This flattening happens\nimmediately when the struct literal is assigned, not later when fields are\naccessed. These narrow bindings behave exactly like any other bindings in the\nenvironment, which means they can be individually modified, dropped, or used\nin expressions.\n\n```hamelin\nLET event = {user_id: "alice", action: "login", timestamp: now(), status: "pending"}\n\n// Desugars to:\nLET event.user_id = "alice"\n| LET event.action = "login" \n| LET event.timestamp = now()\n| LET event.status = "pending"\n```\n\nThe struct literal assignment creates four field bindings without any parent\nbinding. The `event` identifier never exists as a binding itself - it\'s just\nthe prefix that groups these fields conceptually. Each field binding stands\nalone and can be referenced directly as `event.user_id` or used in expressions.\nIf you SELECT `event` later, Hamelin reconstructs the struct from these\nindividual bindings. This decomposition into narrow bindings makes struct\nliterals extremely flexible for manipulation.\n\nYou can add new fields to this flattened struct simply by creating new\nbindings with the appropriate prefix. When you write `LET event.ip_address =\n"192.168.1.1"`, you\'re just adding another binding to the environment that\nhappens to share the `event` prefix. Hamelin recognizes this pattern and\nincludes the new field when reconstructing the struct. Similarly, you can\nmodify existing fields by reassigning them - `LET event.status = "success"`\nsimply overwrites the existing `event.status` binding. The struct literal\ndecomposition makes these operations natural extensions of regular variable\nassignment.\n\n```hamelin\nLET event = {user_id: "alice", action: "login", timestamp: now()}\n| LET event.session_id = generate_id() // Adds new field\n| LET event.action = "authenticated" // Modifies existing field\n| DROP event.timestamp // Removes field\n\n// Results in these bindings:\nLET event.user_id = "alice"\n| LET event.action = "authenticated"\n| LET event.session_id = generate_id()\n```\n\n## Non-splittable struct expressions\n\nNot all struct values get flattened into narrow bindings like struct literals\ndo. When you access a struct from an array element, column reference, or\nfunction call, Hamelin stores it as a single struct-typed binding rather than\ndecomposing it. These "non-splittable" struct expressions maintain their\nstructure as a cohesive unit in the environment. The distinction matters\nbecause modifying a field within these structs requires a different approach -\nthis is where field assignment desugaring becomes essential. Without\ndesugaring, you\'d have to manually extract every field, modify the one you\nwant, and rebuild the entire struct.\n\n```hamelin\nFROM events\n| LET first = events[0]\n\n// This creates just one binding:\n// first = <struct value from array>\n\n// NOT these bindings:\n// first.user_id = ...\n// first.action = ...\n// etc.\n```\n\nThe array access creates a single binding `first` that contains the entire\nstruct value. Unlike with struct literals, there are no individual field\nbindings for `first.user_id`, `first.action`, etc. The struct remains intact\nas a single value in the environment. If you want to modify a field within\nthis struct, you can\'t just reassign `first.action` because that binding\ndoesn\'t exist yet. This is the key difference between splittable struct\nliterals and non-splittable struct expressions.\n\n## The desugaring transformation\n\nAssigning to a field of a non-splittable struct triggers Hamelin\'s desugaring\ntransformation. The assignment `first.status = "active"` where `first` \ncontains a struct value starts the desugaring process. Hamelin recognizes that \n`first` holds a struct and expands the entire assignment into a series of \noperations that reconstruct the struct with the modification. The system doesn\'t \nerror or create an isolated field - it transforms the single assignment into \nmultiple bindings. The desugaring creates explicit bindings for every field, \nwith your target field getting the new value and all other fields preserving \ntheir original values through column references.\n\n```hamelin\nFROM events \n| LET first = events[0] // first contains {user_id, action, timestamp, status}\n| LET first.status = "active"\n\n// Desugars to:\nLET first.status = "active"\n| LET first.user_id = first.user_id\n| LET first.action = first.action\n| LET first.timestamp = first.timestamp\n```\n\nThe desugaring process transforms one field assignment into multiple bindings\nthat collectively reconstruct the struct. Your assigned field `first.status`\ngets the value `"active"` as intended. Every other field in the original\nstruct gets a binding that references its current value through a column\nreference expression. This transformation happens transparently during\ncompilation, converting your simple field assignment into the equivalent\nmanual reconstruction.\n\n## Multiple field modifications\n\nWhen you modify multiple fields of the same struct, each modification builds\non the previous state. The first modification expands the struct into \nindividual bindings, and subsequent modifications operate on those bindings \nrather than the original struct. Each field assignment layers on top of the\nprevious changes, creating a natural composition pattern. The order of\nmodifications matters because each desugaring uses the current environment\nstate, which includes bindings created by earlier desugarings. This incremental\napproach lets you progressively transform structs through a series of field \nassignments without complex manual reconstruction.\n\n```hamelin\nFROM security_logs\n| LET entry = logs[0] // {event_id, severity, message, timestamp}\n| LET entry.severity = "HIGH"\n| LET entry.reviewed = true\n| LET entry.reviewer = "alice"\n\n// First assignment desugars to:\nLET entry.severity = "HIGH"\n| LET entry.event_id = entry.event_id\n| LET entry.message = entry.message \n| LET entry.timestamp = entry.timestamp\n\n// Second assignment adds:\nLET entry.reviewed = true\n\n// Third assignment adds:\nLET entry.reviewer = "alice"\n```\n\nThe first field assignment triggers a complete desugaring that creates\nbindings for all original fields. The `entry.severity` field gets the new\nvalue "HIGH" while other fields get column reference bindings. When you assign\n`entry.reviewed = true`, Hamelin recognizes that `entry` already has\nfield bindings from the desugaring, so it simply adds the new binding without\ntriggering another desugaring. The third assignment similarly adds the\n`reviewer` field. The final result is a struct with all original fields plus\nthe two new fields you added.\n\n## Nested struct modification\n\nDesugaring handles nested struct modifications by identifying the correct\nlevel to expand based on which identifier holds an actual struct value. When\nyou write `event.metadata.risk_score = 100`, Hamelin examines each level of\nthe path to find where the struct boundary exists. If `event` holds a struct\nvalue but `metadata` is just a field within it (not a separate binding), the\ndesugaring happens at the `event` level. The system reconstructs the entire\n`event` struct with a modified `metadata` field that contains your new\n`risk_score`. This process works recursively, handling arbitrary levels of\nnesting as long as each level properly resolves to struct types.\n\n```hamelin\nFROM security_events\n| LET event = events[0] // {id, metadata: {risk_score: 50, category: "auth"}, timestamp}\n| LET event.metadata = {risk_score: 100, category: event.metadata.category, flagged: true}\n\n// Desugars to:\nLET event.metadata.risk_score = 100\n| LET event.metadata.category = event.metadata.category\n| LET event.metadata.flagged = true\n| LET event.id = event.id\n| LET event.timestamp = event.timestamp\n```\n\nModifying nested structs requires reconstructing the nested struct explicitly\nas shown above. The assignment to `event.metadata` replaces the entire nested\nstruct with a new one that includes the modifications. The desugaring process \nensures that all other fields of the parent `event` struct get preserved \nthrough column references. This pattern of replacing entire\nnested structs while preserving parent fields is common when working with\ndeeply nested data structures.\n\n## How DROP uses desugaring\n\nThe DROP command leverages the same desugaring mechanism but with inverted\nlogic - instead of adding or modifying fields, it reconstructs the struct with\nspecific fields excluded. When you DROP a field from a non-splittable struct,\nHamelin expands all fields except the ones being dropped. This creates the\nsame pattern of bindings with column references, but omits the dropped fields\nentirely. The resulting bindings represent the struct minus the dropped\nfields. This elegant reuse of the desugaring mechanism makes DROP work\nconsistently with the rest of Hamelin\'s struct manipulation features.\n\n```hamelin\nFROM user_events\n| LET user = events[0] // {id, email, password_hash, last_login, preferences}\n| DROP user.password_hash, user.email\n\n// Desugars to (includes all fields EXCEPT dropped ones):\nLET user.id = user.id\n| LET user.last_login = user.last_login\n| LET user.preferences = user.preferences\n```\n\nThe DROP desugaring creates bindings for every field except `password_hash`\nand `email`, effectively removing them from the struct. Each preserved field\ngets a column reference binding that maintains its current value. If you later\nSELECT `user`, you\'ll get a struct containing only `id`, `last_login`, and\n`preferences`. This approach to dropping fields integrates seamlessly with\nother struct operations.\n\n## Field ordering in desugared structs\n\nWhen Hamelin reconstructs a struct after desugaring, modified fields appear\nfirst, followed by preserved fields in their original order. This consistent\nordering rule makes modifications visible in the struct\'s schema and helps track\nwhat changed. You see this ordering whether modifying fields through\nassignment or adding new fields to an existing struct. Each modification moves\nthat field to the front of the field order, creating a kind of modification\nhistory encoded in the structure itself. While this changes the field order from\nthe original struct, it provides valuable visibility into what transformations\nhave been applied, and since Hamelin fields have nominal equivalence rather than\npositional equivalence, this reordering is safe.\n\n```hamelin\nFROM events\n| LET evt = events[0] // Original: {timestamp, user_id, action, session_id, ip_address}\n| LET evt.action = "logout"\n| LET evt.duration = 30\n\n// Resulting field order:\n// {duration, action, timestamp, user_id, session_id, ip_address}\n```\n\nThe modified `action` field and newly added `duration` field both appear at\nthe front of the reconstructed struct. The original fields maintain their\nrelative ordering (`timestamp` still comes before `user_id`, which comes\nbefore `session_id`). This predictable ordering helps when debugging or when\nyou need to understand what transformations have been applied to a struct. The\nconsistency of this rule across all struct operations makes Hamelin\'s behavior\npredictable even with complex nested modifications.\n',
|
|
507
508
|
"smart-features/type-expansion.md": "# Type expansion\n\nHamelin expands types when you query multiple datasets. If you write `FROM\nevents, logs`, Hamelin creates an *expanded type* that includes all fields from\nboth sources. This lets you write queries that work across datasets without\nworrying about schema differences.\n\n## How type expansion works\n\nHamelin constructs expanded types by combining field names from all source\ntypes. Fields with the same name get aligned into a single field in the\nexpanded type. This lets you write queries that work across datasets with\nconsistent field naming. As an example, consider searching across different\nauthentication logs with varying schemas:\n\n```hamelin\nFROM security_logs, audit_logs, access_logs\n| WHERE action = 'login' OR event_type = 'authentication'\n| SELECT timestamp, user_id, source_ip, action, event_type, session_id\n```\n\nEach logging system has its own schema:\n\nSecurity logs track authentication attempts with IP addresses:\n\n```hamelin\n{timestamp: timestamp, user_id: string, source_ip: string, action: string}\n```\n\nAudit logs capture detailed session information:\n\n```hamelin\n{timestamp: timestamp, user_id: string, event_type: string, session_id: string}\n```\n\nAccess logs record basic user activity:\n\n```hamelin\n{timestamp: timestamp, user_id: string, source_ip: string}\n```\n\nThe expanded result type becomes:\n\n```hamelin\n{\n timestamp: timestamp,\n user_id: string,\n source_ip: string,\n action: string,\n event_type: string,\n session_id: string\n}\n```\n\nRows from `security_logs` will have `NULL` for the `event_type` and\n`session_id` fields. Rows from `audit_logs` will have `NULL` for the\n`source_ip` and `action` fields. Rows from `access_logs` will have `NULL` for\nthe `action`, `event_type`, and `session_id` fields.\n\nThe result contains rows like:\n\n| timestamp | user_id | source_ip | action | event_type | session_id |\n|-----------|---------|-----------|---------|------------|------------|\n| 2024-01-15 10:30:00 | alice.smith | 192.168.1.100 | login | NULL | NULL |\n| 2024-01-15 10:31:00 | alice.smith | NULL | NULL | authentication | sess_abc123 |\n| 2024-01-15 10:32:00 | alice.smith | 192.168.1.100 | NULL | NULL | NULL |\n\nAll three datasets contribute to the same `timestamp` and `user_id` fields\nbecause they use identical field names. You can filter and select on shared\nfields without knowing which source contributed each row. This lets you write a\nsingle query to search for authentication events across all systems, even\nthough each system logs different fields. The expanded type accommodates all\npossible fields, and you can filter on any field that exists in any source.\n\n\n\n## Nested type expansion\n\nType expansion works with nested structures. Hamelin expands the type hierarchy\nto accommodate nested fields from different sources. As an example, consider\ncombining user data from different systems:\n\n```hamelin\nFROM user_profiles, account_settings\n| SELECT user.name, user.email, user.preferences\n```\n\nEach system has its own nested user structure:\n\nUser profiles contain basic identity information:\n\n```hamelin\n{user: {name: string, email: string}}\n```\n\nAccount settings store user preferences:\n\n```hamelin\n{user: {preferences: string}}\n```\n\nThe expanded result type becomes:\n\n```hamelin\n{user: {name: string, email: string, preferences: string}}\n```\n\nThis means you can access `user.name` from profile data and `user.preferences`\nfrom settings data in the same query, even though the original sources have\ndifferent nested structures.\n\n :::note\n\n Hamelin maintains stable field ordering when merging nested structures. Fields\n from the first source appear first, then fields from the second source are\n added in their original order. This consistent ordering means you can rely on\n the structure of expanded types being predictable across queries.\n\n :::\n\n## Array literal expansion\n\nType expansion also happens when you create array literals containing struct\ntypes with different schemas. Just like `FROM` clauses, Hamelin creates an\nexpanded type that accommodates all fields from every struct in the array. As\nan example, consider creating an array mixing user records with different\navailable fields:\n\n```hamelin\nLET mixed_users = [\n {name: 'Alice', age: 30, department: 'Engineering'},\n {name: 'Bob', email: 'bob@company.com', age: 25},\n {name: 'Carol', email: 'carol@company.com', department: 'Sales'}\n]\n```\n\nEach struct has its own schema:\n\nThe first user record has name, age, and department:\n\n```hamelin\n{name: string, age: number, department: string}\n```\n\nThe second user record has name, email, and age:\n\n```hamelin\n{name: string, email: string, age: number}\n```\n\nThe third user record has name, email, and department:\n\n```hamelin\n{name: string, email: string, department: string}\n```\n\nThe expanded array type becomes:\n\n```hamelin\n[{name: string, age: number, department: string, email: string}]\n```\n\nEach element gets `NULL` values for missing fields. The first element has\n`NULL` for `email`. The second element has `NULL` for `department`. The third\nelement has `NULL` for `age`. This lets you create arrays from structs with\ndifferent schemas while maintaining type consistency across all elements.\n\nYou can then query the expanded array just like any other dataset:\n\n```hamelin\nLET mixed_users = [\n {name: 'Alice', age: 30, department: 'Engineering'},\n {name: 'Bob', email: 'bob@company.com', age: 25},\n {name: 'Carol', email: 'carol@company.com', department: 'Sales'}\n]\n| UNNEST mixed_users\n```\n\nThe query works across all elements regardless of which fields were originally\npresent in each struct. Missing fields appear as `NULL` in the results, just\nlike with `FROM` clause expansion.\n\nThe results would be:\n\n| name | age | department | email |\n|------|-----|------------|-------|\n| Alice | 30 | Engineering | NULL |\n| Bob | 25 | NULL | bob@company.com |\n| Carol | NULL | Sales | carol@company.com |\n",
|
|
508
509
|
"types/array.md": '# Array\n\nArrays let you work with collections of values. Hamelin arrays work much like SQL arrays, but they integrate seamlessly with structs to handle complex nested data.\n\n## Creating arrays\n\nYou create arrays using square brackets with elements separated by commas.\n\n```hamelin\nLET event_types = ["login", "logout", "purchase"]\nLET user_ids = [1001, 1002, 1003, 1004]\nLET timestamps = [ts(\'2024-01-15T10:00:00\'), ts(\'2024-01-15T10:15:00\'), ts(\'2024-01-15T10:30:00\')]\n```\n\nArrays can contain any type of value, including numbers, strings, timestamps, and even structs.\n\n## Arrays of structs\n\nArrays become especially useful when they contain structs, as Hamelin automatically handles differences between struct fields.\n\n```hamelin\nLET user_events = [\n {user_id: 1001, event: "login", timestamp: ts(\'2024-01-15T09:00:00\')},\n {user_id: 1001, event: "purchase", timestamp: ts(\'2024-01-15T09:15:00\'), amount: 49.99},\n {user_id: 1001, event: "logout", timestamp: ts(\'2024-01-15T09:30:00\')}\n]\n```\n\nNotice how the second struct has an `amount` field that the others don\'t have. Hamelin automatically creates a combined type that includes all fields, setting missing fields to `NULL` where needed.\n\n## Accessing array elements\n\nUse square brackets with zero-based indexing to access individual elements.\n\n```hamelin\nFROM events\n| SELECT \n first_tag = tags[0],\n second_tag = tags[1],\n last_tag = tags[-1]\n```\n\nNegative indices count from the end of the array, so `[-1]` gives you the last element.\n\n## Array operations with mixed structs\n\nWhen you combine arrays containing structs with different fields, Hamelin merges the struct types intelligently.\n\n```hamelin\nLET login_events = [\n {event_type: "login", user_id: 100, timestamp: ts(\'2024-01-15T09:00:00\')},\n {event_type: "login", user_id: 101, timestamp: ts(\'2024-01-15T09:05:00\')}\n]\n\nLET purchase_events = [\n {event_type: "purchase", user_id: 100, amount: 25.99, timestamp: ts(\'2024-01-15T09:10:00\')}\n]\n\n// Combining these creates an array with all fields: event_type, user_id, timestamp, amount\n```\n\nThe resulting combined array contains structs where each element has all the fields that appear in any struct, with `NULL` values where fields are missing.\n\n## Field ordering in combined structs\n\nWhen Hamelin combines structs with different fields, it maintains the field order from the first struct encountered, then appends any new fields in the order they first appear.\n\n```hamelin\nLET events = [\n {id: 1, type: "login", user_id: 100}, // Order: id, type, user_id\n {status: "success", id: 2, type: "logout"} // New field \'status\' gets appended\n]\n\n// Result order: id, type, user_id, status\n```\n\n## Type compatibility\n\nArrays can only contain elements that can be coerced to a common type. Hamelin will combine compatible types automatically, but incompatible types will cause an error.\n\n```hamelin\n// This works - numbers can be in the same array\nLET mixed_numbers = [1, 2.5, 3]\n\n// This works - structs with compatible fields\nLET compatible_structs = [\n {name: "Alice", age: 30},\n {name: "Bob", age: 25, city: "Seattle"}\n]\n\n// This would fail - structs with same field name but different types\n// {name: "Alice", count: 5} and {name: "Bob", count: "many"} \n```\n\n## Practical examples\n\nArrays work well for collecting related values and organizing repeated data.\n\n```hamelin\n// Collecting user actions over time\nFROM user_logs\n| SELECT \n user_id,\n user_session = {\n daily_events: [\n {action: "login", time: login_time},\n {action: "view_page", time: page_view_time, page: page_name},\n {action: "logout", time: logout_time}\n ],\n session_duration: logout_time - login_time\n }\n```\n\nThis creates structured output where each user\'s session contains an array of different event types, each with their own specific fields.\n\n## Working with nested arrays\n\nArrays can contain structs that themselves contain arrays, creating complex nested structures.\n\n```hamelin\nFROM dns_logs\n| SELECT dns_response = {\n query: query_name,\n answers: [\n {name: answer1_name, type: answer1_type, ttl: answer1_ttl},\n {name: answer2_name, type: answer2_type, ttl: answer2_ttl}\n ],\n response_time: query_duration\n }\n```\n\nThis organizes DNS response data where each query can have multiple answers, and each answer has its own set of fields.',
|
|
509
510
|
"types/casting.md": "# Casting\n\nTo cast, use the infix operator `AS`. Hamelin uses the `AS` operator for explicit type casting. You write the value, then `AS`, then the type you want.\n\nThe two most common reasons to cast are:\n\n- Casting variant to explicit types after parsing JSON\n- Casting types to string to concatenate them together\n\n## Basic syntax\n\nCast a value by putting `AS` between the value and the target type:\n\n```hamelin\n| LET x = 5 AS double\n```\n\nThis creates a double-precision value instead of an integer.\n\n## Why `AS` for casting?\n\nYou'll use explicit casting often, especially when declaring literals to influence type inference. We wanted something terse. Using `AS` for assignment confuses people (the order seems backwards). This frees up `AS` for casting, which reads cleanly: *treat this one thing as another type*.\n\n## How it works\n\nThe `AS` operator translates explicit cast expressions into the generated code. We often actually translate to `try_cast()` in order to make sure the query doesn't crash.\n\nHamelin delegates **implicit casting to the underlying engine** \u2014 if you assign a value to a typed column or pass it to a function that expects a different type, the engine decides whether and how to cast the value.\n\n## Common casting examples\n\n### String conversions\nConvert values to strings for display or storage:\n\n```hamelin\nFROM events\n| SELECT\n user_id_str = user_id AS string,\n timestamp_str = timestamp AS string,\n status_display = status_code AS string\n```\n\n### Numeric conversions\nConvert between different numeric types or from strings to numbers:\n\n```hamelin\nFROM logs\n| SELECT\n status_code = response_code AS integer,\n response_time = response_time_str AS double,\n user_count = total_users AS integer\n```\n\n### Boolean conversions\nConvert various values to boolean types:\n\n```hamelin\nFROM user_data\n| SELECT\n user_id,\n is_active = status_flag AS boolean,\n has_permissions = permission_level AS boolean\n```\n\n## Type inference with casting\n\nYou can influence type inference in variable declarations by casting literals:\n\n```hamelin\nFROM events\n| LET\n threshold = 100 AS double,\n max_retries = 5 AS integer,\n default_timeout = 30.0 AS double\n| WHERE response_time > threshold\n```\n\n## Complex type casting\n\n### Array casting\nCast arrays to specific element types:\n\n```hamelin\nFROM json_data\n| SELECT\n tags = tag_list AS array(string),\n scores = score_array AS array(double)\n```\n\n### Struct casting\nCast structured data to specific field types:\n\n```hamelin\nFROM structured_data\n| SELECT\n user_info = user_data AS {name: string, email: string},\n coordinates = location AS {x: double, y: double}\n```\n",
|