npm - @forzalabs/remora - Versions diffs - 1.1.15 → 1.2.1 - Mend

@forzalabs/remora 1.1.15 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/CHANGELOG.md +107 -0
package/index.js +196 -2
package/json_schemas/consumer-schema.json +107 -21
package/package.json +1 -1
package/workers/ExecutorWorker.js +196 -2

package/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,107 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
+## Unreleased
+### Added
+- Added field-level consumer validations with support for multiple rules per field and per-rule failure actions: `fail`, `skip`, `warn`, and `set_default`
+- Added dataset-level consumer validations for `unique_fields`, `min_rows`, `max_rows`, `no_duplicates`, and `not_empty`
+- Added `DataValidationEngine` to centralize field and dataset validation logic
+- Added validation result type definitions to the definitions package for shared use across engines and executors
+- Added `warn()` logging support for non-fatal validation outcomes
+- Added canary consumer coverage for field-level and dataset-level validations with passing, warning, skipped, defaulted, and failing scenarios
+- Added `verify:local` to the canary package to build the local CLI and run the canary suite against it instead of the published package
+### Changed
+- Updated consumer field validation configuration from a single flat validation object to an ordered array of validation rules with explicit `onFail` behavior
+### Fixed
+- Fixed the consumer JSON schema to support the new field-level and dataset-level validation configuration
+- Fixed AJV strict-mode compatibility for validation `in` and `not_in` rule arrays by replacing union `type` declarations with `oneOf`
+## V 1.1.15 - 2026-03-26
+### Added
+- Added validation that consumer field keys exist in the referenced producer's dimensions/measures
+- Added validation that every consumer field defines the required `key` property
+- Added validation that `copyFrom` references a field that appears earlier in the consumer's field list
+- Added validation that `distinctOn` keys and `orderBy` reference fields present in the consumer
+- Added validation that join SQL `${P.field}` and `${producer.field}` references point to valid fields
+### Fixed
+- Fixed environment variable not exposed to front-end
+- Fixed database endpoint selector
+- Fixed worker image volume usage in a cloud environment
+- Fixed worker-thread execution errors being logged only to the terminal by propagating them back to the orchestrator file logger
+- Fixed `ConsumerExecutor.processRecord` error reporting to log step-specific failures for field resolution, aliasing, transformations, and filter evaluation
+## V 1.1.11 - 2026-02-05
+### Added
+- Added `startRow` and `startColumn` settings for Excel producers (.xls/.xlsx), allowing users to specify the 1-indexed row and column from which to begin reading data
+### Fixed
+- Fixed `MaxListenersExceededWarning` on `WriteStream` during consumer execution by replacing shared stream merge with per-file append pipelines
+- Fixed CLI `run` command always exiting with code 1 even on successful runs
+- Fixed incomplete file logging caused by `process.exit()` terminating before winston could flush buffered writes; added `logger.flush()` to worker threads, orchestrator, and CLI exit paths
+- Added `logger.flush()` before `process.exit()` in all data-processing CLI actions (sample, mock, automap, discover, debug) and worker startup
+- Fixed CLI `discover` command exiting with code 1 on success instead of code 0
+- Fixed per-worker `WriteStream` in `Executor.ts` never being closed, risking data loss before distinct/distinctOn post-processing passes
+- Fixed `Dataset.ts` stream await pattern where `resolve` was never called in the `finish` handler (5 sites: transformStream, sort batches, k-way merge, append), causing promises to hang indefinitely
+- Fixed `ExecutorWriter.ts` not awaiting intermediate stream flush during file-size-based rotation
+- Fixed `DriverHelper.appendObjectsToUnifiedFile` and `LocalDestinationDriver.transformAndMove` not awaiting stream flush before returning
+## V 1.1.9 - 2026-02-04
+### Added
+- Added `switch_case` transformation for mapping specific values to other values (similar to a switch/case statement)
+- Added validation to detect multiple consumer fields reading from the same producer dimension (suggests using `copyFrom` instead)
+- Added detailed logging to the executor orchestrator with usage ID tracing throughout the execution lifecycle
+### Changed
+- Cleaned up CLI execution error output to show concise messages in console while preserving full stack traces in internal logs
+## V 1.1.8 - 2026-02-03
+### Added
+- Added `pivot` option to consumers, enabling row-to-column transformation with aggregation (sum, count, avg, min, max)
+- Added `copyFrom` property to consumer fields, allowing a field to be a value copy of another field in the dataset
+## V 1.1.7 - 2026-02-02
+### Changed
+- Improved the mock engine
+- Improved logging
+## V 1.1.6 - 2026-02-02
+### Added
+- Added `--limit` option to `remora run` command to process only the first N records
+- Added descriptive error messages for failed field transformations with full stack trace preservation
+- Added file logging with rotation (enabled via `REMORA_DEBUG_MODE=true` in production)
+- Added structured logging across key application areas
+### Changed
+- Moved `DEBUG_MODE` from project.json settings to `REMORA_DEBUG_MODE` environment variable
+## V 1.1.5 - 2026-02-01
+### Added
+- Refactored for monorepository
+- Added output maximum file size definable from consumer
+- Added support for nested subfolders inside remora configuration directories (sources, producers, consumers, schemas)
+### Fixed
+- Bug in parsing via GZ file
+- Issues with concurrent requests
+### Changed
+- Dockerfile for apps in a monorepo build
+- Package.json to workspaces compliance
+- Refactored internal module structure
+- Removed the _file annotations for environment variables
+## V 1.0.18

package/index.js CHANGED Viewed

@@ -13357,6 +13357,10 @@ var Logger = class {
       console.info(message);
       FileLogService_default.write("INFO", String(message));
     };
+    this.warn = (message) => {
+      console.warn(message);
+      FileLogService_default.write("WARN", String(message));
+    };
     this.flush = () => FileLogService_default.flush();
     this.close = () => FileLogService_default.close();
     this.error = (error) => {
@@ -13500,7 +13504,7 @@ var import_promises = __toESM(require("fs/promises"), 1);
 // ../../packages/constants/src/Constants.ts
 var CONSTANTS = {
-  cliVersion: "1.1.15",
+  cliVersion: "1.2.1",
   backendVersion: 1,
   backendPort: 5088,
   workerVersion: 2,
@@ -18482,6 +18486,112 @@ var TransformationEngineClass = class {
 var TransformationEngine = new TransformationEngineClass();
 var TransformationEngine_default = TransformationEngine;
+// ../../packages/engines/src/transform/DataValidationEngine.ts
+var DataValidationEngineClass = class {
+  constructor() {
+    this.applyValidations = (value, validations, fieldKey) => {
+      for (const validation of validations) {
+        const passed = this.evaluateRule(value, validation.rule);
+        if (!passed) {
+          return {
+            valid: false,
+            message: this.buildMessage(value, validation.rule, fieldKey),
+            onFail: validation.onFail
+          };
+        }
+      }
+      return { valid: true };
+    };
+    this.evaluateRule = (value, rule) => {
+      if ("required" in rule)
+        return Algo_default.hasVal(value);
+      if ("min" in rule) {
+        if (!Algo_default.hasVal(value)) return true;
+        const num = Number(value);
+        return !isNaN(num) && num >= rule.min;
+      }
+      if ("max" in rule) {
+        if (!Algo_default.hasVal(value)) return true;
+        const num = Number(value);
+        return !isNaN(num) && num <= rule.max;
+      }
+      if ("regex" in rule) {
+        if (!Algo_default.hasVal(value)) return true;
+        return new RegExp(rule.regex).test(String(value));
+      }
+      if ("min_length" in rule) {
+        if (!Algo_default.hasVal(value)) return true;
+        return String(value).length >= rule.min_length;
+      }
+      if ("max_length" in rule) {
+        if (!Algo_default.hasVal(value)) return true;
+        return String(value).length <= rule.max_length;
+      }
+      if ("in" in rule) {
+        return rule.in.includes(value);
+      }
+      if ("not_in" in rule) {
+        return !rule.not_in.includes(value);
+      }
+      return true;
+    };
+    this.buildMessage = (value, rule, fieldKey) => {
+      const preview = Algo_default.hasVal(value) ? JSON.stringify(value) : "null/undefined";
+      if ("required" in rule) return `Field "${fieldKey}" is required but got ${preview}`;
+      if ("min" in rule) return `Field "${fieldKey}" value ${preview} is below minimum ${rule.min}`;
+      if ("max" in rule) return `Field "${fieldKey}" value ${preview} exceeds maximum ${rule.max}`;
+      if ("regex" in rule) return `Field "${fieldKey}" value ${preview} does not match pattern "${rule.regex}"`;
+      if ("min_length" in rule) return `Field "${fieldKey}" value ${preview} is shorter than minimum length ${rule.min_length}`;
+      if ("max_length" in rule) return `Field "${fieldKey}" value ${preview} exceeds maximum length ${rule.max_length}`;
+      if ("in" in rule) return `Field "${fieldKey}" value ${preview} is not in the allowed values`;
+      if ("not_in" in rule) return `Field "${fieldKey}" value ${preview} is in the disallowed values`;
+      return `Field "${fieldKey}" failed validation`;
+    };
+    this.evaluateDatasetValidations = (validations, context) => {
+      const results = [];
+      for (const validation of validations) {
+        const result = this.evaluateDatasetRule(validation, context);
+        if (result) results.push(result);
+      }
+      return results;
+    };
+    this.extractUniqueFieldKeys = (validations) => {
+      return validations.filter((v) => "unique_fields" in v.rule).flatMap((v) => v.rule.unique_fields);
+    };
+    this.hasRule = (validations, ruleKey) => {
+      return validations.some((v) => ruleKey in v.rule);
+    };
+    this.evaluateDatasetRule = (validation, context) => {
+      const { rule, onFail } = validation;
+      const { rowCount, hasDuplicateRows, duplicateFields } = context;
+      if ("not_empty" in rule) {
+        if (rowCount === 0)
+          return { message: "Dataset is empty", onFail };
+      }
+      if ("min_rows" in rule) {
+        if (rowCount < rule.min_rows)
+          return { message: `Dataset has ${rowCount} rows, expected at least ${rule.min_rows}`, onFail };
+      }
+      if ("max_rows" in rule) {
+        if (rowCount > rule.max_rows)
+          return { message: `Dataset has ${rowCount} rows, expected at most ${rule.max_rows}`, onFail };
+      }
+      if ("no_duplicates" in rule) {
+        if (hasDuplicateRows)
+          return { message: "Dataset contains duplicate rows", onFail };
+      }
+      if ("unique_fields" in rule) {
+        const failedFields = rule.unique_fields.filter((f) => duplicateFields.includes(f));
+        if (failedFields.length > 0)
+          return { message: `Duplicate values found in field(s): ${failedFields.join(", ")}`, onFail };
+      }
+      return null;
+    };
+  }
+};
+var DataValidationEngine = new DataValidationEngineClass();
+var DataValidationEngine_default = DataValidationEngine;
 // ../../packages/engines/src/usage/DataframeManager.ts
 var DataframeManagerClass = class {
   fill(points, from, to, onlyLastValue, maintainLastValue) {
@@ -18971,6 +19081,32 @@ var ConsumerExecutorClass = class {
           }
         }
       }
+      for (const field of fields) {
+        const { cField } = field;
+        const fieldKey = cField.alias ?? cField.key;
+        if (cField.validate && cField.validate.length > 0) {
+          const result = DataValidationEngine_default.applyValidations(record[fieldKey], cField.validate, fieldKey);
+          if (!result.valid) {
+            const errorMessage = `Validation failed for field "${fieldKey}" (index: ${recordIndex}): ${result.message}`;
+            switch (result.onFail) {
+              case "set_default":
+                record[fieldKey] = cField.default;
+                break;
+              case "skip":
+                return null;
+              case "warn":
+                Logger_default.warn(errorMessage);
+                break;
+              case "fail":
+              default: {
+                const err = new Error(errorMessage);
+                Logger_default.error(err);
+                throw err;
+              }
+            }
+          }
+        }
+      }
       try {
         for (const dimension of dimensions) {
           const field = fields.find((x) => x.cField.key === dimension.name);
@@ -19211,6 +19347,48 @@ var ConsumerExecutorClass = class {
           return false;
       }
     };
+    this.processDatasetValidation = async (consumer, datasetPath) => {
+      const validations = consumer.validate;
+      if (!validations || validations.length === 0) return [];
+      const internalRecordFormat = OutputExecutor_default._getInternalRecordFormat(consumer);
+      const internalFields = ConsumerManager_default.getExpandedFields(consumer);
+      let rowCount = 0;
+      const seenRows = /* @__PURE__ */ new Set();
+      const fieldValueSets = /* @__PURE__ */ new Map();
+      let hasDuplicateRows = false;
+      const duplicateFields = [];
+      const uniqueFieldKeys = DataValidationEngine_default.extractUniqueFieldKeys(validations);
+      const checkDuplicateRows = DataValidationEngine_default.hasRule(validations, "no_duplicates");
+      for (const fieldKey of uniqueFieldKeys) {
+        fieldValueSets.set(fieldKey, /* @__PURE__ */ new Set());
+      }
+      const reader = import_fs11.default.createReadStream(datasetPath);
+      const lineReader = import_readline6.default.createInterface({ input: reader, crlfDelay: Infinity });
+      for await (const line of lineReader) {
+        rowCount++;
+        if (checkDuplicateRows) {
+          if (seenRows.has(line))
+            hasDuplicateRows = true;
+          else
+            seenRows.add(line);
+        }
+        if (uniqueFieldKeys.length > 0) {
+          const record = internalRecordFormat === "CSV" || internalRecordFormat === "TXT" ? LineParser_default._internalParseCSV(line, internalFields) : LineParser_default._internalParseJSON(line);
+          for (const fieldKey of uniqueFieldKeys) {
+            const valueSet = fieldValueSets.get(fieldKey);
+            const val = String(record[fieldKey] ?? "");
+            if (valueSet.has(val)) {
+              if (!duplicateFields.includes(fieldKey))
+                duplicateFields.push(fieldKey);
+            } else {
+              valueSet.add(val);
+            }
+          }
+        }
+      }
+      lineReader.close();
+      return DataValidationEngine_default.evaluateDatasetValidations(validations, { rowCount, hasDuplicateRows, duplicateFields });
+    };
     /**
      * Compares two values, handling numbers, strings, and dates
      * Returns: negative if a < b, positive if a > b, 0 if equal
@@ -19468,7 +19646,7 @@ var ExecutorOrchestratorClass = class {
       const tracker = new ExecutorPerformance_default();
       const _progress = new ExecutorProgress_default(logProgress);
       const { usageId } = UsageManager_default.startUsage(consumer, details);
-      const scope = { id: usageId, folder: `${consumer.name}_${usageId}`, workersId: [], limitFileSize: consumer.MaximumFileSize };
+      const scope = { id: usageId, folder: `${consumer.name}_${usageId}`, workersId: [], limitFileSize: consumer.maximumFileSize };
       const pool = this.createPool();
       try {
         const start = performance.now();
@@ -19563,6 +19741,22 @@ var ExecutorOrchestratorClass = class {
           postOperation.totalOutputCount = unifiedOutputCount;
           Logger_default.log(`[${usageId}] Pivot complete: ${unifiedOutputCount} rows in ${Math.round(performance.now() - counter)}ms`);
         }
+        if (consumer.validate && consumer.validate.length > 0) {
+          Logger_default.log(`[${usageId}] Running dataset-level validations`);
+          counter = performance.now();
+          const validationResults = await ConsumerExecutor_default.processDatasetValidation(consumer, ExecutorScope_default2.getMainPath(scope));
+          tracker.measure("dataset-validation", performance.now() - counter);
+          for (const result of validationResults) {
+            if (result.onFail === "fail") {
+              const err = new Error(`Dataset validation failed for consumer "${consumer.name}": ${result.message}`);
+              Logger_default.error(err);
+              throw err;
+            } else if (result.onFail === "warn") {
+              Logger_default.warn(`Dataset validation warning for consumer "${consumer.name}": ${result.message}`);
+            }
+          }
+          Logger_default.log(`[${usageId}] Dataset validations complete in ${Math.round(performance.now() - counter)}ms`);
+        }
         counter = performance.now();
         Logger_default.log(`[${usageId}] Exporting results to ${consumer.outputs.length} output(s)`);
         const exportRes = await OutputExecutor_default.exportResult(consumer, ConsumerManager_default.getExpandedFields(consumer), scope);

package/json_schemas/consumer-schema.json CHANGED Viewed

@@ -129,31 +129,70 @@
                         ]
                     },
                     "validate": {
-                        "type": "object",
-                        "description": "Rules to check field value compliance and data quality",
-                        "properties": {
-                            "min": {
-                                "type": "number",
-                                "description": "Minimum value for numeric fields"
-                            },
-                            "max": {
-                                "type": "number",
-                                "description": "Maximum value for numeric fields"
-                            },
-                            "regex": {
-                                "type": "string",
-                                "description": "Regular expression pattern to validate string fields"
+                        "type": "array",
+                        "description": "Rules to check field value compliance and data quality. Each validation has its own rule and action to take on failure.",
+                        "items": {
+                            "type": "object",
+                            "properties": {
+                                "rule": {
+                                    "type": "object",
+                                    "description": "The validation rule to check",
+                                    "oneOf": [
+                                        {
+                                            "properties": { "min": { "type": "number", "description": "Minimum value for numeric fields" } },
+                                            "required": ["min"],
+                                            "additionalProperties": false
+                                        },
+                                        {
+                                            "properties": { "max": { "type": "number", "description": "Maximum value for numeric fields" } },
+                                            "required": ["max"],
+                                            "additionalProperties": false
+                                        },
+                                        {
+                                            "properties": { "regex": { "type": "string", "description": "Regular expression pattern to validate string fields" } },
+                                            "required": ["regex"],
+                                            "additionalProperties": false
+                                        },
+                                        {
+                                            "properties": { "required": { "type": "boolean", "const": true, "description": "Whether the field value must be present" } },
+                                            "required": ["required"],
+                                            "additionalProperties": false
+                                        },
+                                        {
+                                            "properties": { "min_length": { "type": "number", "description": "Minimum string length" } },
+                                            "required": ["min_length"],
+                                            "additionalProperties": false
+                                        },
+                                        {
+                                            "properties": { "max_length": { "type": "number", "description": "Maximum string length" } },
+                                            "required": ["max_length"],
+                                            "additionalProperties": false
+                                        },
+                                        {
+                                            "properties": { "in": { "type": "array", "items": { "oneOf": [{ "type": "string" }, { "type": "number" }, { "type": "boolean" }] }, "description": "Allowed values" } },
+                                            "required": ["in"],
+                                            "additionalProperties": false
+                                        },
+                                        {
+                                            "properties": { "not_in": { "type": "array", "items": { "oneOf": [{ "type": "string" }, { "type": "number" }, { "type": "boolean" }] }, "description": "Disallowed values" } },
+                                            "required": ["not_in"],
+                                            "additionalProperties": false
+                                        }
+                                    ]
+                                },
+                                "onFail": {
+                                    "type": "string",
+                                    "description": "Action to take when validation fails",
+                                    "enum": ["fail", "skip", "warn", "set_default"]
+                                }
                             },
-                            "required": {
-                                "type": "boolean",
-                                "description": "Whether the field is required"
-                            }
-                        },
-                        "additionalProperties": false
+                            "required": ["rule", "onFail"],
+                            "additionalProperties": false
+                        }
                     },
                     "onError": {
                         "type": "string",
-                        "description": "Action to take if an error occurs during transformations or validation",
+                        "description": "Action to take if an error occurs during transformations",
                         "enum": ["set_default", "skip", "fail"]
                     },
                     "default": {
@@ -463,6 +502,53 @@
         "_version": {
             "type": "number",
             "description": "Version number of the consumer configuration"
+        },
+        "validate": {
+            "type": "array",
+            "description": "Dataset-level validations applied to the final result set before export",
+            "items": {
+                "type": "object",
+                "properties": {
+                    "rule": {
+                        "type": "object",
+                        "description": "The dataset validation rule to check",
+                        "oneOf": [
+                            {
+                                "properties": { "unique_fields": { "type": "array", "items": { "type": "string" }, "minItems": 1, "description": "Field(s) that must have unique values across the dataset" } },
+                                "required": ["unique_fields"],
+                                "additionalProperties": false
+                            },
+                            {
+                                "properties": { "min_rows": { "type": "number", "description": "Minimum number of rows expected in the dataset" } },
+                                "required": ["min_rows"],
+                                "additionalProperties": false
+                            },
+                            {
+                                "properties": { "max_rows": { "type": "number", "description": "Maximum number of rows allowed in the dataset" } },
+                                "required": ["max_rows"],
+                                "additionalProperties": false
+                            },
+                            {
+                                "properties": { "no_duplicates": { "type": "boolean", "const": true, "description": "No fully duplicate rows allowed" } },
+                                "required": ["no_duplicates"],
+                                "additionalProperties": false
+                            },
+                            {
+                                "properties": { "not_empty": { "type": "boolean", "const": true, "description": "Dataset must contain at least one row" } },
+                                "required": ["not_empty"],
+                                "additionalProperties": false
+                            }
+                        ]
+                    },
+                    "onFail": {
+                        "type": "string",
+                        "description": "Action to take when dataset validation fails",
+                        "enum": ["fail", "warn"]
+                    }
+                },
+                "required": ["rule", "onFail"],
+                "additionalProperties": false
+            }
         }
     },
     "required": [

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
     "name": "@forzalabs/remora",
-    "version": "1.1.15",
+    "version": "1.2.1",
     "description": "A powerful CLI tool for seamless data translation.",
     "main": "index.js",
     "private": false,

package/workers/ExecutorWorker.js CHANGED Viewed

@@ -13351,6 +13351,10 @@ var Logger = class {
       console.info(message);
       FileLogService_default.write("INFO", String(message));
     };
+    this.warn = (message) => {
+      console.warn(message);
+      FileLogService_default.write("WARN", String(message));
+    };
     this.flush = () => FileLogService_default.flush();
     this.close = () => FileLogService_default.close();
     this.error = (error) => {
@@ -13494,7 +13498,7 @@ var import_promises = __toESM(require("fs/promises"), 1);
 // ../../packages/constants/src/Constants.ts
 var CONSTANTS = {
-  cliVersion: "1.1.15",
+  cliVersion: "1.2.1",
   backendVersion: 1,
   backendPort: 5088,
   workerVersion: 2,
@@ -17812,6 +17816,112 @@ var TransformationEngineClass = class {
 var TransformationEngine = new TransformationEngineClass();
 var TransformationEngine_default = TransformationEngine;
+// ../../packages/engines/src/transform/DataValidationEngine.ts
+var DataValidationEngineClass = class {
+  constructor() {
+    this.applyValidations = (value, validations, fieldKey) => {
+      for (const validation of validations) {
+        const passed = this.evaluateRule(value, validation.rule);
+        if (!passed) {
+          return {
+            valid: false,
+            message: this.buildMessage(value, validation.rule, fieldKey),
+            onFail: validation.onFail
+          };
+        }
+      }
+      return { valid: true };
+    };
+    this.evaluateRule = (value, rule) => {
+      if ("required" in rule)
+        return Algo_default.hasVal(value);
+      if ("min" in rule) {
+        if (!Algo_default.hasVal(value)) return true;
+        const num = Number(value);
+        return !isNaN(num) && num >= rule.min;
+      }
+      if ("max" in rule) {
+        if (!Algo_default.hasVal(value)) return true;
+        const num = Number(value);
+        return !isNaN(num) && num <= rule.max;
+      }
+      if ("regex" in rule) {
+        if (!Algo_default.hasVal(value)) return true;
+        return new RegExp(rule.regex).test(String(value));
+      }
+      if ("min_length" in rule) {
+        if (!Algo_default.hasVal(value)) return true;
+        return String(value).length >= rule.min_length;
+      }
+      if ("max_length" in rule) {
+        if (!Algo_default.hasVal(value)) return true;
+        return String(value).length <= rule.max_length;
+      }
+      if ("in" in rule) {
+        return rule.in.includes(value);
+      }
+      if ("not_in" in rule) {
+        return !rule.not_in.includes(value);
+      }
+      return true;
+    };
+    this.buildMessage = (value, rule, fieldKey) => {
+      const preview = Algo_default.hasVal(value) ? JSON.stringify(value) : "null/undefined";
+      if ("required" in rule) return `Field "${fieldKey}" is required but got ${preview}`;
+      if ("min" in rule) return `Field "${fieldKey}" value ${preview} is below minimum ${rule.min}`;
+      if ("max" in rule) return `Field "${fieldKey}" value ${preview} exceeds maximum ${rule.max}`;
+      if ("regex" in rule) return `Field "${fieldKey}" value ${preview} does not match pattern "${rule.regex}"`;
+      if ("min_length" in rule) return `Field "${fieldKey}" value ${preview} is shorter than minimum length ${rule.min_length}`;
+      if ("max_length" in rule) return `Field "${fieldKey}" value ${preview} exceeds maximum length ${rule.max_length}`;
+      if ("in" in rule) return `Field "${fieldKey}" value ${preview} is not in the allowed values`;
+      if ("not_in" in rule) return `Field "${fieldKey}" value ${preview} is in the disallowed values`;
+      return `Field "${fieldKey}" failed validation`;
+    };
+    this.evaluateDatasetValidations = (validations, context) => {
+      const results = [];
+      for (const validation of validations) {
+        const result = this.evaluateDatasetRule(validation, context);
+        if (result) results.push(result);
+      }
+      return results;
+    };
+    this.extractUniqueFieldKeys = (validations) => {
+      return validations.filter((v) => "unique_fields" in v.rule).flatMap((v) => v.rule.unique_fields);
+    };
+    this.hasRule = (validations, ruleKey) => {
+      return validations.some((v) => ruleKey in v.rule);
+    };
+    this.evaluateDatasetRule = (validation, context) => {
+      const { rule, onFail } = validation;
+      const { rowCount, hasDuplicateRows, duplicateFields } = context;
+      if ("not_empty" in rule) {
+        if (rowCount === 0)
+          return { message: "Dataset is empty", onFail };
+      }
+      if ("min_rows" in rule) {
+        if (rowCount < rule.min_rows)
+          return { message: `Dataset has ${rowCount} rows, expected at least ${rule.min_rows}`, onFail };
+      }
+      if ("max_rows" in rule) {
+        if (rowCount > rule.max_rows)
+          return { message: `Dataset has ${rowCount} rows, expected at most ${rule.max_rows}`, onFail };
+      }
+      if ("no_duplicates" in rule) {
+        if (hasDuplicateRows)
+          return { message: "Dataset contains duplicate rows", onFail };
+      }
+      if ("unique_fields" in rule) {
+        const failedFields = rule.unique_fields.filter((f) => duplicateFields.includes(f));
+        if (failedFields.length > 0)
+          return { message: `Duplicate values found in field(s): ${failedFields.join(", ")}`, onFail };
+      }
+      return null;
+    };
+  }
+};
+var DataValidationEngine = new DataValidationEngineClass();
+var DataValidationEngine_default = DataValidationEngine;
 // ../../packages/engines/src/usage/DataframeManager.ts
 var DataframeManagerClass = class {
   fill(points, from, to, onlyLastValue, maintainLastValue) {
@@ -18570,6 +18680,32 @@ var ConsumerExecutorClass = class {
           }
         }
       }
+      for (const field of fields) {
+        const { cField } = field;
+        const fieldKey = cField.alias ?? cField.key;
+        if (cField.validate && cField.validate.length > 0) {
+          const result = DataValidationEngine_default.applyValidations(record[fieldKey], cField.validate, fieldKey);
+          if (!result.valid) {
+            const errorMessage = `Validation failed for field "${fieldKey}" (index: ${recordIndex}): ${result.message}`;
+            switch (result.onFail) {
+              case "set_default":
+                record[fieldKey] = cField.default;
+                break;
+              case "skip":
+                return null;
+              case "warn":
+                Logger_default.warn(errorMessage);
+                break;
+              case "fail":
+              default: {
+                const err = new Error(errorMessage);
+                Logger_default.error(err);
+                throw err;
+              }
+            }
+          }
+        }
+      }
       try {
         for (const dimension of dimensions) {
           const field = fields.find((x) => x.cField.key === dimension.name);
@@ -18810,6 +18946,48 @@ var ConsumerExecutorClass = class {
           return false;
       }
     };
+    this.processDatasetValidation = async (consumer, datasetPath) => {
+      const validations = consumer.validate;
+      if (!validations || validations.length === 0) return [];
+      const internalRecordFormat = OutputExecutor_default._getInternalRecordFormat(consumer);
+      const internalFields = ConsumerManager_default.getExpandedFields(consumer);
+      let rowCount = 0;
+      const seenRows = /* @__PURE__ */ new Set();
+      const fieldValueSets = /* @__PURE__ */ new Map();
+      let hasDuplicateRows = false;
+      const duplicateFields = [];
+      const uniqueFieldKeys = DataValidationEngine_default.extractUniqueFieldKeys(validations);
+      const checkDuplicateRows = DataValidationEngine_default.hasRule(validations, "no_duplicates");
+      for (const fieldKey of uniqueFieldKeys) {
+        fieldValueSets.set(fieldKey, /* @__PURE__ */ new Set());
+      }
+      const reader = import_fs9.default.createReadStream(datasetPath);
+      const lineReader = import_readline6.default.createInterface({ input: reader, crlfDelay: Infinity });
+      for await (const line of lineReader) {
+        rowCount++;
+        if (checkDuplicateRows) {
+          if (seenRows.has(line))
+            hasDuplicateRows = true;
+          else
+            seenRows.add(line);
+        }
+        if (uniqueFieldKeys.length > 0) {
+          const record = internalRecordFormat === "CSV" || internalRecordFormat === "TXT" ? LineParser_default._internalParseCSV(line, internalFields) : LineParser_default._internalParseJSON(line);
+          for (const fieldKey of uniqueFieldKeys) {
+            const valueSet = fieldValueSets.get(fieldKey);
+            const val = String(record[fieldKey] ?? "");
+            if (valueSet.has(val)) {
+              if (!duplicateFields.includes(fieldKey))
+                duplicateFields.push(fieldKey);
+            } else {
+              valueSet.add(val);
+            }
+          }
+        }
+      }
+      lineReader.close();
+      return DataValidationEngine_default.evaluateDatasetValidations(validations, { rowCount, hasDuplicateRows, duplicateFields });
+    };
     /**
      * Compares two values, handling numbers, strings, and dates
      * Returns: negative if a < b, positive if a > b, 0 if equal
@@ -19227,7 +19405,7 @@ var ExecutorOrchestratorClass = class {
       const tracker = new ExecutorPerformance_default();
       const _progress = new ExecutorProgress_default(logProgress);
       const { usageId } = UsageManager_default.startUsage(consumer, details);
-      const scope = { id: usageId, folder: `${consumer.name}_${usageId}`, workersId: [], limitFileSize: consumer.MaximumFileSize };
+      const scope = { id: usageId, folder: `${consumer.name}_${usageId}`, workersId: [], limitFileSize: consumer.maximumFileSize };
       const pool = this.createPool();
       try {
         const start = performance.now();
@@ -19322,6 +19500,22 @@ var ExecutorOrchestratorClass = class {
           postOperation.totalOutputCount = unifiedOutputCount;
           Logger_default.log(`[${usageId}] Pivot complete: ${unifiedOutputCount} rows in ${Math.round(performance.now() - counter)}ms`);
         }
+        if (consumer.validate && consumer.validate.length > 0) {
+          Logger_default.log(`[${usageId}] Running dataset-level validations`);
+          counter = performance.now();
+          const validationResults = await ConsumerExecutor_default.processDatasetValidation(consumer, ExecutorScope_default2.getMainPath(scope));
+          tracker.measure("dataset-validation", performance.now() - counter);
+          for (const result of validationResults) {
+            if (result.onFail === "fail") {
+              const err = new Error(`Dataset validation failed for consumer "${consumer.name}": ${result.message}`);
+              Logger_default.error(err);
+              throw err;
+            } else if (result.onFail === "warn") {
+              Logger_default.warn(`Dataset validation warning for consumer "${consumer.name}": ${result.message}`);
+            }
+          }
+          Logger_default.log(`[${usageId}] Dataset validations complete in ${Math.round(performance.now() - counter)}ms`);
+        }
         counter = performance.now();
         Logger_default.log(`[${usageId}] Exporting results to ${consumer.outputs.length} output(s)`);
         const exportRes = await OutputExecutor_default.exportResult(consumer, ConsumerManager_default.getExpandedFields(consumer), scope);