synapse-sdk 2025.9.1__py3-none-any.whl → 2025.9.4__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of synapse-sdk might be problematic. Click here for more details.

Files changed (81) hide show
  1. synapse_sdk/devtools/docs/docs/api/clients/annotation-mixin.md +378 -0
  2. synapse_sdk/devtools/docs/docs/api/clients/backend.md +368 -1
  3. synapse_sdk/devtools/docs/docs/api/clients/core-mixin.md +477 -0
  4. synapse_sdk/devtools/docs/docs/api/clients/data-collection-mixin.md +422 -0
  5. synapse_sdk/devtools/docs/docs/api/clients/hitl-mixin.md +554 -0
  6. synapse_sdk/devtools/docs/docs/api/clients/index.md +391 -0
  7. synapse_sdk/devtools/docs/docs/api/clients/integration-mixin.md +571 -0
  8. synapse_sdk/devtools/docs/docs/api/clients/ml-mixin.md +578 -0
  9. synapse_sdk/devtools/docs/docs/plugins/developing-upload-template.md +1463 -0
  10. synapse_sdk/devtools/docs/docs/plugins/export-plugins.md +161 -34
  11. synapse_sdk/devtools/docs/docs/plugins/upload-plugins.md +1497 -213
  12. synapse_sdk/devtools/docs/i18n/ko/docusaurus-plugin-content-docs/current/api/clients/annotation-mixin.md +289 -0
  13. synapse_sdk/devtools/docs/i18n/ko/docusaurus-plugin-content-docs/current/api/clients/backend.md +378 -11
  14. synapse_sdk/devtools/docs/i18n/ko/docusaurus-plugin-content-docs/current/api/clients/core-mixin.md +417 -0
  15. synapse_sdk/devtools/docs/i18n/ko/docusaurus-plugin-content-docs/current/api/clients/data-collection-mixin.md +356 -0
  16. synapse_sdk/devtools/docs/i18n/ko/docusaurus-plugin-content-docs/current/api/clients/hitl-mixin.md +192 -0
  17. synapse_sdk/devtools/docs/i18n/ko/docusaurus-plugin-content-docs/current/api/clients/index.md +391 -0
  18. synapse_sdk/devtools/docs/i18n/ko/docusaurus-plugin-content-docs/current/api/clients/integration-mixin.md +479 -0
  19. synapse_sdk/devtools/docs/i18n/ko/docusaurus-plugin-content-docs/current/api/clients/ml-mixin.md +284 -0
  20. synapse_sdk/devtools/docs/i18n/ko/docusaurus-plugin-content-docs/current/plugins/developing-upload-template.md +1463 -0
  21. synapse_sdk/devtools/docs/i18n/ko/docusaurus-plugin-content-docs/current/plugins/export-plugins.md +161 -34
  22. synapse_sdk/devtools/docs/i18n/ko/docusaurus-plugin-content-docs/current/plugins/upload-plugins.md +1752 -572
  23. synapse_sdk/devtools/docs/sidebars.ts +7 -0
  24. synapse_sdk/plugins/README.md +1 -2
  25. synapse_sdk/plugins/categories/base.py +7 -0
  26. synapse_sdk/plugins/categories/export/actions/__init__.py +3 -0
  27. synapse_sdk/plugins/categories/export/actions/export/__init__.py +28 -0
  28. synapse_sdk/plugins/categories/export/actions/export/action.py +160 -0
  29. synapse_sdk/plugins/categories/export/actions/export/enums.py +113 -0
  30. synapse_sdk/plugins/categories/export/actions/export/exceptions.py +53 -0
  31. synapse_sdk/plugins/categories/export/actions/export/models.py +74 -0
  32. synapse_sdk/plugins/categories/export/actions/export/run.py +195 -0
  33. synapse_sdk/plugins/categories/export/actions/export/utils.py +187 -0
  34. synapse_sdk/plugins/categories/export/templates/plugin/__init__.py +1 -1
  35. synapse_sdk/plugins/categories/upload/actions/upload/__init__.py +1 -2
  36. synapse_sdk/plugins/categories/upload/actions/upload/action.py +154 -531
  37. synapse_sdk/plugins/categories/upload/actions/upload/context.py +185 -0
  38. synapse_sdk/plugins/categories/upload/actions/upload/factory.py +143 -0
  39. synapse_sdk/plugins/categories/upload/actions/upload/models.py +66 -29
  40. synapse_sdk/plugins/categories/upload/actions/upload/orchestrator.py +182 -0
  41. synapse_sdk/plugins/categories/upload/actions/upload/registry.py +113 -0
  42. synapse_sdk/plugins/categories/upload/actions/upload/steps/__init__.py +1 -0
  43. synapse_sdk/plugins/categories/upload/actions/upload/steps/base.py +106 -0
  44. synapse_sdk/plugins/categories/upload/actions/upload/steps/cleanup.py +62 -0
  45. synapse_sdk/plugins/categories/upload/actions/upload/steps/collection.py +62 -0
  46. synapse_sdk/plugins/categories/upload/actions/upload/steps/generate.py +80 -0
  47. synapse_sdk/plugins/categories/upload/actions/upload/steps/initialize.py +66 -0
  48. synapse_sdk/plugins/categories/upload/actions/upload/steps/metadata.py +101 -0
  49. synapse_sdk/plugins/categories/upload/actions/upload/steps/organize.py +89 -0
  50. synapse_sdk/plugins/categories/upload/actions/upload/steps/upload.py +96 -0
  51. synapse_sdk/plugins/categories/upload/actions/upload/steps/validate.py +61 -0
  52. synapse_sdk/plugins/categories/upload/actions/upload/strategies/__init__.py +1 -0
  53. synapse_sdk/plugins/categories/upload/actions/upload/strategies/base.py +86 -0
  54. synapse_sdk/plugins/categories/upload/actions/upload/strategies/data_unit/__init__.py +1 -0
  55. synapse_sdk/plugins/categories/upload/actions/upload/strategies/data_unit/batch.py +39 -0
  56. synapse_sdk/plugins/categories/upload/actions/upload/strategies/data_unit/single.py +34 -0
  57. synapse_sdk/plugins/categories/upload/actions/upload/strategies/file_discovery/__init__.py +1 -0
  58. synapse_sdk/plugins/categories/upload/actions/upload/strategies/file_discovery/flat.py +233 -0
  59. synapse_sdk/plugins/categories/upload/actions/upload/strategies/file_discovery/recursive.py +253 -0
  60. synapse_sdk/plugins/categories/upload/actions/upload/strategies/metadata/__init__.py +1 -0
  61. synapse_sdk/plugins/categories/upload/actions/upload/strategies/metadata/excel.py +174 -0
  62. synapse_sdk/plugins/categories/upload/actions/upload/strategies/metadata/none.py +16 -0
  63. synapse_sdk/plugins/categories/upload/actions/upload/strategies/upload/__init__.py +1 -0
  64. synapse_sdk/plugins/categories/upload/actions/upload/strategies/upload/async_upload.py +109 -0
  65. synapse_sdk/plugins/categories/upload/actions/upload/strategies/upload/sync.py +43 -0
  66. synapse_sdk/plugins/categories/upload/actions/upload/strategies/validation/__init__.py +1 -0
  67. synapse_sdk/plugins/categories/upload/actions/upload/strategies/validation/default.py +45 -0
  68. synapse_sdk/plugins/categories/upload/actions/upload/utils.py +194 -83
  69. synapse_sdk/plugins/categories/upload/templates/config.yaml +4 -0
  70. synapse_sdk/plugins/categories/upload/templates/plugin/__init__.py +269 -0
  71. synapse_sdk/plugins/categories/upload/templates/plugin/upload.py +71 -27
  72. synapse_sdk/plugins/models.py +7 -0
  73. synapse_sdk/shared/__init__.py +21 -0
  74. {synapse_sdk-2025.9.1.dist-info → synapse_sdk-2025.9.4.dist-info}/METADATA +2 -1
  75. {synapse_sdk-2025.9.1.dist-info → synapse_sdk-2025.9.4.dist-info}/RECORD +79 -28
  76. synapse_sdk/plugins/categories/export/actions/export.py +0 -385
  77. synapse_sdk/plugins/categories/export/enums.py +0 -7
  78. {synapse_sdk-2025.9.1.dist-info → synapse_sdk-2025.9.4.dist-info}/WHEEL +0 -0
  79. {synapse_sdk-2025.9.1.dist-info → synapse_sdk-2025.9.4.dist-info}/entry_points.txt +0 -0
  80. {synapse_sdk-2025.9.1.dist-info → synapse_sdk-2025.9.4.dist-info}/licenses/LICENSE +0 -0
  81. {synapse_sdk-2025.9.1.dist-info → synapse_sdk-2025.9.4.dist-info}/top_level.txt +0 -0
@@ -32,16 +32,26 @@ Upload plugins provide file upload and data ingestion operations for processing
32
32
 
33
33
  ## Upload Action Architecture
34
34
 
35
- The upload system uses a modular architecture with specialized components for different aspects of file processing:
35
+ The upload system uses a modern, extensible architecture built on proven design patterns. The refactored implementation transforms the previous monolithic approach into a modular, strategy-based system with clear separation of concerns.
36
+
37
+ ### Design Patterns
38
+
39
+ The architecture leverages several key design patterns:
40
+
41
+ - **Strategy Pattern**: Pluggable behaviors for validation, file discovery, metadata processing, upload operations, and data unit creation
42
+ - **Facade Pattern**: UploadOrchestrator provides a simplified interface to coordinate complex workflows
43
+ - **Factory Pattern**: StrategyFactory creates appropriate strategy implementations based on runtime parameters
44
+ - **Context Pattern**: UploadContext maintains shared state and communication between workflow components
45
+
46
+ ### Component Architecture
36
47
 
37
48
  ```mermaid
38
49
  classDiagram
39
- %% Light/Dark mode compatible colors with semi-transparency
40
- classDef baseClass fill:#e1f5fe80,stroke:#0288d1,stroke-width:2px
41
- classDef childClass fill:#c8e6c980,stroke:#388e3c,stroke-width:2px
42
- classDef modelClass fill:#fff9c480,stroke:#f57c00,stroke-width:2px
43
- classDef utilClass fill:#f5f5f580,stroke:#616161,stroke-width:2px
44
- classDef enumClass fill:#ffccbc80,stroke:#d32f2f,stroke-width:2px
50
+ %% Light/Dark mode compatible colors
51
+ classDef coreClass fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000000
52
+ classDef strategyClass fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000000
53
+ classDef stepClass fill:#fff9c4,stroke:#f57c00,stroke-width:2px,color:#000000
54
+ classDef contextClass fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#000000
45
55
 
46
56
  class UploadAction {
47
57
  +name: str = "upload"
@@ -51,160 +61,682 @@ classDiagram
51
61
  +params_model: UploadParams
52
62
  +progress_categories: dict
53
63
  +metrics_categories: dict
64
+ +strategy_factory: StrategyFactory
65
+ +step_registry: StepRegistry
54
66
 
55
67
  +start() dict
56
- +get_uploader(...) object
57
- +_discover_files_recursive(path) List[Path]
58
- +_discover_files_non_recursive(path) List[Path]
59
- +_validate_excel_security(path) None
60
- +_process_excel_metadata(path) dict
68
+ +get_workflow_summary() dict
69
+ +_configure_workflow() None
70
+ +_configure_strategies() dict
71
+ }
72
+
73
+ class UploadOrchestrator {
74
+ +context: UploadContext
75
+ +step_registry: StepRegistry
76
+ +strategies: dict
77
+ +executed_steps: list
78
+ +current_step_index: int
79
+ +rollback_executed: bool
80
+
81
+ +execute() dict
82
+ +get_workflow_summary() dict
83
+ +get_executed_steps() list
84
+ +is_rollback_executed() bool
85
+ +_execute_step(step) StepResult
86
+ +_handle_step_failure(step, error) None
87
+ +_rollback_executed_steps() None
88
+ }
89
+
90
+ class UploadContext {
91
+ +params: dict
92
+ +run: UploadRun
93
+ +client: Any
94
+ +storage: Any
95
+ +pathlib_cwd: Path
96
+ +metadata: dict
97
+ +file_specifications: dict
98
+ +organized_files: list
99
+ +uploaded_files: list
100
+ +data_units: list
101
+ +metrics: dict
102
+ +errors: list
103
+ +strategies: dict
104
+ +rollback_data: dict
105
+
106
+ +update(result: StepResult) None
107
+ +get_result() dict
108
+ +has_errors() bool
109
+ +update_metrics(category, metrics) None
110
+ }
111
+
112
+ class StepRegistry {
113
+ +_steps: list
114
+ +register(step: BaseStep) None
115
+ +get_steps() list
116
+ +get_total_progress_weight() float
117
+ +clear() None
61
118
  }
62
119
 
63
- class UploadRun {
64
- +log_message_with_code(code, args, level) None
65
- +log_upload_event(code, args, level) None
66
- +UploadEventLog: BaseModel
67
- +DataFileLog: BaseModel
68
- +DataUnitLog: BaseModel
69
- +TaskLog: BaseModel
70
- +MetricsRecord: BaseModel
120
+ class StrategyFactory {
121
+ +create_validation_strategy(params, context) BaseValidationStrategy
122
+ +create_file_discovery_strategy(params, context) BaseFileDiscoveryStrategy
123
+ +create_metadata_strategy(params, context) BaseMetadataStrategy
124
+ +create_upload_strategy(params, context) BaseUploadStrategy
125
+ +create_data_unit_strategy(params, context) BaseDataUnitStrategy
126
+ +get_available_strategies() dict
71
127
  }
72
128
 
73
- class UploadParams {
129
+ class BaseStep {
130
+ <<abstract>>
74
131
  +name: str
75
- +description: str | None
76
- +path: str
77
- +storage: int
78
- +collection: int
79
- +project: int | None
80
- +excel_metadata_path: str | None
81
- +is_recursive: bool = False
82
- +max_file_size_mb: int = 50
83
- +creating_data_unit_batch_size: int = 100
84
- +use_async_upload: bool = True
85
-
86
- +check_storage_exists(value) str
87
- +check_collection_exists(value) str
88
- +check_project_exists(value) str
132
+ +progress_weight: float
133
+ +execute(context: UploadContext) StepResult
134
+ +can_skip(context: UploadContext) bool
135
+ +rollback(context: UploadContext) None
136
+ +create_success_result(data) StepResult
137
+ +create_error_result(error) StepResult
138
+ +create_skip_result() StepResult
89
139
  }
90
140
 
91
141
  class ExcelSecurityConfig {
142
+ +max_file_size_mb: int = 10
143
+ +max_rows: int = 100000
144
+ +max_columns: int = 50
145
+ +max_file_size_bytes: int
92
146
  +MAX_FILE_SIZE_MB: int
93
147
  +MAX_FILE_SIZE_BYTES: int
94
- +MAX_MEMORY_USAGE_MB: int
95
- +MAX_MEMORY_USAGE_BYTES: int
96
148
  +MAX_ROWS: int
97
149
  +MAX_COLUMNS: int
98
- +MAX_FILENAME_LENGTH: int
99
- +MAX_COLUMN_NAME_LENGTH: int
100
- +MAX_METADATA_VALUE_LENGTH: int
150
+ +from_action_config(action_config) ExcelSecurityConfig
151
+ }
152
+
153
+ class StepResult {
154
+ +success: bool
155
+ +data: dict
156
+ +error: str
157
+ +rollback_data: dict
158
+ +skipped: bool
159
+ +original_exception: Exception
160
+ +timestamp: datetime
161
+ }
162
+
163
+ %% Strategy Base Classes
164
+ class BaseValidationStrategy {
165
+ <<abstract>>
166
+ +validate_files(files, context) bool
167
+ +validate_security(file_path) bool
168
+ }
169
+
170
+ class BaseFileDiscoveryStrategy {
171
+ <<abstract>>
172
+ +discover_files(path, context) list
173
+ +organize_files(files, specs, context) list
174
+ }
175
+
176
+ class BaseMetadataStrategy {
177
+ <<abstract>>
178
+ +process_metadata(context) dict
179
+ +extract_metadata(file_path) dict
180
+ }
181
+
182
+ class BaseUploadStrategy {
183
+ <<abstract>>
184
+ +upload_files(files, context) list
185
+ +upload_batch(batch, context) list
101
186
  }
102
187
 
103
- class ExcelMetadataUtils {
104
- +config: ExcelSecurityConfig
105
- +validate_and_truncate_string(value, max_length) str
106
- +is_valid_filename_length(filename) bool
188
+ class BaseDataUnitStrategy {
189
+ <<abstract>>
190
+ +generate_data_units(files, context) list
191
+ +create_data_unit_batch(batch, context) list
107
192
  }
108
193
 
109
- class LogCode {
110
- +VALIDATION_FAILED: str
111
- +NO_FILES_FOUND: str
112
- +EXCEL_SECURITY_VIOLATION: str
113
- +EXCEL_PARSING_ERROR: str
114
- +FILES_DISCOVERED: str
115
- +UPLOADING_DATA_FILES: str
116
- +GENERATING_DATA_UNITS: str
117
- +IMPORT_COMPLETED: str
194
+ %% Workflow Steps
195
+ class InitializeStep {
196
+ +name = "initialize"
197
+ +progress_weight = 0.05
118
198
  }
119
199
 
120
- class UploadStatus {
121
- +SUCCESS: str = "success"
122
- +FAILED: str = "failed"
200
+ class ProcessMetadataStep {
201
+ +name = "process_metadata"
202
+ +progress_weight = 0.05
203
+ }
204
+
205
+ class AnalyzeCollectionStep {
206
+ +name = "analyze_collection"
207
+ +progress_weight = 0.05
208
+ }
209
+
210
+ class OrganizeFilesStep {
211
+ +name = "organize_files"
212
+ +progress_weight = 0.10
213
+ }
214
+
215
+ class ValidateFilesStep {
216
+ +name = "validate_files"
217
+ +progress_weight = 0.05
218
+ }
219
+
220
+ class UploadFilesStep {
221
+ +name = "upload_files"
222
+ +progress_weight = 0.30
223
+ }
224
+
225
+ class GenerateDataUnitsStep {
226
+ +name = "generate_data_units"
227
+ +progress_weight = 0.35
228
+ }
229
+
230
+ class CleanupStep {
231
+ +name = "cleanup"
232
+ +progress_weight = 0.05
123
233
  }
124
234
 
125
235
  %% Relationships
126
236
  UploadAction --> UploadRun : uses
127
237
  UploadAction --> UploadParams : validates with
128
238
  UploadAction --> ExcelSecurityConfig : configures
129
- UploadAction --> ExcelMetadataUtils : processes with
239
+ UploadAction --> UploadOrchestrator : creates and executes
240
+ UploadAction --> StrategyFactory : configures strategies
241
+ UploadAction --> StepRegistry : manages workflow steps
130
242
  UploadRun --> LogCode : logs with
131
243
  UploadRun --> UploadStatus : tracks status
132
- ExcelMetadataUtils --> ExcelSecurityConfig : validates against
133
-
134
- %% Apply styles
244
+ UploadOrchestrator --> UploadContext : coordinates state
245
+ UploadOrchestrator --> StepRegistry : executes steps from
246
+ UploadOrchestrator --> BaseStep : executes
247
+ BaseStep --> StepResult : returns
248
+ UploadContext --> StepResult : updates with
249
+ StrategyFactory --> BaseValidationStrategy : creates
250
+ StrategyFactory --> BaseFileDiscoveryStrategy : creates
251
+ StrategyFactory --> BaseMetadataStrategy : creates
252
+ StrategyFactory --> BaseUploadStrategy : creates
253
+ StrategyFactory --> BaseDataUnitStrategy : creates
254
+ StepRegistry --> BaseStep : contains
255
+
256
+ %% Step inheritance
257
+ InitializeStep --|> BaseStep : extends
258
+ ProcessMetadataStep --|> BaseStep : extends
259
+ AnalyzeCollectionStep --|> BaseStep : extends
260
+ OrganizeFilesStep --|> BaseStep : extends
261
+ ValidateFilesStep --|> BaseStep : extends
262
+ UploadFilesStep --|> BaseStep : extends
263
+ GenerateDataUnitsStep --|> BaseStep : extends
264
+ CleanupStep --|> BaseStep : extends
265
+
266
+ %% Note: Class styling defined above - Mermaid will apply based on classDef definitions
135
267
  ```
136
268
 
137
- ### Upload Processing Flow
269
+ ### Step-Based Workflow Execution
270
+
271
+ The refactored architecture uses a step-based workflow coordinated by the UploadOrchestrator. Each step has a defined responsibility and progress weight.
272
+
273
+ #### Workflow Steps Overview
138
274
 
139
- This flowchart shows the complete execution flow of upload operations:
275
+ | Step | Name | Weight | Responsibility |
276
+ | ---- | ------------------- | ------ | -------------------------------------------- |
277
+ | 1 | Initialize | 5% | Setup storage, pathlib, and basic validation |
278
+ | 2 | Process Metadata | 5% | Handle Excel metadata if provided |
279
+ | 3 | Analyze Collection | 5% | Retrieve and validate data collection specs |
280
+ | 4 | Organize Files | 10% | Discover and organize files by type |
281
+ | 5 | Validate Files | 5% | Security and content validation |
282
+ | 6 | Upload Files | 30% | Upload files to storage |
283
+ | 7 | Generate Data Units | 35% | Create data units from uploaded files |
284
+ | 8 | Cleanup | 5% | Clean temporary resources |
285
+
286
+ #### Execution Flow
140
287
 
141
288
  ```mermaid
142
289
  flowchart TD
143
290
  %% Start
144
- A[Upload Action Started] --> B[Validate Parameters]
145
- B --> C[Setup Output Paths]
146
- C --> D[Discover Files]
147
-
148
- %% File Discovery
149
- D --> E{Recursive Mode?}
150
- E -->|Yes| F[Scan Recursively]
151
- E -->|No| G[Scan Directory Only]
152
- F --> H[Collect All Files]
153
- G --> H
154
-
155
- %% Excel Processing
156
- H --> I{Excel Metadata?}
157
- I -->|Yes| J[Validate Excel Security]
158
- I -->|No| L[Organize Files by Type]
159
-
160
- J --> K[Process Excel Metadata]
161
- K --> L
162
-
163
- %% File Organization
164
- L --> M[Create Type Directories]
165
- M --> N[Batch Files for Processing]
166
-
167
- %% Upload Processing
168
- N --> O[Start File Upload]
169
- O --> P[Process File Batch]
170
- P --> Q{More Batches?}
171
- Q -->|Yes| P
172
- Q -->|No| R[Generate Data Units]
173
-
174
- %% Data Unit Creation
175
- R --> S[Create Data Unit Batch]
176
- S --> T{More Units?}
177
- T -->|Yes| S
178
- T -->|No| U[Complete Upload]
179
-
180
- %% Completion
181
- U --> V[Update Metrics]
182
- V --> W[Log Results]
183
- W --> X[Return Summary]
291
+ A["🚀 Upload Action Started"] --> B["📋 Create UploadContext"]
292
+ B --> C["⚙️ Configure Strategies"]
293
+ C --> D["📝 Register Workflow Steps"]
294
+ D --> E["🎯 Create UploadOrchestrator"]
295
+
296
+ %% Strategy Injection
297
+ E --> F["💉 Inject Strategies into Context"]
298
+ F --> G["📊 Initialize Progress Tracking"]
299
+
300
+ %% Step Execution Loop
301
+ G --> H["🔄 Start Step Execution Loop"]
302
+ H --> I["📍 Get Next Step"]
303
+ I --> J{"🤔 Can Step be Skipped?"}
304
+ J -->|Yes| K["⏭️ Skip Step"]
305
+ J -->|No| L["▶️ Execute Step"]
306
+
307
+ %% Step Execution
308
+ L --> M{"✅ Step Successful?"}
309
+ M -->|Yes| N["📈 Update Progress"]
310
+ M -->|No| O["❌ Handle Step Failure"]
311
+
312
+ %% Success Path
313
+ N --> P["💾 Store Step Result"]
314
+ P --> Q["📝 Add to Executed Steps"]
315
+ Q --> R{"🏁 More Steps?"}
316
+ R -->|Yes| I
317
+ R -->|No| S["🎉 Workflow Complete"]
318
+
319
+ %% Skip Path
320
+ K --> T["📊 Update Progress (Skip)"]
321
+ T --> R
184
322
 
185
323
  %% Error Handling
186
- B -->|Error| Y[Log Validation Error]
187
- J -->|Error| Z[Log Excel Error]
188
- P -->|Error| AA[Log Upload Error]
189
- S -->|Error| BB[Log Data Unit Error]
190
-
191
- Y --> CC[Return Error Result]
192
- Z --> CC
193
- AA --> CC
194
- BB --> CC
195
-
196
- %% Apply styles with light/dark mode compatibility
197
- classDef startNode fill:#90caf980,stroke:#1565c0,stroke-width:2px
198
- classDef processNode fill:#ce93d880,stroke:#6a1b9a,stroke-width:2px
199
- classDef decisionNode fill:#ffcc8080,stroke:#ef6c00,stroke-width:2px
200
- classDef errorNode fill:#ef9a9a80,stroke:#c62828,stroke-width:2px
201
- classDef endNode fill:#a5d6a780,stroke:#2e7d32,stroke-width:2px
202
-
203
- class A startNode
204
- class B,C,D,F,G,H,J,K,L,M,N,O,P,R,S,U,V,W processNode
205
- class E,I,Q,T decisionNode
206
- class Y,Z,AA,BB,CC errorNode
207
- class X endNode
324
+ O --> U["🔙 Start Rollback Process"]
325
+ U --> V["⏪ Rollback Executed Steps"]
326
+ V --> W["📝 Log Rollback Results"]
327
+ W --> X["💥 Propagate Exception"]
328
+
329
+ %% Final Results
330
+ S --> Y["📊 Collect Final Metrics"]
331
+ Y --> Z["📋 Generate Result Summary"]
332
+ Z --> AA["🔄 Return to UploadAction"]
333
+
334
+ %% Apply styles - Light/Dark mode compatible
335
+ classDef startNode fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000000
336
+ classDef processNode fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000000
337
+ classDef decisionNode fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000000
338
+ classDef successNode fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000000
339
+ classDef errorNode fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#000000
340
+ classDef stepNode fill:#f0f4c3,stroke:#689f38,stroke-width:1px,color:#000000
341
+
342
+ class A,B,E startNode
343
+ class C,D,F,G,H,I,L,N,P,Q,T,Y,Z,AA processNode
344
+ class J,M,R decisionNode
345
+ class K,S successNode
346
+ class O,U,V,W,X errorNode
347
+ ```
348
+
349
+ #### Strategy Integration Points
350
+
351
+ Strategies are injected into the workflow at specific points:
352
+
353
+ - **Validation Strategy**: Used by ValidateFilesStep
354
+ - **File Discovery Strategy**: Used by OrganizeFilesStep
355
+ - **Metadata Strategy**: Used by ProcessMetadataStep
356
+ - **Upload Strategy**: Used by UploadFilesStep
357
+ - **Data Unit Strategy**: Used by GenerateDataUnitsStep
358
+
359
+ #### Error Handling and Rollback
360
+
361
+ The orchestrator provides automatic rollback functionality:
362
+
363
+ 1. **Exception Capture**: Preserves original exceptions for debugging
364
+ 2. **Rollback Execution**: Calls rollback() on all successfully executed steps in reverse order
365
+ 3. **Graceful Degradation**: Continues rollback even if individual step rollbacks fail
366
+ 4. **State Preservation**: Maintains execution state for post-failure analysis
367
+
368
+ ## Development Guide
369
+
370
+ This section provides comprehensive guidance for extending the upload action with custom strategies and workflow steps.
371
+
372
+ ### Creating Custom Strategies
373
+
374
+ Strategies implement specific behaviors for different aspects of the upload process. Each strategy type has a well-defined interface.
375
+
376
+ #### Custom Validation Strategy
377
+
378
+ ```python
379
+ from synapse_sdk.plugins.categories.upload.actions.upload.strategies.validation.base import BaseValidationStrategy
380
+ from synapse_sdk.plugins.categories.upload.actions.upload.context import UploadContext
381
+
382
+ class CustomValidationStrategy(BaseValidationStrategy):
383
+ """Custom validation strategy with advanced security checks."""
384
+
385
+ def validate_files(self, files: List[Path], context: UploadContext) -> bool:
386
+ """Validate files using custom business rules."""
387
+ for file_path in files:
388
+ # Custom validation logic
389
+ if not self._validate_custom_rules(file_path):
390
+ return False
391
+
392
+ # Call security validation
393
+ if not self.validate_security(file_path):
394
+ return False
395
+ return True
396
+
397
+ def validate_security(self, file_path: Path) -> bool:
398
+ """Custom security validation."""
399
+ # Implement custom security checks
400
+ if file_path.suffix in ['.exe', '.bat', '.sh']:
401
+ return False
402
+
403
+ # Check file size
404
+ if file_path.stat().st_size > 100 * 1024 * 1024: # 100MB
405
+ return False
406
+
407
+ return True
408
+
409
+ def _validate_custom_rules(self, file_path: Path) -> bool:
410
+ """Implement domain-specific validation rules."""
411
+ # Custom business logic
412
+ return True
413
+ ```
414
+
415
+ #### Custom File Discovery Strategy
416
+
417
+ ```python
418
+ from synapse_sdk.plugins.categories.upload.actions.upload.strategies.file_discovery.base import BaseFileDiscoveryStrategy
419
+ from pathlib import Path
420
+ from typing import List, Dict, Any
421
+
422
+ class CustomFileDiscoveryStrategy(BaseFileDiscoveryStrategy):
423
+ """Custom file discovery with advanced filtering."""
424
+
425
+ def discover_files(self, path: Path, context: UploadContext) -> List[Path]:
426
+ """Discover files with custom filtering rules."""
427
+ files = []
428
+
429
+ if context.get_param('is_recursive', False):
430
+ files = list(path.rglob('*'))
431
+ else:
432
+ files = list(path.iterdir())
433
+
434
+ # Apply custom filtering
435
+ return self._apply_custom_filters(files, context)
436
+
437
+ def organize_files(self, files: List[Path], specs: Dict[str, Any], context: UploadContext) -> List[Dict[str, Any]]:
438
+ """Organize files using custom categorization."""
439
+ organized = []
440
+
441
+ for file_path in files:
442
+ if file_path.is_file():
443
+ category = self._determine_category(file_path)
444
+ organized.append({
445
+ 'file_path': file_path,
446
+ 'category': category,
447
+ 'metadata': self._extract_file_metadata(file_path)
448
+ })
449
+
450
+ return organized
451
+
452
+ def _apply_custom_filters(self, files: List[Path], context: UploadContext) -> List[Path]:
453
+ """Apply domain-specific file filters."""
454
+ filtered = []
455
+ for file_path in files:
456
+ if self._should_include_file(file_path):
457
+ filtered.append(file_path)
458
+ return filtered
459
+
460
+ def _determine_category(self, file_path: Path) -> str:
461
+ """Determine file category using custom logic."""
462
+ # Custom categorization logic
463
+ ext = file_path.suffix.lower()
464
+ if ext in ['.jpg', '.png', '.gif']:
465
+ return 'images'
466
+ elif ext in ['.pdf', '.doc', '.docx']:
467
+ return 'documents'
468
+ else:
469
+ return 'other'
470
+ ```
471
+
472
+ #### Custom Upload Strategy
473
+
474
+ ```python
475
+ from synapse_sdk.plugins.categories.upload.actions.upload.strategies.upload.base import BaseUploadStrategy
476
+ from typing import List, Dict, Any
477
+
478
+ class CustomUploadStrategy(BaseUploadStrategy):
479
+ """Custom upload strategy with advanced retry logic."""
480
+
481
+ def upload_files(self, files: List[Dict[str, Any]], context: UploadContext) -> List[Dict[str, Any]]:
482
+ """Upload files with custom batching and retry logic."""
483
+ uploaded_files = []
484
+ batch_size = context.get_param('upload_batch_size', 10)
485
+
486
+ # Process in custom batches
487
+ for i in range(0, len(files), batch_size):
488
+ batch = files[i:i + batch_size]
489
+ batch_results = self.upload_batch(batch, context)
490
+ uploaded_files.extend(batch_results)
491
+
492
+ return uploaded_files
493
+
494
+ def upload_batch(self, batch: List[Dict[str, Any]], context: UploadContext) -> List[Dict[str, Any]]:
495
+ """Upload a batch of files with retry logic."""
496
+ results = []
497
+
498
+ for file_info in batch:
499
+ max_retries = 3
500
+ for attempt in range(max_retries):
501
+ try:
502
+ result = self._upload_single_file(file_info, context)
503
+ results.append(result)
504
+ break
505
+ except Exception as e:
506
+ if attempt == max_retries - 1:
507
+ # Final attempt failed
508
+ context.add_error(f"Failed to upload {file_info['file_path']}: {e}")
509
+ else:
510
+ # Wait before retry
511
+ time.sleep(2 ** attempt)
512
+
513
+ return results
514
+
515
+ def _upload_single_file(self, file_info: Dict[str, Any], context: UploadContext) -> Dict[str, Any]:
516
+ """Upload a single file with custom logic."""
517
+ # Custom upload implementation
518
+ file_path = file_info['file_path']
519
+
520
+ # Use the storage from context
521
+ storage = context.storage
522
+
523
+ # Custom upload logic here
524
+ uploaded_file = {
525
+ 'file_path': str(file_path),
526
+ 'storage_path': f"uploads/{file_path.name}",
527
+ 'size': file_path.stat().st_size,
528
+ 'checksum': self._calculate_checksum(file_path)
529
+ }
530
+
531
+ return uploaded_file
532
+ ```
533
+
534
+ ### Creating Custom Workflow Steps
535
+
536
+ Custom workflow steps extend the base step class and implement the required interface.
537
+
538
+ #### Custom Processing Step
539
+
540
+ ```python
541
+ from synapse_sdk.plugins.categories.upload.actions.upload.steps.base import BaseStep
542
+ from synapse_sdk.plugins.categories.upload.actions.upload.context import UploadContext, StepResult
543
+ from pathlib import Path
544
+
545
+ class CustomProcessingStep(BaseStep):
546
+ """Custom processing step for specialized file handling."""
547
+
548
+ @property
549
+ def name(self) -> str:
550
+ return 'custom_processing'
551
+
552
+ @property
553
+ def progress_weight(self) -> float:
554
+ return 0.15 # 15% of total workflow
555
+
556
+ def execute(self, context: UploadContext) -> StepResult:
557
+ """Execute custom processing logic."""
558
+ try:
559
+ # Custom processing logic
560
+ processed_files = self._process_files(context)
561
+
562
+ # Update context with results
563
+ return self.create_success_result({
564
+ 'processed_files': processed_files,
565
+ 'processing_stats': self._get_processing_stats()
566
+ })
567
+
568
+ except Exception as e:
569
+ return self.create_error_result(f'Custom processing failed: {str(e)}')
570
+
571
+ def can_skip(self, context: UploadContext) -> bool:
572
+ """Determine if step can be skipped."""
573
+ # Skip if no files to process
574
+ return len(context.organized_files) == 0
575
+
576
+ def rollback(self, context: UploadContext) -> None:
577
+ """Rollback custom processing operations."""
578
+ # Clean up any resources created during processing
579
+ self._cleanup_processing_resources(context)
580
+
581
+ def _process_files(self, context: UploadContext) -> List[Dict]:
582
+ """Implement custom file processing."""
583
+ processed = []
584
+
585
+ for file_info in context.organized_files:
586
+ # Custom processing logic
587
+ result = self._process_single_file(file_info)
588
+ processed.append(result)
589
+
590
+ return processed
591
+
592
+ def _process_single_file(self, file_info: Dict) -> Dict:
593
+ """Process a single file."""
594
+ # Custom processing implementation
595
+ return {
596
+ 'original': file_info,
597
+ 'processed': True,
598
+ 'timestamp': datetime.now()
599
+ }
600
+ ```
601
+
602
+ ### Strategy Factory Extension
603
+
604
+ To make custom strategies available, extend the StrategyFactory:
605
+
606
+ ```python
607
+ from synapse_sdk.plugins.categories.upload.actions.upload.factory import StrategyFactory
608
+
609
+ class CustomStrategyFactory(StrategyFactory):
610
+ """Extended factory with custom strategies."""
611
+
612
+ def create_validation_strategy(self, params: Dict, context=None):
613
+ """Create validation strategy with custom options."""
614
+ validation_type = params.get('custom_validation_type', 'default')
615
+
616
+ if validation_type == 'strict':
617
+ return CustomValidationStrategy()
618
+ else:
619
+ return super().create_validation_strategy(params, context)
620
+
621
+ def create_file_discovery_strategy(self, params: Dict, context=None):
622
+ """Create file discovery strategy with custom options."""
623
+ discovery_mode = params.get('discovery_mode', 'default')
624
+
625
+ if discovery_mode == 'advanced':
626
+ return CustomFileDiscoveryStrategy()
627
+ else:
628
+ return super().create_file_discovery_strategy(params, context)
629
+ ```
630
+
631
+ ### Custom Upload Action
632
+
633
+ For comprehensive customization, extend the UploadAction itself:
634
+
635
+ ```python
636
+ from synapse_sdk.plugins.categories.upload.actions.upload.action import UploadAction
637
+ from synapse_sdk.plugins.categories.decorators import register_action
638
+
639
+ @register_action
640
+ class CustomUploadAction(UploadAction):
641
+ """Custom upload action with extended workflow."""
642
+
643
+ name = 'custom_upload'
644
+
645
+ def __init__(self, *args, **kwargs):
646
+ super().__init__(*args, **kwargs)
647
+ # Use custom strategy factory
648
+ self.strategy_factory = CustomStrategyFactory()
649
+
650
+ def _configure_workflow(self) -> None:
651
+ """Configure custom workflow with additional steps."""
652
+ # Register standard steps
653
+ super()._configure_workflow()
654
+
655
+ # Add custom processing step
656
+ self.step_registry.register(CustomProcessingStep())
657
+
658
+ def _configure_strategies(self, context=None) -> Dict[str, Any]:
659
+ """Configure strategies with custom parameters."""
660
+ strategies = super()._configure_strategies(context)
661
+
662
+ # Add custom strategy
663
+ strategies['custom_processing'] = self._create_custom_processing_strategy()
664
+
665
+ return strategies
666
+
667
+ def _create_custom_processing_strategy(self):
668
+ """Create custom processing strategy."""
669
+ return CustomProcessingStrategy(self.params)
670
+ ```
671
+
672
+ ### Testing Custom Components
673
+
674
+ #### Testing Custom Strategies
675
+
676
+ ```python
677
+ import pytest
678
+ from unittest.mock import Mock
679
+ from pathlib import Path
680
+
681
+ class TestCustomValidationStrategy:
682
+
683
+ def setup_method(self):
684
+ self.strategy = CustomValidationStrategy()
685
+ self.context = Mock()
686
+
687
+ def test_validate_files_success(self):
688
+ """Test successful file validation."""
689
+ files = [Path('/test/file1.txt'), Path('/test/file2.jpg')]
690
+ result = self.strategy.validate_files(files, self.context)
691
+ assert result is True
692
+
693
+ def test_validate_files_security_failure(self):
694
+ """Test validation failure for security reasons."""
695
+ files = [Path('/test/malware.exe')]
696
+ result = self.strategy.validate_files(files, self.context)
697
+ assert result is False
698
+
699
+ def test_validate_large_file_failure(self):
700
+ """Test validation failure for large files."""
701
+ # Mock file stat to return large size
702
+ large_file = Mock(spec=Path)
703
+ large_file.suffix = '.txt'
704
+ large_file.stat.return_value.st_size = 200 * 1024 * 1024 # 200MB
705
+
706
+ result = self.strategy.validate_security(large_file)
707
+ assert result is False
708
+ ```
709
+
710
+ #### Testing Custom Steps
711
+
712
+ ```python
713
+ class TestCustomProcessingStep:
714
+
715
+ def setup_method(self):
716
+ self.step = CustomProcessingStep()
717
+ self.context = Mock()
718
+ self.context.organized_files = [
719
+ {'file_path': '/test/file1.txt'},
720
+ {'file_path': '/test/file2.jpg'}
721
+ ]
722
+
723
+ def test_execute_success(self):
724
+ """Test successful step execution."""
725
+ result = self.step.execute(self.context)
726
+
727
+ assert result.success is True
728
+ assert 'processed_files' in result.data
729
+ assert len(result.data['processed_files']) == 2
730
+
731
+ def test_can_skip_with_no_files(self):
732
+ """Test step skipping logic."""
733
+ self.context.organized_files = []
734
+ assert self.step.can_skip(self.context) is True
735
+
736
+ def test_rollback_cleanup(self):
737
+ """Test rollback cleanup."""
738
+ # This should not raise an exception
739
+ self.step.rollback(self.context)
208
740
  ```
209
741
 
210
742
  ## Upload Parameters
@@ -213,12 +745,12 @@ The upload action uses `UploadParams` for comprehensive parameter validation:
213
745
 
214
746
  ### Required Parameters
215
747
 
216
- | Parameter | Type | Description | Validation |
217
- | ------------ | ----- | -------------------------- | ------------------ |
218
- | `name` | `str` | Human-readable upload name | Must be non-blank |
219
- | `path` | `str` | Source file/directory path | Must be valid path |
220
- | `storage` | `int` | Target storage ID | Must exist via API |
221
- | `collection` | `int` | Data collection ID | Must exist via API |
748
+ | Parameter | Type | Description | Validation |
749
+ | ----------------- | ----- | -------------------------- | ------------------ |
750
+ | `name` | `str` | Human-readable upload name | Must be non-blank |
751
+ | `path` | `str` | Source file/directory path | Must be valid path |
752
+ | `storage` | `int` | Target storage ID | Must exist via API |
753
+ | `data_collection` | `int` | Data collection ID | Must exist via API |
222
754
 
223
755
  ### Optional Parameters
224
756
 
@@ -252,53 +784,136 @@ def check_storage_exists(cls, value: str, info) -> str:
252
784
 
253
785
  ## Excel Metadata Processing
254
786
 
255
- Upload plugins support Excel files for enhanced metadata annotation:
787
+ Upload plugins provide advanced Excel metadata processing with comprehensive filename matching, flexible header support, and optimized performance:
256
788
 
257
789
  ### Excel File Format
258
790
 
259
- The Excel file should follow this structure:
791
+ The Excel file supports flexible header formats and comprehensive filename matching:
792
+
793
+ #### Supported Header Formats
260
794
 
795
+ Both header formats are supported with case-insensitive matching:
796
+
797
+ **Option 1: "filename" header**
261
798
  | filename | category | description | custom_field |
262
799
  | ---------- | -------- | ------------------ | ------------ |
263
800
  | image1.jpg | nature | Mountain landscape | high_res |
264
801
  | image2.png | urban | City skyline | processed |
265
802
 
803
+ **Option 2: "file_name" header**
804
+ | file_name | category | description | custom_field |
805
+ | ---------- | -------- | ------------------ | ------------ |
806
+ | image1.jpg | nature | Mountain landscape | high_res |
807
+ | image2.png | urban | City skyline | processed |
808
+
809
+ #### Filename Matching Strategy
810
+
811
+ The system uses a comprehensive 5-tier priority matching algorithm to associate files with metadata:
812
+
813
+ 1. **Exact stem match** (highest priority): `image1` matches `image1.jpg`
814
+ 2. **Exact filename match**: `image1.jpg` matches `image1.jpg`
815
+ 3. **Metadata key stem match**: `path/image1.ext` stem matches `image1`
816
+ 4. **Partial path matching**: `/uploads/image1.jpg` contains `image1`
817
+ 5. **Full path matching**: Complete path matching for complex structures
818
+
819
+ This robust matching ensures metadata is correctly associated regardless of file organization or naming conventions.
820
+
266
821
  ### Security Validation
267
822
 
268
823
  Excel files undergo comprehensive security validation:
269
824
 
270
825
  ```python
271
826
  class ExcelSecurityConfig:
272
- MAX_FILE_SIZE_MB = 10 # File size limit
273
- MAX_MEMORY_USAGE_MB = 30 # Memory usage limit
274
- MAX_ROWS = 10000 # Row count limit
275
- MAX_COLUMNS = 50 # Column count limit
276
- MAX_FILENAME_LENGTH = 255 # Filename length limit
277
- MAX_COLUMN_NAME_LENGTH = 100 # Column name length
278
- MAX_METADATA_VALUE_LENGTH = 1000 # Metadata value length
827
+ max_file_size_mb: int = 10 # File size limit in MB
828
+ max_rows: int = 100000 # Row count limit
829
+ max_columns: int = 50 # Column count limit
279
830
  ```
280
831
 
281
- ### Environment Configuration
832
+ #### Advanced Security Features
833
+
834
+ - **File format validation**: Checks Excel file signatures (PK for .xlsx, compound document for .xls)
835
+ - **Memory estimation**: Prevents memory exhaustion from oversized spreadsheets
836
+ - **Content sanitization**: Automatic truncation of overly long values
837
+ - **Error resilience**: Graceful handling of corrupted or inaccessible files
282
838
 
283
- Security limits can be configured via environment variables:
839
+ ### Configuration via config.yaml
284
840
 
285
- ```bash
286
- export EXCEL_MAX_FILE_SIZE_MB=20
287
- export EXCEL_MAX_MEMORY_MB=50
288
- export EXCEL_MAX_ROWS=20000
289
- export EXCEL_MAX_COLUMNS=100
290
- export EXCEL_MAX_FILENAME_LENGTH=500
291
- export EXCEL_MAX_COLUMN_NAME_LENGTH=200
292
- export EXCEL_MAX_METADATA_VALUE_LENGTH=2000
841
+ Security limits and processing options can be configured:
842
+
843
+ ```yaml
844
+ actions:
845
+ upload:
846
+ excel_config:
847
+ max_file_size_mb: 10 # Maximum Excel file size in MB
848
+ max_rows: 100000 # Maximum number of rows allowed
849
+ max_columns: 50 # Maximum number of columns allowed
293
850
  ```
294
851
 
852
+ ### Performance Optimizations
853
+
854
+ The Excel metadata processing includes several performance enhancements:
855
+
856
+ #### Metadata Indexing
857
+ - **O(1) hash lookups** for exact stem and filename matches
858
+ - **Pre-built indexes** for common matching patterns
859
+ - **Fallback algorithms** for complex path matching scenarios
860
+
861
+ #### Efficient Processing
862
+ - **Optimized row processing**: Skip empty rows early
863
+ - **Memory-conscious operation**: Process files in batches
864
+ - **Smart file discovery**: Cache path strings to avoid repeated conversions
865
+
295
866
  ### Metadata Processing Flow
296
867
 
297
- 1. **Security Validation**: File size, memory estimation
298
- 2. **Format Validation**: Header structure, column count
299
- 3. **Content Processing**: Row-by-row metadata extraction
300
- 4. **Data Sanitization**: Length limits, string truncation
301
- 5. **Mapping Creation**: Filename to metadata mapping
868
+ 1. **Security Validation**: File size, format, and content limits
869
+ 2. **Header Validation**: Support for both "filename" and "file_name" with case-insensitive matching
870
+ 3. **Index Building**: Create O(1) lookup structures for performance
871
+ 4. **Content Processing**: Row-by-row metadata extraction with optimization
872
+ 5. **Data Sanitization**: Automatic truncation and validation
873
+ 6. **Pattern Matching**: 5-tier filename association algorithm
874
+ 7. **Mapping Creation**: Optimized filename to metadata mapping
875
+
876
+ ### Excel Metadata Parameter
877
+
878
+ You can specify a custom Excel metadata file path:
879
+
880
+ ```python
881
+ params = {
882
+ "name": "Excel Metadata Upload",
883
+ "path": "/data/files",
884
+ "storage": 1,
885
+ "data_collection": 5,
886
+ "excel_metadata_path": "/data/custom_metadata.xlsx" # Custom Excel file
887
+ }
888
+ ```
889
+
890
+ #### Path Resolution
891
+ - **Absolute paths**: Used directly if they exist and are accessible
892
+ - **Relative paths**: Resolved relative to the upload path
893
+ - **Default discovery**: Automatically searches for `meta.xlsx` or `meta.xls` if no path specified
894
+ - **Storage integration**: Uses storage configuration for proper path resolution
895
+
896
+ ### Error Handling
897
+
898
+ Comprehensive error handling ensures robust operation:
899
+
900
+ ```python
901
+ # Excel processing errors are handled gracefully
902
+ try:
903
+ metadata = process_excel_metadata(excel_path)
904
+ except ExcelSecurityError as e:
905
+ # Security violation - file too large, too many rows, etc.
906
+ log_security_violation(e)
907
+ except ExcelParsingError as e:
908
+ # Parsing failure - corrupted file, invalid format, etc.
909
+ log_parsing_error(e)
910
+ ```
911
+
912
+ #### Error Recovery
913
+ - **Graceful degradation**: Continue processing with empty metadata if Excel fails
914
+ - **Detailed logging**: Specific error codes for different failure types
915
+ - **Path validation**: Comprehensive validation during parameter processing
916
+ - **Fallback behavior**: Smart defaults when metadata cannot be processed
302
917
 
303
918
  ## File Organization
304
919
 
@@ -405,20 +1020,97 @@ run.log_message_with_code(
405
1020
  run.log_upload_event(LogCode.UPLOADING_DATA_FILES, batch_size)
406
1021
  ```
407
1022
 
1023
+ ## Migration Guide
1024
+
1025
+ ### From Legacy to Refactored Architecture
1026
+
1027
+ The upload action has been refactored using modern design patterns while maintaining **100% backward compatibility**. Existing code will continue to work without changes.
1028
+
1029
+ #### Key Changes
1030
+
1031
+ **Before (Legacy Monolithic):**
1032
+
1033
+ - Single 900+ line action class with all logic
1034
+ - Hard-coded behaviors for validation, file discovery, etc.
1035
+ - No extensibility or customization options
1036
+ - Manual error handling throughout
1037
+
1038
+ **After (Strategy/Facade Patterns):**
1039
+
1040
+ - Clean separation of concerns with 8 workflow steps
1041
+ - Pluggable strategies for different behaviors
1042
+ - Extensible architecture for custom implementations
1043
+ - Automatic rollback and comprehensive error handling
1044
+
1045
+ #### Backward Compatibility
1046
+
1047
+ ```python
1048
+ # This legacy usage still works exactly the same
1049
+ from synapse_sdk.plugins.categories.upload.actions.upload.action import UploadAction
1050
+
1051
+ params = {
1052
+ "name": "My Upload",
1053
+ "path": "/data/files",
1054
+ "storage": 1,
1055
+ "data_collection": 5 # Changed from 'collection' to 'data_collection'
1056
+ }
1057
+
1058
+ action = UploadAction(params=params, plugin_config=config)
1059
+ result = action.start() # Works identically to before
1060
+ ```
1061
+
1062
+ #### Enhanced Capabilities
1063
+
1064
+ The refactored architecture provides new capabilities:
1065
+
1066
+ ```python
1067
+ # Get detailed workflow information
1068
+ action = UploadAction(params=params, plugin_config=config)
1069
+ workflow_info = action.get_workflow_summary()
1070
+ print(f"Configured with {workflow_info['step_count']} steps")
1071
+ print(f"Available strategies: {workflow_info['available_strategies']}")
1072
+
1073
+ # Execute and get detailed results
1074
+ result = action.start()
1075
+ print(f"Success: {result['success']}")
1076
+ print(f"Uploaded files: {result['uploaded_files_count']}")
1077
+ print(f"Generated data units: {result['generated_data_units_count']}")
1078
+ print(f"Errors: {result['errors']}")
1079
+ print(f"Metrics: {result['metrics']}")
1080
+ ```
1081
+
1082
+ #### Parameter Changes
1083
+
1084
+ Only one parameter name changed:
1085
+
1086
+ | Legacy | Refactored | Status |
1087
+ | -------------------- | ----------------- | ------------------- |
1088
+ | `collection` | `data_collection` | **Required change** |
1089
+ | All other parameters | Unchanged | Fully compatible |
1090
+
1091
+ #### Benefits of Migration
1092
+
1093
+ - **Better Error Handling**: Automatic rollback on failures
1094
+ - **Progress Tracking**: Detailed progress metrics across workflow steps
1095
+ - **Extensibility**: Add custom strategies and steps
1096
+ - **Testing**: Better testability with mock-friendly architecture
1097
+ - **Maintainability**: Clean separation of concerns
1098
+ - **Performance**: More efficient resource management
1099
+
408
1100
  ## Usage Examples
409
1101
 
410
- ### Basic File Upload
1102
+ ### Basic File Upload (Refactored Architecture)
411
1103
 
412
1104
  ```python
413
- from synapse_sdk.plugins.categories.upload.actions.upload import UploadAction
1105
+ from synapse_sdk.plugins.categories.upload.actions.upload.action import UploadAction
414
1106
 
415
- # Basic upload configuration
1107
+ # Basic upload configuration with new architecture
416
1108
  params = {
417
1109
  "name": "Dataset Upload",
418
1110
  "description": "Training dataset for ML model",
419
1111
  "path": "/data/training_images",
420
1112
  "storage": 1,
421
- "collection": 5,
1113
+ "data_collection": 5, # Note: 'data_collection' instead of 'collection'
422
1114
  "is_recursive": True,
423
1115
  "max_file_size_mb": 100
424
1116
  }
@@ -428,20 +1120,30 @@ action = UploadAction(
428
1120
  plugin_config=plugin_config
429
1121
  )
430
1122
 
431
- result = action.run_action()
1123
+ # Execute with automatic step-based workflow and rollback
1124
+ result = action.start()
1125
+
1126
+ # Enhanced result information
1127
+ print(f"Upload successful: {result['success']}")
432
1128
  print(f"Uploaded {result['uploaded_files_count']} files")
433
- print(f"Created {result['generated_data_units_count']} data units")
1129
+ print(f"Generated {result['generated_data_units_count']} data units")
1130
+ print(f"Workflow errors: {result['errors']}")
1131
+
1132
+ # Access detailed metrics
1133
+ workflow_metrics = result['metrics'].get('workflow', {})
1134
+ print(f"Total steps executed: {workflow_metrics.get('current_step', 0)}")
1135
+ print(f"Progress completed: {workflow_metrics.get('progress_percentage', 0)}%")
434
1136
  ```
435
1137
 
436
- ### Excel Metadata Upload
1138
+ ### Excel Metadata Upload with Progress Tracking
437
1139
 
438
1140
  ```python
439
- # Upload with Excel metadata
1141
+ # Upload with Excel metadata and progress monitoring
440
1142
  params = {
441
1143
  "name": "Annotated Dataset Upload",
442
1144
  "path": "/data/images",
443
1145
  "storage": 1,
444
- "collection": 5,
1146
+ "data_collection": 5,
445
1147
  "excel_metadata_path": "/data/metadata.xlsx",
446
1148
  "is_recursive": False,
447
1149
  "creating_data_unit_batch_size": 50
@@ -452,37 +1154,152 @@ action = UploadAction(
452
1154
  plugin_config=plugin_config
453
1155
  )
454
1156
 
455
- result = action.run_action()
1157
+ # Get workflow summary before execution
1158
+ workflow_info = action.get_workflow_summary()
1159
+ print(f"Workflow configured with {workflow_info['step_count']} steps")
1160
+ print(f"Total progress weight: {workflow_info['total_progress_weight']}")
1161
+ print(f"Steps: {workflow_info['steps']}")
1162
+
1163
+ # Execute with enhanced error handling
1164
+ try:
1165
+ result = action.start()
1166
+ if result['success']:
1167
+ print("Upload completed successfully!")
1168
+ print(f"Files: {result['uploaded_files_count']}")
1169
+ print(f"Data units: {result['generated_data_units_count']}")
1170
+ else:
1171
+ print("Upload failed with errors:")
1172
+ for error in result['errors']:
1173
+ print(f" - {error}")
1174
+ except Exception as e:
1175
+ print(f"Upload action failed: {e}")
456
1176
  ```
457
1177
 
458
- ### Custom Configuration
1178
+ ### Custom Strategy Upload
459
1179
 
460
1180
  ```python
461
- # Custom environment setup
462
- import os
1181
+ from synapse_sdk.plugins.categories.upload.actions.upload.action import UploadAction
1182
+ from my_custom_strategies import CustomValidationStrategy
1183
+
1184
+ # Create action with custom factory
1185
+ class CustomUploadAction(UploadAction):
1186
+ def _configure_strategies(self, context=None):
1187
+ strategies = super()._configure_strategies(context)
463
1188
 
464
- os.environ['EXCEL_MAX_FILE_SIZE_MB'] = '20'
465
- os.environ['EXCEL_MAX_ROWS'] = '20000'
1189
+ # Override with custom validation
1190
+ if self.params.get('use_strict_validation'):
1191
+ strategies['validation'] = CustomValidationStrategy()
466
1192
 
467
- # Large file upload
1193
+ return strategies
1194
+
1195
+ # Use custom action
468
1196
  params = {
469
- "name": "Large Dataset Upload",
1197
+ "name": "Strict Validation Upload",
1198
+ "path": "/data/sensitive_files",
1199
+ "storage": 1,
1200
+ "data_collection": 5,
1201
+ "use_strict_validation": True,
1202
+ "max_file_size_mb": 10 # Stricter limits
1203
+ }
1204
+
1205
+ action = CustomUploadAction(
1206
+ params=params,
1207
+ plugin_config=plugin_config
1208
+ )
1209
+
1210
+ result = action.start()
1211
+ ```
1212
+
1213
+ ### Batch Processing with Custom Configuration
1214
+
1215
+ ```python
1216
+ # Custom plugin configuration with config.yaml
1217
+ plugin_config = {
1218
+ "actions": {
1219
+ "upload": {
1220
+ "excel_config": {
1221
+ "max_file_size_mb": 20,
1222
+ "max_rows": 50000,
1223
+ "max_columns": 100
1224
+ }
1225
+ }
1226
+ }
1227
+ }
1228
+
1229
+ # Large batch upload with custom settings
1230
+ params = {
1231
+ "name": "Large Batch Upload",
470
1232
  "path": "/data/large_dataset",
471
1233
  "storage": 2,
472
- "collection": 10,
1234
+ "data_collection": 10,
1235
+ "is_recursive": True,
473
1236
  "max_file_size_mb": 500,
474
1237
  "creating_data_unit_batch_size": 200,
475
- "use_async_upload": True,
1238
+ "use_async_upload": True
1239
+ }
1240
+
1241
+ action = UploadAction(
1242
+ params=params,
1243
+ plugin_config=plugin_config
1244
+ )
1245
+
1246
+ # Execute with progress monitoring
1247
+ result = action.start()
1248
+
1249
+ # Analyze results
1250
+ print(f"Batch upload summary:")
1251
+ print(f" Success: {result['success']}")
1252
+ print(f" Files processed: {result['uploaded_files_count']}")
1253
+ print(f" Data units created: {result['generated_data_units_count']}")
1254
+
1255
+ # Check metrics by category
1256
+ metrics = result['metrics']
1257
+ if 'data_files' in metrics:
1258
+ files_metrics = metrics['data_files']
1259
+ print(f" Files - Success: {files_metrics.get('success', 0)}")
1260
+ print(f" Files - Failed: {files_metrics.get('failed', 0)}")
1261
+
1262
+ if 'data_units' in metrics:
1263
+ units_metrics = metrics['data_units']
1264
+ print(f" Units - Success: {units_metrics.get('success', 0)}")
1265
+ print(f" Units - Failed: {units_metrics.get('failed', 0)}")
1266
+ ```
1267
+
1268
+ ### Error Handling and Rollback
1269
+
1270
+ ```python
1271
+ # Demonstrate enhanced error handling with automatic rollback
1272
+ params = {
1273
+ "name": "Error Recovery Example",
1274
+ "path": "/data/problematic_files",
1275
+ "storage": 1,
1276
+ "data_collection": 5,
476
1277
  "is_recursive": True
477
1278
  }
478
1279
 
479
1280
  action = UploadAction(
480
1281
  params=params,
481
- plugin_config=plugin_config,
482
- debug=True
1282
+ plugin_config=plugin_config
483
1283
  )
484
1284
 
485
- result = action.run_action()
1285
+ try:
1286
+ result = action.start()
1287
+
1288
+ if not result['success']:
1289
+ print("Upload failed, but cleanup was automatic:")
1290
+ print(f"Errors encountered: {len(result['errors'])}")
1291
+ for i, error in enumerate(result['errors'], 1):
1292
+ print(f" {i}. {error}")
1293
+
1294
+ # Check if rollback was performed (via orchestrator internals)
1295
+ workflow_metrics = result['metrics'].get('workflow', {})
1296
+ current_step = workflow_metrics.get('current_step', 0)
1297
+ total_steps = workflow_metrics.get('total_steps', 0)
1298
+ print(f"Workflow stopped at step {current_step} of {total_steps}")
1299
+
1300
+ except Exception as e:
1301
+ print(f"Critical upload failure: {e}")
1302
+ # Rollback was automatically performed before exception propagation
486
1303
  ```
487
1304
 
488
1305
  ## Error Handling
@@ -523,11 +1340,11 @@ except ValidationError as e:
523
1340
 
524
1341
  ## API Reference
525
1342
 
526
- ### Main Classes
1343
+ ### Core Components
527
1344
 
528
1345
  #### UploadAction
529
1346
 
530
- Main upload action class for file processing operations.
1347
+ Main upload action class implementing Strategy and Facade patterns for file processing operations.
531
1348
 
532
1349
  **Class Attributes:**
533
1350
 
@@ -536,17 +1353,256 @@ Main upload action class for file processing operations.
536
1353
  - `method = RunMethod.JOB` - Execution method
537
1354
  - `run_class = UploadRun` - Specialized run management
538
1355
  - `params_model = UploadParams` - Parameter validation model
1356
+ - `strategy_factory: StrategyFactory` - Creates strategy implementations
1357
+ - `step_registry: StepRegistry` - Manages workflow steps
1358
+
1359
+ **Key Methods:**
1360
+
1361
+ - `start() -> Dict[str, Any]` - Execute orchestrated upload workflow
1362
+ - `get_workflow_summary() -> Dict[str, Any]` - Get configured workflow summary
1363
+ - `_configure_workflow() -> None` - Register workflow steps in execution order
1364
+ - `_configure_strategies(context=None) -> Dict[str, Any]` - Create strategy instances
1365
+
1366
+ **Progress Categories:**
1367
+
1368
+ ```python
1369
+ progress_categories = {
1370
+ 'analyze_collection': {'proportion': 2},
1371
+ 'upload_data_files': {'proportion': 38},
1372
+ 'generate_data_units': {'proportion': 60},
1373
+ }
1374
+ ```
1375
+
1376
+ #### UploadOrchestrator
1377
+
1378
+ Facade component coordinating the complete upload workflow with automatic rollback.
1379
+
1380
+ **Class Attributes:**
1381
+
1382
+ - `context: UploadContext` - Shared state across workflow
1383
+ - `step_registry: StepRegistry` - Registry of workflow steps
1384
+ - `strategies: Dict[str, Any]` - Strategy implementations
1385
+ - `executed_steps: List[BaseStep]` - Successfully executed steps
1386
+ - `current_step_index: int` - Current position in workflow
1387
+ - `rollback_executed: bool` - Whether rollback was performed
1388
+
1389
+ **Key Methods:**
1390
+
1391
+ - `execute() -> Dict[str, Any]` - Execute complete workflow with error handling
1392
+ - `get_workflow_summary() -> Dict[str, Any]` - Get execution summary and metrics
1393
+ - `get_executed_steps() -> List[BaseStep]` - Get list of successfully executed steps
1394
+ - `is_rollback_executed() -> bool` - Check if rollback was performed
1395
+ - `_execute_step(step: BaseStep) -> StepResult` - Execute individual workflow step
1396
+ - `_handle_step_failure(step: BaseStep, error: Exception) -> None` - Handle step failures
1397
+ - `_rollback_executed_steps() -> None` - Rollback executed steps in reverse order
1398
+
1399
+ #### UploadContext
1400
+
1401
+ Context object maintaining shared state and communication between workflow components.
1402
+
1403
+ **State Attributes:**
1404
+
1405
+ - `params: Dict` - Upload parameters
1406
+ - `run: UploadRun` - Run management instance
1407
+ - `client: Any` - API client for external operations
1408
+ - `storage: Any` - Storage configuration object
1409
+ - `pathlib_cwd: Path` - Current working directory path
1410
+ - `metadata: Dict[str, Dict[str, Any]]` - File metadata mappings
1411
+ - `file_specifications: Dict[str, Any]` - Data collection file specs
1412
+ - `organized_files: List[Dict[str, Any]]` - Organized file information
1413
+ - `uploaded_files: List[Dict[str, Any]]` - Successfully uploaded files
1414
+ - `data_units: List[Dict[str, Any]]` - Generated data units
1415
+
1416
+ **Progress and Metrics:**
1417
+
1418
+ - `metrics: Dict[str, Any]` - Workflow metrics and statistics
1419
+ - `errors: List[str]` - Accumulated error messages
1420
+ - `step_results: List[StepResult]` - Results from executed steps
1421
+
1422
+ **Strategy and Rollback:**
1423
+
1424
+ - `strategies: Dict[str, Any]` - Injected strategy implementations
1425
+ - `rollback_data: Dict[str, Any]` - Data for rollback operations
1426
+
1427
+ **Key Methods:**
1428
+
1429
+ - `update(result: StepResult) -> None` - Update context with step results
1430
+ - `get_result() -> Dict[str, Any]` - Generate final result dictionary
1431
+ - `has_errors() -> bool` - Check for accumulated errors
1432
+ - `get_last_step_result() -> Optional[StepResult]` - Get most recent step result
1433
+ - `update_metrics(category: str, metrics: Dict[str, Any]) -> None` - Update metrics
1434
+ - `add_error(error: str) -> None` - Add error to context
1435
+ - `get_param(key: str, default: Any = None) -> Any` - Get parameter with default
1436
+
1437
+ #### StepRegistry
1438
+
1439
+ Registry managing the collection and execution order of workflow steps.
1440
+
1441
+ **Attributes:**
1442
+
1443
+ - `_steps: List[BaseStep]` - Registered workflow steps in execution order
1444
+
1445
+ **Key Methods:**
1446
+
1447
+ - `register(step: BaseStep) -> None` - Register a workflow step
1448
+ - `get_steps() -> List[BaseStep]` - Get all registered steps in order
1449
+ - `get_total_progress_weight() -> float` - Calculate total progress weight
1450
+ - `clear() -> None` - Clear all registered steps
1451
+ - `__len__() -> int` - Get number of registered steps
1452
+
1453
+ #### StrategyFactory
1454
+
1455
+ Factory component creating appropriate strategy implementations based on parameters.
539
1456
 
540
1457
  **Key Methods:**
541
1458
 
542
- - `start()` - Main upload processing logic
543
- - `get_uploader()` - Get configured uploader instance
544
- - `_discover_files_recursive()` - Recursive file discovery
545
- - `_process_excel_metadata()` - Excel metadata processing
1459
+ - `create_validation_strategy(params: Dict, context=None) -> BaseValidationStrategy` - Create validation strategy
1460
+ - `create_file_discovery_strategy(params: Dict, context=None) -> BaseFileDiscoveryStrategy` - Create file discovery strategy
1461
+ - `create_metadata_strategy(params: Dict, context=None) -> BaseMetadataStrategy` - Create metadata processing strategy
1462
+ - `create_upload_strategy(params: Dict, context: UploadContext) -> BaseUploadStrategy` - Create upload strategy (requires context)
1463
+ - `create_data_unit_strategy(params: Dict, context: UploadContext) -> BaseDataUnitStrategy` - Create data unit strategy (requires context)
1464
+ - `get_available_strategies() -> Dict[str, List[str]]` - Get available strategy types and implementations
1465
+
1466
+ ### Workflow Steps
1467
+
1468
+ #### BaseStep (Abstract)
1469
+
1470
+ Base class for all workflow steps providing common interface and utilities.
1471
+
1472
+ **Abstract Properties:**
1473
+
1474
+ - `name: str` - Unique step identifier
1475
+ - `progress_weight: float` - Weight for progress calculation (sum should equal 1.0)
1476
+
1477
+ **Abstract Methods:**
1478
+
1479
+ - `execute(context: UploadContext) -> StepResult` - Execute step logic
1480
+ - `can_skip(context: UploadContext) -> bool` - Determine if step can be skipped
1481
+ - `rollback(context: UploadContext) -> None` - Rollback step operations
1482
+
1483
+ **Utility Methods:**
1484
+
1485
+ - `create_success_result(data: Dict = None) -> StepResult` - Create success result
1486
+ - `create_error_result(error: str, original_exception: Exception = None) -> StepResult` - Create error result
1487
+ - `create_skip_result() -> StepResult` - Create skip result
1488
+
1489
+ #### StepResult
1490
+
1491
+ Result object returned by workflow step execution.
1492
+
1493
+ **Attributes:**
1494
+
1495
+ - `success: bool` - Whether step executed successfully
1496
+ - `data: Dict[str, Any]` - Step result data
1497
+ - `error: str` - Error message if step failed
1498
+ - `rollback_data: Dict[str, Any]` - Data needed for rollback
1499
+ - `skipped: bool` - Whether step was skipped
1500
+ - `original_exception: Optional[Exception]` - Original exception for debugging
1501
+ - `timestamp: datetime` - Execution timestamp
1502
+
1503
+ **Usage:**
1504
+
1505
+ ```python
1506
+ # Boolean evaluation
1507
+ if step_result:
1508
+ # Step was successful
1509
+ process_success(step_result.data)
1510
+ ```
1511
+
1512
+ #### Concrete Steps
1513
+
1514
+ **InitializeStep** (`name: "initialize"`, `weight: 0.05`)
1515
+
1516
+ - Sets up storage connection and pathlib working directory
1517
+ - Validates basic upload prerequisites
1518
+
1519
+ **ProcessMetadataStep** (`name: "process_metadata"`, `weight: 0.05`)
1520
+
1521
+ - Processes Excel metadata if provided
1522
+ - Validates metadata security and format
1523
+
1524
+ **AnalyzeCollectionStep** (`name: "analyze_collection"`, `weight: 0.05`)
1525
+
1526
+ - Retrieves and validates data collection file specifications
1527
+ - Sets up file organization rules
1528
+
1529
+ **OrganizeFilesStep** (`name: "organize_files"`, `weight: 0.10`)
1530
+
1531
+ - Discovers files using file discovery strategy
1532
+ - Organizes files by type and specification
1533
+
1534
+ **ValidateFilesStep** (`name: "validate_files"`, `weight: 0.05`)
1535
+
1536
+ - Validates files using validation strategy
1537
+ - Performs security and content checks
1538
+
1539
+ **UploadFilesStep** (`name: "upload_files"`, `weight: 0.30`)
1540
+
1541
+ - Uploads files using upload strategy
1542
+ - Handles batching and progress tracking
1543
+
1544
+ **GenerateDataUnitsStep** (`name: "generate_data_units"`, `weight: 0.35`)
1545
+
1546
+ - Creates data units using data unit strategy
1547
+ - Links uploaded files to data units
1548
+
1549
+ **CleanupStep** (`name: "cleanup"`, `weight: 0.05`)
1550
+
1551
+ - Cleans temporary resources and files
1552
+ - Performs final validation
1553
+
1554
+ ### Strategy Base Classes
1555
+
1556
+ #### BaseValidationStrategy (Abstract)
1557
+
1558
+ Base class for file validation strategies.
1559
+
1560
+ **Abstract Methods:**
1561
+
1562
+ - `validate_files(files: List[Path], context: UploadContext) -> bool` - Validate collection of files
1563
+ - `validate_security(file_path: Path) -> bool` - Validate individual file security
1564
+
1565
+ #### BaseFileDiscoveryStrategy (Abstract)
1566
+
1567
+ Base class for file discovery and organization strategies.
1568
+
1569
+ **Abstract Methods:**
1570
+
1571
+ - `discover_files(path: Path, context: UploadContext) -> List[Path]` - Discover files from path
1572
+ - `organize_files(files: List[Path], specs: Dict[str, Any], context: UploadContext) -> List[Dict[str, Any]]` - Organize discovered files
1573
+
1574
+ #### BaseMetadataStrategy (Abstract)
1575
+
1576
+ Base class for metadata processing strategies.
1577
+
1578
+ **Abstract Methods:**
1579
+
1580
+ - `process_metadata(context: UploadContext) -> Dict[str, Any]` - Process metadata from context
1581
+ - `extract_metadata(file_path: Path) -> Dict[str, Any]` - Extract metadata from file
1582
+
1583
+ #### BaseUploadStrategy (Abstract)
1584
+
1585
+ Base class for file upload strategies.
1586
+
1587
+ **Abstract Methods:**
1588
+
1589
+ - `upload_files(files: List[Dict[str, Any]], context: UploadContext) -> List[Dict[str, Any]]` - Upload collection of files
1590
+ - `upload_batch(batch: List[Dict[str, Any]], context: UploadContext) -> List[Dict[str, Any]]` - Upload file batch
1591
+
1592
+ #### BaseDataUnitStrategy (Abstract)
1593
+
1594
+ Base class for data unit creation strategies.
1595
+
1596
+ **Abstract Methods:**
1597
+
1598
+ - `generate_data_units(files: List[Dict[str, Any]], context: UploadContext) -> List[Dict[str, Any]]` - Generate data units
1599
+ - `create_data_unit_batch(batch: List[Dict[str, Any]], context: UploadContext) -> List[Dict[str, Any]]` - Create data unit batch
1600
+
1601
+ ### Legacy Components
546
1602
 
547
1603
  #### UploadRun
548
1604
 
549
- Specialized run management for upload operations.
1605
+ Specialized run management for upload operations (unchanged from legacy).
550
1606
 
551
1607
  **Logging Methods:**
552
1608
 
@@ -563,11 +1619,28 @@ Specialized run management for upload operations.
563
1619
 
564
1620
  #### UploadParams
565
1621
 
566
- Parameter validation model with Pydantic integration.
1622
+ Parameter validation model with Pydantic integration (unchanged from legacy).
1623
+
1624
+ **Required Parameters:**
1625
+
1626
+ - `name: str` - Upload name
1627
+ - `path: str` - Source path
1628
+ - `storage: int` - Storage ID
1629
+ - `data_collection: int` - Data collection ID
1630
+
1631
+ **Optional Parameters:**
1632
+
1633
+ - `description: str | None = None` - Upload description
1634
+ - `project: int | None = None` - Project ID
1635
+ - `excel_metadata_path: str | None = None` - Excel metadata file path
1636
+ - `is_recursive: bool = False` - Recursive file discovery
1637
+ - `max_file_size_mb: int = 50` - Maximum file size
1638
+ - `creating_data_unit_batch_size: int = 100` - Data unit batch size
1639
+ - `use_async_upload: bool = True` - Async upload processing
567
1640
 
568
1641
  **Validation Features:**
569
1642
 
570
- - Real-time API validation for storage/collection/project
1643
+ - Real-time API validation for storage/data_collection/project
571
1644
  - String sanitization and length validation
572
1645
  - Type checking and conversion
573
1646
  - Custom validator methods
@@ -580,19 +1653,13 @@ Security configuration for Excel file processing.
580
1653
 
581
1654
  **Configuration Attributes:**
582
1655
 
583
- - File size and memory limits
584
- - Row and column count limits
585
- - String length restrictions
586
- - Environment variable overrides
587
-
588
- #### ExcelMetadataUtils
589
-
590
- Utility methods for Excel metadata processing.
1656
+ - `max_file_size_mb` - File size limit in megabytes (default: 10)
1657
+ - `max_rows` - Row count limit (default: 100000)
1658
+ - `max_columns` - Column count limit (default: 50)
591
1659
 
592
1660
  **Key Methods:**
593
1661
 
594
- - `validate_and_truncate_string()` - String sanitization
595
- - `is_valid_filename_length()` - Filename validation
1662
+ - `from_action_config(action_config)` - Create config from config.yaml
596
1663
 
597
1664
  #### PathAwareJSONEncoder
598
1665
 
@@ -651,30 +1718,247 @@ Raised when Excel files cannot be parsed.
651
1718
 
652
1719
  ## Best Practices
653
1720
 
1721
+ ### Architecture Patterns
1722
+
1723
+ 1. **Strategy Selection**: Choose appropriate strategies based on use case requirements
1724
+
1725
+ - Use `RecursiveFileDiscoveryStrategy` for deep directory structures
1726
+ - Use `BasicValidationStrategy` for standard file validation
1727
+ - Use `AsyncUploadStrategy` for large file sets
1728
+
1729
+ 2. **Step Ordering**: Maintain logical step dependencies
1730
+
1731
+ - Initialize → Process Metadata → Analyze Collection → Organize Files → Validate → Upload → Generate Data Units → Cleanup
1732
+ - Custom steps should be inserted at appropriate points in the workflow
1733
+
1734
+ 3. **Context Management**: Leverage UploadContext for state sharing
1735
+ - Store intermediate results in context for downstream steps
1736
+ - Use context for cross-step communication
1737
+ - Preserve rollback data for cleanup operations
1738
+
654
1739
  ### Performance Optimization
655
1740
 
656
- 1. **Batch Processing**: Use appropriate batch sizes for large uploads
657
- 2. **Async Operations**: Enable async processing for better throughput
658
- 3. **Memory Management**: Configure Excel security limits appropriately
659
- 4. **Progress Monitoring**: Track progress categories for user feedback
1741
+ 1. **Batch Processing**: Configure optimal batch sizes based on system resources
660
1742
 
661
- ### Security Considerations
1743
+ ```python
1744
+ params = {
1745
+ "creating_data_unit_batch_size": 200, # Adjust based on memory
1746
+ "upload_batch_size": 10, # Custom parameter for upload strategies
1747
+ }
1748
+ ```
662
1749
 
663
- 1. **File Validation**: Always validate file sizes and types
664
- 2. **Excel Security**: Configure appropriate security limits
665
- 3. **Path Sanitization**: Validate and sanitize file paths
666
- 4. **Content Filtering**: Implement content-based security checks
1750
+ 2. **Async Operations**: Enable async processing for I/O-bound operations
667
1751
 
668
- ### Error Handling
1752
+ ```python
1753
+ params = {
1754
+ "use_async_upload": True, # Better throughput for network operations
1755
+ }
1756
+ ```
1757
+
1758
+ 3. **Memory Management**: Monitor memory usage in custom strategies
1759
+
1760
+ - Process files in chunks rather than loading all into memory
1761
+ - Use generators for large file collections
1762
+ - Configure Excel security limits appropriately
1763
+
1764
+ 4. **Progress Monitoring**: Implement detailed progress tracking
1765
+ ```python
1766
+ # Custom step with progress updates
1767
+ def execute(self, context):
1768
+ total_files = len(context.organized_files)
1769
+ for i, file_info in enumerate(context.organized_files):
1770
+ # Process file
1771
+ progress = (i + 1) / total_files * 100
1772
+ context.update_metrics('custom_step', {'progress': progress})
1773
+ ```
1774
+
1775
+ ### Security Considerations
669
1776
 
670
- 1. **Graceful Degradation**: Handle partial upload failures
671
- 2. **Detailed Logging**: Use LogCode enum for consistent logging
672
- 3. **User Feedback**: Provide clear error messages
673
- 4. **Recovery Options**: Implement retry mechanisms where appropriate
1777
+ 1. **Input Validation**: Validate all input parameters and file paths
1778
+
1779
+ ```python
1780
+ # Custom validation in strategy
1781
+ def validate_files(self, files, context):
1782
+ for file_path in files:
1783
+ if not self._is_safe_path(file_path):
1784
+ return False
1785
+ return True
1786
+ ```
1787
+
1788
+ 2. **File Content Security**: Implement content-based security checks
1789
+
1790
+ - Scan for malicious file signatures
1791
+ - Validate file headers match extensions
1792
+ - Check for embedded executables
1793
+
1794
+ 3. **Excel Security**: Configure appropriate security limits
1795
+
1796
+ ```python
1797
+ import os
1798
+ os.environ['EXCEL_MAX_FILE_SIZE_MB'] = '10'
1799
+ os.environ['EXCEL_MAX_MEMORY_MB'] = '30'
1800
+ ```
1801
+
1802
+ 4. **Path Sanitization**: Validate and sanitize all file paths
1803
+ - Prevent path traversal attacks
1804
+ - Validate file extensions
1805
+ - Check file permissions
1806
+
1807
+ ### Error Handling and Recovery
1808
+
1809
+ 1. **Graceful Degradation**: Design for partial failure scenarios
1810
+
1811
+ ```python
1812
+ class RobustUploadStrategy(BaseUploadStrategy):
1813
+ def upload_files(self, files, context):
1814
+ successful_uploads = []
1815
+ failed_uploads = []
1816
+
1817
+ for file_info in files:
1818
+ try:
1819
+ result = self._upload_file(file_info)
1820
+ successful_uploads.append(result)
1821
+ except Exception as e:
1822
+ failed_uploads.append({'file': file_info, 'error': str(e)})
1823
+ # Continue with other files instead of failing completely
1824
+
1825
+ # Update context with partial results
1826
+ context.add_uploaded_files(successful_uploads)
1827
+ if failed_uploads:
1828
+ context.add_error(f"Failed to upload {len(failed_uploads)} files")
1829
+
1830
+ return successful_uploads
1831
+ ```
1832
+
1833
+ 2. **Rollback Design**: Implement comprehensive rollback strategies
1834
+
1835
+ ```python
1836
+ def rollback(self, context):
1837
+ # Clean up in reverse order of operations
1838
+ if hasattr(self, '_created_temp_files'):
1839
+ for temp_file in self._created_temp_files:
1840
+ try:
1841
+ temp_file.unlink()
1842
+ except Exception:
1843
+ pass # Don't fail rollback due to cleanup issues
1844
+ ```
1845
+
1846
+ 3. **Detailed Logging**: Use structured logging for debugging
1847
+ ```python
1848
+ def execute(self, context):
1849
+ try:
1850
+ context.run.log_message_with_code(
1851
+ 'CUSTOM_STEP_STARTED',
1852
+ {'step': self.name, 'file_count': len(context.organized_files)}
1853
+ )
1854
+ # Step logic here
1855
+ except Exception as e:
1856
+ context.run.log_message_with_code(
1857
+ 'CUSTOM_STEP_FAILED',
1858
+ {'step': self.name, 'error': str(e)},
1859
+ level=Context.DANGER
1860
+ )
1861
+ raise
1862
+ ```
674
1863
 
675
1864
  ### Development Guidelines
676
1865
 
677
- 1. **Modular Structure**: Follow the established modular pattern
678
- 2. **Type Safety**: Use Pydantic models and enum logging
679
- 3. **Testing**: Comprehensive unit test coverage
680
- 4. **Documentation**: Document custom validators and methods
1866
+ 1. **Custom Strategy Development**: Follow established patterns
1867
+
1868
+ ```python
1869
+ # Always extend appropriate base class
1870
+ class MyCustomStrategy(BaseValidationStrategy):
1871
+ def __init__(self, config=None):
1872
+ self.config = config or {}
1873
+
1874
+ def validate_files(self, files, context):
1875
+ # Implement validation logic
1876
+ return True
1877
+
1878
+ def validate_security(self, file_path):
1879
+ # Implement security validation
1880
+ return True
1881
+ ```
1882
+
1883
+ 2. **Testing Strategy**: Comprehensive test coverage
1884
+
1885
+ ```python
1886
+ # Test both success and failure scenarios
1887
+ class TestCustomStrategy:
1888
+ def test_success_case(self):
1889
+ strategy = MyCustomStrategy()
1890
+ result = strategy.validate_files([Path('valid_file.txt')], mock_context)
1891
+ assert result is True
1892
+
1893
+ def test_security_failure(self):
1894
+ strategy = MyCustomStrategy()
1895
+ result = strategy.validate_security(Path('malware.exe'))
1896
+ assert result is False
1897
+
1898
+ def test_rollback_cleanup(self):
1899
+ step = MyCustomStep()
1900
+ step.rollback(mock_context)
1901
+ # Assert cleanup was performed
1902
+ ```
1903
+
1904
+ 3. **Extension Points**: Use factory pattern for extensibility
1905
+
1906
+ ```python
1907
+ class CustomStrategyFactory(StrategyFactory):
1908
+ def create_validation_strategy(self, params, context=None):
1909
+ validation_type = params.get('validation_type', 'basic')
1910
+
1911
+ strategy_map = {
1912
+ 'basic': BasicValidationStrategy,
1913
+ 'strict': StrictValidationStrategy,
1914
+ 'custom': MyCustomValidationStrategy,
1915
+ }
1916
+
1917
+ strategy_class = strategy_map.get(validation_type, BasicValidationStrategy)
1918
+ return strategy_class(params)
1919
+ ```
1920
+
1921
+ 4. **Configuration Management**: Use environment variables and parameters
1922
+
1923
+ ```python
1924
+ class ConfigurableStep(BaseStep):
1925
+ def __init__(self):
1926
+ # Allow runtime configuration
1927
+ self.batch_size = int(os.getenv('STEP_BATCH_SIZE', '50'))
1928
+ self.timeout = int(os.getenv('STEP_TIMEOUT_SECONDS', '300'))
1929
+
1930
+ def execute(self, context):
1931
+ # Use configured values
1932
+ batch_size = context.get_param('step_batch_size', self.batch_size)
1933
+ timeout = context.get_param('step_timeout', self.timeout)
1934
+ ```
1935
+
1936
+ ### Anti-Patterns to Avoid
1937
+
1938
+ 1. **Tight Coupling**: Don't couple strategies to specific implementations
1939
+ 2. **State Mutation**: Don't modify context state directly outside of update() method
1940
+ 3. **Exception Swallowing**: Don't catch and ignore exceptions without proper handling
1941
+ 4. **Blocking Operations**: Don't perform long-running synchronous operations without progress updates
1942
+ 5. **Memory Leaks**: Don't hold references to large objects in step instances
1943
+
1944
+ ### Troubleshooting Guide
1945
+
1946
+ 1. **Step Failures**: Check step execution order and dependencies
1947
+ 2. **Strategy Issues**: Verify strategy factory configuration and parameter passing
1948
+ 3. **Context Problems**: Ensure proper context updates and state management
1949
+ 4. **Rollback Failures**: Design idempotent rollback operations
1950
+ 5. **Performance Issues**: Profile batch sizes and async operation usage
1951
+
1952
+ ### Migration Checklist
1953
+
1954
+ When upgrading from legacy implementation:
1955
+
1956
+ - [ ] Update parameter name from `collection` to `data_collection`
1957
+ - [ ] Test existing workflows for compatibility
1958
+ - [ ] Review custom extensions for new architecture opportunities
1959
+ - [ ] Update error handling to leverage new rollback capabilities
1960
+
1961
+ For detailed information on developing custom upload plugins using the BaseUploader template, see the [Developing Upload Templates](./developing-upload-template.md) guide.
1962
+ - [ ] Consider implementing custom strategies for specialized requirements
1963
+ - [ ] Update test cases to validate new workflow steps
1964
+ - [ ] Review logging and metrics collection for enhanced information