activerecord-graph-extractor 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/docs/dry_run.md ADDED
@@ -0,0 +1,410 @@
1
+ # Dry Run Analysis
2
+
3
+ The dry run feature allows you to analyze what would be extracted without performing the actual extraction. This is particularly valuable for large datasets where you want to understand the scope, performance implications, and resource requirements before committing to a full extraction.
4
+
5
+ ## Overview
6
+
7
+ Dry run analysis provides:
8
+
9
+ - **Scope Analysis**: Which models and how many records would be included
10
+ - **File Size Estimation**: Predicted size of the extraction file
11
+ - **Performance Estimates**: Expected extraction time and memory usage
12
+ - **Relationship Mapping**: Understanding of the object graph structure
13
+ - **Warnings & Recommendations**: Actionable insights for optimization
14
+ - **Circular Reference Detection**: Identification of potential issues
15
+
16
+ ## Usage
17
+
18
+ ### Ruby API
19
+
20
+ #### Basic Dry Run
21
+
22
+ ```ruby
23
+ require 'activerecord_graph_extractor'
24
+
25
+ # Create an extractor instance
26
+ extractor = ActiveRecordGraphExtractor::Extractor.new
27
+
28
+ # Find the object to analyze
29
+ user = User.find(123)
30
+
31
+ # Perform dry run analysis
32
+ analysis = extractor.dry_run(user)
33
+
34
+ # Access analysis results
35
+ puts "Total estimated records: #{analysis['extraction_scope']['total_estimated_records']}"
36
+ puts "Estimated file size: #{analysis['estimated_file_size']['human_readable']}"
37
+ puts "Estimated extraction time: #{analysis['performance_estimates']['estimated_extraction_time_human']}"
38
+ ```
39
+
40
+ #### Dry Run with Options
41
+
42
+ ```ruby
43
+ # Analyze with custom max_depth
44
+ analysis = extractor.dry_run(user, max_depth: 2)
45
+
46
+ # Analyze multiple objects
47
+ users = User.limit(5)
48
+ analysis = extractor.dry_run(users, max_depth: 3)
49
+ ```
50
+
51
+ #### Using DryRunAnalyzer Directly
52
+
53
+ ```ruby
54
+ # For more control, use the analyzer directly
55
+ analyzer = ActiveRecordGraphExtractor::DryRunAnalyzer.new
56
+ analysis = analyzer.analyze(user, max_depth: 2)
57
+ ```
58
+
59
+ ### CLI Usage
60
+
61
+ #### Basic Dry Run
62
+
63
+ ```bash
64
+ # Analyze a specific record
65
+ arge dry_run User 123
66
+
67
+ # With custom max depth
68
+ arge dry_run User 123 --max-depth 2
69
+
70
+ # Save analysis report to file
71
+ arge dry_run User 123 --output analysis_report.json
72
+ ```
73
+
74
+ #### Example CLI Output
75
+
76
+ ```
77
+ 🔍 Performing dry run analysis...
78
+
79
+ Model: User
80
+ ID: 123
81
+ Max Depth: default
82
+
83
+ ✅ Dry run analysis completed!
84
+
85
+ 📊 Analysis Summary:
86
+ Analysis time: 0.245 seconds
87
+ Root objects: 1
88
+ Models involved: 8
89
+ Total estimated records: 1,247
90
+ Estimated file size: 2.3 MB
91
+
92
+ ⏱️ Performance Estimates:
93
+ Extraction time: 1.2 seconds
94
+ Memory usage: 1.3 MB
95
+
96
+ 📋 Records by Model:
97
+ Order 856 (68.7%)
98
+ Product 234 (18.8%)
99
+ User 1 (0.1%)
100
+ Address 2 (0.2%)
101
+ Profile 1 (0.1%)
102
+ Photo 145 (11.6%)
103
+ Category 6 (0.5%)
104
+ AdminAction 2 (0.2%)
105
+
106
+ 🌳 Depth Analysis:
107
+ Level 1: User
108
+ Level 2: Order, Address, Profile
109
+ Level 3: Product, AdminAction
110
+ Level 4: Photo, Category
111
+
112
+ 💡 Recommendations:
113
+ S3: Large file detected - consider uploading directly to S3
114
+ → Use extract_to_s3 or extract_and_upload_to_s3 methods
115
+ ```
116
+
117
+ ## Analysis Report Structure
118
+
119
+ The dry run analysis returns a comprehensive JSON structure:
120
+
121
+ ```json
122
+ {
123
+ "dry_run": true,
124
+ "analysis_time": 0.245,
125
+ "root_objects": {
126
+ "models": ["User"],
127
+ "ids": [123],
128
+ "count": 1
129
+ },
130
+ "extraction_scope": {
131
+ "max_depth": 3,
132
+ "total_models": 8,
133
+ "total_estimated_records": 1247,
134
+ "models_involved": ["User", "Order", "Product", "Address", "Profile", "Photo", "Category", "AdminAction"]
135
+ },
136
+ "estimated_counts_by_model": {
137
+ "Order": 856,
138
+ "Product": 234,
139
+ "Photo": 145,
140
+ "Category": 6,
141
+ "Address": 2,
142
+ "AdminAction": 2,
143
+ "User": 1,
144
+ "Profile": 1
145
+ },
146
+ "estimated_file_size": {
147
+ "bytes": 2415616,
148
+ "human_readable": "2.3 MB"
149
+ },
150
+ "depth_analysis": {
151
+ "1": ["User"],
152
+ "2": ["Order", "Address", "Profile"],
153
+ "3": ["Product", "AdminAction"],
154
+ "4": ["Photo", "Category"]
155
+ },
156
+ "relationship_analysis": {
157
+ "total_relationships": 15,
158
+ "circular_references": [],
159
+ "circular_references_count": 0
160
+ },
161
+ "performance_estimates": {
162
+ "estimated_extraction_time_seconds": 1.2,
163
+ "estimated_extraction_time_human": "1.2 seconds",
164
+ "estimated_memory_usage_mb": 1.3,
165
+ "estimated_memory_usage_human": "1.3 MB"
166
+ },
167
+ "warnings": [
168
+ {
169
+ "type": "medium_file",
170
+ "message": "Estimated file size is large (2.3 MB). Ensure adequate disk space.",
171
+ "severity": "medium"
172
+ }
173
+ ],
174
+ "recommendations": [
175
+ {
176
+ "type": "s3",
177
+ "message": "Large file detected - consider uploading directly to S3",
178
+ "action": "Use extract_to_s3 or extract_and_upload_to_s3 methods"
179
+ }
180
+ ]
181
+ }
182
+ ```
183
+
184
+ ## Key Analysis Components
185
+
186
+ ### Extraction Scope
187
+
188
+ - **total_models**: Number of different model classes involved
189
+ - **total_estimated_records**: Total number of records across all models
190
+ - **models_involved**: List of model class names that would be included
191
+
192
+ ### File Size Estimation
193
+
194
+ The analyzer estimates file size based on:
195
+ - Column types and their typical sizes
196
+ - Number of columns per model
197
+ - JSON structure overhead
198
+ - Relationship data overhead
199
+
200
+ ### Performance Estimates
201
+
202
+ - **Extraction Time**: Based on estimated records per second processing rate
203
+ - **Memory Usage**: Estimated peak memory consumption during extraction
204
+
205
+ ### Depth Analysis
206
+
207
+ Shows which models appear at each relationship depth level, helping you understand the object graph structure.
208
+
209
+ ### Warnings
210
+
211
+ Automatic warnings for:
212
+ - **Large datasets** (>10,000 records): High severity for >100,000 records
213
+ - **Large files** (>100MB): High severity for >1GB files
214
+ - **Deep nesting** (>5 levels): Performance impact warnings
215
+ - **Circular references**: Potential infinite loops
216
+
217
+ ### Recommendations
218
+
219
+ Actionable suggestions based on analysis:
220
+ - **Performance**: Batch processing for large datasets
221
+ - **Depth**: Reducing max_depth for better performance
222
+ - **Filtering**: Excluding large models or using custom filters
223
+ - **S3**: Direct S3 upload for large files
224
+ - **Memory**: RAM considerations for large extractions
225
+
226
+ ## Best Practices
227
+
228
+ ### 1. Always Dry Run Large Extractions
229
+
230
+ ```ruby
231
+ # Before extracting a potentially large dataset
232
+ analysis = extractor.dry_run(user)
233
+
234
+ if analysis['extraction_scope']['total_estimated_records'] > 10_000
235
+ puts "⚠️ Large extraction detected!"
236
+ puts "Consider reducing max_depth or using filters"
237
+ end
238
+ ```
239
+
240
+ ### 2. Use Analysis for Decision Making
241
+
242
+ ```ruby
243
+ analysis = extractor.dry_run(user)
244
+
245
+ file_size_mb = analysis['estimated_file_size']['bytes'] / (1024.0 * 1024)
246
+ extraction_time = analysis['performance_estimates']['estimated_extraction_time_seconds']
247
+
248
+ if file_size_mb > 50
249
+ # Use S3 for large files
250
+ extractor.extract_to_s3(user, s3_client, 'my-key')
251
+ elsif extraction_time > 300 # 5 minutes
252
+ # Schedule for off-peak hours
253
+ puts "Consider running during off-peak hours"
254
+ else
255
+ # Proceed with normal extraction
256
+ extractor.extract(user, 'output.json')
257
+ end
258
+ ```
259
+
260
+ ### 3. Compare Different Depths
261
+
262
+ ```ruby
263
+ [1, 2, 3, 4].each do |depth|
264
+ analysis = extractor.dry_run(user, max_depth: depth)
265
+
266
+ puts "Depth #{depth}:"
267
+ puts " Records: #{analysis['extraction_scope']['total_estimated_records']}"
268
+ puts " File size: #{analysis['estimated_file_size']['human_readable']}"
269
+ puts " Time: #{analysis['performance_estimates']['estimated_extraction_time_human']}"
270
+ puts
271
+ end
272
+ ```
273
+
274
+ ### 4. Save Analysis Reports
275
+
276
+ ```ruby
277
+ # Save detailed analysis for documentation
278
+ analysis = extractor.dry_run(user)
279
+ timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
280
+ File.write("analysis_#{timestamp}.json", JSON.pretty_generate(analysis))
281
+ ```
282
+
283
+ ### 5. Monitor Warnings and Recommendations
284
+
285
+ ```ruby
286
+ analysis = extractor.dry_run(user)
287
+
288
+ # Check for high-severity warnings
289
+ high_warnings = analysis['warnings'].select { |w| w['severity'] == 'high' }
290
+ if high_warnings.any?
291
+ puts "🚨 High severity warnings detected:"
292
+ high_warnings.each { |w| puts " - #{w['message']}" }
293
+ end
294
+
295
+ # Follow recommendations
296
+ analysis['recommendations'].each do |rec|
297
+ puts "💡 #{rec['type'].upcase}: #{rec['message']}"
298
+ puts " Action: #{rec['action']}"
299
+ end
300
+ ```
301
+
302
+ ## Integration Examples
303
+
304
+ ### With Rails Console
305
+
306
+ ```ruby
307
+ # In Rails console
308
+ user = User.find(123)
309
+ extractor = ActiveRecordGraphExtractor::Extractor.new
310
+ analysis = extractor.dry_run(user, max_depth: 2)
311
+
312
+ # Quick summary
313
+ puts "#{analysis['extraction_scope']['total_estimated_records']} records, #{analysis['estimated_file_size']['human_readable']}"
314
+ ```
315
+
316
+ ### With Rake Tasks
317
+
318
+ ```ruby
319
+ # lib/tasks/data_analysis.rake
320
+ namespace :data do
321
+ desc "Analyze extraction scope for a user"
322
+ task :analyze_user, [:user_id] => :environment do |t, args|
323
+ user = User.find(args[:user_id])
324
+ extractor = ActiveRecordGraphExtractor::Extractor.new
325
+
326
+ analysis = extractor.dry_run(user)
327
+
328
+ puts "Analysis for User #{user.id}:"
329
+ puts " Total records: #{analysis['extraction_scope']['total_estimated_records']}"
330
+ puts " File size: #{analysis['estimated_file_size']['human_readable']}"
331
+ puts " Extraction time: #{analysis['performance_estimates']['estimated_extraction_time_human']}"
332
+
333
+ # Save report
334
+ File.write("user_#{user.id}_analysis.json", JSON.pretty_generate(analysis))
335
+ end
336
+ end
337
+ ```
338
+
339
+ ### With Background Jobs
340
+
341
+ ```ruby
342
+ class DataExtractionJob < ApplicationJob
343
+ def perform(user_id)
344
+ user = User.find(user_id)
345
+ extractor = ActiveRecordGraphExtractor::Extractor.new
346
+
347
+ # Always dry run first
348
+ analysis = extractor.dry_run(user)
349
+
350
+ # Check if extraction is feasible
351
+ if analysis['extraction_scope']['total_estimated_records'] > 100_000
352
+ Rails.logger.warn "Large extraction requested for user #{user_id}"
353
+ # Maybe split into smaller jobs or use different strategy
354
+ return
355
+ end
356
+
357
+ # Proceed with actual extraction
358
+ result = extractor.extract(user, "user_#{user_id}_data.json")
359
+ Rails.logger.info "Extracted #{result.total_records} records"
360
+ end
361
+ end
362
+ ```
363
+
364
+ ## Troubleshooting
365
+
366
+ ### Inaccurate Estimates
367
+
368
+ Estimates are based on sampling and heuristics. They may be inaccurate if:
369
+ - Your data has unusual distribution patterns
370
+ - Relationships have highly variable cardinality
371
+ - Models have very large or very small average record sizes
372
+
373
+ ### Performance Issues
374
+
375
+ If dry run analysis itself is slow:
376
+ - Reduce the max_depth for analysis
377
+ - Check for database performance issues
378
+ - Consider if your relationships are properly indexed
379
+
380
+ ### Memory Usage During Analysis
381
+
382
+ Dry run analysis uses minimal memory as it doesn't load actual record data, only counts and metadata.
383
+
384
+ ## Configuration
385
+
386
+ Dry run analysis respects the same configuration options as regular extraction:
387
+
388
+ ```ruby
389
+ ActiveRecordGraphExtractor.configure do |config|
390
+ config.max_depth = 3
391
+ config.include_models = ['User', 'Order', 'Product']
392
+ config.exclude_relationships = ['audit_logs', 'temp_data']
393
+ end
394
+
395
+ # Analysis will use these configuration settings
396
+ analysis = extractor.dry_run(user)
397
+ ```
398
+
399
+ ## Limitations
400
+
401
+ 1. **Estimates Only**: Results are estimates based on sampling and heuristics
402
+ 2. **Database Dependent**: Accuracy depends on database statistics and data distribution
403
+ 3. **Static Analysis**: Cannot account for dynamic filtering or custom serializers
404
+ 4. **Relationship Complexity**: Complex polymorphic relationships may not be fully analyzed
405
+
406
+ ## See Also
407
+
408
+ - [Usage Guide](usage.md) - General extraction usage
409
+ - [S3 Integration](s3_integration.md) - S3 upload capabilities
410
+ - [Examples](examples.md) - More usage examples
data/docs/examples.md ADDED
@@ -0,0 +1,239 @@
1
+ # Usage Examples
2
+
3
+ ## Basic Usage
4
+
5
+ ### Extract an Order and Related Records
6
+
7
+ ```ruby
8
+ # Find the order to extract
9
+ order = Order.find(12345)
10
+
11
+ # Create an extractor instance
12
+ extractor = ActiveRecordGraphExtractor::Extractor.new
13
+
14
+ # Extract the order and all related records
15
+ data = extractor.extract(order)
16
+
17
+ # Export to a JSON file
18
+ File.write('order_12345.json', data.to_json)
19
+ ```
20
+
21
+ ### Import the Extracted Data
22
+
23
+ ```ruby
24
+ # Create an importer instance
25
+ importer = ActiveRecordGraphExtractor::Importer.new
26
+
27
+ # Import from the JSON file
28
+ importer.import_from_file('order_12345.json')
29
+ ```
30
+
31
+ ## Configuration Examples
32
+
33
+ ### Extraction Configuration
34
+
35
+ ```ruby
36
+ extractor = ActiveRecordGraphExtractor::Extractor.new(
37
+ # Control which relationships to include
38
+ include_relationships: %w[products customer shipping_address],
39
+
40
+ # Set maximum depth of relationship traversal
41
+ max_depth: 3,
42
+
43
+ # Custom serialization for specific models
44
+ custom_serializers: {
45
+ 'User' => ->(record) {
46
+ {
47
+ id: record.id,
48
+ full_name: "#{record.first_name} #{record.last_name}",
49
+ email: record.email
50
+ }
51
+ }
52
+ }
53
+ )
54
+ ```
55
+
56
+ ### Import Configuration
57
+
58
+ ```ruby
59
+ importer = ActiveRecordGraphExtractor::Importer.new(
60
+ # Skip records that already exist
61
+ skip_existing: true,
62
+
63
+ # Update existing records instead of skipping
64
+ update_existing: false,
65
+
66
+ # Wrap import in a transaction
67
+ transaction: true,
68
+
69
+ # Validate records before saving
70
+ validate: true,
71
+
72
+ # Custom finder methods for specific models
73
+ custom_finders: {
74
+ 'Product' => ->(attrs) {
75
+ Product.find_by(product_number: attrs['product_number'])
76
+ }
77
+ }
78
+ )
79
+ ```
80
+
81
+ ## S3 Integration
82
+
83
+ ### Upload Extractions to S3
84
+
85
+ ```ruby
86
+ # Extract and upload to S3 in one step
87
+ extractor = ActiveRecordGraphExtractor::Extractor.new
88
+ result = extractor.extract_and_upload_to_s3(
89
+ order,
90
+ bucket_name: 'my-extraction-bucket',
91
+ s3_key: 'extractions/order_12345.json',
92
+ region: 'us-east-1'
93
+ )
94
+
95
+ puts "Uploaded to: #{result['s3_upload'][:url]}"
96
+ ```
97
+
98
+ ### Using S3Client Directly
99
+
100
+ ```ruby
101
+ # Create S3 client
102
+ s3_client = ActiveRecordGraphExtractor::S3Client.new(
103
+ bucket_name: 'my-extraction-bucket',
104
+ region: 'us-east-1'
105
+ )
106
+
107
+ # Extract to S3
108
+ result = extractor.extract_to_s3(order, s3_client, 'extractions/order_12345.json')
109
+
110
+ # List files in bucket
111
+ files = s3_client.list_files(prefix: 'extractions/')
112
+ files.each { |file| puts "#{file[:key]} (#{file[:size]} bytes)" }
113
+
114
+ # Download a file
115
+ s3_client.download_file('extractions/order_12345.json', 'local_order.json')
116
+ ```
117
+
118
+ For detailed S3 configuration and usage, see [S3 Integration Guide](s3_integration.md).
119
+
120
+ ## CLI Usage
121
+
122
+ ### Extract Records
123
+
124
+ ```bash
125
+ # Basic extraction
126
+ $ arge extract Order 12345
127
+
128
+ # With specific relationships
129
+ $ arge extract Order 12345 --include-relationships products,customer,shipping_address
130
+
131
+ # With maximum depth
132
+ $ arge extract Order 12345 --max-depth 3
133
+
134
+ # With output file
135
+ $ arge extract Order 12345 --output order_12345.json
136
+ ```
137
+
138
+ ### Import Records
139
+
140
+ ```bash
141
+ # Basic import
142
+ $ arge import order_12345.json
143
+
144
+ # With batch size
145
+ $ arge import order_12345.json --batch-size 1000
146
+
147
+ # With validation
148
+ $ arge import order_12345.json --validate
149
+
150
+ # With transaction
151
+ $ arge import order_12345.json --transaction
152
+ ```
153
+
154
+ ### S3 Operations
155
+
156
+ ```bash
157
+ # Extract and upload to S3
158
+ $ arge extract_to_s3 Order 12345 --bucket my-extraction-bucket --key extractions/order_12345.json
159
+
160
+ # List files in S3 bucket
161
+ $ arge s3_list --bucket my-extraction-bucket --prefix extractions/
162
+
163
+ # Download from S3
164
+ $ arge s3_download extractions/order_12345.json --bucket my-extraction-bucket --output local_order.json
165
+ ```
166
+
167
+ ## Advanced Examples
168
+
169
+ ### Custom Traversal Rules
170
+
171
+ ```ruby
172
+ extractor = ActiveRecordGraphExtractor::Extractor.new(
173
+ traversal_rules: {
174
+ 'Order' => {
175
+ 'products' => { max_depth: 2 },
176
+ 'customer' => { max_depth: 1 },
177
+ 'shipping_address' => { max_depth: 1 }
178
+ }
179
+ }
180
+ )
181
+ ```
182
+
183
+ ### Progress Tracking
184
+
185
+ ```ruby
186
+ # Extract with progress tracking
187
+ extractor = ActiveRecordGraphExtractor::Extractor.new
188
+ extractor.extract(order) do |progress|
189
+ puts "Processed #{progress.current} of #{progress.total} records"
190
+ end
191
+
192
+ # Import with progress tracking
193
+ importer = ActiveRecordGraphExtractor::Importer.new
194
+ importer.import_from_file('order_12345.json') do |progress|
195
+ puts "Imported #{progress.current} of #{progress.total} records"
196
+ end
197
+ ```
198
+
199
+ ### Error Handling
200
+
201
+ ```ruby
202
+ # Handle extraction errors
203
+ begin
204
+ extractor.extract(order)
205
+ rescue ActiveRecordGraphExtractor::ExtractionError => e
206
+ puts "Extraction failed: #{e.message}"
207
+ puts "Failed records: #{e.failed_records}"
208
+ end
209
+
210
+ # Handle import errors
211
+ begin
212
+ importer.import_from_file('order_12345.json')
213
+ rescue ActiveRecordGraphExtractor::ImportError => e
214
+ puts "Import failed: #{e.message}"
215
+ puts "Failed records: #{e.failed_records}"
216
+ end
217
+ ```
218
+
219
+ ### Circular Dependencies
220
+
221
+ ```ruby
222
+ begin
223
+ extractor.extract(order)
224
+ rescue ActiveRecordGraphExtractor::CircularDependencyError => e
225
+ puts "Circular dependency detected: #{e.message}"
226
+ puts "Dependency chain: #{e.dependency_chain}"
227
+ end
228
+ ```
229
+
230
+ ## Best Practices
231
+
232
+ 1. Always specify the maximum depth to prevent infinite loops
233
+ 2. Use custom serializers to control what data is exported
234
+ 3. Use custom finders to handle complex record matching
235
+ 4. Enable transactions for atomic imports
236
+ 5. Monitor progress for large extractions/imports
237
+ 6. Handle errors appropriately
238
+ 7. Validate data before importing
239
+ 8. Use batch processing for large datasets