@cdklabs/cdk-appmod-catalog-blueprints 1.5.0 → 1.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (92) hide show
  1. package/.jsii +2537 -204
  2. package/lib/document-processing/adapter/adapter.d.ts +4 -2
  3. package/lib/document-processing/adapter/adapter.js +1 -1
  4. package/lib/document-processing/adapter/queued-s3-adapter.d.ts +9 -2
  5. package/lib/document-processing/adapter/queued-s3-adapter.js +29 -15
  6. package/lib/document-processing/agentic-document-processing.d.ts +4 -0
  7. package/lib/document-processing/agentic-document-processing.js +20 -10
  8. package/lib/document-processing/base-document-processing.d.ts +54 -2
  9. package/lib/document-processing/base-document-processing.js +136 -82
  10. package/lib/document-processing/bedrock-document-processing.d.ts +202 -2
  11. package/lib/document-processing/bedrock-document-processing.js +717 -77
  12. package/lib/document-processing/chunking-config.d.ts +614 -0
  13. package/lib/document-processing/chunking-config.js +5 -0
  14. package/lib/document-processing/default-document-processing-config.js +1 -1
  15. package/lib/document-processing/index.d.ts +1 -0
  16. package/lib/document-processing/index.js +2 -1
  17. package/lib/document-processing/resources/aggregation/handler.py +567 -0
  18. package/lib/document-processing/resources/aggregation/requirements.txt +7 -0
  19. package/lib/document-processing/resources/aggregation/test_handler.py +362 -0
  20. package/lib/document-processing/resources/cleanup/handler.py +276 -0
  21. package/lib/document-processing/resources/cleanup/requirements.txt +5 -0
  22. package/lib/document-processing/resources/cleanup/test_handler.py +436 -0
  23. package/lib/document-processing/resources/default-bedrock-invoke/index.py +85 -3
  24. package/lib/document-processing/resources/default-bedrock-invoke/test_index.py +622 -0
  25. package/lib/document-processing/resources/pdf-chunking/README.md +313 -0
  26. package/lib/document-processing/resources/pdf-chunking/chunking_strategies.py +460 -0
  27. package/lib/document-processing/resources/pdf-chunking/error_handling.py +491 -0
  28. package/lib/document-processing/resources/pdf-chunking/handler.py +958 -0
  29. package/lib/document-processing/resources/pdf-chunking/metrics.py +435 -0
  30. package/lib/document-processing/resources/pdf-chunking/requirements.txt +3 -0
  31. package/lib/document-processing/resources/pdf-chunking/strategy_selection.py +420 -0
  32. package/lib/document-processing/resources/pdf-chunking/structured_logging.py +457 -0
  33. package/lib/document-processing/resources/pdf-chunking/test_chunking_strategies.py +353 -0
  34. package/lib/document-processing/resources/pdf-chunking/test_error_handling.py +487 -0
  35. package/lib/document-processing/resources/pdf-chunking/test_handler.py +609 -0
  36. package/lib/document-processing/resources/pdf-chunking/test_integration.py +694 -0
  37. package/lib/document-processing/resources/pdf-chunking/test_metrics.py +532 -0
  38. package/lib/document-processing/resources/pdf-chunking/test_strategy_selection.py +471 -0
  39. package/lib/document-processing/resources/pdf-chunking/test_structured_logging.py +449 -0
  40. package/lib/document-processing/resources/pdf-chunking/test_token_estimation.py +374 -0
  41. package/lib/document-processing/resources/pdf-chunking/token_estimation.py +189 -0
  42. package/lib/document-processing/tests/agentic-document-processing-nag.test.js +4 -3
  43. package/lib/document-processing/tests/agentic-document-processing.test.js +488 -4
  44. package/lib/document-processing/tests/base-document-processing-nag.test.js +9 -2
  45. package/lib/document-processing/tests/base-document-processing-schema.test.d.ts +1 -0
  46. package/lib/document-processing/tests/base-document-processing-schema.test.js +337 -0
  47. package/lib/document-processing/tests/base-document-processing.test.js +114 -8
  48. package/lib/document-processing/tests/bedrock-document-processing-chunking-nag.test.d.ts +1 -0
  49. package/lib/document-processing/tests/bedrock-document-processing-chunking-nag.test.js +382 -0
  50. package/lib/document-processing/tests/bedrock-document-processing-nag.test.js +4 -3
  51. package/lib/document-processing/tests/bedrock-document-processing-security.test.d.ts +1 -0
  52. package/lib/document-processing/tests/bedrock-document-processing-security.test.js +389 -0
  53. package/lib/document-processing/tests/bedrock-document-processing.test.js +808 -8
  54. package/lib/document-processing/tests/chunking-config.test.d.ts +1 -0
  55. package/lib/document-processing/tests/chunking-config.test.js +238 -0
  56. package/lib/document-processing/tests/queued-s3-adapter-nag.test.js +9 -2
  57. package/lib/document-processing/tests/queued-s3-adapter.test.js +17 -6
  58. package/lib/framework/agents/base-agent.js +1 -1
  59. package/lib/framework/agents/batch-agent.js +1 -1
  60. package/lib/framework/agents/default-agent-config.js +1 -1
  61. package/lib/framework/bedrock/bedrock.js +1 -1
  62. package/lib/framework/custom-resource/default-runtimes.js +1 -1
  63. package/lib/framework/foundation/access-log.js +1 -1
  64. package/lib/framework/foundation/eventbridge-broker.js +1 -1
  65. package/lib/framework/foundation/network.js +1 -1
  66. package/lib/framework/tests/access-log.test.js +5 -2
  67. package/lib/framework/tests/batch-agent.test.js +5 -2
  68. package/lib/framework/tests/bedrock.test.js +5 -2
  69. package/lib/framework/tests/eventbridge-broker.test.js +5 -2
  70. package/lib/framework/tests/framework-nag.test.js +16 -8
  71. package/lib/framework/tests/network.test.js +9 -4
  72. package/lib/tsconfig.tsbuildinfo +1 -1
  73. package/lib/utilities/data-loader.js +1 -1
  74. package/lib/utilities/lambda-iam-utils.js +1 -1
  75. package/lib/utilities/observability/cloudfront-distribution-observability-property-injector.js +1 -1
  76. package/lib/utilities/observability/default-observability-config.js +1 -1
  77. package/lib/utilities/observability/lambda-observability-property-injector.js +1 -1
  78. package/lib/utilities/observability/log-group-data-protection-utils.js +1 -1
  79. package/lib/utilities/observability/powertools-config.d.ts +10 -1
  80. package/lib/utilities/observability/powertools-config.js +19 -3
  81. package/lib/utilities/observability/state-machine-observability-property-injector.js +1 -1
  82. package/lib/utilities/test-utils.d.ts +43 -0
  83. package/lib/utilities/test-utils.js +56 -0
  84. package/lib/utilities/tests/data-loader-nag.test.js +3 -2
  85. package/lib/utilities/tests/data-loader.test.js +3 -2
  86. package/lib/webapp/frontend-construct.js +1 -1
  87. package/lib/webapp/tests/frontend-construct-nag.test.js +3 -2
  88. package/lib/webapp/tests/frontend-construct.test.js +3 -2
  89. package/package.json +6 -5
  90. package/lib/document-processing/resources/default-error-handler/index.js +0 -46
  91. package/lib/document-processing/resources/default-pdf-processor/index.js +0 -46
  92. package/lib/document-processing/resources/default-pdf-validator/index.js +0 -36
@@ -0,0 +1,313 @@
1
+ # PDF Analysis and Chunking Lambda
2
+
3
+ This Lambda function is the first step in the Step Functions workflow for chunked document processing. It analyzes PDFs to determine if chunking is needed, and if so, splits the PDF into chunks and uploads them to S3.
4
+
5
+ ## Overview
6
+
7
+ This is a **single Lambda function** that performs both analysis and chunking to avoid downloading the PDF twice. It:
8
+
9
+ 1. Analyzes the PDF to determine token count and page count
10
+ 2. Determines if chunking is required based on strategy and thresholds
11
+ 3. If no chunking needed: returns analysis metadata only
12
+ 4. If chunking needed: splits PDF and uploads chunks to S3
13
+
14
+ ## Files
15
+
16
+ - `handler.py` - Main Lambda handler function
17
+ - `token_estimation.py` - Token estimation module (word-based heuristic)
18
+ - `chunking_strategies.py` - Chunking algorithms (fixed-pages, token-based, hybrid)
19
+ - `requirements.txt` - Python dependencies (PyPDF2, boto3)
20
+ - `test_handler.py` - Unit tests for handler
21
+ - `test_token_estimation.py` - Unit tests for token estimation
22
+ - `test_chunking_strategies.py` - Unit tests for chunking strategies
23
+ - `test_integration.py` - Integration tests (requires AWS setup)
24
+
25
+ ## Configuration
26
+
27
+ The Lambda supports three chunking strategies:
28
+
29
+ ### 1. Fixed-Pages Strategy (Legacy)
30
+ Simple page-based chunking. Fast but doesn't account for token density.
31
+
32
+ ```python
33
+ config = {
34
+ 'strategy': 'fixed-pages',
35
+ 'pageThreshold': 100,
36
+ 'chunkSize': 50,
37
+ 'overlapPages': 5
38
+ }
39
+ ```
40
+
41
+ ### 2. Token-Based Strategy
42
+ Token-aware chunking that respects model limits. Ideal for variable density documents.
43
+
44
+ ```python
45
+ config = {
46
+ 'strategy': 'token-based',
47
+ 'tokenThreshold': 150000,
48
+ 'maxTokensPerChunk': 100000,
49
+ 'overlapTokens': 5000
50
+ }
51
+ ```
52
+
53
+ ### 3. Hybrid Strategy (RECOMMENDED)
54
+ Best of both worlds - targets token count but respects page limits.
55
+
56
+ ```python
57
+ config = {
58
+ 'strategy': 'hybrid',
59
+ 'pageThreshold': 100,
60
+ 'tokenThreshold': 150000,
61
+ 'targetTokensPerChunk': 80000,
62
+ 'maxPagesPerChunk': 99, # Bedrock has a hard limit of 100 pages
63
+ 'overlapTokens': 5000
64
+ }
65
+ ```
66
+
67
+ ## Event Format
68
+
69
+ ### Input Event (from SQS Consumer)
70
+
71
+ The Lambda receives events from the SQS Consumer in this exact format:
72
+
73
+ ```json
74
+ {
75
+ "documentId": "invoice-2024-001-1705315800000",
76
+ "contentType": "file",
77
+ "content": {
78
+ "location": "s3",
79
+ "bucket": "my-document-bucket",
80
+ "key": "raw/invoice-2024-001.pdf",
81
+ "filename": "invoice-2024-001.pdf"
82
+ },
83
+ "eventTime": "2024-01-15T10:30:00.000Z",
84
+ "eventName": "ObjectCreated:Put",
85
+ "source": "sqs-consumer"
86
+ }
87
+ ```
88
+
89
+ **Required Fields:**
90
+ - `documentId` - Unique document identifier (generated by SQS consumer)
91
+ - `content.bucket` - S3 bucket name
92
+ - `content.key` - S3 object key (must be in `raw/` prefix)
93
+
94
+ **Optional Fields:**
95
+ - `contentType` - Must be "file" if provided (default behavior)
96
+ - `content.location` - Always "s3" (informational)
97
+ - `content.filename` - Original filename (informational)
98
+ - `eventTime` - S3 event timestamp (informational)
99
+ - `eventName` - S3 event name (informational)
100
+ - `source` - Event source identifier (informational)
101
+ - `config` - Optional chunking configuration override
102
+
103
+ ### Input Event (with Custom Configuration)
104
+
105
+ You can override chunking configuration per document:
106
+
107
+ ```json
108
+ {
109
+ "documentId": "doc-123",
110
+ "contentType": "file",
111
+ "content": {
112
+ "bucket": "document-bucket",
113
+ "key": "raw/document.pdf",
114
+ "filename": "document.pdf"
115
+ },
116
+ "config": {
117
+ "strategy": "hybrid",
118
+ "pageThreshold": 100,
119
+ "tokenThreshold": 150000,
120
+ "targetTokensPerChunk": 80000,
121
+ "maxPagesPerChunk": 99
122
+ }
123
+ }
124
+ ```
125
+
126
+ ### Output (No Chunking)
127
+ ```json
128
+ {
129
+ "documentId": "doc-123",
130
+ "requiresChunking": false,
131
+ "tokenAnalysis": {
132
+ "totalTokens": 45000,
133
+ "totalPages": 30,
134
+ "avgTokensPerPage": 1500
135
+ },
136
+ "reason": "Document has 30 pages, below threshold of 100"
137
+ }
138
+ ```
139
+
140
+ ### Output (Chunking)
141
+ ```json
142
+ {
143
+ "documentId": "doc-456",
144
+ "requiresChunking": true,
145
+ "tokenAnalysis": {
146
+ "totalTokens": 200000,
147
+ "totalPages": 150,
148
+ "avgTokensPerPage": 1333,
149
+ "tokensPerPage": [...]
150
+ },
151
+ "strategy": "hybrid",
152
+ "chunks": [
153
+ {
154
+ "chunkId": "doc-456_chunk_0",
155
+ "chunkIndex": 0,
156
+ "totalChunks": 2,
157
+ "startPage": 0,
158
+ "endPage": 74,
159
+ "pageCount": 75,
160
+ "estimatedTokens": 100000,
161
+ "bucket": "document-bucket",
162
+ "key": "chunks/doc-456_chunk_0.pdf"
163
+ }
164
+ ],
165
+ "config": {
166
+ "strategy": "hybrid",
167
+ "totalPages": 150,
168
+ "totalTokens": 200000,
169
+ "targetTokensPerChunk": 80000,
170
+ "maxPagesPerChunk": 99
171
+ }
172
+ }
173
+ ```
174
+
175
+ ## Validation
176
+
177
+ The Lambda performs several validation checks:
178
+
179
+ ### 1. Payload Validation
180
+ - **documentId** - Must be present
181
+ - **content.bucket** - Must be present
182
+ - **content.key** - Must be present
183
+ - **contentType** - Must be "file" if provided (only file-based processing is supported)
184
+
185
+ ### 2. File Extension Check
186
+ - Logs a warning if file doesn't have `.pdf` extension
187
+ - Still processes the file (validates using magic bytes)
188
+ - Useful for catching misnamed files
189
+
190
+ ### 3. PDF Magic Bytes Validation
191
+ - Validates file starts with `%PDF-` before processing
192
+ - Prevents wasting resources on non-PDF files
193
+ - Rejects HTML, text, images, and other formats
194
+
195
+ ### 4. PDF Format Validation
196
+ - Uses PyPDF2 to validate PDF structure
197
+ - Detects corrupted or invalid PDFs
198
+ - Rejects encrypted PDFs (not supported)
199
+
200
+ ### Error Responses
201
+
202
+ All validation errors return a standardized error response:
203
+
204
+ ```json
205
+ {
206
+ "documentId": "doc-123",
207
+ "requiresChunking": false,
208
+ "error": {
209
+ "type": "ValueError",
210
+ "message": "Missing required field: documentId"
211
+ }
212
+ }
213
+ ```
214
+
215
+ The Lambda handles various error scenarios:
216
+
217
+ 1. **Non-PDF files** - Validates file starts with PDF magic bytes (%PDF-) before processing
218
+ 2. **Invalid PDF format** - Returns error response if PyPDF2 cannot parse the file
219
+ 3. **Corrupted PDF files** - Returns error response with details
220
+ 4. **S3 access denied** - Returns error with specific message
221
+ 5. **Corrupted pages** - Skips page, logs warning, continues with remaining pages
222
+ 6. **S3 write failures** - Retries with exponential backoff (3 attempts)
223
+
224
+ ### PDF Validation
225
+
226
+ Before attempting to process a file, the Lambda validates it's actually a PDF by checking the magic bytes:
227
+ - Valid PDFs must start with `%PDF-` (hex: 25 50 44 46 2D)
228
+ - Files without this signature are rejected immediately
229
+ - This prevents wasting resources on non-PDF files (HTML, text, images, etc.)
230
+
231
+ ## Testing
232
+
233
+ ### Unit Tests
234
+ ```bash
235
+ python test_handler.py
236
+ ```
237
+
238
+ ### Integration Tests
239
+ Requires AWS credentials and test bucket:
240
+
241
+ ```bash
242
+ export RUN_INTEGRATION_TESTS=true
243
+ export TEST_BUCKET=your-test-bucket-name
244
+ python test_integration.py
245
+ ```
246
+
247
+ Test PDFs should be uploaded to:
248
+ - `s3://your-test-bucket/test-data/small-document.pdf`
249
+ - `s3://your-test-bucket/test-data/large-document.pdf`
250
+ - `s3://your-test-bucket/test-data/invalid.pdf`
251
+
252
+ ## Performance
253
+
254
+ - **Token analysis**: 2-5 seconds for 100-page PDF
255
+ - **Chunking**: ~1 second per chunk
256
+ - **Memory**: 2048 MB recommended
257
+ - **Timeout**: 10 minutes recommended
258
+
259
+ ## IAM Permissions Required
260
+
261
+ ```json
262
+ {
263
+ "Version": "2012-10-17",
264
+ "Statement": [
265
+ {
266
+ "Effect": "Allow",
267
+ "Action": [
268
+ "s3:GetObject"
269
+ ],
270
+ "Resource": "arn:aws:s3:::bucket-name/raw/*"
271
+ },
272
+ {
273
+ "Effect": "Allow",
274
+ "Action": [
275
+ "s3:PutObject"
276
+ ],
277
+ "Resource": "arn:aws:s3:::bucket-name/chunks/*"
278
+ }
279
+ ]
280
+ }
281
+ ```
282
+
283
+ ## Environment Variables
284
+
285
+ - `CHUNKING_STRATEGY` - Default strategy (default: 'hybrid')
286
+ - `PAGE_THRESHOLD` - Page count threshold (default: 100)
287
+ - `TOKEN_THRESHOLD` - Token count threshold (default: 150000)
288
+ - `CHUNK_SIZE` - Pages per chunk for fixed-pages (default: 50)
289
+ - `OVERLAP_PAGES` - Overlap pages for fixed-pages (default: 5)
290
+ - `MAX_TOKENS_PER_CHUNK` - Max tokens for token-based (default: 100000)
291
+ - `OVERLAP_TOKENS` - Overlap tokens (default: 5000)
292
+ - `TARGET_TOKENS_PER_CHUNK` - Target tokens for hybrid (default: 80000)
293
+ - `MAX_PAGES_PER_CHUNK` - Max pages for hybrid (default: 99, Bedrock limit is 100)
294
+ - `LOG_LEVEL` - Logging level (default: 'INFO')
295
+
296
+ ## Architecture Integration
297
+
298
+ This Lambda is invoked by Step Functions as the **first step** in the workflow (before Init Metadata). The SQS Consumer has **NO changes** - it simply triggers Step Functions as before.
299
+
300
+ The workflow structure:
301
+ ```
302
+ SQS Consumer → Step Functions → PDF Analysis & Chunking Lambda → Init Metadata → ...
303
+ ```
304
+
305
+ ## Token Estimation
306
+
307
+ Uses word-based heuristic for fast estimation:
308
+ - Count words using regex `\b\w+\b`
309
+ - Apply 1.3 tokens per word multiplier
310
+ - Accuracy: ~85-90% for English text
311
+ - Speed: ~0.2 seconds per 100 pages
312
+
313
+ Can be upgraded to tiktoken for production if needed.