npm - @cdklabs/cdk-appmod-catalog-blueprints - Versions diffs - 1.5.0 → 1.6.0 - Mend

@cdklabs/cdk-appmod-catalog-blueprints 1.5.0 → 1.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (92) hide show

package/lib/document-processing/resources/pdf-chunking/README.md ADDED Viewed

@@ -0,0 +1,313 @@
+# PDF Analysis and Chunking Lambda
+This Lambda function is the first step in the Step Functions workflow for chunked document processing. It analyzes PDFs to determine if chunking is needed, and if so, splits the PDF into chunks and uploads them to S3.
+## Overview
+This is a **single Lambda function** that performs both analysis and chunking to avoid downloading the PDF twice. It:
+1. Analyzes the PDF to determine token count and page count
+2. Determines if chunking is required based on strategy and thresholds
+3. If no chunking needed: returns analysis metadata only
+4. If chunking needed: splits PDF and uploads chunks to S3
+## Files
+- `handler.py` - Main Lambda handler function
+- `token_estimation.py` - Token estimation module (word-based heuristic)
+- `chunking_strategies.py` - Chunking algorithms (fixed-pages, token-based, hybrid)
+- `requirements.txt` - Python dependencies (PyPDF2, boto3)
+- `test_handler.py` - Unit tests for handler
+- `test_token_estimation.py` - Unit tests for token estimation
+- `test_chunking_strategies.py` - Unit tests for chunking strategies
+- `test_integration.py` - Integration tests (requires AWS setup)
+## Configuration
+The Lambda supports three chunking strategies:
+### 1. Fixed-Pages Strategy (Legacy)
+Simple page-based chunking. Fast but doesn't account for token density.
+```python
+config = {
+    'strategy': 'fixed-pages',
+    'pageThreshold': 100,
+    'chunkSize': 50,
+    'overlapPages': 5
+}
+```
+### 2. Token-Based Strategy
+Token-aware chunking that respects model limits. Ideal for variable density documents.
+```python
+config = {
+    'strategy': 'token-based',
+    'tokenThreshold': 150000,
+    'maxTokensPerChunk': 100000,
+    'overlapTokens': 5000
+}
+```
+### 3. Hybrid Strategy (RECOMMENDED)
+Best of both worlds - targets token count but respects page limits.
+```python
+config = {
+    'strategy': 'hybrid',
+    'pageThreshold': 100,
+    'tokenThreshold': 150000,
+    'targetTokensPerChunk': 80000,
+    'maxPagesPerChunk': 99,  # Bedrock has a hard limit of 100 pages
+    'overlapTokens': 5000
+}
+```
+## Event Format
+### Input Event (from SQS Consumer)
+The Lambda receives events from the SQS Consumer in this exact format:
+```json
+{
+  "documentId": "invoice-2024-001-1705315800000",
+  "contentType": "file",
+  "content": {
+    "location": "s3",
+    "bucket": "my-document-bucket",
+    "key": "raw/invoice-2024-001.pdf",
+    "filename": "invoice-2024-001.pdf"
+  },
+  "eventTime": "2024-01-15T10:30:00.000Z",
+  "eventName": "ObjectCreated:Put",
+  "source": "sqs-consumer"
+}
+```
+**Required Fields:**
+- `documentId` - Unique document identifier (generated by SQS consumer)
+- `content.bucket` - S3 bucket name
+- `content.key` - S3 object key (must be in `raw/` prefix)
+**Optional Fields:**
+- `contentType` - Must be "file" if provided (default behavior)
+- `content.location` - Always "s3" (informational)
+- `content.filename` - Original filename (informational)
+- `eventTime` - S3 event timestamp (informational)
+- `eventName` - S3 event name (informational)
+- `source` - Event source identifier (informational)
+- `config` - Optional chunking configuration override
+### Input Event (with Custom Configuration)
+You can override chunking configuration per document:
+```json
+{
+  "documentId": "doc-123",
+  "contentType": "file",
+  "content": {
+    "bucket": "document-bucket",
+    "key": "raw/document.pdf",
+    "filename": "document.pdf"
+  },
+  "config": {
+    "strategy": "hybrid",
+    "pageThreshold": 100,
+    "tokenThreshold": 150000,
+    "targetTokensPerChunk": 80000,
+    "maxPagesPerChunk": 99
+  }
+}
+```
+### Output (No Chunking)
+```json
+{
+  "documentId": "doc-123",
+  "requiresChunking": false,
+  "tokenAnalysis": {
+    "totalTokens": 45000,
+    "totalPages": 30,
+    "avgTokensPerPage": 1500
+  },
+  "reason": "Document has 30 pages, below threshold of 100"
+}
+```
+### Output (Chunking)
+```json
+{
+  "documentId": "doc-456",
+  "requiresChunking": true,
+  "tokenAnalysis": {
+    "totalTokens": 200000,
+    "totalPages": 150,
+    "avgTokensPerPage": 1333,
+    "tokensPerPage": [...]
+  },
+  "strategy": "hybrid",
+  "chunks": [
+    {
+      "chunkId": "doc-456_chunk_0",
+      "chunkIndex": 0,
+      "totalChunks": 2,
+      "startPage": 0,
+      "endPage": 74,
+      "pageCount": 75,
+      "estimatedTokens": 100000,
+      "bucket": "document-bucket",
+      "key": "chunks/doc-456_chunk_0.pdf"
+    }
+  ],
+  "config": {
+    "strategy": "hybrid",
+    "totalPages": 150,
+    "totalTokens": 200000,
+    "targetTokensPerChunk": 80000,
+    "maxPagesPerChunk": 99
+  }
+}
+```
+## Validation
+The Lambda performs several validation checks:
+### 1. Payload Validation
+- **documentId** - Must be present
+- **content.bucket** - Must be present
+- **content.key** - Must be present
+- **contentType** - Must be "file" if provided (only file-based processing is supported)
+### 2. File Extension Check
+- Logs a warning if file doesn't have `.pdf` extension
+- Still processes the file (validates using magic bytes)
+- Useful for catching misnamed files
+### 3. PDF Magic Bytes Validation
+- Validates file starts with `%PDF-` before processing
+- Prevents wasting resources on non-PDF files
+- Rejects HTML, text, images, and other formats
+### 4. PDF Format Validation
+- Uses PyPDF2 to validate PDF structure
+- Detects corrupted or invalid PDFs
+- Rejects encrypted PDFs (not supported)
+### Error Responses
+All validation errors return a standardized error response:
+```json
+{
+  "documentId": "doc-123",
+  "requiresChunking": false,
+  "error": {
+    "type": "ValueError",
+    "message": "Missing required field: documentId"
+  }
+}
+```
+The Lambda handles various error scenarios:
+1. **Non-PDF files** - Validates file starts with PDF magic bytes (%PDF-) before processing
+2. **Invalid PDF format** - Returns error response if PyPDF2 cannot parse the file
+3. **Corrupted PDF files** - Returns error response with details
+4. **S3 access denied** - Returns error with specific message
+5. **Corrupted pages** - Skips page, logs warning, continues with remaining pages
+6. **S3 write failures** - Retries with exponential backoff (3 attempts)
+### PDF Validation
+Before attempting to process a file, the Lambda validates it's actually a PDF by checking the magic bytes:
+- Valid PDFs must start with `%PDF-` (hex: 25 50 44 46 2D)
+- Files without this signature are rejected immediately
+- This prevents wasting resources on non-PDF files (HTML, text, images, etc.)
+## Testing
+### Unit Tests
+```bash
+python test_handler.py
+```
+### Integration Tests
+Requires AWS credentials and test bucket:
+```bash
+export RUN_INTEGRATION_TESTS=true
+export TEST_BUCKET=your-test-bucket-name
+python test_integration.py
+```
+Test PDFs should be uploaded to:
+- `s3://your-test-bucket/test-data/small-document.pdf`
+- `s3://your-test-bucket/test-data/large-document.pdf`
+- `s3://your-test-bucket/test-data/invalid.pdf`
+## Performance
+- **Token analysis**: 2-5 seconds for 100-page PDF
+- **Chunking**: ~1 second per chunk
+- **Memory**: 2048 MB recommended
+- **Timeout**: 10 minutes recommended
+## IAM Permissions Required
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Effect": "Allow",
+      "Action": [
+        "s3:GetObject"
+      ],
+      "Resource": "arn:aws:s3:::bucket-name/raw/*"
+    },
+    {
+      "Effect": "Allow",
+      "Action": [
+        "s3:PutObject"
+      ],
+      "Resource": "arn:aws:s3:::bucket-name/chunks/*"
+    }
+  ]
+}
+```
+## Environment Variables
+- `CHUNKING_STRATEGY` - Default strategy (default: 'hybrid')
+- `PAGE_THRESHOLD` - Page count threshold (default: 100)
+- `TOKEN_THRESHOLD` - Token count threshold (default: 150000)
+- `CHUNK_SIZE` - Pages per chunk for fixed-pages (default: 50)
+- `OVERLAP_PAGES` - Overlap pages for fixed-pages (default: 5)
+- `MAX_TOKENS_PER_CHUNK` - Max tokens for token-based (default: 100000)
+- `OVERLAP_TOKENS` - Overlap tokens (default: 5000)
+- `TARGET_TOKENS_PER_CHUNK` - Target tokens for hybrid (default: 80000)
+- `MAX_PAGES_PER_CHUNK` - Max pages for hybrid (default: 99, Bedrock limit is 100)
+- `LOG_LEVEL` - Logging level (default: 'INFO')
+## Architecture Integration
+This Lambda is invoked by Step Functions as the **first step** in the workflow (before Init Metadata). The SQS Consumer has **NO changes** - it simply triggers Step Functions as before.
+The workflow structure:
+```
+SQS Consumer → Step Functions → PDF Analysis & Chunking Lambda → Init Metadata → ...
+```
+## Token Estimation
+Uses word-based heuristic for fast estimation:
+- Count words using regex `\b\w+\b`
+- Apply 1.3 tokens per word multiplier
+- Accuracy: ~85-90% for English text
+- Speed: ~0.2 seconds per 100 pages
+Can be upgraded to tiktoken for production if needed.