@atomic-ehr/fhirpath 0.0.1-canary.0c6931e.20250727185306

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (85) hide show
  1. package/README.md +473 -0
  2. package/dist/index.d.ts +462 -0
  3. package/dist/index.js +10307 -0
  4. package/dist/index.js.map +1 -0
  5. package/package.json +58 -0
  6. package/src/analyzer/analyzer.ts +499 -0
  7. package/src/analyzer/model-provider.ts +244 -0
  8. package/src/analyzer/schemas/index.ts +2 -0
  9. package/src/analyzer/schemas/types.ts +40 -0
  10. package/src/analyzer/types.ts +142 -0
  11. package/src/api/builder.ts +157 -0
  12. package/src/api/errors.ts +145 -0
  13. package/src/api/expression.ts +156 -0
  14. package/src/api/index.ts +122 -0
  15. package/src/api/inspect.ts +99 -0
  16. package/src/api/registry.ts +128 -0
  17. package/src/api/types.ts +210 -0
  18. package/src/compiler/compiler.ts +546 -0
  19. package/src/compiler/index.ts +2 -0
  20. package/src/compiler/prototype-context-adapter.ts +99 -0
  21. package/src/compiler/types.ts +24 -0
  22. package/src/index.ts +107 -0
  23. package/src/interpreter/README.md +78 -0
  24. package/src/interpreter/interpreter.ts +475 -0
  25. package/src/interpreter/types.ts +108 -0
  26. package/src/lexer/char-tables.ts +37 -0
  27. package/src/lexer/errors.ts +31 -0
  28. package/src/lexer/index.ts +5 -0
  29. package/src/lexer/lexer.ts +745 -0
  30. package/src/lexer/token.ts +104 -0
  31. package/src/lexer2/index.md +232 -0
  32. package/src/lexer2/index.perf.test.ts +68 -0
  33. package/src/lexer2/index.test.ts +549 -0
  34. package/src/lexer2/index.ts +1251 -0
  35. package/src/lexer2/notes.md +173 -0
  36. package/src/lexer2/optimization-summary.md +718 -0
  37. package/src/parser/ast-factory.ts +220 -0
  38. package/src/parser/ast.ts +144 -0
  39. package/src/parser/collection-parser.ts +89 -0
  40. package/src/parser/diagnostic-messages.ts +216 -0
  41. package/src/parser/diagnostics.ts +85 -0
  42. package/src/parser/error-reporter.ts +230 -0
  43. package/src/parser/index.ts +3 -0
  44. package/src/parser/literal-parser.ts +103 -0
  45. package/src/parser/parse-error.ts +16 -0
  46. package/src/parser/parser-error-factory.ts +141 -0
  47. package/src/parser/parser-state.ts +134 -0
  48. package/src/parser/parser.ts +1272 -0
  49. package/src/parser/pprint.ts +169 -0
  50. package/src/parser/precedence-manager.ts +64 -0
  51. package/src/parser/source-mapper.ts +248 -0
  52. package/src/parser/special-constructs.ts +142 -0
  53. package/src/parser/token-navigator.ts +110 -0
  54. package/src/parser/types.ts +60 -0
  55. package/src/parser2/index.md +177 -0
  56. package/src/parser2/index.perf.test.ts +184 -0
  57. package/src/parser2/index.test.ts +305 -0
  58. package/src/parser2/index.ts +578 -0
  59. package/src/parser2/optimization-summary.md +176 -0
  60. package/src/registry/default-analyzers.ts +257 -0
  61. package/src/registry/default-compilers.ts +31 -0
  62. package/src/registry/index.ts +96 -0
  63. package/src/registry/operations/arithmetic.ts +506 -0
  64. package/src/registry/operations/collection.ts +425 -0
  65. package/src/registry/operations/comparison.ts +432 -0
  66. package/src/registry/operations/existence.ts +703 -0
  67. package/src/registry/operations/filtering.ts +358 -0
  68. package/src/registry/operations/literals.ts +341 -0
  69. package/src/registry/operations/logical.ts +439 -0
  70. package/src/registry/operations/math.ts +128 -0
  71. package/src/registry/operations/membership.ts +132 -0
  72. package/src/registry/operations/navigation.ts +52 -0
  73. package/src/registry/operations/string.ts +507 -0
  74. package/src/registry/operations/subsetting.ts +174 -0
  75. package/src/registry/operations/type-checking.ts +162 -0
  76. package/src/registry/operations/type-conversion.ts +404 -0
  77. package/src/registry/operations/type-operators.ts +308 -0
  78. package/src/registry/operations/utility.ts +644 -0
  79. package/src/registry/registry.ts +146 -0
  80. package/src/registry/types.ts +161 -0
  81. package/src/registry/utils/evaluation-helpers.ts +93 -0
  82. package/src/registry/utils/index.ts +3 -0
  83. package/src/registry/utils/type-system.ts +173 -0
  84. package/src/runtime/context.ts +158 -0
  85. package/src/runtime/debug-context.ts +135 -0
@@ -0,0 +1,718 @@
1
+ # Lexer Optimization Summary
2
+
3
+ ## 1. Lookup Tables for Character Classification
4
+
5
+ ### What was changed:
6
+ - Created lookup tables (`IS_DIGIT`, `IS_LETTER`, `IS_LETTER_OR_DIGIT`, `IS_HEX_DIGIT`) as `Uint8Array(256)`
7
+ - Replaced all function calls like `isDigit(charCode)` with direct array lookups like `IS_DIGIT[charCode]`
8
+ - Added bounds checking where needed (`charCode >= 0 && charCode < 256`)
9
+
10
+ ### Performance Impact:
11
+ - Before: ~1,477K expressions/second
12
+ - After: ~1,546K expressions/second
13
+ - Improvement: ~4.7%
14
+
15
+ ### Why it works:
16
+ - Array lookups are O(1) and avoid function call overhead
17
+ - Modern JavaScript engines optimize array access very well
18
+ - Lookup tables fit in CPU cache (only 256 bytes per table)
19
+ - Eliminates branching in character classification logic
20
+
21
+ ### Trade-offs:
22
+ - Slightly more memory usage (4 × 256 = 1KB for lookup tables)
23
+ - One-time initialization cost when module loads
24
+ - Need bounds checking for safety (charCode < 256)
25
+
26
+ ## 2. Switch-based Keyword Lookup
27
+
28
+ ### What was changed:
29
+ - Replaced object/hash map keyword lookup with nested switch statements
30
+ - First switch on keyword length (2-12 characters)
31
+ - Then switch on the actual keyword value
32
+ - Early exit for non-keyword lengths
33
+
34
+ ### Performance Impact:
35
+ - Before: ~1,546K expressions/second
36
+ - After: ~2,192K expressions/second
37
+ - Improvement: ~42%
38
+
39
+ ### Why it works:
40
+ - Switch statements compile to jump tables in V8
41
+ - Length check filters out most identifiers immediately
42
+ - No hash computation or object property lookup
43
+ - Better CPU branch prediction
44
+ - Compiler can optimize switch statements very aggressively
45
+
46
+ ### Trade-offs:
47
+ - More verbose code (but still maintainable)
48
+ - Fixed set of keywords (can't add dynamically)
49
+ - Slightly larger code size
50
+
51
+ ## 3. Character Code Switch for readWhitespace
52
+
53
+ ### What was changed:
54
+ - Replaced string comparisons (`char === ' '`) with character code switch
55
+ - Single `charCodeAt()` call instead of multiple string comparisons
56
+ - Switch statement on integer values (32, 9, 13, 10)
57
+
58
+ ### Performance Impact:
59
+ - Before: ~2,192K expressions/second
60
+ - After: ~2,240K expressions/second
61
+ - Improvement: ~2.2%
62
+
63
+ ### Why it works:
64
+ - Integer comparison is faster than string comparison
65
+ - Switch on integers compiles to efficient jump table
66
+ - Single charCodeAt call vs multiple charAt/string comparisons
67
+ - Better branch prediction with switch statement
68
+
69
+ ### Trade-offs:
70
+ - Slightly less readable (need to know ASCII codes)
71
+ - More verbose code structure
72
+
73
+ ## 4. Reusable Token Attempt (Reverted)
74
+
75
+ ### What was tried:
76
+ - Added a reusable token object to avoid allocations
77
+ - Created `setToken()` method to update and return the same object
78
+ - Modified `tokenize()` to clone tokens when storing in array
79
+
80
+ ### Performance Impact:
81
+ - Before: ~2,240K expressions/second
82
+ - After: ~1,995K expressions/second
83
+ - **Degradation: ~11%**
84
+
85
+ ### Why it failed:
86
+ - Method call overhead for `setToken()` on every token
87
+ - Still need to clone tokens in `tokenize()`, negating the benefit
88
+ - The cloning actually made it worse than direct object creation
89
+ - V8's object allocation is already highly optimized for small objects
90
+
91
+ ### Lesson learned:
92
+ - Not all traditional optimizations work in modern JavaScript engines
93
+ - V8's object allocation is very fast for small, short-lived objects
94
+ - Method call overhead can outweigh allocation savings
95
+
96
+ ## 5. Token Representation Benchmarks
97
+
98
+ ### What was tested:
99
+ Compared different approaches for token representation:
100
+ - Plain object literals `{ type, start, end }`
101
+ - Arrays `[type, start, end]`
102
+ - Classes with constructor
103
+ - Classes with fields
104
+ - Object.create(null)
105
+ - Object.assign and spread operators
106
+
107
+ ### Performance Results:
108
+
109
+ #### Token Creation (5M tokens):
110
+ - **Object literal**: 8.09ms (618M tokens/sec) ✅
111
+ - **Class with constructor**: 14.45ms (346M tokens/sec) - 78% slower
112
+ - **Arrays**: ~28ms - 244% slower in real usage
113
+ - **Object.assign**: 224.68ms - 2678% slower
114
+ - **Spread operator**: 109.61ms - 1255% slower
115
+
116
+ #### Real Lexer Usage (100K iterations):
117
+ - **Object literals**: 54.06ms (9.2M expressions/sec) ✅
118
+ - **Arrays**: 180.79ms (2.8M expressions/sec)
119
+ - **Classes**: 169.79ms (2.9M expressions/sec)
120
+
121
+ ### Why Object Literals Win:
122
+ 1. **V8 Hidden Classes** - Objects with consistent shape share optimized hidden classes
123
+ 2. **Inline Caching** - Property access is optimized through monomorphic inline caches
124
+ 3. **No Constructor Overhead** - Direct allocation without function calls
125
+ 4. **JIT Optimization** - V8 heavily optimizes object literal creation
126
+ 5. **Better Property Access** - Named properties are faster than array indices for this use case
127
+
128
+ ### Key Insights:
129
+ - Arrays are slower despite using less memory due to worse access patterns
130
+ - Classes add constructor overhead without benefits
131
+ - V8 is extremely optimized for plain object literals with consistent shapes
132
+ - The current implementation is already optimal
133
+
134
+ ### Conclusion:
135
+ The current token representation using plain object literals is the fastest approach. No change needed.
136
+
137
+ ## 6. Line and Column Tracking Added Back
138
+
139
+ ### What was changed:
140
+ - Added `line` and `column` fields back to Token interface
141
+ - Track line/column during advance() method
142
+ - Capture startLine and startColumn for each token
143
+ - Update line/column in readWhitespace and readSpecialIdentifier
144
+
145
+ ### Performance Impact:
146
+ - Before: ~2,240K expressions/second (without line/column)
147
+ - After: ~2,147K expressions/second (with line/column)
148
+ - **Cost: ~4.2% performance decrease**
149
+
150
+ ### Why it's acceptable:
151
+ - Line/column information is valuable for error reporting and debugging
152
+ - 4.2% overhead is reasonable for this functionality
153
+ - Still maintaining over 2M expressions/second
154
+ - Total performance is still 45% better than original implementation
155
+
156
+ ### Implementation notes:
157
+ - Line increments on '\n', column resets to 1
158
+ - Column increments for all other characters
159
+ - Special handling in readSpecialIdentifier for multi-character advances
160
+
161
+ ## 7. Numeric Enum Token Types
162
+
163
+ ### What was changed:
164
+ - Converted TokenType from string enum to numeric enum
165
+ - String enum: `IDENTIFIER = 'IDENTIFIER'` → Numeric enum: `IDENTIFIER = 6`
166
+ - Added `tokenTypeToString()` helper function for debugging
167
+ - Added `debugTokens()` method for human-readable token output
168
+
169
+ ### Performance Impact:
170
+ - Before: ~2,147K expressions/second (string enums)
171
+ - After: ~2,200K expressions/second (numeric enums)
172
+ - **Improvement: ~2.5%**
173
+
174
+ ### Why it works:
175
+ - Numeric comparisons are faster than string comparisons
176
+ - Smaller memory footprint (4 bytes vs string length)
177
+ - Better CPU cache utilization
178
+ - Switch statements optimize better with numeric values
179
+
180
+ ### Trade-offs:
181
+ - Less readable in raw debugging output (see `6` instead of `"IDENTIFIER"`)
182
+ - Need helper functions for human-readable output
183
+ - Breaking change if external code depends on string values
184
+
185
+ ### Debug Support:
186
+ ```typescript
187
+ // Convert numeric type to string
188
+ tokenTypeToString(TokenType.IDENTIFIER) // "IDENTIFIER"
189
+
190
+ // Debug tokens
191
+ lexer.debugTokens() // "IDENTIFIER(foo) [1:1]\nDOT(.) [1:4]..."
192
+ ```
193
+
194
+ ## 8. CharCode-based Dispatch in nextToken()
195
+
196
+ ### What was changed:
197
+ - Replaced char-based switch (`switch (firstChar)`) with charCode-based switch (`switch (firstCharCode)`)
198
+ - Use numeric character codes instead of string literals (e.g., `case 39:` instead of `case "'"`)
199
+ - Updated all case statements to use ASCII codes with comments showing the character
200
+ - Modified error handling to convert charCode back to string when needed
201
+
202
+ ### Performance Impact:
203
+ - Before: ~2,212K expressions/second (char-based dispatch)
204
+ - After: ~2,305K expressions/second (charCode-based dispatch)
205
+ - **Improvement: ~4.2%** (93,132 more expressions/sec)
206
+
207
+ ### Why it works:
208
+ - Integer comparison is faster than string comparison in switch statements
209
+ - JavaScript engines optimize numeric switches into jump tables more efficiently
210
+ - Avoids string indexing overhead when accessing characters
211
+ - Better CPU branch prediction with numeric values
212
+ - Single charCodeAt() call is more efficient than charAt() or string indexing
213
+
214
+ ### Implementation Details:
215
+ ```typescript
216
+ // Before
217
+ switch (firstChar) {
218
+ case "'": return this.readString();
219
+ case ".": return { type: TokenType.DOT, ... };
220
+ }
221
+
222
+ // After
223
+ switch (firstCharCode) {
224
+ case 39: // '
225
+ return this.readString();
226
+ case 46: // .
227
+ return { type: TokenType.DOT, ... };
228
+ }
229
+ ```
230
+
231
+ ### Trade-offs:
232
+ - Less readable code (need ASCII code comments)
233
+ - Need to convert charCode back to string for error messages
234
+ - Slightly more complex error handling
235
+
236
+ ### Isolated Test Results:
237
+ In isolated benchmarks, charCode dispatch showed 25% improvement over char dispatch. The smaller gain in the real lexer (4.2%) is due to:
238
+ - Other operations diluting the pure dispatch improvement
239
+ - Real-world expressions include various token types, not just operators
240
+ - Position tracking, token creation, and other overhead
241
+
242
+ ## 9. Lookup Table for nextToken() Dispatch (Attempted)
243
+
244
+ ### What was tried:
245
+ - Created a lookup table mapping character codes to token types for single-character tokens
246
+ - Used array indexing instead of switch statement for dispatch
247
+ - Aimed to reduce branching and improve performance
248
+
249
+ ### Implementation approach:
250
+ ```typescript
251
+ // Lookup table initialization
252
+ private static readonly CHAR_TOKEN_TABLE: (TokenType | null)[] = new Array(256).fill(null);
253
+ static {
254
+ this.CHAR_TOKEN_TABLE[46] = TokenType.DOT; // .
255
+ this.CHAR_TOKEN_TABLE[40] = TokenType.LPAREN; // (
256
+ // ... etc
257
+ }
258
+
259
+ // Dispatch logic
260
+ const tokenType = CHAR_TOKEN_TABLE[firstCharCode];
261
+ if (tokenType !== null) {
262
+ this.advance();
263
+ return { type: tokenType, ... };
264
+ }
265
+ // Fall back to switch for complex cases
266
+ ```
267
+
268
+ ### Performance Impact:
269
+ - Switch-based: ~7,164K expressions/second
270
+ - Lookup table: ~3,990K expressions/second
271
+ - **Result: 79.5% SLOWER**
272
+
273
+ ### Why it failed:
274
+ 1. **V8 Optimization** - Modern JavaScript engines compile numeric switches into highly optimized jump tables
275
+ 2. **Array Access Overhead**:
276
+ - Array bounds checking on every access
277
+ - Memory indirection (fetch from array)
278
+ - Potential cache misses
279
+ - Additional null check required
280
+ 3. **Two-Stage Dispatch** - Still need switch/if-else fallback for complex tokens
281
+ 4. **CPU Branch Prediction** - Modern CPUs predict switch branches very well
282
+ 5. **Inline Optimization** - V8 can inline entire switch statement, but not array lookups
283
+
284
+ ### Detailed Test Results:
285
+ - **Regular expressions test**: 79.5% slower with lookup table
286
+ - **Operator-heavy expressions**: 43.9% slower with lookup table
287
+ - The more operators (which benefit most from lookup), the less severe the penalty
288
+ - But still significantly slower in all cases
289
+
290
+ ### Lesson learned:
291
+ - Not all traditional C/C++ optimizations translate to JavaScript
292
+ - V8's switch statement optimization is extremely efficient for numeric cases
293
+ - Array lookups in hot paths can be slower than well-optimized switches
294
+ - The current switch-based implementation is already optimal
295
+
296
+ ### Current Performance Summary:
297
+ - Original: ~1,477K expressions/second
298
+ - Current: ~2,305K expressions/second
299
+ - **Total improvement: ~56%**
300
+
301
+ ### Conclusion:
302
+ The current switch-based implementation with charCode dispatch represents the optimal approach for the nextToken() method. Further optimization efforts should focus on other areas of the lexer.
303
+
304
+ ## 10. Unicode Support (Added then Removed)
305
+
306
+ ### What was attempted:
307
+ - Added full Unicode support for identifiers using regex patterns `/\p{L}|\p{Nl}/u` and `/\p{L}|\p{N}|\p{M}|\p{Pc}/u`
308
+ - Allowed Unicode characters in regular identifiers (e.g., `café`, `日本語`)
309
+ - Extended support to environment variable identifiers
310
+
311
+ ### Performance Impact:
312
+ - Before Unicode: ~2,305K expressions/second
313
+ - With Unicode: ~1,908K expressions/second
314
+ - **Result: 17% performance degradation**
315
+
316
+ ### Why it was removed:
317
+ 1. **Not spec compliant** - FHIRPath grammar explicitly defines identifiers as `[A-Za-z_][A-Za-z0-9_]*`
318
+ 2. **Unicode is only allowed in**:
319
+ - String literals: `'café'`, `'日本語'`
320
+ - Delimited identifiers: `` `café` ``, `` `日本語` ``
321
+ - Environment variable strings: `%'café'`, `%`日本語``
322
+ 3. **Performance cost** - Regex checks with Unicode property escapes are expensive
323
+
324
+ ### Implementation Details:
325
+ The original implementation used regex for every character:
326
+ ```typescript
327
+ function isUnicodeIdentifierStart(char: string): boolean {
328
+ return /\p{L}|\p{Nl}/u.test(char); // Expensive!
329
+ }
330
+ ```
331
+
332
+ ### Final Resolution:
333
+ - Removed all Unicode support from regular identifiers
334
+ - Kept ASCII-only lookup tables for maximum performance
335
+ - Unicode remains supported in strings and delimited identifiers as per spec
336
+ - Performance restored to ~2,303K expressions/second
337
+
338
+ ### Lesson learned:
339
+ - Always verify spec compliance before implementing features
340
+ - Unicode regex operations have significant performance overhead
341
+ - Following the spec can lead to better performance
342
+
343
+ ### Current Performance Summary:
344
+ - Original: ~1,477K expressions/second
345
+ - Current: ~2,893K expressions/second
346
+ - **Total improvement: ~96%**
347
+
348
+ ## 11. Optimize readSpecialIdentifier (Avoid substring)
349
+
350
+ ### What was changed:
351
+ - Replaced `substring()` call with direct character code comparisons
352
+ - Check each character individually using `charCodeAt()`
353
+ - Avoid allocating a new string object for lookahead
354
+
355
+ ### Implementation:
356
+ ```typescript
357
+ // Before: const ahead = this.input.substring(this.position, this.position + 6);
358
+ // After: Direct charCode checks
359
+ if (pos + 4 < len &&
360
+ this.input.charCodeAt(pos + 1) === 116 && // t
361
+ this.input.charCodeAt(pos + 2) === 104 && // h
362
+ this.input.charCodeAt(pos + 3) === 105 && // i
363
+ this.input.charCodeAt(pos + 4) === 115) { // s
364
+ // $this
365
+ }
366
+ ```
367
+
368
+ ### Performance Impact:
369
+ - Before: ~2,303K expressions/second
370
+ - After: ~2,893K expressions/second
371
+ - **Improvement: ~25.6%**
372
+
373
+ ### Why it works:
374
+ - Avoids string allocation overhead
375
+ - Direct integer comparisons are faster than string operations
376
+ - No temporary objects created
377
+ - Better memory locality
378
+
379
+ ## 12. Remove Redundant Bounds Checking for Lookup Tables
380
+
381
+ ### What was changed:
382
+ - Removed unnecessary upper bound checks (`< 256`) from lookup table accesses
383
+ - Changed from `charCode >= 0 && charCode < 256 && IS_DIGIT[charCode]` to `charCode !== -1 && IS_DIGIT[charCode]`
384
+ - Applied to all lookup table uses: IS_DIGIT, IS_LETTER, IS_LETTER_OR_DIGIT, IS_HEX_DIGIT
385
+
386
+ ### Why the optimization works:
387
+ 1. **charCodeAt() always returns valid values** - Either 0-65535 or NaN (never negative except our -1 sentinel)
388
+ 2. **JavaScript arrays handle out-of-bounds gracefully** - Returns `undefined` which is falsy
389
+ 3. **Only need to check for -1** - Our EOF sentinel value from peekCharCode()
390
+ 4. **Reduces comparisons** - From 3 comparisons to 1 per lookup
391
+
392
+ ### Performance Impact:
393
+ - Before: ~2,893K expressions/second
394
+ - After: ~4,454K expressions/second
395
+ - **Improvement: ~54%**
396
+
397
+ ## 13. Add Dedicated EOF Case to Switch Statement
398
+
399
+ ### What was changed:
400
+ - Added explicit `case -1:` for EOF in the main switch statement
401
+ - Removed redundant EOF check from the default case
402
+ - Simplified default case by removing `firstCharCode !== -1` checks
403
+
404
+ ### Implementation:
405
+ ```typescript
406
+ // Before:
407
+ default:
408
+ if (firstCharCode !== -1 && IS_DIGIT[firstCharCode]) { ... }
409
+ if (firstCharCode !== -1 && IS_LETTER[firstCharCode]) { ... }
410
+ if (firstCharCode === -1) { return EOF; }
411
+
412
+ // After:
413
+ case -1:
414
+ return { type: TokenType.EOF, ... };
415
+ default:
416
+ if (IS_DIGIT[firstCharCode]) { ... }
417
+ if (IS_LETTER[firstCharCode]) { ... }
418
+ ```
419
+
420
+ ### Performance Impact:
421
+ - Before: ~4,454K expressions/second
422
+ - After: ~4,497K expressions/second
423
+ - **Improvement: ~1%**
424
+
425
+ ### Why it works:
426
+ - Eliminates redundant EOF checks in the default case
427
+ - Switch statement handles EOF directly without falling through
428
+ - Cleaner code path for the common case
429
+
430
+ ## 14. Avoid Substring Allocation in readIdentifierOrKeyword
431
+
432
+ ### What was changed:
433
+ - Replaced `substring()` call with direct character code comparisons for common keywords
434
+ - Check keywords directly from the input buffer using charCodeAt()
435
+ - Only use substring for longer, less common keywords (6+ characters)
436
+
437
+ ### Implementation:
438
+ ```typescript
439
+ // Before:
440
+ const value = this.input.substring(start, this.position);
441
+ switch (value) {
442
+ case 'true': type = TokenType.TRUE; break;
443
+ // ... many string comparisons
444
+ }
445
+
446
+ // After:
447
+ // For length 4 example:
448
+ const c0_4 = input.charCodeAt(start);
449
+ if (c0_4 === 116 && // 't'
450
+ input.charCodeAt(start + 1) === 114 && // 'r'
451
+ input.charCodeAt(start + 2) === 117 && // 'u'
452
+ input.charCodeAt(start + 3) === 101) { // 'e'
453
+ type = TokenType.TRUE;
454
+ }
455
+ ```
456
+
457
+ ### Performance Impact:
458
+ - Before: ~4,497K expressions/second
459
+ - After: ~5,656K expressions/second
460
+ - **Improvement: ~26%**
461
+
462
+ ### Why it works:
463
+ 1. **Avoids string allocation** - No substring object created for most keywords
464
+ 2. **Direct integer comparisons** - Faster than string equality checks
465
+ 3. **Optimized for common cases** - Most keywords are 2-5 characters
466
+ 4. **Falls back gracefully** - Still uses substring for rare long keywords
467
+
468
+ ### Trade-offs:
469
+ - More verbose code (but still maintainable with comments)
470
+ - Larger code size due to unrolled comparisons
471
+ - Best for hot paths with frequent keyword checking
472
+
473
+ ## 15. Inline advance() Method in Hot Paths
474
+
475
+ ### What was changed:
476
+ - Inlined `advance()` method in the most frequently called code paths
477
+ - Replaced `this.advance()` with direct `this.position++; this.column++`
478
+ - Applied to: single-character tokens, two-character operators, identifier/number reading loops
479
+
480
+ ### Where inlined:
481
+ 1. **nextToken() switch cases** - All single-character tokens (., (, ), +, -, *, =, etc.)
482
+ 2. **Two-character operators** - <, >, !, and their combinations (<=, >=, !=, !~)
483
+ 3. **Tight loops** - readIdentifierOrKeyword, readNumber digit scanning
484
+
485
+ ### Performance Impact:
486
+ - Before: ~5,656K expressions/second
487
+ - After: ~6,042K expressions/second
488
+ - **Improvement: ~7%**
489
+
490
+ ### Why it works:
491
+ 1. **Eliminates function call overhead** - No stack frame allocation/deallocation
492
+ 2. **Better inlining by JIT** - Compiler can optimize the inlined code better
493
+ 3. **Removes unused work** - advance() returns a value that's rarely used
494
+ 4. **Hot path optimization** - These paths are executed millions of times
495
+
496
+ ### Trade-offs:
497
+ - Code duplication (position++ and column++ repeated)
498
+ - Harder to maintain if line/column tracking logic changes
499
+ - Must remember to handle newlines separately where needed
500
+
501
+ ## 16. Use CharCode Instead of Char in String/Delimiter Reading
502
+
503
+ ### What was changed:
504
+ - Replaced string comparisons with character code comparisons in string reading methods
505
+ - Applied to: `readString()`, `readDelimitedIdentifier()`, `readEnvVar()`
506
+ - Changed `char === "'"` to `charCode === 39` etc.
507
+
508
+ ### Implementation:
509
+ ```typescript
510
+ // Before:
511
+ if (char === quoteChar) { ... }
512
+ if (char === '\\') { ... }
513
+ switch (escaped) {
514
+ case "'": ...
515
+ case "n": ...
516
+ }
517
+
518
+ // After:
519
+ if (charCode === quoteCharCode) { ... }
520
+ if (charCode === 92) { // \\
521
+ switch (escapedCode) {
522
+ case 39: // '
523
+ case 110: // n
524
+ }
525
+ ```
526
+
527
+ ### Performance Impact:
528
+ - Before: ~6,042K expressions/second
529
+ - After: ~6,093K expressions/second
530
+ - **Improvement: ~1%**
531
+
532
+ ### Why it works:
533
+ - Integer comparisons faster than string comparisons
534
+ - Consistent with the optimization pattern used in nextToken()
535
+ - Avoids string allocation for single character comparisons
536
+ - Better for hot paths that process many characters
537
+
538
+ ## 17. Inline peek() and advance() in readString
539
+
540
+ ### What was changed:
541
+ - Inlined `peekCharCode()` and `advance()` calls in the `readString()` method
542
+ - Replaced method calls with direct buffer access and position/column updates
543
+ - Applied specifically to string parsing which is a hot path for string-heavy expressions
544
+
545
+ ### Implementation:
546
+ ```typescript
547
+ // Before:
548
+ const charCode = this.peekCharCode();
549
+ if (charCode === quoteCharCode) {
550
+ this.advance();
551
+ return { type: TokenType.STRING, ... };
552
+ }
553
+
554
+ // After:
555
+ const charCode = this.input.charCodeAt(this.position);
556
+ if (charCode === quoteCharCode) {
557
+ this.position++;
558
+ this.column++;
559
+ return { type: TokenType.STRING, ... };
560
+ }
561
+ ```
562
+
563
+ ### Performance Impact:
564
+ - **String-heavy expressions**: ~31% improvement (3.0M → 4.0M expressions/second)
565
+ - **Overall benchmark**: Mixed results (slight decrease in some tests)
566
+ - Decision: Applied to `readString()` only, not to other methods
567
+
568
+ ### Why it works for strings:
569
+ 1. **String parsing is advance-heavy** - Each character requires an advance() call
570
+ 2. **Eliminates method call overhead** - Direct buffer access is faster
571
+ 3. **Better locality** - All operations on adjacent memory
572
+ 4. **Escape sequence handling** - Multiple advance() calls for escape sequences benefit greatly
573
+
574
+ ### Trade-offs:
575
+ - **Code size increase** - Method becomes larger and more complex
576
+ - **Maintenance burden** - Must handle line/column tracking manually
577
+ - **Mixed performance impact** - Benefits string-heavy code but may hurt instruction cache
578
+ - **JIT optimization** - Very large methods may be harder for JIT to optimize
579
+
580
+ ### String-Heavy Performance Test Results:
581
+
582
+ Comprehensive benchmarks show the significant impact of this optimization on string processing:
583
+
584
+ **Test Categories and Results:**
585
+ - **Long strings**: 2,171,821 expr/sec (strings with 100+ characters)
586
+ - **Escaped strings**: 2,512,563 expr/sec (strings with escape sequences like \n, \t, \\)
587
+ - **Unicode strings**: 2,061,856 expr/sec (strings with \uXXXX escape sequences)
588
+ - **Multiple strings**: 816,327 expr/sec (expressions with multiple string concatenations)
589
+ - **FHIR expressions**: 1,179,941 expr/sec (real-world FHIR queries with string literals)
590
+ - **Environment variables**: 1,494,768 expr/sec (string-based environment variables)
591
+
592
+ **Overall String-Heavy Performance**: 1,483,680 expressions/second
593
+
594
+ **Comparison with Simple Expressions**:
595
+ - Simple expressions (identifiers, operators): 4,132,231 expr/sec
596
+ - String-heavy slowdown: 2.71x
597
+ - The ~31% improvement from inlining is crucial for maintaining acceptable string performance
598
+
599
+ **Conclusion**:
600
+ The peek/advance inlining in readString provides essential performance for string-heavy workloads. Without this optimization, string processing would be ~31% slower, making string-heavy expressions run at only ~1.13M expr/sec instead of ~1.48M expr/sec.
601
+
602
+ ### Current Performance Summary:
603
+ - Original: ~1,477K expressions/second
604
+ - Current: ~6,093K expressions/second
605
+ - **Total improvement: ~313%** (4.1x faster than original)
606
+
607
+ ## 18. Added Optional Trivia/Channel Support
608
+
609
+ ### What was changed:
610
+ - Added `Channel` enum with `REGULAR` and `HIDDEN` values
611
+ - Added optional `channel` property to Token interface
612
+ - Added `preserveTrivia` option to LexerOptions
613
+ - Modified whitespace and comment tokens to include `Channel.HIDDEN` when `preserveTrivia` is true
614
+ - Updated `nextToken` to return trivia tokens when `preserveTrivia` is enabled
615
+
616
+ ### Implementation:
617
+ ```typescript
618
+ export enum Channel {
619
+ REGULAR = 0,
620
+ HIDDEN = 1,
621
+ }
622
+
623
+ export interface Token {
624
+ // ... existing properties
625
+ channel?: Channel;
626
+ }
627
+
628
+ export interface LexerOptions {
629
+ skipWhitespace?: boolean;
630
+ skipComments?: boolean;
631
+ preserveTrivia?: boolean; // New option
632
+ }
633
+ ```
634
+
635
+ ### Performance Impact:
636
+ - **Without preserveTrivia**: 2,730,350 expressions/second (default)
637
+ - **With preserveTrivia**: 1,737,645 expressions/second
638
+ - **Overhead**: 36.4% slower when enabled
639
+ - Channel assignment only happens when explicitly requested
640
+ - No performance penalty for users who don't need trivia
641
+
642
+ The overhead is expected because:
643
+ 1. Whitespace and comment tokens are returned instead of skipped
644
+ 2. Each trivia token requires channel assignment
645
+ 3. More tokens are created and added to the result array
646
+ 4. Typical expressions have 30-50% trivia tokens
647
+
648
+ ### Why it's important:
649
+ 1. **Code Formatters** - Need to preserve whitespace and comments
650
+ 2. **Refactoring Tools** - Must maintain original formatting
651
+ 3. **Documentation Generators** - Extract comments for API docs
652
+ 4. **Round-trip Parsing** - Parse → Modify → Serialize preserves formatting
653
+ 5. **Compatibility** - Matches original lexer's channel support
654
+
655
+ ### Usage:
656
+ ```typescript
657
+ // Preserve all trivia
658
+ const lexer = new Lexer(code, { preserveTrivia: true });
659
+ const tokens = lexer.tokenize();
660
+
661
+ // Filter by channel
662
+ const regularTokens = tokens.filter(t => t.channel !== Channel.HIDDEN);
663
+ const triviaTokens = tokens.filter(t => t.channel === Channel.HIDDEN);
664
+ ```
665
+
666
+ ### Current Performance Summary:
667
+ - Original: ~1,477K expressions/second
668
+ - Current: ~2,730K expressions/second (real-world expressions)
669
+ - Current: ~6,093K expressions/second (simple benchmark)
670
+ - **Total improvement: ~85%** (real-world) to **313%** (simple)
671
+ - With trivia enabled: ~1,738K expressions/second (36.4% overhead)
672
+
673
+ ## 19. Performance Comparison with ANTLR
674
+
675
+ ### What was tested:
676
+ - Compared hand-optimized Lexer2 with ANTLR-generated lexer
677
+ - Used same test suite: 1,539 real-world FHIRPath expressions
678
+ - 10,000 iterations per expression (~15.4 million tokens)
679
+ - ANTLR lexer generated from official FHIRPath grammar (spec/fhirpath.g4)
680
+
681
+ ### Performance Results:
682
+ - **Lexer2**: 2,820,268 expressions/second
683
+ - **ANTLR**: 725,684 expressions/second
684
+ - **Performance advantage: 3.89x faster**
685
+
686
+ ### Why Lexer2 Outperforms ANTLR:
687
+
688
+ #### Lexer2 Optimizations:
689
+ 1. **Lookup tables** - O(1) character classification
690
+ 2. **CharCode dispatch** - Integer switches instead of string comparisons
691
+ 3. **Inlined hot paths** - Reduced function call overhead
692
+ 4. **Optimized keywords** - Switch by length then value
693
+ 5. **Minimal allocations** - Simple object literals
694
+
695
+ #### ANTLR Overhead:
696
+ 1. **Generic DFA/NFA** - Flexible but less optimized
697
+ 2. **Heavy tokens** - Complex Token class with many properties
698
+ 3. **Stream abstractions** - Multiple indirection layers
699
+ 4. **Error recovery** - Built-in but adds overhead
700
+ 5. **Memory usage** - Larger token objects and buffering
701
+
702
+ ### Trade-offs:
703
+ - **Lexer2**: Maximum performance, requires manual maintenance
704
+ - **ANTLR**: Grammar-based, easier to modify, good tooling
705
+
706
+ ### Conclusion:
707
+ The hand-optimized approach provides substantial performance benefits (3.89x) for performance-critical applications like FHIRPath evaluation. ANTLR remains valuable for prototyping and when 725K expr/sec is sufficient.
708
+
709
+ ### Final Performance Summary:
710
+ - **vs Original implementation**: ~85% improvement (real-world)
711
+ - **vs ANTLR-generated lexer**: ~289% improvement (3.89x faster)
712
+ - **Absolute performance**: 2.82M expressions/second
713
+
714
+ ### Remaining Optimization Opportunities:
715
+ 1. **Optimize readDateTime/readTimeFormat** - Reduce redundant charCode lookups
716
+ 2. **Consider selective inlining** - Apply only where measurable benefit exists
717
+ 3. **Profile-guided optimization** - Focus on actual hot paths in real usage
718
+ 4. **Add offset property** - For complete position tracking compatibility