oak 0.4.1 → 0.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 1894c4827e6cb478e373686a2c5c198530a6eabf
4
- data.tar.gz: 157bf6e8b64b962cefdabfee31eadd0955b58fed
3
+ metadata.gz: c5a9c1b9b0290b6e92de16b26030631e9f795fec
4
+ data.tar.gz: 898af2d521ee3fba60258c21411fe4dcc839ce16
5
5
  SHA512:
6
- metadata.gz: fb82da115acd3abd4cc992bfa0896c0ad8ec0a13e56a4843ce332672ba7dc8c1e5b6608dcc0ac7215debf16644e3ac2bf709cb043c826d35aeaf8c07a6a92cb8
7
- data.tar.gz: 30200fc38b86dd7e5953a9c70ae0d84825f6008042c1283a66554ad47480eae06658ef626eb6eaa7b3d93156d718952669cd4677add032b3b232408a8acdbf3f
6
+ metadata.gz: f724eac8c05c5f948192ac73c70c136623beae95289ddff8f2d1b024d0351a2f55032841c347e54b0caf6c8d3b604f698b4e86981f716a988599b709c79e95b9
7
+ data.tar.gz: 75137b8bdf2d4807c018253f7535335bad3d74b94afa7c71ebd085a03de423015dfa6cd12e95d3d3c8f61ce4c8d366ca2cc7ecb5f58e89686e6e2a7cf4ca1451
@@ -4,9 +4,11 @@ before_install:
4
4
  - gem install bundler -v 1.16.1
5
5
  rvm:
6
6
  - 2.1.6
7
- - 2.2.9
8
- - 2.4.3
9
- - 2.5.0
7
+ - 2.2.7
8
+ - 2.2.10
9
+ - 2.3.7
10
+ - 2.4.4
11
+ - 2.5.1
10
12
  script:
11
13
  #
12
14
  # Run several tests in parallel, and be happy if they are all happy.
@@ -1,3 +1,12 @@
1
+ ## 0.4.2 (TBD)
2
+
3
+ - `oak`, `oak.rb` patched to play nicer with bundler
4
+ - Fix exception type bug in ruby >= 2.3 and openssl >= 2.0.0 when
5
+ parsing some corrupt encrypted data.
6
+ - Updated .travis.yml to update and expand the versions of Ruby covered.
7
+ - Migrate DESIGN.md and ENCRYPTION.md in from Google Docs.
8
+
9
+
1
10
  ## 0.4.1 (2018-10-01)
2
11
 
3
12
  - `oak`, `oak.rb` published as executables from gem.
@@ -8,7 +17,7 @@
8
17
 
9
18
  ## 0.4.0 (2018-09-24)
10
19
 
11
- - First export from ProsperWorks/ALI.
20
+ - First export from [ProsperWorks/ALI](https://github.com/ProsperWorks/ALI).
12
21
  - First conversion to gem.
13
22
  - Not open (yet).
14
23
  - OAK3 emitted by default.
@@ -266,10 +266,15 @@ than hypothesis.
266
266
 
267
267
  ## Encryption Choices
268
268
 
269
- Encryption is the first extension of OAK since it went live in
270
- ProsperWorks's Redis layer on 2016-06-02 and in the S3 Correspondence
271
- bodies since 2016-07-06. There had been only Rubocop updates and nary
272
- a bugfix since 2016-07-01.
269
+ OAK has been live in `ALI` (copper.com's primary web service) in our
270
+ Redis cache layer on 2016-06-02 and for archiving correspondence
271
+ bodies in S3 on 2016-07-06.
272
+
273
+ There had been only Rubocop updates and nary a bugfix since
274
+ 2016-07-01.
275
+
276
+ Encryption is the first extension since OAK first went live.
277
+
273
278
 
274
279
  ### Encryption-in-OAK Design Decisions (see arch doc for discussion):
275
280
 
@@ -0,0 +1,410 @@
1
+ # OAK: The Object ArKive
2
+
3
+ ## All that to avoid JSON?
4
+
5
+ OAK is a serialization and envelope format which encodes simple Ruby
6
+ objects as strings. It bundles together a variety of well-understood
7
+ encoding libraries into a succinct self-describing package.
8
+
9
+ This document covers the existing OAK format, and proposed encryption
10
+ extension.
11
+
12
+ OAK compares to JSON, YAML, and Marshal. OAK is more precise than
13
+ JSON or YAML, but slightly more Ruby-esque, and supports fewer types
14
+ than Marshal. OAK also has features similar to OpenPGP
15
+ ([https://tools.ietf.org/html/rfc4880](https://tools.ietf.org/html/rfc4880))
16
+ (though not, so far, encryption).
17
+
18
+ The main value proposition for OAK is operational flexibility. OAK
19
+ leds you defer choices between compression, checksumming, and 7-bit
20
+ cleanliness algorithms until after a system is live and under load.
21
+
22
+ As of 2017-09-13, OAK is used by `ALI` (copper.com's primary web
23
+ service) for volatile caches in Redis, durable archives in S3, and for
24
+ 7-bit clean encoding of complex configuration data.
25
+
26
+ Author: [JHW](https://github.com/jhwillett)
27
+ Advisors: Marshall, Gerald, Kelly, Neil
28
+
29
+ Here is a sneak preview of some OAK strings:
30
+ ```
31
+ $ echo 'HelloWorld!' | bin/oak.rb --format none
32
+ oak_3CNN_1336599037_18_F1SU11_HelloWorld!_ok
33
+
34
+ $ echo 'HelloWorld!' | bin/oak.rb
35
+ oak_3CNB_1336599037_24_RjFTVTExX0hlbGxvV29ybGQh_ok
36
+
37
+ $ echo 'HelloWorld!' | bin/oak.rb --compression lz4 --force
38
+ oak_3C4B_1336599037_28_EvADRjFTVTExX0hlbGxvV29ybGQh_ok
39
+
40
+ $ echo 'HelloWorld!' | bin/oak.rb | bin/oak.rb --mode decode-lines
41
+ HelloWorld!
42
+ ```
43
+
44
+ ## OAK Version History
45
+
46
+ OAK has been live for 16 months by the time this arch document was
47
+ prepared retroactively.
48
+
49
+ * [https://github.com/ProsperWorks/ALI/pull/1245](https://github.com/ProsperWorks/ALI/pull/1245) **oak**
50
+ * Merged to major_2016_04_dragonfruit.
51
+ * Initial implementation. Not integrated or active.
52
+ * Version oak_1
53
+ * [https://github.com/ProsperWorks/ALI/pull/1350](https://github.com/ProsperWorks/ALI/pull/1350) **oak-in-summary_accessor**
54
+ * Merged to major_2016_04_dragonfruit.
55
+ * Reworked SummaryAccessor with ENV flag to switch to OAK serialization.
56
+ * Exposure to live data revealed some issues missed in the lab.
57
+ * [https://github.com/ProsperWorks/ALI/pull/1631](https://github.com/ProsperWorks/ALI/pull/1631) **oak-remove-json-and-yaml**
58
+ * Merged to major_2016_06_goat.
59
+ * Simplified down to only the one serialization algorithm, "FRIZZY".
60
+ * Version oak_2
61
+ * [https://github.com/ProsperWorks/ALI/pull/1618](https://github.com/ProsperWorks/ALI/pull/1618) **fix-oak-utf8**
62
+ * Merged in major_2016_06_goat.
63
+ * Fixed issues uncovered in SummaryAccessor ramp.
64
+ * Version oak_3
65
+ * [https://github.com/ProsperWorks/ALI/pull/1655](https://github.com/ProsperWorks/ALI/pull/1655) **volatile_cache_accessor**
66
+ * Merged in major_2016_06_goat.
67
+ * Introduced RedisCache. Later PRs use RedisCache for
68
+ * SummaryAccessor (OAK)
69
+ * RussianDoll caches (Marshal+OAK)
70
+ * S3 cache (OAK)
71
+ * COMPANY_ID_CACHE (JSON)
72
+ * Commits to OAK for volatile use cases.
73
+ * [https://github.com/ProsperWorks/ALI/pull/1757](https://github.com/ProsperWorks/ALI/pull/1757) **oak-woe-2016-06-30**
74
+ * Merged in major_2016_07_hotpocket.
75
+ * Fixed a regexp which had broken some Float parsing.
76
+ * [https://github.com/ProsperWorks/ALI/pull/1724](https://github.com/ProsperWorks/ALI/pull/1724) **correspondences-in-s3-fixes**
77
+ * Merged in major_2016_07_hotpocket.
78
+ * Stores Correspondence bodies in S3 as OAK strings.
79
+ * Commits to OAK for durable use cases at oak_3.
80
+
81
+ See also [OAK: Encryption-in-OAK](ENCRYPTION.md) for later
82
+ developments and the introduction of OAK_4.
83
+
84
+ ## Overview of OAK strings.
85
+
86
+ `OAK.encode` includes a manifest of its options explicitly in the
87
+ OAK string output. There is no need for an options back channel to
88
+ `OAK.decode`.
89
+
90
+ We could encode every OAK string with different options, and
91
+ `OAK.decode` can reverse all of them with no extra info.
92
+ ```
93
+ >> OAK.encode('HelloWorld',redundancy: :none)
94
+ => "oak_3NNB_0_23_RjFTVTEwX0hlbGxvV29ybGQ_ok"
95
+
96
+ >> OAK.encode('HelloWorld',format: :none,redundancy: :none)
97
+ => "oak_3NNN_0_17_F1SU10_HelloWorld_ok"
98
+
99
+ >> OAK.encode('HelloWorld',compression: :zlib,force: true)
100
+ => "oak_3CZB_3789329355_34_eJxzMwwONTSI90jNyckPzy_KSQEAL2gF3A_ok"
101
+
102
+ >> OAK.decode(OAK.encode('HelloWorld',redundancy: :none))
103
+ => "HelloWorld"
104
+
105
+ >> OAK.decode(OAK.encode('HelloWorld',format: :none,redundancy: :none))
106
+ => "HelloWorld"
107
+
108
+ >> OAK.decode(OAK.encode('HelloWorld',compression: :zlib,force: true))
109
+ => "HelloWorld"
110
+ ```
111
+
112
+ We use this to defer our choice of time-space tradeoffs until runtime.
113
+ `ALI`'s `Caches::RedisCache` mechanism enshrines this pattern by parsing
114
+ OAK options from the ENV:
115
+ ```
116
+ # in Caches::RedisCache#_serialize
117
+ OAK.encode(
118
+ pre_obj,
119
+ redundancy: (ENV["CACHE_OAK_REDUNDANCY_#{name}"] || 'sha1').intern,
120
+ compression: (ENV["CACHE_OAK_COMPRESSION_#{name}"]|| 'bzip2').intern,
121
+ force: (ENV["CACHE_OAK_FORCE_#{name}"] == 'true'),
122
+ format: (ENV["CACHE_OAK_FORMAT_#{name}"] || 'base64').intern,
123
+ )
124
+ ```
125
+ These defaults differ from those in `OAK.encode`.
126
+
127
+ Here is a quick parse of some OAK strings.
128
+ ```
129
+ >> OAK.encode('Hi',format: :none)
130
+
131
+ => "oak_3CNN_3475096913_8_F1SU2_Hi_ok"
132
+ oak_3 # OAK ver 3
133
+ C # checksum Crc32
134
+ N # compression None
135
+ N # format None
136
+ 3475096913 # checksum value
137
+ 8 # 8 data bytes
138
+ F1SU2_Hi # data
139
+ ok # end of sequence
140
+
141
+ >> OAK.encode([1,'2'],redundancy: :none,format: :none)
142
+
143
+ => "oak_3NNN_0_15_F3A2_1_2I1SU1_2_ok"
144
+ oak_3 # OAK ver 3
145
+ N # checksum None
146
+ N # compression None
147
+ N # format None
148
+ 0 # checksum value
149
+ 15 # 15 data bytes
150
+ F3A2_1_2I1SU1_2 # data
151
+ ok # end of sequence
152
+ ```
153
+ The FRIZZY format encodes all the objects in the graph as a vector,
154
+ with the element 0 implicitly the top-level object and compound
155
+ objects encoded with indices into the main object vector.
156
+ ```
157
+ F # FRIZZY serializer
158
+ 3 # 3 objects
159
+ A2_1_2 # obj 0 an Array
160
+ # w/ 2 slots:
161
+ # obj 1 and
162
+ # obj 2
163
+ I1 # obj 1 Int 1
164
+ SU1_2 # obj 2 Str
165
+ # UTF-8
166
+ # 1 string bytes
167
+ # bytes '2'
168
+ ```
169
+
170
+ ## option :redundancy => :crc32, :none, or :sha1
171
+
172
+ The `:redundancy` option selects which algorithm is used to compute
173
+ the checksum included by OAK.encode. This checksum lets OAK.decode
174
+ detect stream errors.The choice of `:redundancy` at encode time is
175
+ recorded in the 6th character of the OAK string.
176
+
177
+ `:redundancy => :crc32`, the default, is flagged as a `C` and is
178
+ `'%d' % Zlib.crc32(str)`.
179
+
180
+ Advantages:
181
+
182
+ * Encodes in only 12 bytes.
183
+ * Plenty good enough for all natural stream errors.
184
+
185
+ Disadvantages:
186
+
187
+ * Encodes in 12 whole bytes!
188
+ * Easily spoofed: not cryptographically secure.
189
+
190
+ `:redundancy => :none`, is flagged as a `N` and is simply `_0`.
191
+
192
+ I chose to leave an explicit place-holder field even at the cost of 2
193
+ useless bytes to keep the number of meta-data field constant.
194
+
195
+ Advantages:
196
+
197
+ * Encodes in only 2 bytes.
198
+ * OAK will still catch truncation errors.
199
+
200
+ Disadvantages:
201
+
202
+ * Encodes in 2 whole bytes!
203
+ * OAK will not catch twiddled bits.
204
+ * All compression algorithms do their own checksumming.
205
+ * So this is only a disadvantage with `:compression => :none`.
206
+ * Caution: `:force => false` is default.
207
+ * So fallback to `:compression => :none` is likely.
208
+ * So `:compression` is not a substitute for `:redundancy`.
209
+
210
+ `:redundancy => :sha1`, is flagged as a `S`. It is very large and is
211
+ not recommended for most use cases.
212
+
213
+ Advantages:
214
+
215
+ * Harder for a malicious hacker to fool.
216
+
217
+ Disadvantages:
218
+
219
+ * Encodes in 41 bytes!
220
+
221
+ ## option :compression => :none, :lz4, :zlib, :bzip2, or :lzma
222
+
223
+ `:compression` is recorded in the 7th char and selects which
224
+ algorithm compresses the payload.
225
+
226
+ `:compression => :none`, the default, is flagged as a `N`. No compression.
227
+
228
+ * No compression or decompression costs.
229
+ * Human-readable if the source content is human readable and format is none.
230
+
231
+ `:compression => :lz4` is flagged as a `4`. [LZ4](https://github.com/lz4/lz4) is in the [Lempel-Ziv](https://en.wikipedia.org/wiki/LZ77_and_LZ78) family of dictionary-based redundancy eaters which are popular for low-latency online systems
232
+
233
+ * Low compression costs, low decompression costs.
234
+ * Compression ratios in 1.8-2.1 for English.
235
+
236
+ `:compression => :zlib` is flagged as a `Z`. [RFC 1951 Zlib ](https://en.wikipedia.org/wiki/DEFLATE)is the widely-used compression used in pkzip, zip, and gzip. It crunches LZ77 with a follow-on Huffman step.
237
+
238
+ * Medium compression costs, medium decompression costs.
239
+ * Compression ratios around 4.0 for English.
240
+
241
+ `:compression => :bzip2` is flagged as a `B`. [Burroughs-Wheeler transform](https://en.wikipedia.org/wiki/Bzip2) with some Huffman, delta, and sparse array encoding thrown in for good measure.
242
+
243
+ * Higher compression and decompression costs.
244
+ * Compression ratios around 5.0 for English.
245
+
246
+ `:compression => :lzma` is flagged as a `M` uses the [Lempel-Zib_Markov chain algorithm](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm). It is an unusual choice for an online system.
247
+
248
+ Advantages:
249
+
250
+ * Very high compression costs, medium-low decompression costs.
251
+ * Compression ratios around 5.2 for English.
252
+
253
+ `option :force => false, true`
254
+
255
+ By default, `OAK.encode` will fall back to `:compression => :none` if
256
+ the compressed string is larger than the source string.
257
+
258
+ `:force => true` overrides this fail safe.
259
+
260
+ ## option :format => :base64 or :none
261
+
262
+ The `:format` option selects the character set used in the main body
263
+ of the OAK string - the payload part which follows the flags and
264
+ checksum and before the `_ok` terminator. The choice of `:redundancy`
265
+ at encode time is recorded in the 8th character of the OAK string.
266
+
267
+ `:format => :base64`, the default, is flagged as a `B` and is
268
+ `Base64.urlsafe_encode64(str)` with the final `===` padding stripped.
269
+
270
+ Advantages:
271
+
272
+ * 7-bit clean (not that it matters in the TCP age)
273
+ * Prints prettily in ASCII terminals and editors.
274
+ * Easy to eyeball: no spaces, commas, slashes, colons, etc.
275
+ * In most text GUIs, the auto-highlighting feature when you
276
+ double-click an OAK string will exactly select it.
277
+ * Personally, this is my favorite feature of OAK.
278
+
279
+ Disadvantages:
280
+
281
+ * Not human readable.
282
+ * Only matters when compression is off.
283
+ * Size bloat in the ratio of 6 bits to 8 bits i.e. by a factor of 133%.
284
+ * Reverses some compression gains.
285
+
286
+ `:format => :none` is flagged as a `N` and does nothing. The source
287
+ string, including zeros, bells, form feeds, and umlauts are all are
288
+ catenated nakedly into the OAK string.
289
+
290
+ Advantages:
291
+
292
+ * Human-readable if the source content is human readable.
293
+ * No size bloat.
294
+
295
+ Disadvantages:
296
+
297
+ * Hard to parse visually in most cases.
298
+ * Nasty in logs when data is binary or compressed.
299
+
300
+ ## Why FRIZZY? Why not JSON, YAML, XML, or Marshal?
301
+
302
+ JSON treats everything as a value type - it knows nothing about object
303
+ identity.
304
+
305
+ In Ruby, each distinct string literal is a distinct String
306
+ object. There is a subtle difference between a pair of equivalent
307
+ Strings and a pair of identical strings:
308
+ ```
309
+ >> arr = ['x','x'] ; arr[0].object_id == arr[1].object_id
310
+ => false
311
+
312
+ >> str = 'x'; arr = [str,str] ; arr[0].object_id == arr[1].object_id
313
+ => true
314
+ ```
315
+ JSON is the same for `['x','x']` and `[str,str]`. The difference
316
+ is lost in translation.
317
+ ```
318
+ >> str = 'x' ; JSON.dump([str,str]) == JSON.dump(['x','x'])
319
+ => true
320
+
321
+ >> JSON.dump(['x','x'])
322
+ => "[\"x\",\"x\"]"
323
+
324
+ >> str = 'x' ; JSON.dump([str,str])
325
+ => "[\"x\",\"x\"]"
326
+
327
+ >> arr = JSON.load(JSON.dump([str,str]))
328
+ => ["x", "x"]
329
+
330
+ >> arr[0].object_id == arr[1].object_id
331
+ => false
332
+ ```
333
+ With OAK, vive la différence:
334
+ ```
335
+ >> str = 'x' ; OAK.encode([str,str]) == OAK.encode(['x','x'])
336
+ => false
337
+
338
+ >> OAK.encode(['x','x'],format: :none)
339
+ => "oak_3CNN_3737537744_16_F3A2_1_2SU1_xsU0_ok"
340
+
341
+ >> str = 'x' ; OAK.encode([str,str],format: :none)
342
+ => "oak_3CNN_2865617390_13_F2A2_1_1SU1_x_ok"
343
+ ```
344
+
345
+ The JSON format does not support `Infinity`, `-Infinity`, or `NaN` -
346
+ though Ruby's JSON encoder transcodes thes via a nonstandard
347
+ extension.
348
+
349
+ YAML handles `Infinity`, `-Infinity`, and `NaN`. YAML also handles
350
+ DAGs - but not cycles.
351
+
352
+ XML is ... XML. And huge. And Nokogiri is weird.
353
+
354
+ Who cares? These are just strings. It's better to treat them as
355
+ immutable anyhow, right? What about compound objects like lists or
356
+ hashes?
357
+
358
+ It turns out that capturing identity is the key to serializing any
359
+ non-tree objects.
360
+ ```
361
+ >> a = ['a','TBD']
362
+ => ["a", "TBD"]
363
+
364
+ >> b = ['b',a]
365
+ => ["b", ["a", "TBD"]]
366
+
367
+ >> a[1] = b # a cycle!
368
+ => ["b", ["a", [...]]]
369
+
370
+ >> JSON.dump(a)
371
+ SystemStackError: stack level too deep
372
+
373
+ >> OAK.encode(a,format: :none)
374
+ => "oak_3CNN_3573295141_24_F4A2_1_2SU1_aA2_3_0SU1_b_ok"
375
+ ```
376
+ The essence of serializing non-tree objects is capturing identity.
377
+
378
+ Does this matter in `ALI`? Honestly, I don't know. Cycles and DAGs are
379
+ irrelevant for Correspondence bodies. We do have Summaries which are
380
+ DAGgy on Strings but that is probably irrelevant in all logic.
381
+
382
+ But "Do we need it?" is the wrong question. Data in the wild is
383
+ diverse and surprising. The right question is, "Can we *prove* that
384
+ we do not need it now or tomorrow?" With a cycle-aware serializer, I
385
+ don't *need* to prove nonexistence or constrain the future.
386
+
387
+ What about
388
+ [Marshal](http://jakegoulding.com/blog/2013/01/15/a-little-dip-into-rubys-marshal-format/)?
389
+ It handles (almost) all Ruby types including user-defined classes, has
390
+ no problems with cycles, and is widely available and accepted.
391
+
392
+ My reasons for not using Marshal are:
393
+
394
+ * Security
395
+ * [https://ruby-doc.org/core-2.2.2/Marshal.html](https://ruby-doc.org/core-2.2.2/Marshal.html)
396
+ * "By design, `load` can deserialize almost any class loaded into
397
+ the Ruby process. In many cases this can lead to remote code
398
+ execution if the Marshal data is loaded from an untrusted
399
+ source."
400
+ * Marshal can have problems if a user-defined class changes between
401
+ encoding time and decoding time.
402
+ * I wanted OAK to refuse to encode objects whose structure it
403
+ could not guarantee to recover with perfect fidelity.
404
+ * I have ambitions for language portability with in OAK.
405
+ * Specious: porting a subset of Marshal would be no harder than
406
+ porting OAK.
407
+
408
+ To be fair, we use Marshal anyhow, wrapped in OAK, in our cache layers
409
+ which store full ActiveRecord model objects. So any arguments about
410
+ architectural purity vis-a-vis OAK are part hype.