oak 0.4.1 → 0.4.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 1894c4827e6cb478e373686a2c5c198530a6eabf
4
- data.tar.gz: 157bf6e8b64b962cefdabfee31eadd0955b58fed
3
+ metadata.gz: c5a9c1b9b0290b6e92de16b26030631e9f795fec
4
+ data.tar.gz: 898af2d521ee3fba60258c21411fe4dcc839ce16
5
5
  SHA512:
6
- metadata.gz: fb82da115acd3abd4cc992bfa0896c0ad8ec0a13e56a4843ce332672ba7dc8c1e5b6608dcc0ac7215debf16644e3ac2bf709cb043c826d35aeaf8c07a6a92cb8
7
- data.tar.gz: 30200fc38b86dd7e5953a9c70ae0d84825f6008042c1283a66554ad47480eae06658ef626eb6eaa7b3d93156d718952669cd4677add032b3b232408a8acdbf3f
6
+ metadata.gz: f724eac8c05c5f948192ac73c70c136623beae95289ddff8f2d1b024d0351a2f55032841c347e54b0caf6c8d3b604f698b4e86981f716a988599b709c79e95b9
7
+ data.tar.gz: 75137b8bdf2d4807c018253f7535335bad3d74b94afa7c71ebd085a03de423015dfa6cd12e95d3d3c8f61ce4c8d366ca2cc7ecb5f58e89686e6e2a7cf4ca1451
@@ -4,9 +4,11 @@ before_install:
4
4
  - gem install bundler -v 1.16.1
5
5
  rvm:
6
6
  - 2.1.6
7
- - 2.2.9
8
- - 2.4.3
9
- - 2.5.0
7
+ - 2.2.7
8
+ - 2.2.10
9
+ - 2.3.7
10
+ - 2.4.4
11
+ - 2.5.1
10
12
  script:
11
13
  #
12
14
  # Run several tests in parallel, and be happy if they are all happy.
@@ -1,3 +1,12 @@
1
+ ## 0.4.2 (TBD)
2
+
3
+ - `oak`, `oak.rb` patched to play nicer with bundler
4
+ - Fix exception type bug in ruby >= 2.3 and openssl >= 2.0.0 when
5
+ parsing some corrupt encrypted data.
6
+ - Updated .travis.yml to update and expand the versions of Ruby covered.
7
+ - Migrate DESIGN.md and ENCRYPTION.md in from Google Docs.
8
+
9
+
1
10
  ## 0.4.1 (2018-10-01)
2
11
 
3
12
  - `oak`, `oak.rb` published as executables from gem.
@@ -8,7 +17,7 @@
8
17
 
9
18
  ## 0.4.0 (2018-09-24)
10
19
 
11
- - First export from ProsperWorks/ALI.
20
+ - First export from [ProsperWorks/ALI](https://github.com/ProsperWorks/ALI).
12
21
  - First conversion to gem.
13
22
  - Not open (yet).
14
23
  - OAK3 emitted by default.
@@ -266,10 +266,15 @@ than hypothesis.
266
266
 
267
267
  ## Encryption Choices
268
268
 
269
- Encryption is the first extension of OAK since it went live in
270
- ProsperWorks's Redis layer on 2016-06-02 and in the S3 Correspondence
271
- bodies since 2016-07-06. There had been only Rubocop updates and nary
272
- a bugfix since 2016-07-01.
269
+ OAK has been live in `ALI` (copper.com's primary web service) in our
270
+ Redis cache layer on 2016-06-02 and for archiving correspondence
271
+ bodies in S3 on 2016-07-06.
272
+
273
+ There had been only Rubocop updates and nary a bugfix since
274
+ 2016-07-01.
275
+
276
+ Encryption is the first extension since OAK first went live.
277
+
273
278
 
274
279
  ### Encryption-in-OAK Design Decisions (see arch doc for discussion):
275
280
 
@@ -0,0 +1,410 @@
1
+ # OAK: The Object ArKive
2
+
3
+ ## All that to avoid JSON?
4
+
5
+ OAK is a serialization and envelope format which encodes simple Ruby
6
+ objects as strings. It bundles together a variety of well-understood
7
+ encoding libraries into a succinct self-describing package.
8
+
9
+ This document covers the existing OAK format, and proposed encryption
10
+ extension.
11
+
12
+ OAK compares to JSON, YAML, and Marshal. OAK is more precise than
13
+ JSON or YAML, but slightly more Ruby-esque, and supports fewer types
14
+ than Marshal. OAK also has features similar to OpenPGP
15
+ ([https://tools.ietf.org/html/rfc4880](https://tools.ietf.org/html/rfc4880))
16
+ (though not, so far, encryption).
17
+
18
+ The main value proposition for OAK is operational flexibility. OAK
19
+ leds you defer choices between compression, checksumming, and 7-bit
20
+ cleanliness algorithms until after a system is live and under load.
21
+
22
+ As of 2017-09-13, OAK is used by `ALI` (copper.com's primary web
23
+ service) for volatile caches in Redis, durable archives in S3, and for
24
+ 7-bit clean encoding of complex configuration data.
25
+
26
+ Author: [JHW](https://github.com/jhwillett)
27
+ Advisors: Marshall, Gerald, Kelly, Neil
28
+
29
+ Here is a sneak preview of some OAK strings:
30
+ ```
31
+ $ echo 'HelloWorld!' | bin/oak.rb --format none
32
+ oak_3CNN_1336599037_18_F1SU11_HelloWorld!_ok
33
+
34
+ $ echo 'HelloWorld!' | bin/oak.rb
35
+ oak_3CNB_1336599037_24_RjFTVTExX0hlbGxvV29ybGQh_ok
36
+
37
+ $ echo 'HelloWorld!' | bin/oak.rb --compression lz4 --force
38
+ oak_3C4B_1336599037_28_EvADRjFTVTExX0hlbGxvV29ybGQh_ok
39
+
40
+ $ echo 'HelloWorld!' | bin/oak.rb | bin/oak.rb --mode decode-lines
41
+ HelloWorld!
42
+ ```
43
+
44
+ ## OAK Version History
45
+
46
+ OAK has been live for 16 months by the time this arch document was
47
+ prepared retroactively.
48
+
49
+ * [https://github.com/ProsperWorks/ALI/pull/1245](https://github.com/ProsperWorks/ALI/pull/1245) **oak**
50
+ * Merged to major_2016_04_dragonfruit.
51
+ * Initial implementation. Not integrated or active.
52
+ * Version oak_1
53
+ * [https://github.com/ProsperWorks/ALI/pull/1350](https://github.com/ProsperWorks/ALI/pull/1350) **oak-in-summary_accessor**
54
+ * Merged to major_2016_04_dragonfruit.
55
+ * Reworked SummaryAccessor with ENV flag to switch to OAK serialization.
56
+ * Exposure to live data revealed some issues missed in the lab.
57
+ * [https://github.com/ProsperWorks/ALI/pull/1631](https://github.com/ProsperWorks/ALI/pull/1631) **oak-remove-json-and-yaml**
58
+ * Merged to major_2016_06_goat.
59
+ * Simplified down to only the one serialization algorithm, "FRIZZY".
60
+ * Version oak_2
61
+ * [https://github.com/ProsperWorks/ALI/pull/1618](https://github.com/ProsperWorks/ALI/pull/1618) **fix-oak-utf8**
62
+ * Merged in major_2016_06_goat.
63
+ * Fixed issues uncovered in SummaryAccessor ramp.
64
+ * Version oak_3
65
+ * [https://github.com/ProsperWorks/ALI/pull/1655](https://github.com/ProsperWorks/ALI/pull/1655) **volatile_cache_accessor**
66
+ * Merged in major_2016_06_goat.
67
+ * Introduced RedisCache. Later PRs use RedisCache for
68
+ * SummaryAccessor (OAK)
69
+ * RussianDoll caches (Marshal+OAK)
70
+ * S3 cache (OAK)
71
+ * COMPANY_ID_CACHE (JSON)
72
+ * Commits to OAK for volatile use cases.
73
+ * [https://github.com/ProsperWorks/ALI/pull/1757](https://github.com/ProsperWorks/ALI/pull/1757) **oak-woe-2016-06-30**
74
+ * Merged in major_2016_07_hotpocket.
75
+ * Fixed a regexp which had broken some Float parsing.
76
+ * [https://github.com/ProsperWorks/ALI/pull/1724](https://github.com/ProsperWorks/ALI/pull/1724) **correspondences-in-s3-fixes**
77
+ * Merged in major_2016_07_hotpocket.
78
+ * Stores Correspondence bodies in S3 as OAK strings.
79
+ * Commits to OAK for durable use cases at oak_3.
80
+
81
+ See also [OAK: Encryption-in-OAK](ENCRYPTION.md) for later
82
+ developments and the introduction of OAK_4.
83
+
84
+ ## Overview of OAK strings.
85
+
86
+ `OAK.encode` includes a manifest of its options explicitly in the
87
+ OAK string output. There is no need for an options back channel to
88
+ `OAK.decode`.
89
+
90
+ We could encode every OAK string with different options, and
91
+ `OAK.decode` can reverse all of them with no extra info.
92
+ ```
93
+ >> OAK.encode('HelloWorld',redundancy: :none)
94
+ => "oak_3NNB_0_23_RjFTVTEwX0hlbGxvV29ybGQ_ok"
95
+
96
+ >> OAK.encode('HelloWorld',format: :none,redundancy: :none)
97
+ => "oak_3NNN_0_17_F1SU10_HelloWorld_ok"
98
+
99
+ >> OAK.encode('HelloWorld',compression: :zlib,force: true)
100
+ => "oak_3CZB_3789329355_34_eJxzMwwONTSI90jNyckPzy_KSQEAL2gF3A_ok"
101
+
102
+ >> OAK.decode(OAK.encode('HelloWorld',redundancy: :none))
103
+ => "HelloWorld"
104
+
105
+ >> OAK.decode(OAK.encode('HelloWorld',format: :none,redundancy: :none))
106
+ => "HelloWorld"
107
+
108
+ >> OAK.decode(OAK.encode('HelloWorld',compression: :zlib,force: true))
109
+ => "HelloWorld"
110
+ ```
111
+
112
+ We use this to defer our choice of time-space tradeoffs until runtime.
113
+ `ALI`'s `Caches::RedisCache` mechanism enshrines this pattern by parsing
114
+ OAK options from the ENV:
115
+ ```
116
+ # in Caches::RedisCache#_serialize
117
+ OAK.encode(
118
+ pre_obj,
119
+ redundancy: (ENV["CACHE_OAK_REDUNDANCY_#{name}"] || 'sha1').intern,
120
+ compression: (ENV["CACHE_OAK_COMPRESSION_#{name}"]|| 'bzip2').intern,
121
+ force: (ENV["CACHE_OAK_FORCE_#{name}"] == 'true'),
122
+ format: (ENV["CACHE_OAK_FORMAT_#{name}"] || 'base64').intern,
123
+ )
124
+ ```
125
+ These defaults differ from those in `OAK.encode`.
126
+
127
+ Here is a quick parse of some OAK strings.
128
+ ```
129
+ >> OAK.encode('Hi',format: :none)
130
+
131
+ => "oak_3CNN_3475096913_8_F1SU2_Hi_ok"
132
+ oak_3 # OAK ver 3
133
+ C # checksum Crc32
134
+ N # compression None
135
+ N # format None
136
+ 3475096913 # checksum value
137
+ 8 # 8 data bytes
138
+ F1SU2_Hi # data
139
+ ok # end of sequence
140
+
141
+ >> OAK.encode([1,'2'],redundancy: :none,format: :none)
142
+
143
+ => "oak_3NNN_0_15_F3A2_1_2I1SU1_2_ok"
144
+ oak_3 # OAK ver 3
145
+ N # checksum None
146
+ N # compression None
147
+ N # format None
148
+ 0 # checksum value
149
+ 15 # 15 data bytes
150
+ F3A2_1_2I1SU1_2 # data
151
+ ok # end of sequence
152
+ ```
153
+ The FRIZZY format encodes all the objects in the graph as a vector,
154
+ with the element 0 implicitly the top-level object and compound
155
+ objects encoded with indices into the main object vector.
156
+ ```
157
+ F # FRIZZY serializer
158
+ 3 # 3 objects
159
+ A2_1_2 # obj 0 an Array
160
+ # w/ 2 slots:
161
+ # obj 1 and
162
+ # obj 2
163
+ I1 # obj 1 Int 1
164
+ SU1_2 # obj 2 Str
165
+ # UTF-8
166
+ # 1 string bytes
167
+ # bytes '2'
168
+ ```
169
+
170
+ ## option :redundancy => :crc32, :none, or :sha1
171
+
172
+ The `:redundancy` option selects which algorithm is used to compute
173
+ the checksum included by OAK.encode. This checksum lets OAK.decode
174
+ detect stream errors.The choice of `:redundancy` at encode time is
175
+ recorded in the 6th character of the OAK string.
176
+
177
+ `:redundancy => :crc32`, the default, is flagged as a `C` and is
178
+ `'%d' % Zlib.crc32(str)`.
179
+
180
+ Advantages:
181
+
182
+ * Encodes in only 12 bytes.
183
+ * Plenty good enough for all natural stream errors.
184
+
185
+ Disadvantages:
186
+
187
+ * Encodes in 12 whole bytes!
188
+ * Easily spoofed: not cryptographically secure.
189
+
190
+ `:redundancy => :none`, is flagged as a `N` and is simply `_0`.
191
+
192
+ I chose to leave an explicit place-holder field even at the cost of 2
193
+ useless bytes to keep the number of meta-data field constant.
194
+
195
+ Advantages:
196
+
197
+ * Encodes in only 2 bytes.
198
+ * OAK will still catch truncation errors.
199
+
200
+ Disadvantages:
201
+
202
+ * Encodes in 2 whole bytes!
203
+ * OAK will not catch twiddled bits.
204
+ * All compression algorithms do their own checksumming.
205
+ * So this is only a disadvantage with `:compression => :none`.
206
+ * Caution: `:force => false` is default.
207
+ * So fallback to `:compression => :none` is likely.
208
+ * So `:compression` is not a substitute for `:redundancy`.
209
+
210
+ `:redundancy => :sha1`, is flagged as a `S`. It is very large and is
211
+ not recommended for most use cases.
212
+
213
+ Advantages:
214
+
215
+ * Harder for a malicious hacker to fool.
216
+
217
+ Disadvantages:
218
+
219
+ * Encodes in 41 bytes!
220
+
221
+ ## option :compression => :none, :lz4, :zlib, :bzip2, or :lzma
222
+
223
+ `:compression` is recorded in the 7th char and selects which
224
+ algorithm compresses the payload.
225
+
226
+ `:compression => :none`, the default, is flagged as a `N`. No compression.
227
+
228
+ * No compression or decompression costs.
229
+ * Human-readable if the source content is human readable and format is none.
230
+
231
+ `:compression => :lz4` is flagged as a `4`. [LZ4](https://github.com/lz4/lz4) is in the [Lempel-Ziv](https://en.wikipedia.org/wiki/LZ77_and_LZ78) family of dictionary-based redundancy eaters which are popular for low-latency online systems
232
+
233
+ * Low compression costs, low decompression costs.
234
+ * Compression ratios in 1.8-2.1 for English.
235
+
236
+ `:compression => :zlib` is flagged as a `Z`. [RFC 1951 Zlib ](https://en.wikipedia.org/wiki/DEFLATE)is the widely-used compression used in pkzip, zip, and gzip. It crunches LZ77 with a follow-on Huffman step.
237
+
238
+ * Medium compression costs, medium decompression costs.
239
+ * Compression ratios around 4.0 for English.
240
+
241
+ `:compression => :bzip2` is flagged as a `B`. [Burroughs-Wheeler transform](https://en.wikipedia.org/wiki/Bzip2) with some Huffman, delta, and sparse array encoding thrown in for good measure.
242
+
243
+ * Higher compression and decompression costs.
244
+ * Compression ratios around 5.0 for English.
245
+
246
+ `:compression => :lzma` is flagged as a `M` uses the [Lempel-Zib_Markov chain algorithm](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm). It is an unusual choice for an online system.
247
+
248
+ Advantages:
249
+
250
+ * Very high compression costs, medium-low decompression costs.
251
+ * Compression ratios around 5.2 for English.
252
+
253
+ `option :force => false, true`
254
+
255
+ By default, `OAK.encode` will fall back to `:compression => :none` if
256
+ the compressed string is larger than the source string.
257
+
258
+ `:force => true` overrides this fail safe.
259
+
260
+ ## option :format => :base64 or :none
261
+
262
+ The `:format` option selects the character set used in the main body
263
+ of the OAK string - the payload part which follows the flags and
264
+ checksum and before the `_ok` terminator. The choice of `:redundancy`
265
+ at encode time is recorded in the 8th character of the OAK string.
266
+
267
+ `:format => :base64`, the default, is flagged as a `B` and is
268
+ `Base64.urlsafe_encode64(str)` with the final `===` padding stripped.
269
+
270
+ Advantages:
271
+
272
+ * 7-bit clean (not that it matters in the TCP age)
273
+ * Prints prettily in ASCII terminals and editors.
274
+ * Easy to eyeball: no spaces, commas, slashes, colons, etc.
275
+ * In most text GUIs, the auto-highlighting feature when you
276
+ double-click an OAK string will exactly select it.
277
+ * Personally, this is my favorite feature of OAK.
278
+
279
+ Disadvantages:
280
+
281
+ * Not human readable.
282
+ * Only matters when compression is off.
283
+ * Size bloat in the ratio of 6 bits to 8 bits i.e. by a factor of 133%.
284
+ * Reverses some compression gains.
285
+
286
+ `:format => :none` is flagged as a `N` and does nothing. The source
287
+ string, including zeros, bells, form feeds, and umlauts are all are
288
+ catenated nakedly into the OAK string.
289
+
290
+ Advantages:
291
+
292
+ * Human-readable if the source content is human readable.
293
+ * No size bloat.
294
+
295
+ Disadvantages:
296
+
297
+ * Hard to parse visually in most cases.
298
+ * Nasty in logs when data is binary or compressed.
299
+
300
+ ## Why FRIZZY? Why not JSON, YAML, XML, or Marshal?
301
+
302
+ JSON treats everything as a value type - it knows nothing about object
303
+ identity.
304
+
305
+ In Ruby, each distinct string literal is a distinct String
306
+ object. There is a subtle difference between a pair of equivalent
307
+ Strings and a pair of identical strings:
308
+ ```
309
+ >> arr = ['x','x'] ; arr[0].object_id == arr[1].object_id
310
+ => false
311
+
312
+ >> str = 'x'; arr = [str,str] ; arr[0].object_id == arr[1].object_id
313
+ => true
314
+ ```
315
+ JSON is the same for `['x','x']` and `[str,str]`. The difference
316
+ is lost in translation.
317
+ ```
318
+ >> str = 'x' ; JSON.dump([str,str]) == JSON.dump(['x','x'])
319
+ => true
320
+
321
+ >> JSON.dump(['x','x'])
322
+ => "[\"x\",\"x\"]"
323
+
324
+ >> str = 'x' ; JSON.dump([str,str])
325
+ => "[\"x\",\"x\"]"
326
+
327
+ >> arr = JSON.load(JSON.dump([str,str]))
328
+ => ["x", "x"]
329
+
330
+ >> arr[0].object_id == arr[1].object_id
331
+ => false
332
+ ```
333
+ With OAK, vive la différence:
334
+ ```
335
+ >> str = 'x' ; OAK.encode([str,str]) == OAK.encode(['x','x'])
336
+ => false
337
+
338
+ >> OAK.encode(['x','x'],format: :none)
339
+ => "oak_3CNN_3737537744_16_F3A2_1_2SU1_xsU0_ok"
340
+
341
+ >> str = 'x' ; OAK.encode([str,str],format: :none)
342
+ => "oak_3CNN_2865617390_13_F2A2_1_1SU1_x_ok"
343
+ ```
344
+
345
+ The JSON format does not support `Infinity`, `-Infinity`, or `NaN` -
346
+ though Ruby's JSON encoder transcodes thes via a nonstandard
347
+ extension.
348
+
349
+ YAML handles `Infinity`, `-Infinity`, and `NaN`. YAML also handles
350
+ DAGs - but not cycles.
351
+
352
+ XML is ... XML. And huge. And Nokogiri is weird.
353
+
354
+ Who cares? These are just strings. It's better to treat them as
355
+ immutable anyhow, right? What about compound objects like lists or
356
+ hashes?
357
+
358
+ It turns out that capturing identity is the key to serializing any
359
+ non-tree objects.
360
+ ```
361
+ >> a = ['a','TBD']
362
+ => ["a", "TBD"]
363
+
364
+ >> b = ['b',a]
365
+ => ["b", ["a", "TBD"]]
366
+
367
+ >> a[1] = b # a cycle!
368
+ => ["b", ["a", [...]]]
369
+
370
+ >> JSON.dump(a)
371
+ SystemStackError: stack level too deep
372
+
373
+ >> OAK.encode(a,format: :none)
374
+ => "oak_3CNN_3573295141_24_F4A2_1_2SU1_aA2_3_0SU1_b_ok"
375
+ ```
376
+ The essence of serializing non-tree objects is capturing identity.
377
+
378
+ Does this matter in `ALI`? Honestly, I don't know. Cycles and DAGs are
379
+ irrelevant for Correspondence bodies. We do have Summaries which are
380
+ DAGgy on Strings but that is probably irrelevant in all logic.
381
+
382
+ But "Do we need it?" is the wrong question. Data in the wild is
383
+ diverse and surprising. The right question is, "Can we *prove* that
384
+ we do not need it now or tomorrow?" With a cycle-aware serializer, I
385
+ don't *need* to prove nonexistence or constrain the future.
386
+
387
+ What about
388
+ [Marshal](http://jakegoulding.com/blog/2013/01/15/a-little-dip-into-rubys-marshal-format/)?
389
+ It handles (almost) all Ruby types including user-defined classes, has
390
+ no problems with cycles, and is widely available and accepted.
391
+
392
+ My reasons for not using Marshal are:
393
+
394
+ * Security
395
+ * [https://ruby-doc.org/core-2.2.2/Marshal.html](https://ruby-doc.org/core-2.2.2/Marshal.html)
396
+ * "By design, `load` can deserialize almost any class loaded into
397
+ the Ruby process. In many cases this can lead to remote code
398
+ execution if the Marshal data is loaded from an untrusted
399
+ source."
400
+ * Marshal can have problems if a user-defined class changes between
401
+ encoding time and decoding time.
402
+ * I wanted OAK to refuse to encode objects whose structure it
403
+ could not guarantee to recover with perfect fidelity.
404
+ * I have ambitions for language portability with in OAK.
405
+ * Specious: porting a subset of Marshal would be no harder than
406
+ porting OAK.
407
+
408
+ To be fair, we use Marshal anyhow, wrapped in OAK, in our cache layers
409
+ which store full ActiveRecord model objects. So any arguments about
410
+ architectural purity vis-a-vis OAK are part hype.