RubyGems - oak - Versions diffs - 0.0.3 → 0.4.1 - Mend

oak 0.0.3 → 0.4.1

Files changed (27) hide show

checksums.yaml +7 -0
data/.gitignore +51 -0
data/.rubocop.yml +74 -0
data/.travis.yml +17 -0
data/CHANGELOG.md +24 -0
data/DESIDERATA.md +318 -0
data/Gemfile +3 -15
data/LICENSE +22 -0
data/Makefile +113 -0
data/README.md +163 -23
data/Rakefile +6 -47
data/bin/oak +242 -3
data/bin/oak.rb +245 -0
data/lib/oak.rb +1049 -86
data/lib/oak/version.rb +3 -0
data/oak.gemspec +29 -65
metadata +121 -71
data/.document +0 -5
data/Gemfile.lock +0 -26
data/LICENSE.txt +0 -20
data/VERSION +0 -1
data/test/files/config/application.rb +0 -3
data/test/files/config/database.yml +0 -25
data/test/files/config/initializers/secret_token.rb +0 -7
data/test/files/dot_gitignore +0 -0
data/test/helper.rb +0 -29
data/test/test_oak.rb +0 -44

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 1894c4827e6cb478e373686a2c5c198530a6eabf
+  data.tar.gz: 157bf6e8b64b962cefdabfee31eadd0955b58fed
+SHA512:
+  metadata.gz: fb82da115acd3abd4cc992bfa0896c0ad8ec0a13e56a4843ce332672ba7dc8c1e5b6608dcc0ac7215debf16644e3ac2bf709cb043c826d35aeaf8c07a6a92cb8
+  data.tar.gz: 30200fc38b86dd7e5953a9c70ae0d84825f6008042c1283a66554ad47480eae06658ef626eb6eaa7b3d93156d718952669cd4677add032b3b232408a8acdbf3f

data/.gitignore ADDED

@@ -0,0 +1,51 @@
+*.gem
+*.rbc
+/.config
+/coverage/
+/InstalledFiles
+/pkg/
+/spec/reports/
+/spec/examples.txt
+/test/tmp/
+/test/version_tmp/
+/tmp/
+# Used by dotenv library to load environment variables.
+# .env
+## Specific to RubyMotion:
+.dat*
+.repl_history
+build/
+*.bridgesupport
+build-iPhoneOS/
+build-iPhoneSimulator/
+## Specific to RubyMotion (use of CocoaPods):
+#
+# We recommend against adding the Pods directory to your .gitignore. However
+# you should judge for yourself, the pros and cons are mentioned at:
+# https://guides.cocoapods.org/using/using-cocoapods.html#should-i-check-the-pods-directory-into-source-control
+#
+# vendor/Pods/
+## Documentation cache and generated files:
+/.yardoc/
+/_yardoc/
+/doc/
+/rdoc/
+## Environment normalization:
+/.bundle/
+/vendor/bundle
+/lib/bundler/man/
+# For a library or gem, you might want to ignore these files since the code is
+# intended to run in multiple environments; otherwise, check them in:
+#
+Gemfile.lock
+.ruby-version
+.ruby-gemset
+# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
+.rvmrc

data/.rubocop.yml ADDED

@@ -0,0 +1,74 @@
+AllCops:
+  Include:
+    - Rakefile
+    - Gemfile
+    - '*.gemspec'
+# I like the Metrics family in principle, but OAK was built in a
+# lower-level style much like C.  The default thresholds for these are
+# too tight for the style I chose for OAK.
+#
+# Moreover, IMO test code is not the place get pedantic about class
+# length, method complexity, etc.  One should be encouraged to add
+# more tests with minimal friction, not forced to make a hard choice
+# between cutting tests or splitting up my test suites.
+#
+Metrics/ParameterLists:
+  Max: 10
+Metrics/BlockLength:
+  Max: 150
+  Exclude:
+    - 'test/**/*.rb'
+Metrics/ClassLength:
+  Max: 400
+  Exclude:
+    - 'test/**/*.rb'
+Metrics/MethodLength:
+  Max: 150
+  Exclude:
+    - 'test/**/*.rb'
+Metrics/ModuleLength:
+  Max: 1000
+  Exclude:
+    - 'test/**/*.rb'
+Metrics/AbcSize:
+  Max: 150
+  Exclude:
+    - 'test/**/*.rb'
+Metrics/BlockNesting:
+  Max: 5
+Metrics/CyclomaticComplexity:
+  Max: 50
+Metrics/PerceivedComplexity:
+  Max: 25
+#
+# Normally I am a pedantic adherent to 80-column lines.
+#
+# Over in test/oak.rb however, there are necessarily a lot of OAK
+# strings which are much larger than 80 characters.
+#
+# I have decided that disablement in .rubocop.yml is less disruptive
+# than repeate use of inline rubocop: comments.
+#
+Metrics/LineLength:
+  Exclude:
+    - 'test/**/*.rb'
+Naming/UncommunicativeMethodParamName:
+  Enabled: false
+# I put extra spaces in a lot of expressions for a lot of different
+# reasons, including especially readability.
+#
+# I reject these cops.
+#
+Layout:
+  Enabled: false
+# As a group, the Style cops are bewilderingly opiniated.
+#
+# In some cases IMO they are harmful e.g. Style/TernaryParentheses.
+#
+# I reject these cops.
+#
+Style:
+  Enabled: false

data/.travis.yml ADDED

@@ -0,0 +1,17 @@
+sudo: true
+language: ruby
+before_install:
+  - gem install bundler -v 1.16.1
+rvm:
+  - 2.1.6
+  - 2.2.9
+  - 2.4.3
+  - 2.5.0
+script:
+  #
+  # Run several tests in parallel, and be happy if they are all happy.
+  #
+  # If any fail, rerun serially so we get clean output from the ones
+  # which failed.
+  #
+  - make test -j 3 || make test

data/CHANGELOG.md ADDED

@@ -0,0 +1,24 @@
+## 0.4.1 (2018-10-01)
+- `oak`, `oak.rb` published as executables from gem.
+- Removed heavier dep on `contracts`, switched to manual checks and looser spec.
+- Documentation reorg and cleanup.
+- Open-sourced with MIT License, published as https://rubygems.org/gems/oak!
+## 0.4.0 (2018-09-24)
+- First export from ProsperWorks/ALI.
+- First conversion to gem.
+- Not open (yet).
+- OAK3 emitted by default.
+- OAK4 with AES-256-GCM encryption with random IV supported.
+## 0.0.3 (2011-11-07) and earlier
+- https://rubygems.org/gems/oak had an earlier incarnation as a
+  secret management utility, https://github.com/imonyse/oak.
+- Special thanks and regards to https://github.com/imonyse, who
+  generously released the gem name `oak` so it could have a second
+  life.

data/DESIDERATA.md ADDED

@@ -0,0 +1,318 @@
+# oak design desiderata
+Some design goals with which I started this project.
+- P1 means "top priorty"
+- P2 means "very important"
+- P3 means "nice to have"
+- P4 means "not harmful if cheap"
+- `+` means "accomplished"
+- `-` means "not accomplished"
+- `?` means "accomplished, but only for some combinations of arguments"
+Desiderata for the structure layer:
+- P1 + losslessly handle nil, true, false, Integer, and String
+- P1 + losslessly handle List with arbitrary values and deep nesting
+- P1 + losslessly handle Hash with string keys and deep nesting in values
+- P1 + detect cycles and DAGS in input structures, fail or handle
+- P1 + handle all Integer types without loss
+- P1 - handle Floats with no more than a small quantified loss
+- P2 + Hash key ordering is preserved in Ruby-Ruby transcoding
+- P3 - convenient:  vaguely human-readable representations available
+- P3 + encode cycles and DAGs
+- P3 + handle Hash with non-string keys and deep nesting in keys
+- P3 + losslessly handle Symbol distinct from String
+- P3 - handle Times and Dates
+Desiderata for the byte layer:
+- P1 + reversible:  original string can be reconstructed from only OAK string
+- P1 + unambiguous: no OAK string is the prefix of any other OAK string
+- P1 + extensible:  OAK strings contain ids for ver, format, compression, etc
+- P1 + robust:      error detection in OAK strings
+- P2 + flexible:    multiple compression modes available
+- P3 + convenient:  available representation without `{}`, comma, whitespace
+- P3 + convenient:  7-bit clean representations available
+- P3 + convenient:  representations which are selectable with double-click
+- P3 + convenient:  vaguely human-readable representations available
+- P3 - streamable:  reversing can be accomplished with definite-size buffers
+- P4 - embeddable:  reversing can be accomplished with fixed-size buffers
+- P4 - defensive:   error correction available (no good libs found)
+Techniques used in the byte layer to accomplish these goals.
+- manifest type id for self-identification
+- manifest version id in case format changes in future
+- salient encoding algorithm choices stored in output stream
+  - error detection algorithm aka redundancy
+  - compression
+  - formatting
+- microchoices made to confine metadata characters to [_0-9a-z]
+- algorithm menu constructed to offer data characters in [-_0-9a-z]
+## Serialization Choices
+A survey of alternatives considered for the serialization layer.
+### Considering Marshal
+The Marshal format has some major drawbacks which I believe make it
+a nonstarter.
+- strictly Ruby-specific
+- readability across major versions not guaranteed
+- too powerful: can be used to execute arbitrary code
+- binary and non-human-readable
+  - many option combos for oak make oak strings also non-human-readable
+  - still, it is nice to have layer which is at least potentially clear
+Marshal does offer one major advantage:
+- transcodes all Ruby value types and user-defined value-like classes
+- reported to be much faster than JSON or YAML for serializing
+### Considering JSON
+JSON is awesome most of the time, especially in highly constrained
+environments such as API specifications and simple ad-hoc caching
+situations.
+JSON offers advantages:
+- a portable object model
+- easy to read
+- widely deployed
+- the go-to choice for interchange in recent years
+But it has some shortcomings which lead me to reject it for the
+structural level in OAK.
+- floating point precision is implementation-dependent
+- always decodes as a tree - fails to transcode DAGiness
+- cannot represent cycles - encoder reject, stack overflow, or infinite loop
+- no native date or time handling
+- table keys may only be strings
+  - e.g. `{'123'=>'x'} == JSON.parse(JSON.dump({123=>'x'}))`
+- type information symbol-vs-string lost, symbols transcode to strings
+  - e.g. `'foo'        == JSON.parse(JSON.dump(:foo))`
+  - e.g. `{'foo'=>'x'} == JSON.parse(JSON.dump({:foo=>'x'}))`
+- official grammer only allows {} or [] as top-level object
+  - e.g. `123 == JSON.parse('123')` but `JSON.parse('123')` raises `ParserError`
+  - many parsers in the wild support only this strict official grammer
+  - JSON is suitable only for document encoding, not streams
+    - allows only one object per file
+    - multiple objects must be members of a list
+    - lists must be fully scanned and parsed before being processed
+    - no possibility of streamy processing
+Biggest limitation of JSON IMO is that Hash keys can only be strings:
+```
+2.1.6 :008 > obj = {'str'=>'bar',[1,2,3]=>'baz'}
+ => {"str"=>"bar", [1, 2, 3]=>"baz"}
+2.1.6 :009 > JSON.dump(obj)
+ => "{\"str\":\"bar\",\"[1, 2, 3]\":\"baz\"}"
+2.1.6 :010 > JSON.parse(JSON.dump(obj))
+ => {"str"=>"bar", "[1, 2, 3]"=>"baz"}
+2.1.6 :011 > JSON.parse(JSON.dump(obj)) == obj
+ => false
+```
+### Considering YAML
+YAML is strong where JSON is strong, and also strong in many places
+where JSON is weak.  In fact, YAML includes JSON as a subformat: JSON
+strings *are* YAML strings!
+Some of the advantages of YAML over JSON are:
+- handles any directed graph, including DAGy bits and cycles
+- arguably more human-readable than JSON
+- YAML spec subsumes JSON spec: JSON files are YAML files
+- supports non-string keys
+  - e.g. `{123=>'x'}  == YAML.load(YAML.dump({123=>'x'}))`
+- supports symbols
+  - e.g. `:foo        == YAML.load(YAML.dump(:foo))`
+  - e.g. `{:foo=>'x'} == YAML.load(YAML.dump({:foo=>'x'}))`
+- allows integer or string as top-level object
+YAML overcomes the biggest limitation of JSON by supporting non-string
+hash keys:
+```
+  2.1.6 :008 > obj = {'str'=>'bar',[1,2,3]=>'baz'}
+   => {"str"=>"bar", [1, 2, 3]=>"baz"}
+  2.1.6 :012 > YAML.dump(obj)
+   => "---\nstr: bar\n? - 1\n  - 2\n  - 3\n: baz\n"
+  2.1.6 :013 > YAML.load(YAML.dump(obj))
+  => {"str"=>"bar", [1, 2, 3]=>"baz"}
+  2.1.6 :014 > YAML.load(YAML.dump(obj)) == obj
+   => true
+```
+Note: YAML's support for Symbols is due to Psych, not strictly the
+YAML format itself.  I've taken both `YAML.dump(:foo)` and
+`YAML.dump(':foo')` into Python and done `yaml.load()` on them.  Both
+result in `':foo'`.  So this nicety is not portable.
+But YAML still has some shortcomings:
+- floating point precision is implementation-dependent
+- no native date or time handling
+- unclear whether available parsers support stream processing
+- DAGs and cycles of Arrays and Hash are handled, but Strings are not.
+### Considering FRIZZY
+FRIZZY is a home-grown serialization format which I ended up commiting
+to for OAK.
+The name FRIZZY means nothing, and survives only as the rogue `F`
+character at the start of a serialized object:
+```
+.1.6 :006 > OAK.encode('Hello, World!',redundancy: :none,format: :none)
+ => "oak_3NNN_0_20_F1SU13_Hello, World!_ok"
+```
+Advantages:
+  - Recongizes when Strings are identical, not just equivalent.
+  - It is much more compact than YAML.
+  - Has built-in folding of String and Symbol representation.
+Disadvantages:
+  - Home grown.
+  - Very much not human readable.
+  - Floating point precision is incompletely specified.
+    - Current implementation crudely uses Number.to_s and String.to_f
+I decided to reinvent the wheel and go with FRIZZY.  We have
+discovered Summaries which are DAGs on strings.  It might be
+acceptable to lose that information but I did not want to *prove* it
+was acceptable to lose that information.
+It may have been an ego-driven sin to go custom here, but I did not
+want to pessimize future use cases on fidelity or control.
+## Compression Choices
+A survey of alternatives considered for the compression layer.
+### Considering LZO, LZF, and LZ4.
+These compression formats are similar in performance and algorithm.
+All are in the Lempel-Ziv family of dictionary-based
+redundancy-eaters.  They will all be cheap to compress, cheap to
+uncompress, but will delver only modest compression ratios.
+This family of algorithms are unfamiliar to those accustomed to
+archive formats, but they are used widely in low-latency applications
+(such as server caches ;) ).
+To keep things simple, I settled on supporting only LZ4 because its
+gem, `lz4-ruby`, seems to have more mindshare and momentum.  It is
+weaker but faster than the other weak+fast options - which seems like
+the way to be.
+Based on previous experience, I expect this to be a clear win for use
+in Redis caches vs being uncompressed.
+### Considering ZLIB
+Including ZLIB felt like a no-brainer.  ZLIB is familiar,
+widely-deployed, and standardized in RFC 1951.  It uses the L-Z
+process with an additional Huffman encoding phase.  It will deliver
+intermediate cost for intermediate compression.
+Based on previous experience, I expect this option will usually be
+dominated by either LZ4 for low-latency applications or BZIP2 for
+archival applications, but I'm including it for comparisons and
+because it would feel strage not to.
+### Considering BZIP2
+BZIP2 is an aggressive compression which uses the Burrows–Wheeler,
+move-to-front, and run-length-encoding transforms with Huffman It will
+be several times slower but several 10% stronger than ZLIB.  I chose
+the gem bzip2-ffi over the more flexible rbzip2 to make absolutely
+certain that we use the native libbz2 implementation and do not
+falling back silently to a Ruby version which is 100x slower if/when
+Heroku does not offer FFI.
+Based on previous experience, I expect this option will dominate where
+data is generally cold or where storage is very expensive compared to
+CPU.
+### Considering LZMA
+LZMA is the Lempel-Ziv-Markov chains algorithm.  It will be an order
+of magnitude more expensive to compress than BZIP2, but will
+decompress slightly faster and will yield better compression ratios by
+few 5%.
+This will be useful only for cases where read-write ratios are over 10
+and storage:cpu cost ratios are high.  When read-write ratios are
+close to unity, LZO will dominate where storage:cpu is low and BZIP2
+will dominate where storage:cpu is high.
+Nonetheless, I have a soft spot in my heart for this algorithm so I am
+including it - if only so we can rule it out by demonstration rather
+than hypothesis.
+## Encryption Choices
+Encryption is the first extension of OAK since it went live in
+ProsperWorks's Redis layer on 2016-06-02 and in the S3 Correspondence
+bodies since 2016-07-06.  There had been only Rubocop updates and nary
+a bugfix since 2016-07-01.
+### Encryption-in-OAK Design Decisions (see arch doc for discussion):
+- Encryption is the only change in OAK4.
+- OAK4 will only support AES-256-GCM with random IVs chosen for
+  each encryption event.
+  - OAK4 will use no salt other than the random IV.
+- Encrypted OAK strings will be nondeterministic.
+  - This crushes the desiderata of making OAK.encode a pure function.
+  - This is unavoidable to avoid a blatant security hole.
+- OAK4 dramatically changes how headers are managed from OAK3.
+  - Encrypts all headers which are not required for decryption.
+  - Athenticates all headers and the encrypted stream.
+- Key rotation is supported.
+  - Via an ENV-specified key chain.
+  - Can hold multiple master keys.
+### Encryption-in-OAK Backward Compatibility
+Before encryption was added, the format identifier for OAK strings
+was `'oak_3'`.
+To indicate we are making a non-backward compatible change, I am
+bumping that up to `'oak_4'` for encrypted strings.
+The legacy OAK3 are still supported both on read and on write.
+By default, OAK4 is used only when encryption is requested.
+### Encryption-in-OAK Regarding Compression vs Encryption
+Note that compression of encrypted strings is next to useless.  By
+design, encryption algorithms obscure exploitable redundancy in
+plaintext and produce incompressible ciphertext.
+On the other hand, in the wild there have been a handful of successful
+chosen-plaintext attacks on compress-then-encrypt encodings.  See:
+- https://blog.appcanary.com/2016/encrypt-or-compress.html
+- https://en.wikipedia.org/wiki/CRIME
+OAK4 supports compression and does compression-then-encryption.
+The extremely paranoid are encouraged to use compression: :none.  Note
+however that the source data may be compressed.  Furthermore, for
+larger objects FRIZZY itself is, in part, a compression algorithm.