RubyGems - wordmap - Versions diffs - 0.1.0 → 0.3.0 - Mend

wordmap 0.1.0 → 0.3.0

Files changed (10) hide show

checksums.yaml +4 -4
data/.github/workflows/rspec.yml +10 -10
data/CHANGELOG.md +11 -0
data/LICENSE +1 -1
data/README.md +44 -14
data/lib/wordmap/file_access.rb +1 -2
data/lib/wordmap/version.rb +1 -1
data/lib/wordmap.rb +1 -1
data/wordmap.gemspec +5 -5
metadata +13 -12

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: f357e18c0e2199383ef82f3646e810a102841393b5f12f78455736d945205987
-  data.tar.gz: 18d38ade82bbaf981ffd5279f9f4edb8e9b6a38cbfe521aa45e3bf1f6c41d3ff
+  metadata.gz: ced71b07912a404954c7bab9a29f604cf6acb976a95259c37b1de93a963ac7bc
+  data.tar.gz: 26b615b55c6ed54196ac9fb8967ac6d33aba54f7c1607e2e923c059c16fc3099
 SHA512:
-  metadata.gz: 3f29ca3def2655f7acc36af95f7328620689c7cf7c366e015c196316e4027d1d283f1d36f0403abbc2367c986ae2683c4938dcdb7d25185135fa59fb9c72a3ed
-  data.tar.gz: 60ba8f842cea3d5d9269f4ab0eec323880f7827d4396ffacdc89d1154692d7ee72950ab745f30de79f09ab832894e05d13d6b1d16238d1438d1feac5659d2e09
+  metadata.gz: 0cf9d213226291a8b7c65cba9d253cfe289b03f47645088add2d2fd6d3048dd966c9f311e3357b1933dae3ba9691674ccef8d79bae77f2eb7869b16024979094
+  data.tar.gz: '0149b00b14fd57a13abe21e2edde18a9a6e577be977ed8e03568a0e40deccbd0211e7fffed35229a47ab45df1af307c5d8a4dddb5e4166920843345cce976f20'

data/.github/workflows/rspec.yml CHANGED Viewed

@@ -9,16 +9,16 @@ on:
 jobs:
   test:
     runs-on: ubuntu-latest
+    name: Ruby ${{ matrix.ruby }}
     strategy:
       matrix:
-        ruby: [ '2.4', '2.5', '2.6', '2.7' ]
-    name: Ruby ${{ matrix.ruby }}
+        ruby: [ '2.5', '2.6', '2.7', '3.0', '3.1', '3.2' ]
     steps:
-      - uses: actions/checkout@v2
-      - uses: actions/setup-ruby@v1
-        with:
-          ruby-version: ${{ matrix.ruby }}
-      - run: gem install bundler
-      - run: bundle install
-      - run: bundle exec rake
+    - uses: actions/checkout@v3
+    - name: Set up Ruby
+      uses: ruby/setup-ruby@v1
+      with:
+        ruby-version: ${{ matrix.ruby }}
+        bundler-cache: true
+    - name: Run RSpec
+      run: bundle exec rspec

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,11 @@
+This project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## Unreleased
+## 0.3.0 - 2023-08-03
+* Add support for Ruby 3.x
+## 0.2.0 - 2020-09-16
+* Make file access thread safe

data/LICENSE CHANGED Viewed

@@ -186,7 +186,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
-   Copyright [yyyy] [name of copyright owner]
+   Copyright 2023 Max Chernyak
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.

data/README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-![RSpec](https://github.com/scottscheapflights/wordmap/workflows/RSpec/badge.svg)
+![RSpec](https://github.com/maxim/wordmap/workflows/RSpec/badge.svg)
 # Wordmap
@@ -12,6 +12,8 @@ Useful in cases where:
 ## Installation
+Note: Requires at least ruby 2.5 to support `File#pread` function.
 Add this line to your application's Gemfile:
 ```ruby
@@ -28,7 +30,7 @@ Or install it yourself as:
 ## Usage
-Before we can query a wordmap, we must create one first.
+Before we can query a wordmap, we must create one.
 ### Creating
@@ -76,6 +78,9 @@ fruits.query(%w[banana lemon]).to_a # => ["14", "49"]
 # Give me prices for all yellow fruits.
 fruits.query([:color, 'yellow']).to_a # => ["14", "49"]
+# Give me prices for all citrus and musa fruits.
+fruits.query([:genus, 'citrus', 'musa']).to_a # => ["14", "49"]
 # Give me prices for all yellow citruses.
 fruits.query([:genus, 'citrus'], [:color, 'yellow']).to_a # => ["49"]
@@ -83,9 +88,9 @@ fruits.query([:genus, 'citrus'], [:color, 'yellow']).to_a # => ["49"]
 fruits.query(%w[lemon banana], [:genus, 'citrus']).to_a # => ["49"]
 ```
-Each query is an array of arrays (outer array is omitted in the examples, because it works either way). Inner arrays are treated like unions (everything in them is `OR`'ed). Outer array is treated as an intersection (results of inner arrays are `AND`'ed with one another).
+Each query is an array of arrays (outer array is omitted in the examples, because it works either way). Inner arrays are treated like unions (everything in them is `OR`'ed). Outer array is treated as an intersection (results of inner arrays are `AND`'ed with one another). Order of arrays doesn't matter.
-If an inner array starts with a symbol, the symbol is treated as an index name you want to look in.
+If an inner array starts with a symbol, then we're looking up an index of that name, otherwise — by key(s).
 Tip: if you are only supplying 1 array (as in the first and second examples above), you can drop all array wrappers entirely.
@@ -125,33 +130,58 @@ fruits.each(:genus).to_a # => ["citrus", "musa"]
 ### Multi-dimensional keys
-In the above examples the keys are simply `'banana'` and `'lemon'` — strings. If you make your key an array of strings, that'd make a multi-dimensional key. This can come helpful for some data where 2 keys make sense (we have such use cases at Scott's). Internally, each dimension is a different vector. However if you go that route, keep in mind that all the "unused" key combinations will create gaps in the data file, therefore inflating its size. For example, if you make a key out of genus + name of a fruit, like `%w[citrus lemon]` and `%w[musa banana]`, your file will become inflated with empty cells created for `%w[citrus banana]`, `%w[musa lemon]`. That space is taken (padded with null bytes) even if there are no values for these keys.
+In the above examples the keys are simply `'banana'` and `'lemon'` — strings. If you make your key an array of strings, that'd make a multi-dimensional key. This can come helpful for some data where 2 keys make sense. Internally, each dimension is a different vector. However if you go that route, keep in mind that all the "unused" key combinations will create gaps in the data file, therefore inflating its size. For example, if you make a key out of genus + name of a fruit, like `%w[citrus lemon]` and `%w[musa banana]`, your file will become inflated with empty cells created for keys `%w[citrus banana]` and `%w[musa lemon]`. That space is taken (padded with null bytes) even though there are no values for these keys.
 ## Anatomy
-A wordmap on disk is just a directory with a few files in it.
+For those interested, here's some high level implementation and structure overview.
+### Staying out of RAM
+When you initialize a wordmap object in ruby, it opens a few file descriptors, and reads a few integers of metadata from each file. Nothing else is loaded.
+When making a look up, wordmap seeks and reads just the needed bytes in the file using `File#pread` function. This avoids any caching or preloading of data into RAM.
+### Structure
+A wordmap on disk is just a directory with a few files in it. The files are formatted in a content addressable way similar to "words" in computer memory.
 ### `data` file
-The data file is where the actual entries are stored. When a wordmap is created, it looks through all the entries you want to store, and finds one with the maximum bytesize. Then it makes all entries that size by padding them with null bytes in front, and dumping all of them into the file. Since this makes each entry in the file the same size, we can easily seek to any single entry by knowing its index, because it's just index times entry size. We call such padded entry a "cell".
+The data file is where your entries are stored. When a wordmap is created, it iterates through your input hash of data, and finds the longest entry. This entry determines the size of a single cell in the data file, which means that all other entries are padded to this size. (A cell is just a padded entry. It's like a spreadsheet where all cells must be equal length.) Once we dump all the cells with your entires into the data file, we can easily find each cell by its sequential index, because it's just index times cell size.
+For example, let's take solar system's planet names. The longest name is 7 chars, so all other names are left-padded to 7 chars. Here I'm padding with spaces, but in wordmap they'd be padded with null bytes instead.
+```
+Mercury  Earth   MarsJupiter Saturn Uranus Neptune
+```
+Now to find the 3rd item, we can just 2 * 7 = 14. We seek to 14th byte position and read 7 bytes to get `   Mars`. Then we trim the padding to get `Mars`.
 The important part is the order of data in this file. When a wordmap is created, all the keys are sorted lexicographically, and for every key, entry is written in the order of how the corresponding keys are sorted. This means that if we know index of where a key is positioned sequentially, we also know index of where the cell is in the data file.
 ### `vec` files
-Vector files are where keys are stored. If you used a string as a lookup key, then it creates just one vector file where every key is written in a cell padded to maximum key length just like the case with the data file. Since this file is sorted, we can easily binary-search a key in this file, and then seek to corresponding position in the data file to find the entry.
+Vector files are structured the same as data file, but they store keys instead of entries. If you used a 1-dimensional key, then it creates just one vector file. Since this file is sorted, we can apply binary-search to find a key in this file, and then seek to corresponding position in the data file to find the entry.
-For multi-dimensional keys, multiple vector files are created (one per dimension). Let's say we have 2-dimensional key (a key that's an array of 2 strings). The first vector will contain all the first strings, and second all the second strings. Now when wordmap is doing a lookup by key, it will first bsearch the first vector to find a "page" of entries in the data file, then it will bsearch the second vector to find an exact entry position in that page of entries. Then it will know exactly where to seek to grab the entry from the data file.
+For multi-dimensional keys, multiple vector files are created (one per dimension). Let's say we have a 2-dimensional key (a key that's an array of 2 strings). The first vector will contain all the first strings, and second all the second strings of all keys. Now when wordmap is doing a lookup by key, it will first bsearch the first vector, then bsearch the second vector. The 2 found positions are then multiplied by entry's cell size and added together to get the exact location of the cell in the data file.
 ### Metadata
-Data and vector files each have a couple of numbers at the beginning that specify cells' bytesize and count. This is the only part that wordmap reads into RAM when instantiated: 2 integers per file. Having read metadata we can derive 2 additional pieces of information: 1. the bytesize of the metadata itself, so that we can skip over it, and 2. how many cells we should read every time we read a lot of cells (to optimize sequential reads). The latter is always trying to be near ~10kb per read (unless a single cell is longer than 10kb, then it's using single cell's size).
+Data and vector files each have a couple of numbers at the beginning that specify cells' bytesize and count. This is the only part that wordmap reads into RAM when instantiated. Having read these 2 integers, we can derive 2 additional pieces of information:
+1. the bytesize of the metadata itself, so that we can skip over it
+2. how many cells we should read every time we read a lot of cells (to optimize sequential reads)
+The latter is always trying to be near ~10kb per read (unless a single cell is longer than 10kb, then it's using single cell's size).
 ### Indexes
-Indexes are just wordmaps nested inside the wordmap you create. These inner wordmaps have index keys as the keys, and lists of locations as values. The values of indexes are invisible to the end user, but since this section is about anatomy, it makes sense to mention them. The locations are stored as a comma-separated list of [delta encoded](https://en.wikipedia.org/wiki/Delta_encoding) sorted integers and ranges. For example, if we are storing locations `1,3,5,6,7,8,12,15` the stored value will look like this: `1,2,2+3,4,3`. You can unpack this value by saying "first position is **1**, second position is 1 + 2 = **3**, third position is 3 + 2 = **5**, now add 3 more successively: **6,7,8**, then 8 + 4 = **12**, and 12 + 3 = **15**".
+Indexes are just recursively-nested wordmaps inside the wordmap you create. These nested wordmaps have index keys as the keys, and lists of locations as values. The values of indexes are invisible to the end user, but since this section is about anatomy, it makes sense to mention them.
+The locations are stored as a comma-separated list of [delta encoded](https://en.wikipedia.org/wiki/Delta_encoding) sorted integers and ranges. For example, if we are storing locations `1,3,5,6,7,8,12,15` the stored value will look like this: `1,2,2+3,4,3`. You can unpack this value by saying "first position is **1**, second position is 1 + 2 = **3**, third position is 3 + 2 = **5**, now add 3 more successively: **6,7,8**, then 8 + 4 = **12**, and 12 + 3 = **15**".
-When processing a query, wordmap produces lazy iterators for unioning and intersecting data. These iterators lazily walk indexed locations, or keys in a vector file, and return each found entry from the data file.
+When looking up a query, wordmap produces lazy iterators for unioning and intersecting data. These iterators lazily walk indexed locations, or keys in a vector file, and return each found entry from the data file.
 ## Development
@@ -161,9 +191,9 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
 ## Contributing
-Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/wordmap. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/master/CODE_OF_CONDUCT.md).
+Bug reports and pull requests are welcome on GitHub at https://github.com/maxim/wordmap. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/main/CODE_OF_CONDUCT.md).
 ## Code of Conduct
-Everyone interacting in the Wordmap project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/master/CODE_OF_CONDUCT.md).
+Everyone interacting in the Wordmap project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/main/CODE_OF_CONDUCT.md).

data/lib/wordmap/file_access.rb CHANGED Viewed

@@ -68,8 +68,7 @@ class Wordmap
     def read_at(file, pos, bytes)
       # puts "Seeking in #{file.path.split('.wmap', 2)[1][1..-1]} to #{pos}, " \
       #      "and reading #{bytes} bytes"
-      file.sysseek(pos)
-      file.sysread(bytes)
+      file.pread(bytes, pos)
     end
   end
 end

data/lib/wordmap/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 class Wordmap
-  VERSION = '0.1.0'
+  VERSION = '0.3.0'
 end

data/lib/wordmap.rb CHANGED Viewed

@@ -14,7 +14,7 @@ class Wordmap
   class << self
     def create(path, hash, index_names = [])
-      raise ArgumentError, "Path already exists: #{path}" if Dir.exists?(path)
+      raise ArgumentError, "Path already exists: #{path}" if Dir.exist?(path)
       index_data = index_names.map { |name| [name, {}] }.to_h
       vecs = Builder.build_vectors(hash)

data/wordmap.gemspec CHANGED Viewed

@@ -3,17 +3,17 @@ require_relative 'lib/wordmap/version'
 Gem::Specification.new do |spec|
   spec.name    = 'wordmap'
   spec.version = Wordmap::VERSION
-  spec.authors = ['Maxim Chernyak']
-  spec.email   = ['madfancier@gmail.com']
+  spec.authors = ['Max Chernyak']
+  spec.email   = ['hello@max.engineer']
   spec.summary     = 'Look up data from disk without using your RAM.'
-  spec.description = 'Wordmap is a simple way to lookup data directly from disk, bypassing RAM completely. It uses sysseek and sysread (no buffering), and takes advantage of SSD\'s constant seek time. The data is stored in equal size "cells" making it easy to calculate where things are located based on vectors.'
-  spec.homepage    = 'https://github.com/scottscheapflights/wordmap'
+  spec.description = 'Wordmap is a simple way to lookup data directly from disk, bypassing RAM. It uses pread (no buffering), and takes advantage of SSD\'s constant seek time. The data is stored in equal size "cells" making it easy to calculate where things are located based on vectors.'
+  spec.homepage    = 'https://github.com/maxim/wordmap'
   spec.license     = 'Apache-2.0'
   spec.metadata['homepage_uri'] = spec.homepage
   spec.metadata['source_code_uri'] = spec.homepage
-  spec.metadata['changelog_uri'] = 'https://github.com/scottscheapflights/wordmap/blob/master/CHANGELOG.md'
+  spec.metadata['changelog_uri'] = 'https://github.com/maxim/wordmap/blob/master/CHANGELOG.md'
   spec.required_ruby_version = Gem::Requirement.new('>= 2.4.0')
   spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: wordmap
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.3.0
 platform: ruby
 authors:
-- Maxim Chernyak
+- Max Chernyak
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2020-09-09 00:00:00.000000000 Z
+date: 2023-08-04 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -67,11 +67,11 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0.13'
 description: Wordmap is a simple way to lookup data directly from disk, bypassing
-  RAM completely. It uses sysseek and sysread (no buffering), and takes advantage
-  of SSD's constant seek time. The data is stored in equal size "cells" making it
-  easy to calculate where things are located based on vectors.
+  RAM. It uses pread (no buffering), and takes advantage of SSD's constant seek time.
+  The data is stored in equal size "cells" making it easy to calculate where things
+  are located based on vectors.
 email:
-- madfancier@gmail.com
+- hello@max.engineer
 executables: []
 extensions: []
 extra_rdoc_files: []
@@ -79,6 +79,7 @@ files:
 - ".github/workflows/rspec.yml"
 - ".gitignore"
 - ".rspec"
+- CHANGELOG.md
 - CODE_OF_CONDUCT.md
 - Gemfile
 - LICENSE
@@ -93,13 +94,13 @@ files:
 - lib/wordmap/index_value.rb
 - lib/wordmap/version.rb
 - wordmap.gemspec
-homepage: https://github.com/scottscheapflights/wordmap
+homepage: https://github.com/maxim/wordmap
 licenses:
 - Apache-2.0
 metadata:
-  homepage_uri: https://github.com/scottscheapflights/wordmap
-  source_code_uri: https://github.com/scottscheapflights/wordmap
-  changelog_uri: https://github.com/scottscheapflights/wordmap/blob/master/CHANGELOG.md
+  homepage_uri: https://github.com/maxim/wordmap
+  source_code_uri: https://github.com/maxim/wordmap
+  changelog_uri: https://github.com/maxim/wordmap/blob/master/CHANGELOG.md
 post_install_message:
 rdoc_options: []
 require_paths:
@@ -115,7 +116,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.1.2
+rubygems_version: 3.4.10
 signing_key:
 specification_version: 4
 summary: Look up data from disk without using your RAM.