wordmap 0.1.0 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: f357e18c0e2199383ef82f3646e810a102841393b5f12f78455736d945205987
4
- data.tar.gz: 18d38ade82bbaf981ffd5279f9f4edb8e9b6a38cbfe521aa45e3bf1f6c41d3ff
3
+ metadata.gz: 1a93834b2b238fcd5ca8c11c828b72ccbe7bfa16c98f9629593dea4780b0480a
4
+ data.tar.gz: 68b78491386a8691e3bad3b653d74dbeb8a21c42edce3f056fc9bb28e044ba19
5
5
  SHA512:
6
- metadata.gz: 3f29ca3def2655f7acc36af95f7328620689c7cf7c366e015c196316e4027d1d283f1d36f0403abbc2367c986ae2683c4938dcdb7d25185135fa59fb9c72a3ed
7
- data.tar.gz: 60ba8f842cea3d5d9269f4ab0eec323880f7827d4396ffacdc89d1154692d7ee72950ab745f30de79f09ab832894e05d13d6b1d16238d1438d1feac5659d2e09
6
+ metadata.gz: 38567abd4ea0abc3db1bd390438efa6c8364bc94551ecb127350579eda73c21a071b540ab36bf135927963a57c92dd745dca97236bd4319f817bc53f51dd6873
7
+ data.tar.gz: 9ca7c3a15e4a5201f0029746b76bab5441d017ace05bdcbc0caee7c7355672e5661efd61e67542db090f9f0653ec64e0b80dff2e8e0c5d626d5ab8dbdb99c5ea
@@ -11,7 +11,7 @@ jobs:
11
11
  runs-on: ubuntu-latest
12
12
  strategy:
13
13
  matrix:
14
- ruby: [ '2.4', '2.5', '2.6', '2.7' ]
14
+ ruby: [ '2.5', '2.6', '2.7' ]
15
15
 
16
16
  name: Ruby ${{ matrix.ruby }}
17
17
  steps:
@@ -0,0 +1,7 @@
1
+ This project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
2
+
3
+ ## Unreleased
4
+
5
+ ## 0.2.0 - 2020-09-16
6
+
7
+ * Make file access thread safe
data/README.md CHANGED
@@ -12,6 +12,8 @@ Useful in cases where:
12
12
 
13
13
  ## Installation
14
14
 
15
+ Note: Requires at least ruby 2.5 to support `File#pread` function.
16
+
15
17
  Add this line to your application's Gemfile:
16
18
 
17
19
  ```ruby
@@ -28,7 +30,7 @@ Or install it yourself as:
28
30
 
29
31
  ## Usage
30
32
 
31
- Before we can query a wordmap, we must create one first.
33
+ Before we can query a wordmap, we must create one.
32
34
 
33
35
  ### Creating
34
36
 
@@ -76,6 +78,9 @@ fruits.query(%w[banana lemon]).to_a # => ["14", "49"]
76
78
  # Give me prices for all yellow fruits.
77
79
  fruits.query([:color, 'yellow']).to_a # => ["14", "49"]
78
80
 
81
+ # Give me prices for all citrus and musa fruits.
82
+ fruits.query([:genus, 'citrus', 'musa']).to_a # => ["14", "49"]
83
+
79
84
  # Give me prices for all yellow citruses.
80
85
  fruits.query([:genus, 'citrus'], [:color, 'yellow']).to_a # => ["49"]
81
86
 
@@ -83,9 +88,9 @@ fruits.query([:genus, 'citrus'], [:color, 'yellow']).to_a # => ["49"]
83
88
  fruits.query(%w[lemon banana], [:genus, 'citrus']).to_a # => ["49"]
84
89
  ```
85
90
 
86
- Each query is an array of arrays (outer array is omitted in the examples, because it works either way). Inner arrays are treated like unions (everything in them is `OR`'ed). Outer array is treated as an intersection (results of inner arrays are `AND`'ed with one another).
91
+ Each query is an array of arrays (outer array is omitted in the examples, because it works either way). Inner arrays are treated like unions (everything in them is `OR`'ed). Outer array is treated as an intersection (results of inner arrays are `AND`'ed with one another). Order of arrays doesn't matter.
87
92
 
88
- If an inner array starts with a symbol, the symbol is treated as an index name you want to look in.
93
+ If an inner array starts with a symbol, then we're looking up an index of that name, otherwise by key(s).
89
94
 
90
95
  Tip: if you are only supplying 1 array (as in the first and second examples above), you can drop all array wrappers entirely.
91
96
 
@@ -125,33 +130,58 @@ fruits.each(:genus).to_a # => ["citrus", "musa"]
125
130
 
126
131
  ### Multi-dimensional keys
127
132
 
128
- In the above examples the keys are simply `'banana'` and `'lemon'` — strings. If you make your key an array of strings, that'd make a multi-dimensional key. This can come helpful for some data where 2 keys make sense (we have such use cases at Scott's). Internally, each dimension is a different vector. However if you go that route, keep in mind that all the "unused" key combinations will create gaps in the data file, therefore inflating its size. For example, if you make a key out of genus + name of a fruit, like `%w[citrus lemon]` and `%w[musa banana]`, your file will become inflated with empty cells created for `%w[citrus banana]`, `%w[musa lemon]`. That space is taken (padded with null bytes) even if there are no values for these keys.
133
+ In the above examples the keys are simply `'banana'` and `'lemon'` — strings. If you make your key an array of strings, that'd make a multi-dimensional key. This can come helpful for some data where 2 keys make sense (we have such use cases at Scott's). Internally, each dimension is a different vector. However if you go that route, keep in mind that all the "unused" key combinations will create gaps in the data file, therefore inflating its size. For example, if you make a key out of genus + name of a fruit, like `%w[citrus lemon]` and `%w[musa banana]`, your file will become inflated with empty cells created for keys `%w[citrus banana]` and `%w[musa lemon]`. That space is taken (padded with null bytes) even though there are no values for these keys.
129
134
 
130
135
  ## Anatomy
131
136
 
132
- A wordmap on disk is just a directory with a few files in it.
137
+ For those interested, here's some high level implementation and structure overview.
138
+
139
+ ### Staying out of RAM
140
+
141
+ When you initialize a wordmap object in ruby, it opens a few file descriptors, and reads a few integers of metadata from each file. Nothing else is loaded.
142
+
143
+ When making a look up, wordmap seeks and reads just the needed bytes in the file using `File#pread` function. This avoids any caching or preloading of data into RAM.
144
+
145
+ ### Structure
146
+
147
+ A wordmap on disk is just a directory with a few files in it. The files are formatted in a content addressable way similar to "words" in computer memory.
133
148
 
134
149
  ### `data` file
135
150
 
136
- The data file is where the actual entries are stored. When a wordmap is created, it looks through all the entries you want to store, and finds one with the maximum bytesize. Then it makes all entries that size by padding them with null bytes in front, and dumping all of them into the file. Since this makes each entry in the file the same size, we can easily seek to any single entry by knowing its index, because it's just index times entry size. We call such padded entry a "cell".
151
+ The data file is where your entries are stored. When a wordmap is created, it iterates through your input hash of data, and finds the longest entry. This entry determines the size of a single cell in the data file, which means that all other entries are padded to this size. (A cell is just a padded entry. It's like a spreadsheet where all cells must be equal length.) Once we dump all the cells with your entires into the data file, we can easily find each cell by its sequential index, because it's just index times cell size.
152
+
153
+ For example, let's take solar system's planet names. The longest name is 7 chars, so all other names are left-padded to 7 chars. Here I'm padding with spaces, but in wordmap they'd be padded with null bytes instead.
154
+
155
+ ```
156
+ Mercury Earth MarsJupiter Saturn Uranus Neptune
157
+ ```
158
+
159
+ Now to find the 3rd item, we can just 2 * 7 = 14. We seek to 14th byte position and read 7 bytes to get ` Mars`. Then we trim the padding to get `Mars`.
137
160
 
138
161
  The important part is the order of data in this file. When a wordmap is created, all the keys are sorted lexicographically, and for every key, entry is written in the order of how the corresponding keys are sorted. This means that if we know index of where a key is positioned sequentially, we also know index of where the cell is in the data file.
139
162
 
140
163
  ### `vec` files
141
164
 
142
- Vector files are where keys are stored. If you used a string as a lookup key, then it creates just one vector file where every key is written in a cell padded to maximum key length just like the case with the data file. Since this file is sorted, we can easily binary-search a key in this file, and then seek to corresponding position in the data file to find the entry.
165
+ Vector files are structured the same as data file, but they store keys instead of entries. If you used a 1-dimensional key, then it creates just one vector file. Since this file is sorted, we can apply binary-search to find a key in this file, and then seek to corresponding position in the data file to find the entry.
143
166
 
144
- For multi-dimensional keys, multiple vector files are created (one per dimension). Let's say we have 2-dimensional key (a key that's an array of 2 strings). The first vector will contain all the first strings, and second all the second strings. Now when wordmap is doing a lookup by key, it will first bsearch the first vector to find a "page" of entries in the data file, then it will bsearch the second vector to find an exact entry position in that page of entries. Then it will know exactly where to seek to grab the entry from the data file.
167
+ For multi-dimensional keys, multiple vector files are created (one per dimension). Let's say we have a 2-dimensional key (a key that's an array of 2 strings). The first vector will contain all the first strings, and second all the second strings of all keys. Now when wordmap is doing a lookup by key, it will first bsearch the first vector, then bsearch the second vector. The 2 found positions are then multiplied by entry's cell size and added together to get the exact location of the cell in the data file.
145
168
 
146
169
  ### Metadata
147
170
 
148
- Data and vector files each have a couple of numbers at the beginning that specify cells' bytesize and count. This is the only part that wordmap reads into RAM when instantiated: 2 integers per file. Having read metadata we can derive 2 additional pieces of information: 1. the bytesize of the metadata itself, so that we can skip over it, and 2. how many cells we should read every time we read a lot of cells (to optimize sequential reads). The latter is always trying to be near ~10kb per read (unless a single cell is longer than 10kb, then it's using single cell's size).
171
+ Data and vector files each have a couple of numbers at the beginning that specify cells' bytesize and count. This is the only part that wordmap reads into RAM when instantiated. Having read these 2 integers, we can derive 2 additional pieces of information:
172
+
173
+ 1. the bytesize of the metadata itself, so that we can skip over it
174
+ 2. how many cells we should read every time we read a lot of cells (to optimize sequential reads)
175
+
176
+ The latter is always trying to be near ~10kb per read (unless a single cell is longer than 10kb, then it's using single cell's size).
149
177
 
150
178
  ### Indexes
151
179
 
152
- Indexes are just wordmaps nested inside the wordmap you create. These inner wordmaps have index keys as the keys, and lists of locations as values. The values of indexes are invisible to the end user, but since this section is about anatomy, it makes sense to mention them. The locations are stored as a comma-separated list of [delta encoded](https://en.wikipedia.org/wiki/Delta_encoding) sorted integers and ranges. For example, if we are storing locations `1,3,5,6,7,8,12,15` the stored value will look like this: `1,2,2+3,4,3`. You can unpack this value by saying "first position is **1**, second position is 1 + 2 = **3**, third position is 3 + 2 = **5**, now add 3 more successively: **6,7,8**, then 8 + 4 = **12**, and 12 + 3 = **15**".
180
+ Indexes are just recursively-nested wordmaps inside the wordmap you create. These nested wordmaps have index keys as the keys, and lists of locations as values. The values of indexes are invisible to the end user, but since this section is about anatomy, it makes sense to mention them.
181
+
182
+ The locations are stored as a comma-separated list of [delta encoded](https://en.wikipedia.org/wiki/Delta_encoding) sorted integers and ranges. For example, if we are storing locations `1,3,5,6,7,8,12,15` the stored value will look like this: `1,2,2+3,4,3`. You can unpack this value by saying "first position is **1**, second position is 1 + 2 = **3**, third position is 3 + 2 = **5**, now add 3 more successively: **6,7,8**, then 8 + 4 = **12**, and 12 + 3 = **15**".
153
183
 
154
- When processing a query, wordmap produces lazy iterators for unioning and intersecting data. These iterators lazily walk indexed locations, or keys in a vector file, and return each found entry from the data file.
184
+ When looking up a query, wordmap produces lazy iterators for unioning and intersecting data. These iterators lazily walk indexed locations, or keys in a vector file, and return each found entry from the data file.
155
185
 
156
186
  ## Development
157
187
 
@@ -161,9 +191,9 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
161
191
 
162
192
  ## Contributing
163
193
 
164
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/wordmap. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/master/CODE_OF_CONDUCT.md).
194
+ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/wordmap. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/main/CODE_OF_CONDUCT.md).
165
195
 
166
196
 
167
197
  ## Code of Conduct
168
198
 
169
- Everyone interacting in the Wordmap project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/master/CODE_OF_CONDUCT.md).
199
+ Everyone interacting in the Wordmap project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/main/CODE_OF_CONDUCT.md).
@@ -68,8 +68,7 @@ class Wordmap
68
68
  def read_at(file, pos, bytes)
69
69
  # puts "Seeking in #{file.path.split('.wmap', 2)[1][1..-1]} to #{pos}, " \
70
70
  # "and reading #{bytes} bytes"
71
- file.sysseek(pos)
72
- file.sysread(bytes)
71
+ file.pread(bytes, pos)
73
72
  end
74
73
  end
75
74
  end
@@ -1,3 +1,3 @@
1
1
  class Wordmap
2
- VERSION = '0.1.0'
2
+ VERSION = '0.2.0'
3
3
  end
@@ -7,7 +7,7 @@ Gem::Specification.new do |spec|
7
7
  spec.email = ['madfancier@gmail.com']
8
8
 
9
9
  spec.summary = 'Look up data from disk without using your RAM.'
10
- spec.description = 'Wordmap is a simple way to lookup data directly from disk, bypassing RAM completely. It uses sysseek and sysread (no buffering), and takes advantage of SSD\'s constant seek time. The data is stored in equal size "cells" making it easy to calculate where things are located based on vectors.'
10
+ spec.description = 'Wordmap is a simple way to lookup data directly from disk, bypassing RAM. It uses pread (no buffering), and takes advantage of SSD\'s constant seek time. The data is stored in equal size "cells" making it easy to calculate where things are located based on vectors.'
11
11
  spec.homepage = 'https://github.com/scottscheapflights/wordmap'
12
12
  spec.license = 'Apache-2.0'
13
13
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wordmap
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Maxim Chernyak
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-09-09 00:00:00.000000000 Z
11
+ date: 2020-09-16 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -67,9 +67,9 @@ dependencies:
67
67
  - !ruby/object:Gem::Version
68
68
  version: '0.13'
69
69
  description: Wordmap is a simple way to lookup data directly from disk, bypassing
70
- RAM completely. It uses sysseek and sysread (no buffering), and takes advantage
71
- of SSD's constant seek time. The data is stored in equal size "cells" making it
72
- easy to calculate where things are located based on vectors.
70
+ RAM. It uses pread (no buffering), and takes advantage of SSD's constant seek time.
71
+ The data is stored in equal size "cells" making it easy to calculate where things
72
+ are located based on vectors.
73
73
  email:
74
74
  - madfancier@gmail.com
75
75
  executables: []
@@ -79,6 +79,7 @@ files:
79
79
  - ".github/workflows/rspec.yml"
80
80
  - ".gitignore"
81
81
  - ".rspec"
82
+ - CHANGELOG.md
82
83
  - CODE_OF_CONDUCT.md
83
84
  - Gemfile
84
85
  - LICENSE