wordmap 0.1.0 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.github/workflows/rspec.yml +1 -1
- data/CHANGELOG.md +7 -0
- data/README.md +43 -13
- data/lib/wordmap/file_access.rb +1 -2
- data/lib/wordmap/version.rb +1 -1
- data/wordmap.gemspec +1 -1
- metadata +6 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 1a93834b2b238fcd5ca8c11c828b72ccbe7bfa16c98f9629593dea4780b0480a
|
4
|
+
data.tar.gz: 68b78491386a8691e3bad3b653d74dbeb8a21c42edce3f056fc9bb28e044ba19
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 38567abd4ea0abc3db1bd390438efa6c8364bc94551ecb127350579eda73c21a071b540ab36bf135927963a57c92dd745dca97236bd4319f817bc53f51dd6873
|
7
|
+
data.tar.gz: 9ca7c3a15e4a5201f0029746b76bab5441d017ace05bdcbc0caee7c7355672e5661efd61e67542db090f9f0653ec64e0b80dff2e8e0c5d626d5ab8dbdb99c5ea
|
data/.github/workflows/rspec.yml
CHANGED
data/CHANGELOG.md
ADDED
data/README.md
CHANGED
@@ -12,6 +12,8 @@ Useful in cases where:
|
|
12
12
|
|
13
13
|
## Installation
|
14
14
|
|
15
|
+
Note: Requires at least ruby 2.5 to support `File#pread` function.
|
16
|
+
|
15
17
|
Add this line to your application's Gemfile:
|
16
18
|
|
17
19
|
```ruby
|
@@ -28,7 +30,7 @@ Or install it yourself as:
|
|
28
30
|
|
29
31
|
## Usage
|
30
32
|
|
31
|
-
Before we can query a wordmap, we must create one
|
33
|
+
Before we can query a wordmap, we must create one.
|
32
34
|
|
33
35
|
### Creating
|
34
36
|
|
@@ -76,6 +78,9 @@ fruits.query(%w[banana lemon]).to_a # => ["14", "49"]
|
|
76
78
|
# Give me prices for all yellow fruits.
|
77
79
|
fruits.query([:color, 'yellow']).to_a # => ["14", "49"]
|
78
80
|
|
81
|
+
# Give me prices for all citrus and musa fruits.
|
82
|
+
fruits.query([:genus, 'citrus', 'musa']).to_a # => ["14", "49"]
|
83
|
+
|
79
84
|
# Give me prices for all yellow citruses.
|
80
85
|
fruits.query([:genus, 'citrus'], [:color, 'yellow']).to_a # => ["49"]
|
81
86
|
|
@@ -83,9 +88,9 @@ fruits.query([:genus, 'citrus'], [:color, 'yellow']).to_a # => ["49"]
|
|
83
88
|
fruits.query(%w[lemon banana], [:genus, 'citrus']).to_a # => ["49"]
|
84
89
|
```
|
85
90
|
|
86
|
-
Each query is an array of arrays (outer array is omitted in the examples, because it works either way). Inner arrays are treated like unions (everything in them is `OR`'ed). Outer array is treated as an intersection (results of inner arrays are `AND`'ed with one another).
|
91
|
+
Each query is an array of arrays (outer array is omitted in the examples, because it works either way). Inner arrays are treated like unions (everything in them is `OR`'ed). Outer array is treated as an intersection (results of inner arrays are `AND`'ed with one another). Order of arrays doesn't matter.
|
87
92
|
|
88
|
-
If an inner array starts with a symbol,
|
93
|
+
If an inner array starts with a symbol, then we're looking up an index of that name, otherwise — by key(s).
|
89
94
|
|
90
95
|
Tip: if you are only supplying 1 array (as in the first and second examples above), you can drop all array wrappers entirely.
|
91
96
|
|
@@ -125,33 +130,58 @@ fruits.each(:genus).to_a # => ["citrus", "musa"]
|
|
125
130
|
|
126
131
|
### Multi-dimensional keys
|
127
132
|
|
128
|
-
In the above examples the keys are simply `'banana'` and `'lemon'` — strings. If you make your key an array of strings, that'd make a multi-dimensional key. This can come helpful for some data where 2 keys make sense (we have such use cases at Scott's). Internally, each dimension is a different vector. However if you go that route, keep in mind that all the "unused" key combinations will create gaps in the data file, therefore inflating its size. For example, if you make a key out of genus + name of a fruit, like `%w[citrus lemon]` and `%w[musa banana]`, your file will become inflated with empty cells created for `%w[citrus banana]
|
133
|
+
In the above examples the keys are simply `'banana'` and `'lemon'` — strings. If you make your key an array of strings, that'd make a multi-dimensional key. This can come helpful for some data where 2 keys make sense (we have such use cases at Scott's). Internally, each dimension is a different vector. However if you go that route, keep in mind that all the "unused" key combinations will create gaps in the data file, therefore inflating its size. For example, if you make a key out of genus + name of a fruit, like `%w[citrus lemon]` and `%w[musa banana]`, your file will become inflated with empty cells created for keys `%w[citrus banana]` and `%w[musa lemon]`. That space is taken (padded with null bytes) even though there are no values for these keys.
|
129
134
|
|
130
135
|
## Anatomy
|
131
136
|
|
132
|
-
|
137
|
+
For those interested, here's some high level implementation and structure overview.
|
138
|
+
|
139
|
+
### Staying out of RAM
|
140
|
+
|
141
|
+
When you initialize a wordmap object in ruby, it opens a few file descriptors, and reads a few integers of metadata from each file. Nothing else is loaded.
|
142
|
+
|
143
|
+
When making a look up, wordmap seeks and reads just the needed bytes in the file using `File#pread` function. This avoids any caching or preloading of data into RAM.
|
144
|
+
|
145
|
+
### Structure
|
146
|
+
|
147
|
+
A wordmap on disk is just a directory with a few files in it. The files are formatted in a content addressable way similar to "words" in computer memory.
|
133
148
|
|
134
149
|
### `data` file
|
135
150
|
|
136
|
-
The data file is where
|
151
|
+
The data file is where your entries are stored. When a wordmap is created, it iterates through your input hash of data, and finds the longest entry. This entry determines the size of a single cell in the data file, which means that all other entries are padded to this size. (A cell is just a padded entry. It's like a spreadsheet where all cells must be equal length.) Once we dump all the cells with your entires into the data file, we can easily find each cell by its sequential index, because it's just index times cell size.
|
152
|
+
|
153
|
+
For example, let's take solar system's planet names. The longest name is 7 chars, so all other names are left-padded to 7 chars. Here I'm padding with spaces, but in wordmap they'd be padded with null bytes instead.
|
154
|
+
|
155
|
+
```
|
156
|
+
Mercury Earth MarsJupiter Saturn Uranus Neptune
|
157
|
+
```
|
158
|
+
|
159
|
+
Now to find the 3rd item, we can just 2 * 7 = 14. We seek to 14th byte position and read 7 bytes to get ` Mars`. Then we trim the padding to get `Mars`.
|
137
160
|
|
138
161
|
The important part is the order of data in this file. When a wordmap is created, all the keys are sorted lexicographically, and for every key, entry is written in the order of how the corresponding keys are sorted. This means that if we know index of where a key is positioned sequentially, we also know index of where the cell is in the data file.
|
139
162
|
|
140
163
|
### `vec` files
|
141
164
|
|
142
|
-
Vector files are
|
165
|
+
Vector files are structured the same as data file, but they store keys instead of entries. If you used a 1-dimensional key, then it creates just one vector file. Since this file is sorted, we can apply binary-search to find a key in this file, and then seek to corresponding position in the data file to find the entry.
|
143
166
|
|
144
|
-
For multi-dimensional keys, multiple vector files are created (one per dimension). Let's say we have 2-dimensional key (a key that's an array of 2 strings). The first vector will contain all the first strings, and second all the second strings. Now when wordmap is doing a lookup by key, it will first bsearch the first vector
|
167
|
+
For multi-dimensional keys, multiple vector files are created (one per dimension). Let's say we have a 2-dimensional key (a key that's an array of 2 strings). The first vector will contain all the first strings, and second all the second strings of all keys. Now when wordmap is doing a lookup by key, it will first bsearch the first vector, then bsearch the second vector. The 2 found positions are then multiplied by entry's cell size and added together to get the exact location of the cell in the data file.
|
145
168
|
|
146
169
|
### Metadata
|
147
170
|
|
148
|
-
Data and vector files each have a couple of numbers at the beginning that specify cells' bytesize and count. This is the only part that wordmap reads into RAM when instantiated
|
171
|
+
Data and vector files each have a couple of numbers at the beginning that specify cells' bytesize and count. This is the only part that wordmap reads into RAM when instantiated. Having read these 2 integers, we can derive 2 additional pieces of information:
|
172
|
+
|
173
|
+
1. the bytesize of the metadata itself, so that we can skip over it
|
174
|
+
2. how many cells we should read every time we read a lot of cells (to optimize sequential reads)
|
175
|
+
|
176
|
+
The latter is always trying to be near ~10kb per read (unless a single cell is longer than 10kb, then it's using single cell's size).
|
149
177
|
|
150
178
|
### Indexes
|
151
179
|
|
152
|
-
Indexes are just wordmaps
|
180
|
+
Indexes are just recursively-nested wordmaps inside the wordmap you create. These nested wordmaps have index keys as the keys, and lists of locations as values. The values of indexes are invisible to the end user, but since this section is about anatomy, it makes sense to mention them.
|
181
|
+
|
182
|
+
The locations are stored as a comma-separated list of [delta encoded](https://en.wikipedia.org/wiki/Delta_encoding) sorted integers and ranges. For example, if we are storing locations `1,3,5,6,7,8,12,15` the stored value will look like this: `1,2,2+3,4,3`. You can unpack this value by saying "first position is **1**, second position is 1 + 2 = **3**, third position is 3 + 2 = **5**, now add 3 more successively: **6,7,8**, then 8 + 4 = **12**, and 12 + 3 = **15**".
|
153
183
|
|
154
|
-
When
|
184
|
+
When looking up a query, wordmap produces lazy iterators for unioning and intersecting data. These iterators lazily walk indexed locations, or keys in a vector file, and return each found entry from the data file.
|
155
185
|
|
156
186
|
## Development
|
157
187
|
|
@@ -161,9 +191,9 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
|
|
161
191
|
|
162
192
|
## Contributing
|
163
193
|
|
164
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/wordmap. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/
|
194
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/wordmap. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/main/CODE_OF_CONDUCT.md).
|
165
195
|
|
166
196
|
|
167
197
|
## Code of Conduct
|
168
198
|
|
169
|
-
Everyone interacting in the Wordmap project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/
|
199
|
+
Everyone interacting in the Wordmap project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/wordmap/blob/main/CODE_OF_CONDUCT.md).
|
data/lib/wordmap/file_access.rb
CHANGED
data/lib/wordmap/version.rb
CHANGED
data/wordmap.gemspec
CHANGED
@@ -7,7 +7,7 @@ Gem::Specification.new do |spec|
|
|
7
7
|
spec.email = ['madfancier@gmail.com']
|
8
8
|
|
9
9
|
spec.summary = 'Look up data from disk without using your RAM.'
|
10
|
-
spec.description = 'Wordmap is a simple way to lookup data directly from disk, bypassing RAM
|
10
|
+
spec.description = 'Wordmap is a simple way to lookup data directly from disk, bypassing RAM. It uses pread (no buffering), and takes advantage of SSD\'s constant seek time. The data is stored in equal size "cells" making it easy to calculate where things are located based on vectors.'
|
11
11
|
spec.homepage = 'https://github.com/scottscheapflights/wordmap'
|
12
12
|
spec.license = 'Apache-2.0'
|
13
13
|
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wordmap
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Maxim Chernyak
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2020-09-
|
11
|
+
date: 2020-09-16 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -67,9 +67,9 @@ dependencies:
|
|
67
67
|
- !ruby/object:Gem::Version
|
68
68
|
version: '0.13'
|
69
69
|
description: Wordmap is a simple way to lookup data directly from disk, bypassing
|
70
|
-
RAM
|
71
|
-
|
72
|
-
|
70
|
+
RAM. It uses pread (no buffering), and takes advantage of SSD's constant seek time.
|
71
|
+
The data is stored in equal size "cells" making it easy to calculate where things
|
72
|
+
are located based on vectors.
|
73
73
|
email:
|
74
74
|
- madfancier@gmail.com
|
75
75
|
executables: []
|
@@ -79,6 +79,7 @@ files:
|
|
79
79
|
- ".github/workflows/rspec.yml"
|
80
80
|
- ".gitignore"
|
81
81
|
- ".rspec"
|
82
|
+
- CHANGELOG.md
|
82
83
|
- CODE_OF_CONDUCT.md
|
83
84
|
- Gemfile
|
84
85
|
- LICENSE
|