curate-indexer 0.2.2 → 0.2.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +54 -0
- data/curate-indexer.gemspec +1 -0
- data/lib/curate/indexer.rb +18 -5
- data/lib/curate/indexer/configuration.rb +4 -1
- data/lib/curate/indexer/documents.rb +46 -3
- data/lib/curate/indexer/railtie.rb +1 -1
- data/lib/curate/indexer/relationship_reindexer.rb +4 -1
- data/lib/curate/indexer/version.rb +1 -1
- metadata +4 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f90a339727f8494a69331c91dabfead7a4d3fea5
|
4
|
+
data.tar.gz: b79f7029a6767747b52d48500f7fc8d567301f60
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: a5ece9e30c7760998a9d428c2cab96183eb00aa6f331b550ac0ca914c0140bed986a68694f2574fd75338e9b38d28a0a26865eefbc529ceabe8dfc7eaab63583
|
7
|
+
data.tar.gz: bc7a658c728db9aa803c3b44bc6ff04e94f2429dcdf500adff16a130652de6d1b49bc053b5e35aaade8388c1a8feceeaaef9f002df7ee697a3e22ea2b81211e9
|
data/README.md
CHANGED
@@ -6,4 +6,58 @@
|
|
6
6
|
[![Documentation Status](http://inch-ci.org/github/ndlib/curate-indexer.svg?branch=master)](http://inch-ci.org/github/ndlib/curate-indexer)
|
7
7
|
[![APACHE 2 License](http://img.shields.io/badge/APACHE2-license-blue.svg)](./LICENSE)
|
8
8
|
|
9
|
+
The Curate::Indexer gem is responsible for indexing the graph relationship of objects. It maps a PreservationDocument to an IndexDocument by mapping a PreservationDocument's direct parents into the paths to get from a root document to the given PreservationDocument.
|
10
|
+
|
11
|
+
# Background
|
12
|
+
|
9
13
|
This is a sandbox to work through the reindexing strategy as it relates to [CurateND Collections](https://github.com/ndlib/curate_nd/issues/420). At this point the code is separate to allow for rapid testing and prototyping (no sense spinning up SOLR and Fedora to walk an arbitrary graph).
|
14
|
+
|
15
|
+
# Concepts
|
16
|
+
|
17
|
+
As we are indexing objects, we have two types of documents:
|
18
|
+
|
19
|
+
1. [PreservationDocument](./lib/curate/indexer/documents.rb) - a light-weight representation of a Fedora object
|
20
|
+
2. [IndexDocument](./lib/curate/indexer/documents.rb) - a light-weight representation of a SOLR document object
|
21
|
+
|
22
|
+
We have four attributes to consider for indexing the graph:
|
23
|
+
|
24
|
+
1. pid - the unique identifier for a document
|
25
|
+
2. parent_pids - the pids for all of the parents of a given document
|
26
|
+
3. pathnames - the paths to traverse from a root document to the given document
|
27
|
+
4. ancestors - the pathnames of each of the ancestors
|
28
|
+
|
29
|
+
See [Curate::Indexer::Documents::IndexDocument](./lib/curate/indexer/documents.rb) for further discussion.
|
30
|
+
|
31
|
+
To reindex a single document, we leverage the [`Curate::Indexer.reindex_relationships`](./lib/curate/indexer.rb) method.
|
32
|
+
|
33
|
+
# Examples
|
34
|
+
|
35
|
+
Given the following PreservationDocuments:
|
36
|
+
|
37
|
+
| PID | Parents |
|
38
|
+
|-----|---------|
|
39
|
+
| A | - |
|
40
|
+
| B | - |
|
41
|
+
| C | A |
|
42
|
+
| D | A, B |
|
43
|
+
| E | C |
|
44
|
+
|
45
|
+
If we were to reindex the above PreservationDocuments, we will generate the following IndexDocuments:
|
46
|
+
|
47
|
+
| PID | Parents | Pathnames | Ancestors |
|
48
|
+
|-----|---------|------------|-----------|
|
49
|
+
| A | - | [A] | [] |
|
50
|
+
| B | - | [B] | [] |
|
51
|
+
| C | A | [A/C] | [A] |
|
52
|
+
| D | A, B | [A/D, B/D] | [A, B] |
|
53
|
+
| E | C | [A/C/E] | [A/C] |
|
54
|
+
|
55
|
+
For more scenarios, look at the [Reindex PID and Descendants specs](./spec/features/reindex_pid_and_descendants_spec.rb).
|
56
|
+
|
57
|
+
# Adapters
|
58
|
+
|
59
|
+
An [AbstractAdapter](./lib/curate/indexer/adapters/abstract_adapter.rb) provides the method interface for others to build against.
|
60
|
+
|
61
|
+
The [InMemory adapter](./lib/curate/indexer/adapters/in_memory_adapter.rb) is a reference implementation (and used to ease testing overhead).
|
62
|
+
|
63
|
+
CurateND has implemented the [following adapter](https://github.com/ndlib/curate_nd/blob/master/lib/curate/library_collection_indexing_adapter.rb) for its LibraryCollection indexing.
|
data/curate-indexer.gemspec
CHANGED
@@ -12,6 +12,7 @@ Gem::Specification.new do |spec|
|
|
12
12
|
spec.summary = %q{A playground for CurateND collections indexing}
|
13
13
|
spec.description = %q{A playground for CurateND collections indexing}
|
14
14
|
spec.homepage = "https://github.com/ndlib/curate-indexer"
|
15
|
+
spec.license = "Apache-2.0"
|
15
16
|
|
16
17
|
spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
|
17
18
|
spec.bindir = "bin"
|
data/lib/curate/indexer.rb
CHANGED
@@ -5,7 +5,7 @@ require 'curate/indexer/configuration'
|
|
5
5
|
require 'curate/indexer/railtie' if defined?(Rails)
|
6
6
|
|
7
7
|
module Curate
|
8
|
-
# Responsible for
|
8
|
+
# Responsible for indexing an object and its related child objects.
|
9
9
|
module Indexer
|
10
10
|
# This assumes a rather deep graph
|
11
11
|
DEFAULT_TIME_TO_LIVE = 15
|
@@ -18,7 +18,7 @@ module Curate
|
|
18
18
|
# @return [Boolean] - It was successful
|
19
19
|
# @raise Curate::Exceptions::CycleDetectionError - A potential cycle was detected
|
20
20
|
def self.reindex_relationships(pid, time_to_live = DEFAULT_TIME_TO_LIVE)
|
21
|
-
RelationshipReindexer.call(pid: pid, time_to_live: time_to_live, adapter:
|
21
|
+
RelationshipReindexer.call(pid: pid, time_to_live: time_to_live, adapter: adapter)
|
22
22
|
true
|
23
23
|
end
|
24
24
|
|
@@ -34,10 +34,15 @@ module Curate
|
|
34
34
|
# @return [Boolean] - It was successful
|
35
35
|
# @raise Curate::Exceptions::CycleDetectionError - A potential cycle was detected
|
36
36
|
def self.reindex_all!(time_to_live = DEFAULT_TIME_TO_LIVE)
|
37
|
-
RepositoryReindexer
|
37
|
+
# While the RepositoryReindexer is responsible for reindexing everything, I
|
38
|
+
# want to inject the lambda that will reindex a single item.
|
39
|
+
pid_reindexer = method(:reindex_relationships)
|
40
|
+
RepositoryReindexer.call(time_to_live: time_to_live, pid_reindexer: pid_reindexer, adapter: adapter)
|
38
41
|
true
|
39
42
|
end
|
40
43
|
|
44
|
+
# @api public
|
45
|
+
#
|
41
46
|
# Contains the Curate::Indexer configuration information that is referenceable from wit
|
42
47
|
# @see Curate::Indexer::Configuration
|
43
48
|
def self.configuration
|
@@ -45,23 +50,31 @@ module Curate
|
|
45
50
|
end
|
46
51
|
|
47
52
|
# @api public
|
53
|
+
#
|
54
|
+
# Exposes the data adapter to use for the reindexing process.
|
55
|
+
#
|
56
|
+
# @see Curate::Indexer::Adapters::AbstractAdapter
|
57
|
+
# @return Object that implementes the Curate::Indexer::Adapters::AbstractAdapter method interface
|
48
58
|
def self.adapter
|
49
59
|
configuration.adapter
|
50
60
|
end
|
51
61
|
|
52
62
|
# @api public
|
63
|
+
#
|
64
|
+
# Capture the configuration information
|
65
|
+
#
|
53
66
|
# @see Curate::Indexer::Configuration
|
54
67
|
# @see .configuration
|
68
|
+
# @see Curate::Indexer::Railtie
|
55
69
|
def self.configure(&block)
|
56
70
|
@configuration_block = block
|
57
|
-
configure!
|
58
71
|
# The Rails load sequence means that some of the configured Targets may
|
59
72
|
# not be loaded; As such I am not calling configure! instead relying on
|
60
73
|
# Curate::Indexer::Railtie to handle the configure! call
|
61
74
|
configure! unless defined?(Rails)
|
62
75
|
end
|
63
76
|
|
64
|
-
# @api
|
77
|
+
# @api private
|
65
78
|
def self.configure!
|
66
79
|
return false unless @configuration_block.respond_to?(:call)
|
67
80
|
@configuration_block.call(configuration)
|
@@ -11,8 +11,11 @@ module Curate
|
|
11
11
|
|
12
12
|
private
|
13
13
|
|
14
|
+
IN_MEMORY_ADAPTER_WARNING_MESSAGE =
|
15
|
+
"WARNING: You are using the default Curate::Indexer::Adapters::InMemoryAdapter for the Curate::Indexer.adapter.".freeze
|
16
|
+
|
14
17
|
def default_adapter
|
15
|
-
$stdout.puts
|
18
|
+
$stdout.puts IN_MEMORY_ADAPTER_WARNING_MESSAGE unless defined?(SUPPRESS_MEMORY_ADAPTER_WARNING)
|
16
19
|
require 'curate/indexer/adapters/in_memory_adapter'
|
17
20
|
Adapters::InMemoryAdapter
|
18
21
|
end
|
@@ -12,10 +12,21 @@ module Curate
|
|
12
12
|
@pid = keywords.fetch(:pid).to_s
|
13
13
|
@parent_pids = Array(keywords.fetch(:parent_pids))
|
14
14
|
end
|
15
|
-
|
15
|
+
|
16
|
+
# @api public
|
17
|
+
# @return String The Fedora object's PID
|
18
|
+
attr_reader :pid
|
19
|
+
|
20
|
+
# @api public
|
21
|
+
#
|
22
|
+
# All of the direct parents of the Fedora document associated with the given PID.
|
23
|
+
#
|
24
|
+
# This does not include grandparents, great-grandparents, etc.
|
25
|
+
# @return Array<String>
|
26
|
+
attr_reader :parent_pids
|
16
27
|
end
|
17
28
|
|
18
|
-
# @api
|
29
|
+
# @api public
|
19
30
|
#
|
20
31
|
# A rudimentary representation of what is needed to reindex Solr documents
|
21
32
|
class IndexDocument
|
@@ -28,7 +39,39 @@ module Curate
|
|
28
39
|
@pathnames = Array(keywords.fetch(:pathnames))
|
29
40
|
@ancestors = Array(keywords.fetch(:ancestors))
|
30
41
|
end
|
31
|
-
|
42
|
+
|
43
|
+
# @api public
|
44
|
+
# @return String The Fedora object's PID
|
45
|
+
attr_reader :pid
|
46
|
+
|
47
|
+
# @api public
|
48
|
+
#
|
49
|
+
# All of the direct parents of the Fedora document associated with the given PID.
|
50
|
+
#
|
51
|
+
# This does not include grandparents, great-grandparents, etc.
|
52
|
+
# @return Array<String>
|
53
|
+
attr_reader :parent_pids
|
54
|
+
|
55
|
+
# @api public
|
56
|
+
#
|
57
|
+
# All nodes in the graph are addressable by one or more pathnames.
|
58
|
+
#
|
59
|
+
# If I have A, with parent B, and B has parents C and D, we have the
|
60
|
+
# following pathnames:
|
61
|
+
# [D/B/A, C/B/A]
|
62
|
+
#
|
63
|
+
# In the graph representation, we can get to A by going from D to B to A, or by going from C to B to A.
|
64
|
+
# @return Array<String>
|
65
|
+
attr_reader :pathnames
|
66
|
+
|
67
|
+
# @api public
|
68
|
+
#
|
69
|
+
# All of the :pathnames of each of the documents ancestors. If I have A, with parent B, and B has
|
70
|
+
# parents C and D then we have the following ancestors:
|
71
|
+
# [D/B], [C/B]
|
72
|
+
#
|
73
|
+
# @return Array<String>
|
74
|
+
attr_reader :ancestors
|
32
75
|
|
33
76
|
def sorted_parent_pids
|
34
77
|
parent_pids.sort
|
@@ -21,6 +21,7 @@ module Curate
|
|
21
21
|
end
|
22
22
|
attr_reader :pid, :time_to_live, :queue, :adapter
|
23
23
|
|
24
|
+
# Perform a bread-first tree traversal of the initial document and its descendants.
|
24
25
|
def call
|
25
26
|
enqueue(initial_index_document, time_to_live)
|
26
27
|
processing_document = dequeue
|
@@ -68,7 +69,9 @@ module Curate
|
|
68
69
|
end
|
69
70
|
|
70
71
|
# A small object that helps encapsulate the logic of building the hash of information regarding
|
71
|
-
# the initialization of an
|
72
|
+
# the initialization of an Curate::Indexer::Documents::IndexDocument
|
73
|
+
#
|
74
|
+
# @see Curate::Indexer::Documents::IndexDocument for details on pathnames, ancestors, and parent_pids.
|
72
75
|
class ParentAndPathAndAncestorsBuilder
|
73
76
|
def initialize(preservation_document, adapter)
|
74
77
|
@preservation_document = preservation_document
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: curate-indexer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jeremy Friesen
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-
|
11
|
+
date: 2016-12-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -266,7 +266,8 @@ files:
|
|
266
266
|
- lib/curate/indexer/repository_reindexer.rb
|
267
267
|
- lib/curate/indexer/version.rb
|
268
268
|
homepage: https://github.com/ndlib/curate-indexer
|
269
|
-
licenses:
|
269
|
+
licenses:
|
270
|
+
- Apache-2.0
|
270
271
|
metadata: {}
|
271
272
|
post_install_message:
|
272
273
|
rdoc_options: []
|
@@ -289,4 +290,3 @@ signing_key:
|
|
289
290
|
specification_version: 4
|
290
291
|
summary: A playground for CurateND collections indexing
|
291
292
|
test_files: []
|
292
|
-
has_rdoc:
|