curate-indexer 0.2.2 → 0.2.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +54 -0
- data/curate-indexer.gemspec +1 -0
- data/lib/curate/indexer.rb +18 -5
- data/lib/curate/indexer/configuration.rb +4 -1
- data/lib/curate/indexer/documents.rb +46 -3
- data/lib/curate/indexer/railtie.rb +1 -1
- data/lib/curate/indexer/relationship_reindexer.rb +4 -1
- data/lib/curate/indexer/version.rb +1 -1
- metadata +4 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f90a339727f8494a69331c91dabfead7a4d3fea5
|
4
|
+
data.tar.gz: b79f7029a6767747b52d48500f7fc8d567301f60
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: a5ece9e30c7760998a9d428c2cab96183eb00aa6f331b550ac0ca914c0140bed986a68694f2574fd75338e9b38d28a0a26865eefbc529ceabe8dfc7eaab63583
|
7
|
+
data.tar.gz: bc7a658c728db9aa803c3b44bc6ff04e94f2429dcdf500adff16a130652de6d1b49bc053b5e35aaade8388c1a8feceeaaef9f002df7ee697a3e22ea2b81211e9
|
data/README.md
CHANGED
@@ -6,4 +6,58 @@
|
|
6
6
|
[](http://inch-ci.org/github/ndlib/curate-indexer)
|
7
7
|
[](./LICENSE)
|
8
8
|
|
9
|
+
The Curate::Indexer gem is responsible for indexing the graph relationship of objects. It maps a PreservationDocument to an IndexDocument by mapping a PreservationDocument's direct parents into the paths to get from a root document to the given PreservationDocument.
|
10
|
+
|
11
|
+
# Background
|
12
|
+
|
9
13
|
This is a sandbox to work through the reindexing strategy as it relates to [CurateND Collections](https://github.com/ndlib/curate_nd/issues/420). At this point the code is separate to allow for rapid testing and prototyping (no sense spinning up SOLR and Fedora to walk an arbitrary graph).
|
14
|
+
|
15
|
+
# Concepts
|
16
|
+
|
17
|
+
As we are indexing objects, we have two types of documents:
|
18
|
+
|
19
|
+
1. [PreservationDocument](./lib/curate/indexer/documents.rb) - a light-weight representation of a Fedora object
|
20
|
+
2. [IndexDocument](./lib/curate/indexer/documents.rb) - a light-weight representation of a SOLR document object
|
21
|
+
|
22
|
+
We have four attributes to consider for indexing the graph:
|
23
|
+
|
24
|
+
1. pid - the unique identifier for a document
|
25
|
+
2. parent_pids - the pids for all of the parents of a given document
|
26
|
+
3. pathnames - the paths to traverse from a root document to the given document
|
27
|
+
4. ancestors - the pathnames of each of the ancestors
|
28
|
+
|
29
|
+
See [Curate::Indexer::Documents::IndexDocument](./lib/curate/indexer/documents.rb) for further discussion.
|
30
|
+
|
31
|
+
To reindex a single document, we leverage the [`Curate::Indexer.reindex_relationships`](./lib/curate/indexer.rb) method.
|
32
|
+
|
33
|
+
# Examples
|
34
|
+
|
35
|
+
Given the following PreservationDocuments:
|
36
|
+
|
37
|
+
| PID | Parents |
|
38
|
+
|-----|---------|
|
39
|
+
| A | - |
|
40
|
+
| B | - |
|
41
|
+
| C | A |
|
42
|
+
| D | A, B |
|
43
|
+
| E | C |
|
44
|
+
|
45
|
+
If we were to reindex the above PreservationDocuments, we will generate the following IndexDocuments:
|
46
|
+
|
47
|
+
| PID | Parents | Pathnames | Ancestors |
|
48
|
+
|-----|---------|------------|-----------|
|
49
|
+
| A | - | [A] | [] |
|
50
|
+
| B | - | [B] | [] |
|
51
|
+
| C | A | [A/C] | [A] |
|
52
|
+
| D | A, B | [A/D, B/D] | [A, B] |
|
53
|
+
| E | C | [A/C/E] | [A/C] |
|
54
|
+
|
55
|
+
For more scenarios, look at the [Reindex PID and Descendants specs](./spec/features/reindex_pid_and_descendants_spec.rb).
|
56
|
+
|
57
|
+
# Adapters
|
58
|
+
|
59
|
+
An [AbstractAdapter](./lib/curate/indexer/adapters/abstract_adapter.rb) provides the method interface for others to build against.
|
60
|
+
|
61
|
+
The [InMemory adapter](./lib/curate/indexer/adapters/in_memory_adapter.rb) is a reference implementation (and used to ease testing overhead).
|
62
|
+
|
63
|
+
CurateND has implemented the [following adapter](https://github.com/ndlib/curate_nd/blob/master/lib/curate/library_collection_indexing_adapter.rb) for its LibraryCollection indexing.
|
data/curate-indexer.gemspec
CHANGED
@@ -12,6 +12,7 @@ Gem::Specification.new do |spec|
|
|
12
12
|
spec.summary = %q{A playground for CurateND collections indexing}
|
13
13
|
spec.description = %q{A playground for CurateND collections indexing}
|
14
14
|
spec.homepage = "https://github.com/ndlib/curate-indexer"
|
15
|
+
spec.license = "Apache-2.0"
|
15
16
|
|
16
17
|
spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
|
17
18
|
spec.bindir = "bin"
|
data/lib/curate/indexer.rb
CHANGED
@@ -5,7 +5,7 @@ require 'curate/indexer/configuration'
|
|
5
5
|
require 'curate/indexer/railtie' if defined?(Rails)
|
6
6
|
|
7
7
|
module Curate
|
8
|
-
# Responsible for
|
8
|
+
# Responsible for indexing an object and its related child objects.
|
9
9
|
module Indexer
|
10
10
|
# This assumes a rather deep graph
|
11
11
|
DEFAULT_TIME_TO_LIVE = 15
|
@@ -18,7 +18,7 @@ module Curate
|
|
18
18
|
# @return [Boolean] - It was successful
|
19
19
|
# @raise Curate::Exceptions::CycleDetectionError - A potential cycle was detected
|
20
20
|
def self.reindex_relationships(pid, time_to_live = DEFAULT_TIME_TO_LIVE)
|
21
|
-
RelationshipReindexer.call(pid: pid, time_to_live: time_to_live, adapter:
|
21
|
+
RelationshipReindexer.call(pid: pid, time_to_live: time_to_live, adapter: adapter)
|
22
22
|
true
|
23
23
|
end
|
24
24
|
|
@@ -34,10 +34,15 @@ module Curate
|
|
34
34
|
# @return [Boolean] - It was successful
|
35
35
|
# @raise Curate::Exceptions::CycleDetectionError - A potential cycle was detected
|
36
36
|
def self.reindex_all!(time_to_live = DEFAULT_TIME_TO_LIVE)
|
37
|
-
RepositoryReindexer
|
37
|
+
# While the RepositoryReindexer is responsible for reindexing everything, I
|
38
|
+
# want to inject the lambda that will reindex a single item.
|
39
|
+
pid_reindexer = method(:reindex_relationships)
|
40
|
+
RepositoryReindexer.call(time_to_live: time_to_live, pid_reindexer: pid_reindexer, adapter: adapter)
|
38
41
|
true
|
39
42
|
end
|
40
43
|
|
44
|
+
# @api public
|
45
|
+
#
|
41
46
|
# Contains the Curate::Indexer configuration information that is referenceable from wit
|
42
47
|
# @see Curate::Indexer::Configuration
|
43
48
|
def self.configuration
|
@@ -45,23 +50,31 @@ module Curate
|
|
45
50
|
end
|
46
51
|
|
47
52
|
# @api public
|
53
|
+
#
|
54
|
+
# Exposes the data adapter to use for the reindexing process.
|
55
|
+
#
|
56
|
+
# @see Curate::Indexer::Adapters::AbstractAdapter
|
57
|
+
# @return Object that implementes the Curate::Indexer::Adapters::AbstractAdapter method interface
|
48
58
|
def self.adapter
|
49
59
|
configuration.adapter
|
50
60
|
end
|
51
61
|
|
52
62
|
# @api public
|
63
|
+
#
|
64
|
+
# Capture the configuration information
|
65
|
+
#
|
53
66
|
# @see Curate::Indexer::Configuration
|
54
67
|
# @see .configuration
|
68
|
+
# @see Curate::Indexer::Railtie
|
55
69
|
def self.configure(&block)
|
56
70
|
@configuration_block = block
|
57
|
-
configure!
|
58
71
|
# The Rails load sequence means that some of the configured Targets may
|
59
72
|
# not be loaded; As such I am not calling configure! instead relying on
|
60
73
|
# Curate::Indexer::Railtie to handle the configure! call
|
61
74
|
configure! unless defined?(Rails)
|
62
75
|
end
|
63
76
|
|
64
|
-
# @api
|
77
|
+
# @api private
|
65
78
|
def self.configure!
|
66
79
|
return false unless @configuration_block.respond_to?(:call)
|
67
80
|
@configuration_block.call(configuration)
|
@@ -11,8 +11,11 @@ module Curate
|
|
11
11
|
|
12
12
|
private
|
13
13
|
|
14
|
+
IN_MEMORY_ADAPTER_WARNING_MESSAGE =
|
15
|
+
"WARNING: You are using the default Curate::Indexer::Adapters::InMemoryAdapter for the Curate::Indexer.adapter.".freeze
|
16
|
+
|
14
17
|
def default_adapter
|
15
|
-
$stdout.puts
|
18
|
+
$stdout.puts IN_MEMORY_ADAPTER_WARNING_MESSAGE unless defined?(SUPPRESS_MEMORY_ADAPTER_WARNING)
|
16
19
|
require 'curate/indexer/adapters/in_memory_adapter'
|
17
20
|
Adapters::InMemoryAdapter
|
18
21
|
end
|
@@ -12,10 +12,21 @@ module Curate
|
|
12
12
|
@pid = keywords.fetch(:pid).to_s
|
13
13
|
@parent_pids = Array(keywords.fetch(:parent_pids))
|
14
14
|
end
|
15
|
-
|
15
|
+
|
16
|
+
# @api public
|
17
|
+
# @return String The Fedora object's PID
|
18
|
+
attr_reader :pid
|
19
|
+
|
20
|
+
# @api public
|
21
|
+
#
|
22
|
+
# All of the direct parents of the Fedora document associated with the given PID.
|
23
|
+
#
|
24
|
+
# This does not include grandparents, great-grandparents, etc.
|
25
|
+
# @return Array<String>
|
26
|
+
attr_reader :parent_pids
|
16
27
|
end
|
17
28
|
|
18
|
-
# @api
|
29
|
+
# @api public
|
19
30
|
#
|
20
31
|
# A rudimentary representation of what is needed to reindex Solr documents
|
21
32
|
class IndexDocument
|
@@ -28,7 +39,39 @@ module Curate
|
|
28
39
|
@pathnames = Array(keywords.fetch(:pathnames))
|
29
40
|
@ancestors = Array(keywords.fetch(:ancestors))
|
30
41
|
end
|
31
|
-
|
42
|
+
|
43
|
+
# @api public
|
44
|
+
# @return String The Fedora object's PID
|
45
|
+
attr_reader :pid
|
46
|
+
|
47
|
+
# @api public
|
48
|
+
#
|
49
|
+
# All of the direct parents of the Fedora document associated with the given PID.
|
50
|
+
#
|
51
|
+
# This does not include grandparents, great-grandparents, etc.
|
52
|
+
# @return Array<String>
|
53
|
+
attr_reader :parent_pids
|
54
|
+
|
55
|
+
# @api public
|
56
|
+
#
|
57
|
+
# All nodes in the graph are addressable by one or more pathnames.
|
58
|
+
#
|
59
|
+
# If I have A, with parent B, and B has parents C and D, we have the
|
60
|
+
# following pathnames:
|
61
|
+
# [D/B/A, C/B/A]
|
62
|
+
#
|
63
|
+
# In the graph representation, we can get to A by going from D to B to A, or by going from C to B to A.
|
64
|
+
# @return Array<String>
|
65
|
+
attr_reader :pathnames
|
66
|
+
|
67
|
+
# @api public
|
68
|
+
#
|
69
|
+
# All of the :pathnames of each of the documents ancestors. If I have A, with parent B, and B has
|
70
|
+
# parents C and D then we have the following ancestors:
|
71
|
+
# [D/B], [C/B]
|
72
|
+
#
|
73
|
+
# @return Array<String>
|
74
|
+
attr_reader :ancestors
|
32
75
|
|
33
76
|
def sorted_parent_pids
|
34
77
|
parent_pids.sort
|
@@ -21,6 +21,7 @@ module Curate
|
|
21
21
|
end
|
22
22
|
attr_reader :pid, :time_to_live, :queue, :adapter
|
23
23
|
|
24
|
+
# Perform a bread-first tree traversal of the initial document and its descendants.
|
24
25
|
def call
|
25
26
|
enqueue(initial_index_document, time_to_live)
|
26
27
|
processing_document = dequeue
|
@@ -68,7 +69,9 @@ module Curate
|
|
68
69
|
end
|
69
70
|
|
70
71
|
# A small object that helps encapsulate the logic of building the hash of information regarding
|
71
|
-
# the initialization of an
|
72
|
+
# the initialization of an Curate::Indexer::Documents::IndexDocument
|
73
|
+
#
|
74
|
+
# @see Curate::Indexer::Documents::IndexDocument for details on pathnames, ancestors, and parent_pids.
|
72
75
|
class ParentAndPathAndAncestorsBuilder
|
73
76
|
def initialize(preservation_document, adapter)
|
74
77
|
@preservation_document = preservation_document
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: curate-indexer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jeremy Friesen
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-
|
11
|
+
date: 2016-12-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -266,7 +266,8 @@ files:
|
|
266
266
|
- lib/curate/indexer/repository_reindexer.rb
|
267
267
|
- lib/curate/indexer/version.rb
|
268
268
|
homepage: https://github.com/ndlib/curate-indexer
|
269
|
-
licenses:
|
269
|
+
licenses:
|
270
|
+
- Apache-2.0
|
270
271
|
metadata: {}
|
271
272
|
post_install_message:
|
272
273
|
rdoc_options: []
|
@@ -289,4 +290,3 @@ signing_key:
|
|
289
290
|
specification_version: 4
|
290
291
|
summary: A playground for CurateND collections indexing
|
291
292
|
test_files: []
|
292
|
-
has_rdoc:
|