scout-gear 10.8.4 → 10.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.vimproject +13 -0
- data/README.md +352 -0
- data/VERSION +1 -1
- data/doc/Association.md +288 -0
- data/doc/Entity.md +296 -0
- data/doc/KnowledgeBase.md +433 -0
- data/doc/Persist.md +356 -0
- data/doc/Semaphore.md +171 -0
- data/doc/TSV.md +449 -0
- data/doc/WorkQueue.md +359 -0
- data/doc/Workflow.md +586 -0
- data/lib/scout/association.rb +4 -2
- data/lib/scout/entity/identifiers.rb +1 -1
- data/lib/scout/entity/object.rb +1 -1
- data/lib/scout/entity/property.rb +5 -5
- data/lib/scout/entity.rb +1 -1
- data/lib/scout/knowledge_base/description.rb +1 -1
- data/lib/scout/knowledge_base/list.rb +7 -2
- data/lib/scout/knowledge_base/registry.rb +2 -2
- data/lib/scout/knowledge_base.rb +20 -2
- data/lib/scout/monitor.rb +10 -6
- data/lib/scout/persist/engine/packed_index.rb +2 -2
- data/lib/scout/persist/engine/sharder.rb +1 -1
- data/lib/scout/persist/tsv.rb +1 -0
- data/lib/scout/semaphore.rb +1 -1
- data/lib/scout/tsv/dumper.rb +3 -3
- data/lib/scout/tsv/open.rb +1 -0
- data/lib/scout/tsv/parser.rb +1 -1
- data/lib/scout/tsv/transformer.rb +1 -0
- data/lib/scout/tsv/util.rb +2 -2
- data/lib/scout/work_queue/socket.rb +1 -1
- data/lib/scout/work_queue/worker.rb +7 -5
- data/lib/scout/workflow/entity.rb +22 -1
- data/lib/scout/workflow/step/config.rb +3 -3
- data/lib/scout/workflow/step/file.rb +4 -0
- data/lib/scout/workflow/step/info.rb +8 -2
- data/lib/scout/workflow/step.rb +10 -5
- data/lib/scout/workflow/task/inputs.rb +1 -1
- data/lib/scout/workflow/usage.rb +3 -2
- data/lib/scout/workflow/util.rb +22 -0
- data/scout-gear.gemspec +16 -5
- data/scout_commands/cat +86 -0
- data/scout_commands/doc +3 -1
- data/scout_commands/entity +151 -0
- data/scout_commands/system/status +238 -0
- data/scout_commands/workflow/info +23 -10
- data/scout_commands/workflow/install +1 -1
- data/test/scout/entity/test_property.rb +1 -1
- data/test/scout/knowledge_base/test_registry.rb +19 -0
- data/test/scout/test_work_queue.rb +1 -1
- data/test/scout/work_queue/test_worker.rb +12 -10
- metadata +15 -4
- data/doc/lib/scout/path.md +0 -35
- data/doc/lib/scout/workflow/task.md +0 -13
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 3900d5a0e546d494ae3224176db5feb92344dc1f6ba311461c74d02f5b137a9c
|
|
4
|
+
data.tar.gz: 35421139fea183944db794bad361bdfa9db78b086e13fab6367d6bbcfebc29e3
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 679d05d7ae724825e18a2ce12862d08a3e75179f575a19ff960f4f379be6f2ce844326beca6e47d4bce3f585a4e637c71164b8642a2d61a85b2a54236f345153
|
|
7
|
+
data.tar.gz: c952339dfaa42d948a297a7f2c456d63bb4624d0ec71ee26499b79a2a25be8f32ec8c2b5ea2bd9cc17a664c50edb8f6cf22dec458ea103eb6e82614774973bdc
|
data/.vimproject
CHANGED
|
@@ -1,5 +1,15 @@
|
|
|
1
1
|
scout-gear=/$PWD filter="*.rb *.yaml" {
|
|
2
2
|
Rakefile
|
|
3
|
+
README.md
|
|
4
|
+
chats=chats filter="*"{
|
|
5
|
+
debug
|
|
6
|
+
document.rb
|
|
7
|
+
pipes
|
|
8
|
+
test_doc
|
|
9
|
+
doc=doc{
|
|
10
|
+
documenter.rb
|
|
11
|
+
}
|
|
12
|
+
}
|
|
3
13
|
bin=bin filter="*"{
|
|
4
14
|
scout
|
|
5
15
|
}
|
|
@@ -137,7 +147,9 @@ scout-gear=/$PWD filter="*.rb *.yaml" {
|
|
|
137
147
|
scout_commands=scout_commands filter="*"{
|
|
138
148
|
rbbt
|
|
139
149
|
alias
|
|
150
|
+
entity
|
|
140
151
|
find
|
|
152
|
+
cat
|
|
141
153
|
glob
|
|
142
154
|
log
|
|
143
155
|
doc
|
|
@@ -174,6 +186,7 @@ scout-gear=/$PWD filter="*.rb *.yaml" {
|
|
|
174
186
|
}
|
|
175
187
|
system=system{
|
|
176
188
|
clean
|
|
189
|
+
status
|
|
177
190
|
}
|
|
178
191
|
}
|
|
179
192
|
test=test {
|
data/README.md
ADDED
|
@@ -0,0 +1,352 @@
|
|
|
1
|
+
# Scout Gear
|
|
2
|
+
|
|
3
|
+
Scout Gear is the core, higher-level module set of the Scout framework. It bundles rich, production-grade data and workflow tooling built on top of the lower-level primitives in scout-essentials, and adds domain abstractions such as TSV processing, workflows, knowledge bases, entity typing, parallel work queues, and more.
|
|
4
|
+
|
|
5
|
+
Layering:
|
|
6
|
+
- scout-essentials: foundational utilities used everywhere (Path, Open, CMD, IndiferentHash, Persist, Resource, etc.)
|
|
7
|
+
- scout-gear (this repo): TSV, Workflow, KnowledgeBase, Entity/Association, WorkQueue, Semaphore, and glue code
|
|
8
|
+
- Additional packages:
|
|
9
|
+
- scout-camp: remote servers, cloud deployments, web interfaces, cross-site operations
|
|
10
|
+
- scout-ai: model training and chat agents
|
|
11
|
+
- scout-rig: connect with other languages (e.g., Python)
|
|
12
|
+
|
|
13
|
+
Related ecosystem:
|
|
14
|
+
- Rbbt (Ruby bioinformatics): Many of Scout’s ideas and utilities originated in Rbbt. It still provides a broad set of bioinformatics workflows and tools. See the Rbbt-Workflows organization for many real-world examples and usage patterns:
|
|
15
|
+
- https://github.com/Rbbt-Workflows
|
|
16
|
+
|
|
17
|
+
For module-specific guides, see doc/*.md in this repository (linked below).
|
|
18
|
+
|
|
19
|
+
- TSV: doc/TSV.md
|
|
20
|
+
- Workflow: doc/Workflow.md
|
|
21
|
+
- KnowledgeBase: doc/KnowledgeBase.md
|
|
22
|
+
- Association: doc/Association.md
|
|
23
|
+
- Entity: doc/Entity.md
|
|
24
|
+
- WorkQueue: doc/WorkQueue.md
|
|
25
|
+
- Semaphore: doc/Semaphore.md
|
|
26
|
+
|
|
27
|
+
Additionally, Scout Gear reuses and exposes core facilities from scout-essentials. Summaries of those core modules are included below for convenience.
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## How command-line interfaces work (scout …)
|
|
32
|
+
|
|
33
|
+
Scout provides a single “scout” command that discovers and runs nested subcommands from any installed Scout package. Scripts are discovered using the Path subsystem across PATH-like roots, enabling workflows or packages to inject their own commands.
|
|
34
|
+
|
|
35
|
+
Basics:
|
|
36
|
+
- The CLI resolves terms left-to-right until a file is found under a scout_commands tree.
|
|
37
|
+
- Example: scout workflow task runs scout_commands/workflow/task
|
|
38
|
+
- Example: all TSV-related scripts are under scout_commands/tsv and can be listed with scout tsv
|
|
39
|
+
- If the path resolves to a directory instead of a script, a list of available subcommands in that directory is shown.
|
|
40
|
+
- Remaining ARGV is parsed by the selected script using SimpleOPT (SOPT) or compatible parsers.
|
|
41
|
+
- Because discovery uses Path maps, commands contributed by other packages or installed workflows are automatically found.
|
|
42
|
+
|
|
43
|
+
See the per-module CLI sections below for TSV, Workflow, and KnowledgeBase.
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Scout Essentials: Core building blocks
|
|
48
|
+
|
|
49
|
+
Scout Gear depends on the following main modules from scout-essentials. You’ll use these directly for filesystem/resource orchestration, external command execution, caching, and options handling.
|
|
50
|
+
|
|
51
|
+
### Path
|
|
52
|
+
|
|
53
|
+
doc/Path.md
|
|
54
|
+
|
|
55
|
+
Path is a lightweight, annotation-enabled “smart string” for composing and locating project resources across multiple search maps (current/user/global/lib/tmp, etc.). It integrates with Open and Persist.
|
|
56
|
+
|
|
57
|
+
Highlights:
|
|
58
|
+
- Path.setup("str") turns a String into a Path with join via [], /, or method_missing (path.foo.bar)
|
|
59
|
+
- Map logical locations to physical roots with path maps; find the first match across map order with path.find (and path.find_all)
|
|
60
|
+
- Filename helpers: get/set/replace/unset extensions; sanitize filenames; relative paths
|
|
61
|
+
- Directory helpers: glob and glob_all over maps; dirname/basename; realpath; newer?
|
|
62
|
+
- Digest summaries: path.digest_str summarizes files/dirs for logging/debugging
|
|
63
|
+
|
|
64
|
+
Usage:
|
|
65
|
+
```ruby
|
|
66
|
+
p = Path.setup('share/data/myfile')
|
|
67
|
+
p.find # resolve across configured maps
|
|
68
|
+
p[:subdir, :file] # joins => share/data/subdir/file
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### Open
|
|
72
|
+
|
|
73
|
+
doc/Open.md
|
|
74
|
+
|
|
75
|
+
Open unifies file/stream/remote I/O, atomic writes, pipes/tees/FIFOs, (bg)zip helpers, rsync/sync, and lock handling.
|
|
76
|
+
|
|
77
|
+
Highlights:
|
|
78
|
+
- Open.open/read/write with auto-(de)compression for .gz/.bgz/.zip and remote urls (wget/ssh)
|
|
79
|
+
- Streams: open_pipe, tee_stream, consume_stream, with_fifo
|
|
80
|
+
- Safe writes: sensible_write (tmp + atomic rename + optional locks)
|
|
81
|
+
- Remote: wget with caching, ssh/scp, digest_url, remote cache
|
|
82
|
+
- Filesystem: mkdir/mkfiledir, mv/cp/ln/link_dir, rm/rm_rf, same_file?, exists?, writable?
|
|
83
|
+
- Locking: Open.lock wraps a robust Lockfile (NFS-safe) with refresh/timeout/steal
|
|
84
|
+
|
|
85
|
+
Example:
|
|
86
|
+
```ruby
|
|
87
|
+
Open.sensible_write("out.txt", Open.open("http://example.com"))
|
|
88
|
+
Open.with_fifo { |fifo| ... }
|
|
89
|
+
Open.rsync("src/", "user@server:dst/", delete: true)
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
### CMD
|
|
93
|
+
|
|
94
|
+
doc/CMD.md
|
|
95
|
+
|
|
96
|
+
CMD wraps Open3.popen3 with robust patterns for streaming, stderr logging, stdin feeding, auto-join of producers, and tool discovery/installation.
|
|
97
|
+
|
|
98
|
+
Highlights:
|
|
99
|
+
- CMD.cmd("tool args", pipe: true, in: io_or_string, stderr: Log::HIGH, autojoin: true)
|
|
100
|
+
- ConcurrentStream-enabled stdout with join/error propagation
|
|
101
|
+
- Convenience: CMD.bash("bash -l -c '...'"), cmd_pid/cmd_log
|
|
102
|
+
- Tool registry: CMD.tool, CMD.get_tool (auto-install via conda or producers), version scanning
|
|
103
|
+
|
|
104
|
+
Example:
|
|
105
|
+
```ruby
|
|
106
|
+
io = CMD.cmd("cut", "-f" => 2, "-d" => " ", in: "a b", pipe: true)
|
|
107
|
+
io.read # => "b\n"; io.join
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
### IndiferentHash
|
|
111
|
+
|
|
112
|
+
doc/IndiferentHash.md
|
|
113
|
+
|
|
114
|
+
Hash mixin for indifferent access (string/symbol keys equal), deep-merge, options parsing, and string<->hash conversions.
|
|
115
|
+
|
|
116
|
+
Highlights:
|
|
117
|
+
- IndiferentHash.setup(hash) to extend a single hash instance
|
|
118
|
+
- Access with h[:a] == h["a"]; delete/include? are indifferent
|
|
119
|
+
- Helpers: deep_merge, values_at with indifferent keys, slice, except
|
|
120
|
+
- Options utilities: parse_options, process_options, positional2hash, hash2string/string2hash
|
|
121
|
+
|
|
122
|
+
Example:
|
|
123
|
+
```ruby
|
|
124
|
+
opts = IndiferentHash.parse_options('limit=10 title="A title"')
|
|
125
|
+
opts[:title] # => "A title"
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
### Persist (core serialization/caching)
|
|
129
|
+
|
|
130
|
+
doc/Persist.md (essentials)
|
|
131
|
+
|
|
132
|
+
Typed serialization (json/yaml/marshal/binary/arrays), atomic saves, and the high-level persist pattern with locking and streaming.
|
|
133
|
+
|
|
134
|
+
Highlights:
|
|
135
|
+
- Persist.save/load(obj, file, type)
|
|
136
|
+
- Persist.persist(name, type, dir: ...) { compute_or_stream }
|
|
137
|
+
- Locking and tmp-to-final atomic writes
|
|
138
|
+
- Streaming tee: one copy to file, one to caller
|
|
139
|
+
- Memory cache: Persist.memory(name) { ... }
|
|
140
|
+
- Helpers to parse YAML/JSON/Marshal via Open
|
|
141
|
+
|
|
142
|
+
Example:
|
|
143
|
+
```ruby
|
|
144
|
+
val = Persist.persist("expensive", :json) { compute_hash }
|
|
145
|
+
# subsequent calls load cached JSON unless :update or stale
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### Resource
|
|
149
|
+
|
|
150
|
+
doc/Resource.md
|
|
151
|
+
|
|
152
|
+
Resource system to claim and produce files on demand (string/proc/url/rake/installers), integrated with Path/Open and locking.
|
|
153
|
+
|
|
154
|
+
Highlights:
|
|
155
|
+
- claim path => (:string, :proc, :url, :rake, :install)
|
|
156
|
+
- Produce on demand via path.produce and path.open/read
|
|
157
|
+
- Rake integration: drive file tasks/rules to generate outputs
|
|
158
|
+
- Install software into a per-resource “software” dir and update env
|
|
159
|
+
|
|
160
|
+
Example:
|
|
161
|
+
```ruby
|
|
162
|
+
module MyPkg
|
|
163
|
+
extend Resource
|
|
164
|
+
claim root.tmp.test.hello, :string, "Hello"
|
|
165
|
+
end
|
|
166
|
+
MyPkg.tmp.test.hello.read # produces if missing, then reads
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
Other essentials you’ll encounter:
|
|
170
|
+
- Annotation / AnnotatedArray / NamedArray: lightweight typed attributes on objects and arrays; named tuple-style rows
|
|
171
|
+
- ConcurrentStream: concurrency-aware streams with join/abort/callbacks
|
|
172
|
+
- SimpleOPT (SOPT): tiny CLI option DSL/parser; used by scout commands
|
|
173
|
+
- Log: leveled, colored logging; progress bars; fingerprint utilities
|
|
174
|
+
- TmpFile: temp files/dirs and stable tmp path generator for caches
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
## Scout Gear modules
|
|
179
|
+
|
|
180
|
+
Scout Gear builds on essentials to deliver domain abstractions and engines.
|
|
181
|
+
|
|
182
|
+
### TSV
|
|
183
|
+
|
|
184
|
+
doc/TSV.md
|
|
185
|
+
|
|
186
|
+
A flexible, typed table abstraction with robust parser, streaming dumper/transformer, parallel traversal, joins/attachments, identifier translation, on-disk persistence (TokyoCabinet/Tkrzw), and range/position indices.
|
|
187
|
+
|
|
188
|
+
Highlights:
|
|
189
|
+
- Shapes: :double, :list, :flat, :single; key_field + fields
|
|
190
|
+
- Parse TSV/CSV from files/streams/strings with rich header options (sep, type, cast, merge)
|
|
191
|
+
- Dumper/Transformer for streaming pipelines
|
|
192
|
+
- TSV.traverse(obj, cpus: N, into: …) for parallel iteration
|
|
193
|
+
- Attach, change_key, change_id, translate via identifier indices
|
|
194
|
+
- Persistence via TSVAdapter over HDB/BDB/Tkrzw/FWT/PKI/Sharder
|
|
195
|
+
- Streaming paste/concat/collapse utilities; filters with persisted sets
|
|
196
|
+
|
|
197
|
+
Example:
|
|
198
|
+
```ruby
|
|
199
|
+
tsv = TSV.open(path, persist: true, type: :double)
|
|
200
|
+
tsv.attach(other, complete: true)
|
|
201
|
+
index = TSV.index(tsv, target: "FieldA")
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
CLI (scout tsv):
|
|
205
|
+
- Scripts live under scout_commands/tsv; list with scout tsv
|
|
206
|
+
- Run a specific subcommand: scout tsv <subcommand> [options] [args...]
|
|
207
|
+
- If you hit a directory, available subcommands are listed
|
|
208
|
+
- Subcommands parse options with SOPT (see each script’s help)
|
|
209
|
+
|
|
210
|
+
### Workflow
|
|
211
|
+
|
|
212
|
+
doc/Workflow.md
|
|
213
|
+
|
|
214
|
+
A lightweight workflow engine. Define tasks with typed inputs and dependencies, create jobs (Steps), and run them with persistence, streaming, provenance, and orchestration under resource rules.
|
|
215
|
+
|
|
216
|
+
Highlights:
|
|
217
|
+
- input/dep/task DSL with helper methods; task_alias and overrides
|
|
218
|
+
- Jobs (Step): run/load/stream/join, info files, files_dir, provenance
|
|
219
|
+
- Orchestrator: schedule dependent jobs under cpus/IO constraints; retry recoverable errors; archive/erase deps per rules
|
|
220
|
+
- EntityWorkflow: entity-centric tasks and properties
|
|
221
|
+
- Queue helpers to enqueue and process jobs
|
|
222
|
+
|
|
223
|
+
Example:
|
|
224
|
+
```ruby
|
|
225
|
+
module Baking
|
|
226
|
+
extend Workflow
|
|
227
|
+
task :say => :string do |name| "Hi #{name}" end
|
|
228
|
+
end
|
|
229
|
+
|
|
230
|
+
Baking.job(:say, "Miguel").run # => "Hi Miguel"
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
CLI (scout workflow):
|
|
234
|
+
- List workflows: scout workflow list
|
|
235
|
+
- Run a task: scout workflow task <workflow> <task> [--jobname NAME] [input options...]
|
|
236
|
+
- Options include --fork, --nostream, --update, --printpath, --provenance, --clean, --recursive_clean, --override_deps, --deploy (serial|local|queue|SLURM|server)
|
|
237
|
+
- Show job info: scout workflow info <step_path> [--inputs|--recursive_inputs]
|
|
238
|
+
- Provenance: scout workflow prov <step_path> [--plot file.png] […]
|
|
239
|
+
- Trace execution: scout workflow trace <job-result> [options]
|
|
240
|
+
- Process queue: scout workflow process [filters] [--continuous] [--produce_cpus N] […]
|
|
241
|
+
|
|
242
|
+
You can also dispatch workflow-specific custom commands via:
|
|
243
|
+
- scout workflow cmd <workflow> <subcommand> … (discovers scripts under <workflow>/share/scout_commands/workflow)
|
|
244
|
+
|
|
245
|
+
### KnowledgeBase
|
|
246
|
+
|
|
247
|
+
doc/KnowledgeBase.md
|
|
248
|
+
|
|
249
|
+
A thin orchestrator around Association, TSV, Entity, and Persist to register multiple association databases, normalize/index them, query/traverse across them, manage entity lists, and generate markdown descriptions.
|
|
250
|
+
|
|
251
|
+
Highlights:
|
|
252
|
+
- Register databases with source/target specs and identifier files
|
|
253
|
+
- get_database/get_index (BDB-backed) with undirected options
|
|
254
|
+
- Query: all, subset (children/parents/neighbours), identify/translate entities
|
|
255
|
+
- Lists: save/load/delete/enumerate typed lists
|
|
256
|
+
- Traversal DSL: multi-hop path finding with wildcards/conditions
|
|
257
|
+
- Markdown descriptions from registry/README files
|
|
258
|
+
|
|
259
|
+
Example:
|
|
260
|
+
```ruby
|
|
261
|
+
kb = KnowledgeBase.new(Path.setup("var/kb"), "Hsa")
|
|
262
|
+
kb.register :brothers, datafile_test(:person).brothers, undirected: true
|
|
263
|
+
kb.children(:brothers, "Miki") # => ["Miki~Isa", ...]
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
CLI (scout kb):
|
|
267
|
+
- Configure KB: scout kb config [options] <name>
|
|
268
|
+
- Register DB: scout kb register [options] <name> <filename>
|
|
269
|
+
- Declare entities: scout kb entities <entity> <identifier_files>
|
|
270
|
+
- Show info: scout kb show [<name>]
|
|
271
|
+
- Query: scout kb query <name> <entity_spec>
|
|
272
|
+
- Lists: scout kb list [<list_name>]
|
|
273
|
+
- Traverse: scout kb traverse [options] "<rules,comma,separated>"
|
|
274
|
+
|
|
275
|
+
### Association
|
|
276
|
+
|
|
277
|
+
doc/Association.md
|
|
278
|
+
|
|
279
|
+
Utilities to normalize source/target field specifications from TSVs, open normalized association databases with optional identifier translation, and build pairwise “source~target” indices (optionally undirected). Also includes AssociationItem for entity-like behavior over pair strings and utilities to build incidence/adjacency matrices.
|
|
280
|
+
|
|
281
|
+
Example:
|
|
282
|
+
```ruby
|
|
283
|
+
idx = Association.index(file, source: "=>Name", target: "Parent=>Name", undirected: true)
|
|
284
|
+
idx.match("Clei") # => ["Clei~Guille"]
|
|
285
|
+
idx.to_matrix # boolean incidence matrix
|
|
286
|
+
```
|
|
287
|
+
|
|
288
|
+
### Entity
|
|
289
|
+
|
|
290
|
+
doc/Entity.md
|
|
291
|
+
|
|
292
|
+
Annotate plain values or arrays as entities with behavior-rich “properties”, automatic format mapping, identifier translation (Entity::Identified), array-aware property batching/caching, and persistence for property results via Persist.
|
|
293
|
+
|
|
294
|
+
Example:
|
|
295
|
+
```ruby
|
|
296
|
+
module Person
|
|
297
|
+
extend Entity
|
|
298
|
+
property :greet => :single do "Hi #{self}" end
|
|
299
|
+
end
|
|
300
|
+
Person.setup("Miki").greet
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
### WorkQueue
|
|
304
|
+
|
|
305
|
+
doc/WorkQueue.md
|
|
306
|
+
|
|
307
|
+
A multi-process work pipeline (forked workers + semaphore-guarded sockets) to parallelize processing over a stream of inputs, with robust error propagation.
|
|
308
|
+
|
|
309
|
+
Example:
|
|
310
|
+
```ruby
|
|
311
|
+
q = WorkQueue.new(4){|x| x * 2}
|
|
312
|
+
out = []; q.process{|y| out << y}
|
|
313
|
+
(1..100).each{|i| q.write i}; q.close; q.join
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
### Semaphore (ScoutSemaphore)
|
|
317
|
+
|
|
318
|
+
doc/Semaphore.md
|
|
319
|
+
|
|
320
|
+
Concurrency helpers based on POSIX named semaphores (via RubyInline C bindings), plus higher-level helpers to bound concurrency with forks/threads.
|
|
321
|
+
|
|
322
|
+
Example:
|
|
323
|
+
```ruby
|
|
324
|
+
ScoutSemaphore.with_semaphore(2) do |sem|
|
|
325
|
+
ScoutSemaphore.synchronize(sem){ critical_work }
|
|
326
|
+
end
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
---
|
|
330
|
+
|
|
331
|
+
## Examples and further reading
|
|
332
|
+
|
|
333
|
+
- This repository’s docs directory provides in-depth guides for each module:
|
|
334
|
+
- TSV: doc/TSV.md
|
|
335
|
+
- Workflow: doc/Workflow.md
|
|
336
|
+
- KnowledgeBase: doc/KnowledgeBase.md
|
|
337
|
+
- Association: doc/Association.md
|
|
338
|
+
- Entity: doc/Entity.md
|
|
339
|
+
- WorkQueue: doc/WorkQueue.md
|
|
340
|
+
- Semaphore: doc/Semaphore.md
|
|
341
|
+
- For numerous end-to-end examples and real datasets, explore the Rbbt-Workflows organization:
|
|
342
|
+
- https://github.com/Rbbt-Workflows
|
|
343
|
+
- For foundational utilities (Path, Open, CMD, IndiferentHash, Persist, Resource, etc.), consult the scout-essentials documentation:
|
|
344
|
+
- Those modules are summarized above and used pervasively across Scout Gear.
|
|
345
|
+
|
|
346
|
+
---
|
|
347
|
+
|
|
348
|
+
## Notes
|
|
349
|
+
|
|
350
|
+
- Streaming everywhere: many APIs return ConcurrentStream-enabled IOs. Always read to EOF and join (or rely on autojoin) to ensure producers exit and errors are surfaced.
|
|
351
|
+
- Atomicity and locking: Open.sensible_write and Persist.persist use tmp+mv and lockfiles to provide robust cross-process behavior.
|
|
352
|
+
- Discovery and composition: the Path subsystem and Resource claims make it easy to build portable projects with on-demand production of resources and discoverable commands.
|
data/VERSION
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
10.
|
|
1
|
+
10.9.0
|
data/doc/Association.md
ADDED
|
@@ -0,0 +1,288 @@
|
|
|
1
|
+
# Association
|
|
2
|
+
|
|
3
|
+
Association provides a compact toolkit to open, normalize, and index pairwise relationships from TSV-like sources. With it you can:
|
|
4
|
+
|
|
5
|
+
- Parse declarative source/target field specifications (including format remapping).
|
|
6
|
+
- Open an “association database” (TSV) that standardizes keys/fields and optional identifier translation via Entity/TSV indices.
|
|
7
|
+
- Build a fast BDB-backed index over pair “edges” using “source~target” keys, optionally undirected.
|
|
8
|
+
- Work with association “items” (pairs) as Entities with useful properties and conversions.
|
|
9
|
+
- Produce incidence/adjacency matrices and perform filtering/subsetting over pairs.
|
|
10
|
+
|
|
11
|
+
It integrates with:
|
|
12
|
+
- TSV (parsing, reordering, indices)
|
|
13
|
+
- Entity (format registry and identifier translation)
|
|
14
|
+
- Persist (caching/DB backends)
|
|
15
|
+
|
|
16
|
+
Sections:
|
|
17
|
+
- Field specification syntax and normalization
|
|
18
|
+
- Opening association databases
|
|
19
|
+
- Building and using association indices
|
|
20
|
+
- AssociationItem: entity properties over pairs
|
|
21
|
+
- Matrix utilities
|
|
22
|
+
- Examples
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## Field specification syntax and normalization
|
|
27
|
+
|
|
28
|
+
Association accepts flexible “field specs” to declare which columns are source and target, optionally including header aliases and format conversions.
|
|
29
|
+
|
|
30
|
+
Syntax patterns (strings):
|
|
31
|
+
|
|
32
|
+
- "FieldName"
|
|
33
|
+
- Use the column named FieldName.
|
|
34
|
+
- "FieldName=~Header"
|
|
35
|
+
- Use field FieldName but present it as Header in outputs.
|
|
36
|
+
- "=~Header"
|
|
37
|
+
- No explicit field (infer from header or Entity format), but present as Header.
|
|
38
|
+
- "FieldName=>TargetFormat"
|
|
39
|
+
- Use FieldName and translate identifiers to TargetFormat (via TSV.translation_index / Entity identifiers).
|
|
40
|
+
- "FieldName=~Header=>TargetFormat"
|
|
41
|
+
- Full form; pick field, rename header, and convert identifiers.
|
|
42
|
+
|
|
43
|
+
Parsing and normalization helpers:
|
|
44
|
+
- Association.parse_field_specification(spec) -> [field, header, final_format]
|
|
45
|
+
- Association.normalize_specs(spec, all_fields=nil) -> normalized [field, header, format]
|
|
46
|
+
- If a field is not directly present but is a recognized Entity format, it tries to find a matching column within all_fields by that Entity.
|
|
47
|
+
|
|
48
|
+
Extract source/target specs:
|
|
49
|
+
- specs = Association.extract_specs(all_fields, options)
|
|
50
|
+
- options keys: :source, :target, :source_format, :target_format, :format (hash of entity_type -> default_target_format)
|
|
51
|
+
- Returns a Hash with:
|
|
52
|
+
- :source => [field, header, final_format]
|
|
53
|
+
- :target => [field, header, final_format]
|
|
54
|
+
- Infers default source/target when not provided:
|
|
55
|
+
- If both nil → source := key_field; target := first data field
|
|
56
|
+
- If source nil but target is key → source := first data field; and vice versa
|
|
57
|
+
|
|
58
|
+
Resolve headers and positions:
|
|
59
|
+
- Association.headers(all_fields, info_fields=nil, options)
|
|
60
|
+
- all_fields: [key_field, field1, ...]
|
|
61
|
+
- info_fields: extra value columns to keep besides target (defaults to “all” except source and target).
|
|
62
|
+
- Returns:
|
|
63
|
+
- [source_pos, field_pos, source_header, field_headers, source_format, target_format]
|
|
64
|
+
- Handles :format hash defaults per entity type, and honors explicit source/target formats.
|
|
65
|
+
|
|
66
|
+
---
|
|
67
|
+
|
|
68
|
+
## Opening association databases
|
|
69
|
+
|
|
70
|
+
Association.open coerces a TSV (file/Path/TSV) into a normalized association database with optional identifier translation.
|
|
71
|
+
|
|
72
|
+
```ruby
|
|
73
|
+
db = Association.open(
|
|
74
|
+
file_or_tsv,
|
|
75
|
+
source: "Wife (ID)=>Alias",
|
|
76
|
+
target: "Husband (ID)=>Name",
|
|
77
|
+
namespace: "person", # optional; replaces NAMESPACE placeholders in paths
|
|
78
|
+
type: :list # optional TSV type; inferred when not set
|
|
79
|
+
)
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
Behavior:
|
|
83
|
+
- Reads header and infers positions via headers(...).
|
|
84
|
+
- If target/source formats are specified:
|
|
85
|
+
- Builds translation indices from:
|
|
86
|
+
- TSV.identifier_files(file), Entity.identifier_files(format), and options[:identifiers].
|
|
87
|
+
- Rewrites keys/values to requested formats (e.g., “(ID)=>Name”).
|
|
88
|
+
- Produces a TSV with:
|
|
89
|
+
- key_field: resolved source field name (with “(format)” suffix if translated).
|
|
90
|
+
- fields: [resolved target field (with “(format)” if translated), plus remaining info_fields].
|
|
91
|
+
- type: inherited/passed (:double, :list, :flat, :single).
|
|
92
|
+
|
|
93
|
+
Namespace placeholder:
|
|
94
|
+
- When opening from a path string containing “NAMESPACE”, passing namespace: will substitute it:
|
|
95
|
+
- Example: ".../NAMESPACE/identifiers.tsv" -> ".../person/identifiers.tsv"
|
|
96
|
+
|
|
97
|
+
Persisted variant:
|
|
98
|
+
- Association.database(file, ...) wraps Association.open with Persist.tsv and a “BDB” engine:
|
|
99
|
+
- Returns a persistence-backed TSV (keys/fields/type saved with TSVAdapter).
|
|
100
|
+
- Options: any Association.open options plus :persist / persist_* (via IndiferentHash).
|
|
101
|
+
|
|
102
|
+
Examples:
|
|
103
|
+
- Simple open:
|
|
104
|
+
```ruby
|
|
105
|
+
db = Association.database(datadir.person.marriages,
|
|
106
|
+
source: "Wife", target: "Husband", persist: true)
|
|
107
|
+
db["Clei"]["Husband"] # => "Miguel"
|
|
108
|
+
db["Clei"]["Date"] # => "2021"
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
- Partial field + format:
|
|
112
|
+
```ruby
|
|
113
|
+
db = Association.database(datadir.person.marriages,
|
|
114
|
+
source: "Wife=>Alias", target: "Husband=>Name")
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
- Flat TSV:
|
|
118
|
+
```ruby
|
|
119
|
+
flat = datadir.person.parents.tsv(type: :flat, fields: ["Parent"])
|
|
120
|
+
db = Association.database(flat)
|
|
121
|
+
db["Miki"] # => %w(Juan Mariluz)
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
---
|
|
125
|
+
|
|
126
|
+
## Building and using association indices
|
|
127
|
+
|
|
128
|
+
Association.index materializes a BDB index over pairwise relations with keys of the form “source~target”. The index entries store the “info fields” (everything but the two endpoints) as a :list TSV.
|
|
129
|
+
|
|
130
|
+
```ruby
|
|
131
|
+
idx = Association.index(file_or_tsv,
|
|
132
|
+
source: "=>Name",
|
|
133
|
+
target: "Parent=>Name",
|
|
134
|
+
undirected: false, # true duplicates (source~target) and (target~source)
|
|
135
|
+
persist: true)
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
- Under the hood:
|
|
139
|
+
- Opens/normalizes the database with Association.open (or uses provided DB).
|
|
140
|
+
- Builds keys “[source]~[target]” and writes values (info fields) as a list.
|
|
141
|
+
- If undirected true (or same source/target column), writes both “[s]~[t]” and “[t]~[s]”.
|
|
142
|
+
|
|
143
|
+
- Return value:
|
|
144
|
+
- A BDB TSV extended with Association::Index, annotated with:
|
|
145
|
+
- source_field, target_field, undirected
|
|
146
|
+
- The index sets key_field to “SourceField~TargetField[~undirected]”.
|
|
147
|
+
|
|
148
|
+
- Methods on Association::Index:
|
|
149
|
+
- parse_key_field → sets source_field/target_field/undirected from key_field.
|
|
150
|
+
- match(entity) → returns all “source~target” keys whose source starts with entity (prefix-based).
|
|
151
|
+
- subset(source_list, target_spec)
|
|
152
|
+
- source_list: list of source entities or :all.
|
|
153
|
+
- target_spec: :all or list to filter by target side.
|
|
154
|
+
- Returns matching keys, handling undirected symmetry.
|
|
155
|
+
- reverse → returns a reversed index (keys swapped to “target~source”) persisted in a side file (.reverse).
|
|
156
|
+
- filter(value_field=nil, target_value=nil, &block)
|
|
157
|
+
- Without block: filter keys whose value_field is present (or equals target_value).
|
|
158
|
+
- With block: custom predicate over values (or key+values if value_field nil).
|
|
159
|
+
- to_matrix(value_field=nil) { |values| ... }
|
|
160
|
+
- Produces an incidence matrix TSV (rows: sources, columns: targets):
|
|
161
|
+
- If value_field provided, uses that column (or block mapping).
|
|
162
|
+
- Else boolean incidence.
|
|
163
|
+
|
|
164
|
+
Note:
|
|
165
|
+
- reverse persists its own DB with swapped key_field; it carries over annotations, unnamed flag, and undirected.
|
|
166
|
+
|
|
167
|
+
Example:
|
|
168
|
+
```ruby
|
|
169
|
+
idx = Association.index(datadir.person.brothers, undirected: true)
|
|
170
|
+
idx.match("Clei") # => ["Clei~Guille"]
|
|
171
|
+
idx.reverse.match("Clei") # => ["Clei~Guille"] (same when undirected)
|
|
172
|
+
idx.filter("Type", "mother")
|
|
173
|
+
idx.subset(["Miki","Guille"], :all) # some “source~target” keys
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
## AssociationItem: entity properties over pairs
|
|
179
|
+
|
|
180
|
+
AssociationItem is an Entity module that represents “pairs” as annotated strings “source~target”. You typically obtain such lists from index.keys, and then call properties on the annotated list.
|
|
181
|
+
|
|
182
|
+
Annotate:
|
|
183
|
+
- Association.index(file).keys returns raw strings; annotate them with AssociationItem.setup if needed, or use Index helpers that return annotated where applicable.
|
|
184
|
+
|
|
185
|
+
Properties (selected):
|
|
186
|
+
- name (single): "source~target" (returns friendly names using entity .name where available).
|
|
187
|
+
- full_name: database-prefixed “db:source~target” when database set.
|
|
188
|
+
- invert: swap endpoints (works on single or array); toggles reverse flag.
|
|
189
|
+
- namespace: forwarded from knowledge_base (if present).
|
|
190
|
+
- part (array2single): returns [source, "~", target] tuples for each pair.
|
|
191
|
+
- target / source (array2single): returns just target or source identifiers.
|
|
192
|
+
- target_type / source_type (both): resolve entity type names via knowledge_base target/source (requires a KnowledgeBase integration providing #source/#target/#undirected/#get_index/#index_fields).
|
|
193
|
+
- target_entity / source_entity: wrap target/source into Entity-typed values according to knowledge_base types.
|
|
194
|
+
- index(database=nil): resolve underlying index (delegates to knowledge_base.get_index).
|
|
195
|
+
- value (array2single): fetch info values for each pair from the index; returns NamedArrays.
|
|
196
|
+
- info_fields / info: helper for value lookups; info builds a Hash for each pair.
|
|
197
|
+
- tsv (array): emit a TSV for the pair list with columns: source_type, target_type, info_fields.
|
|
198
|
+
- filter(*args, &block): filter this pair list using the generated tsv.select.
|
|
199
|
+
|
|
200
|
+
Utilities:
|
|
201
|
+
- AssociationItem.incidence(pairs, key_field="Source") { |pair| optional_value }
|
|
202
|
+
- Returns TSV (list) with rows as sources and columns as targets; cells are blocks’ value or booleans.
|
|
203
|
+
- AssociationItem.adjacency(pairs, key_field="Source") { |pair| value }
|
|
204
|
+
- Returns TSV (double) mapping source -> [Target, values].
|
|
205
|
+
|
|
206
|
+
Convenience:
|
|
207
|
+
- TSV.incidence(tsv, **kwargs) delegates to Association.index(...).keys -> AssociationItem.incidence
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
## Matrix utilities
|
|
212
|
+
|
|
213
|
+
Given an index:
|
|
214
|
+
- idx.to_matrix(value_field=nil) { |values| ... } → TSV list
|
|
215
|
+
- value_field omitted and no block → boolean incidence.
|
|
216
|
+
- With value_field → use that column (vector) as the cell value.
|
|
217
|
+
- With block → compute cell values programmatically.
|
|
218
|
+
|
|
219
|
+
Standalone:
|
|
220
|
+
- AssociationItem.incidence/pairs as above.
|
|
221
|
+
- AssociationItem.adjacency for adjacency list.
|
|
222
|
+
|
|
223
|
+
---
|
|
224
|
+
|
|
225
|
+
## Examples
|
|
226
|
+
|
|
227
|
+
Parse specs:
|
|
228
|
+
```ruby
|
|
229
|
+
Association.parse_field_specification("=~Associated Gene Name=>Ensembl Gene ID")
|
|
230
|
+
# => [nil, "Associated Gene Name", "Ensembl Gene ID"]
|
|
231
|
+
|
|
232
|
+
Association.normalize_specs("TG=~Associated Gene Name=>Ensembl Gene ID", %w(SG TG Effect))
|
|
233
|
+
# => ["TG", "Associated Gene Name", "Ensembl Gene ID"]
|
|
234
|
+
|
|
235
|
+
Association.extract_specs(%w(SG TG Effect), source: "SG", target: "TG")
|
|
236
|
+
# => { source: ["SG", nil, nil], target: ["TG", nil, nil] }
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
Open database (translate to human-readable names):
|
|
240
|
+
```ruby
|
|
241
|
+
db = Association.database(datadir.person.marriages,
|
|
242
|
+
source: "Wife (ID)=>Alias",
|
|
243
|
+
target: "Husband (ID)=>Name")
|
|
244
|
+
db["Clei"]["Husband"] # => "Miguel"
|
|
245
|
+
db["Clei"]["Date"] # => "2021"
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
Index and match:
|
|
249
|
+
```ruby
|
|
250
|
+
idx = Association.index(datadir.person.brothers, undirected: true)
|
|
251
|
+
idx.match("Clei") # => ["Clei~Guille"]
|
|
252
|
+
idx.subset(["Clei"], :all) # => ["Clei~Guille"]
|
|
253
|
+
idx.reverse.subset(["Guille"], :all) # => ["Guille~Clei"]
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
Filter:
|
|
257
|
+
```ruby
|
|
258
|
+
idx = Association.index(datadir.person.parents)
|
|
259
|
+
idx.filter('Type of parent', 'mother') # keys whose info field contains 'mother'
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
Incidence matrix:
|
|
263
|
+
```ruby
|
|
264
|
+
pairs = Association.index(datadir.person.brothers, undirected: true).keys
|
|
265
|
+
inc = AssociationItem.incidence(pairs)
|
|
266
|
+
inc["Clei"]["Guille"] # => true
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
List serializer handling:
|
|
270
|
+
```ruby
|
|
271
|
+
tsv = TSV.open <<~EOF
|
|
272
|
+
#: :sep=,#:type=:list
|
|
273
|
+
#lowcase,upcase,double,triple
|
|
274
|
+
a,A,aa,aaa
|
|
275
|
+
b,B,bb,bbb
|
|
276
|
+
EOF
|
|
277
|
+
i = Association.index(tsv)
|
|
278
|
+
i["a~A"] # => ['aa', 'aaa']
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
---
|
|
282
|
+
|
|
283
|
+
## Notes and edge cases
|
|
284
|
+
|
|
285
|
+
- undirected default: if source_field == target_field, undirected is assumed true; else false unless set.
|
|
286
|
+
- When specifying formats, ensure identifier TSVs are reachable. You can pass :identifiers (TSV/Path) or rely on TSV.identifier_files(file) and Entity.identifier_files(format).
|
|
287
|
+
- Association.index returns a BDB-backed TSV; reverse indexing persists to a side .reverse database next to the main DB.
|
|
288
|
+
- Paths containing [NAMESPACE] or NAMESPACE are substituted with options[:namespace].
|