pacer-xml 0.2.1-java → 0.2.2-java
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +5 -0
- data/Gemfile +6 -0
- data/Rakefile +2 -0
- data/Readme.markdown +172 -0
- data/lib/pacer-xml/build_graph.rb +216 -0
- data/lib/pacer-xml/nokogiri_node.rb +148 -0
- data/lib/pacer-xml/sample.rb +107 -0
- data/lib/pacer-xml/string_route.rb +50 -0
- data/lib/pacer-xml/version.rb +7 -0
- data/lib/pacer-xml/xml_route.rb +129 -0
- data/lib/pacer-xml.rb +48 -0
- data/pacer-xml.gemspec +24 -0
- metadata +15 -3
data/Gemfile
ADDED
data/Rakefile
ADDED
data/Readme.markdown
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
1
|
+
pacer-xml
|
|
2
|
+
=========
|
|
3
|
+
|
|
4
|
+
This Pacer plugin is designed to make it dead-simple to import any
|
|
5
|
+
arbitrary XML file (no matter how bizarre) into any graph database
|
|
6
|
+
supported by Pacer.
|
|
7
|
+
|
|
8
|
+
This library evolved out of my need to be able to easily pull in sample
|
|
9
|
+
data when demoing Pacer. GraphML is pretty rare and what I've been able
|
|
10
|
+
to find is mostly pretty lame anyway, but raw XML seems to be everywhere
|
|
11
|
+
(just check out [DATA.GOV](http://www.data.gov/)).
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+
Usage
|
|
15
|
+
-----
|
|
16
|
+
|
|
17
|
+
I suggest looking at the implementation of the below sample to see how
|
|
18
|
+
I've used pacer-xml there.
|
|
19
|
+
|
|
20
|
+
There are 2 key methods:
|
|
21
|
+
|
|
22
|
+
`Pacer.xml(file, start_section = nil, end_section = nil)`
|
|
23
|
+
|
|
24
|
+
```
|
|
25
|
+
file: String | IO
|
|
26
|
+
String path to an xml file to read
|
|
27
|
+
IO an open resource that responds to #each_line
|
|
28
|
+
start_section: String | Symbol | Regex | Proc (optional)
|
|
29
|
+
String | Symbol name of xml tag to use as the root node of each
|
|
30
|
+
section of xml. The end_section will automatically be
|
|
31
|
+
set to the closing tag. This uses very simple regex
|
|
32
|
+
matching.
|
|
33
|
+
Regex If it matches, start the section from this line
|
|
34
|
+
Proc proc { |line| }
|
|
35
|
+
If it results in a truthy value, starts collecting
|
|
36
|
+
lines for the next section of xml.
|
|
37
|
+
end_section: Proc (optional)
|
|
38
|
+
Regex If it matches, end the section including this line
|
|
39
|
+
Proc proc { |line, lines| }
|
|
40
|
+
- If it results in a truthy value to indicate that the
|
|
41
|
+
current line is the last line in a section.
|
|
42
|
+
- if it results in an Array, pass the result of
|
|
43
|
+
joining the array to Nokogiri for the next section.
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
If the parser is building a section when it gets to the end of the file,
|
|
47
|
+
it will call the `end_section.call(nil, lines)`. To prevent the final
|
|
48
|
+
section from being processed, return `[]`.
|
|
49
|
+
|
|
50
|
+
Returns a Pacer Route to a series of Nokogiri::XML::Elements. Each
|
|
51
|
+
element is the root element of the its document. By default, chunks are
|
|
52
|
+
delimited by the presence of `<?xml`.
|
|
53
|
+
|
|
54
|
+
|
|
55
|
+
`xml_route.import(graph, opts = {})`
|
|
56
|
+
|
|
57
|
+
```
|
|
58
|
+
graph: PacerGraph The graph to load the data into.
|
|
59
|
+
opts: Hash
|
|
60
|
+
:cache false | Hash
|
|
61
|
+
false disable caching
|
|
62
|
+
stats: true enable occasional dump of cache info
|
|
63
|
+
:rename Hash map of { 'old-name' => 'new-name' }
|
|
64
|
+
:html Array set of tag names to treat as containing HTML
|
|
65
|
+
:skip Array set of tag or attribute names to skip
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
Baked-in Sample
|
|
69
|
+
---------------
|
|
70
|
+
|
|
71
|
+
This library started out with me tackling a chunk of [Patent Grants](https://explore.data.gov/Business-Enterprise/Patent-Grant-Bibliographic-Text-1976-Present-/8du5-jxih)
|
|
72
|
+
data, and my first attempt at importing it was with a hand-crafted set
|
|
73
|
+
of rules that walked the XML, creating graph elements along the way.
|
|
74
|
+
That was fairly painful and turned out to be very slow as well. My
|
|
75
|
+
second attempt evolved into this tool. The cool thing is that by the
|
|
76
|
+
end, everything specific to the patent grants data set was just a few
|
|
77
|
+
lines of configuration on top of a very powerful streaming XML parsing
|
|
78
|
+
tool.
|
|
79
|
+
|
|
80
|
+
I encourage you to check out the sample data, simply install this gem
|
|
81
|
+
and start up IRB, then:
|
|
82
|
+
|
|
83
|
+
```ruby
|
|
84
|
+
require 'pacer-xml'
|
|
85
|
+
|
|
86
|
+
graph = PacerXml::Sample.load_100
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
That will download and extract a 100M xml file full of 2 weeks of patent
|
|
90
|
+
grants data, then create a graph with the first 100 patents, including
|
|
91
|
+
every piece of data in the file.
|
|
92
|
+
|
|
93
|
+
I encourage you to take a look at [how it was done](https://github.com/xnlogic/pacer-xml/blob/master/lib/pacer-xml/sample.rb).
|
|
94
|
+
|
|
95
|
+
Once you've created a graph from the data, it may be useful for you to
|
|
96
|
+
check out how it's structured. Pacer's got a handy tool built in to do
|
|
97
|
+
that, `Pacer::Utils::GraphAnalysis.structure graph`, but let's go one
|
|
98
|
+
step further and visually analyze the graph. If we run the command
|
|
99
|
+
below, we'll see the same results as the GraphAnalysis, but it will
|
|
100
|
+
export a graphml file that we can load into yEd, an excellent free graph
|
|
101
|
+
visualization tool:
|
|
102
|
+
|
|
103
|
+
```ruby
|
|
104
|
+
PacerXml::Sample.structure! graph
|
|
105
|
+
# ... lots of output ...
|
|
106
|
+
#=> #<PacerGraph tinkergraph[vertices:90 edges:112]
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
The new file in your working directory is called
|
|
110
|
+
`patent-structure.graphml`. Open that file in yEd. You'll see a single
|
|
111
|
+
box... Fortunately, laying it out is fairly simple:
|
|
112
|
+
|
|
113
|
+
1. Tools / Fit Node To Label
|
|
114
|
+
1. OK
|
|
115
|
+
1. Layout / Hierarchical...
|
|
116
|
+
1. Labelling Tab / set Edge Labelling to Hierarchic
|
|
117
|
+
1. OK
|
|
118
|
+
|
|
119
|
+
Cool!
|
|
120
|
+
|
|
121
|
+
Contextual Help
|
|
122
|
+
---------------
|
|
123
|
+
|
|
124
|
+
Back to Pacer, there's lots to learn about Pacer. The best way to do
|
|
125
|
+
that is to use Pacer's own inline help:
|
|
126
|
+
|
|
127
|
+
* Use `Pacer.help` for general help
|
|
128
|
+
* Get into a general section with `Pacer.help :section`
|
|
129
|
+
* Get contextual help with `graph.v.map.help`
|
|
130
|
+
* Get more contextual help with `graph.v.map.help :section`
|
|
131
|
+
|
|
132
|
+
Contextual help was only added recently so it's not complete yet but
|
|
133
|
+
it's developing quickly and contributions are very welcome!
|
|
134
|
+
|
|
135
|
+
More
|
|
136
|
+
-----
|
|
137
|
+
|
|
138
|
+
To play with the xml tools themselves, try out the following commands:
|
|
139
|
+
|
|
140
|
+
```ruby
|
|
141
|
+
xml_route = PacerXml::Sample.xml(nil, start_rule, end_rule)
|
|
142
|
+
|
|
143
|
+
importer = PacerXml::Sample.importer
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
Performance Notes
|
|
147
|
+
-----------------
|
|
148
|
+
|
|
149
|
+
This section uses the `PacerXml::Sample.load_all` method. The `load_100`
|
|
150
|
+
method runs in just a couple of seconds.
|
|
151
|
+
|
|
152
|
+
The default sample file contains 3019840 lines representing 4479
|
|
153
|
+
documents. Running under the simple `bundle exec irb` command on a MBP
|
|
154
|
+
2.3 GHz i7, here are some quick timings (in seconds) for operations on
|
|
155
|
+
the entire file:
|
|
156
|
+
|
|
157
|
+
```
|
|
158
|
+
=> 8.36 iterate through 3019840 lines
|
|
159
|
+
=> 28.534 reduce the lines to 4479 arrays of lines
|
|
160
|
+
=> 29.753 join each array of lines into a string
|
|
161
|
+
=> 34.788 parse each string into a Nokogiri XML document
|
|
162
|
+
=> 812.732 create a graph, producing 494659 vertices and 629690 edges
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
Starting up with `bundle exec jruby --server -J-Xmx2048m -S irb`
|
|
166
|
+
slightly improves performance of the import but does not appear to
|
|
167
|
+
affect Pacer or Nokogiri's performance:
|
|
168
|
+
|
|
169
|
+
```
|
|
170
|
+
=> 34.857 parsed XML documents
|
|
171
|
+
=> 780.828 created graph
|
|
172
|
+
```
|
|
@@ -0,0 +1,216 @@
|
|
|
1
|
+
require 'set'
|
|
2
|
+
|
|
3
|
+
module PacerXml
|
|
4
|
+
class GraphVisitor
|
|
5
|
+
class << self
|
|
6
|
+
def build_rename(custom = {})
|
|
7
|
+
h = Hash.new { |h, k| h[k] = k.to_s }
|
|
8
|
+
h['id'] = 'identifier'
|
|
9
|
+
h.merge! custom if custom
|
|
10
|
+
h
|
|
11
|
+
end
|
|
12
|
+
end
|
|
13
|
+
|
|
14
|
+
attr_reader :graph
|
|
15
|
+
attr_accessor :depth, :documents
|
|
16
|
+
attr_reader :rename, :html, :skip
|
|
17
|
+
|
|
18
|
+
def initialize(graph, opts = {})
|
|
19
|
+
@documents = 0
|
|
20
|
+
@graph = graph
|
|
21
|
+
# treat tag as a property containing html
|
|
22
|
+
@html = (opts[:html] || []).map(&:to_s).to_set
|
|
23
|
+
# skip property or tag
|
|
24
|
+
@skip = (opts[:skip] || []).map(&:to_s).to_set
|
|
25
|
+
# rename type or property
|
|
26
|
+
@rename = self.class.build_rename(opts[:rename])
|
|
27
|
+
end
|
|
28
|
+
|
|
29
|
+
def build(doc)
|
|
30
|
+
self.documents += 1
|
|
31
|
+
self.depth = 0
|
|
32
|
+
if doc.is_a? Nokogiri::XML::Document
|
|
33
|
+
visit_element doc.first_element_child
|
|
34
|
+
elsif doc.element?
|
|
35
|
+
visit_element doc
|
|
36
|
+
elsif doc.is_a? Enumerable
|
|
37
|
+
doc.select(&:element?).each { |e| visit_element e }
|
|
38
|
+
else
|
|
39
|
+
fail "Don't know what you want to do"
|
|
40
|
+
end
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
def visit_vertex_fields(e)
|
|
44
|
+
h = e.fields
|
|
45
|
+
h['type'] = rename[h['type']]
|
|
46
|
+
rename.each do |from, to|
|
|
47
|
+
if h.key? from
|
|
48
|
+
h[to] = h.delete from
|
|
49
|
+
end
|
|
50
|
+
end
|
|
51
|
+
html.each do |name|
|
|
52
|
+
name = rename[name]
|
|
53
|
+
child = e.at_xpath(name)
|
|
54
|
+
h[name] = child.inner_html if child
|
|
55
|
+
end
|
|
56
|
+
skip.each do |name|
|
|
57
|
+
h.delete name
|
|
58
|
+
end
|
|
59
|
+
h
|
|
60
|
+
end
|
|
61
|
+
|
|
62
|
+
def visit_edge_fields(e)
|
|
63
|
+
h = visit_vertex_fields(e)
|
|
64
|
+
h.delete 'type'
|
|
65
|
+
h
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
def tell(x)
|
|
69
|
+
print(' ' * depth) if depth
|
|
70
|
+
if x.is_a? Hash or x.is_a? Array
|
|
71
|
+
p x
|
|
72
|
+
else
|
|
73
|
+
puts x
|
|
74
|
+
end
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
def skip?(e)
|
|
78
|
+
skip.include? e.name or html.include? e.name
|
|
79
|
+
end
|
|
80
|
+
|
|
81
|
+
def level
|
|
82
|
+
self.depth += 1
|
|
83
|
+
yield
|
|
84
|
+
ensure
|
|
85
|
+
self.depth -= 1
|
|
86
|
+
end
|
|
87
|
+
end
|
|
88
|
+
|
|
89
|
+
class BuildGraph < GraphVisitor
|
|
90
|
+
def visit_element(e)
|
|
91
|
+
return nil if skip? e
|
|
92
|
+
level do
|
|
93
|
+
vertex = graph.create_vertex visit_vertex_fields(e)
|
|
94
|
+
e.one_rels.each do |rel|
|
|
95
|
+
visit_one_rel e, vertex, rel
|
|
96
|
+
end
|
|
97
|
+
e.many_rels.each do |rel|
|
|
98
|
+
visit_many_rels e, vertex, rel
|
|
99
|
+
end
|
|
100
|
+
if block_given?
|
|
101
|
+
yield vertex
|
|
102
|
+
else
|
|
103
|
+
vertex
|
|
104
|
+
end
|
|
105
|
+
end
|
|
106
|
+
end
|
|
107
|
+
|
|
108
|
+
def visit_one_rel(e, from, rel)
|
|
109
|
+
to = visit_element(rel)
|
|
110
|
+
if from and to
|
|
111
|
+
graph.create_edge nil, from, to, rename[rel.name]
|
|
112
|
+
end
|
|
113
|
+
end
|
|
114
|
+
|
|
115
|
+
def visit_many_rels(from_e, from, rel)
|
|
116
|
+
return nil if skip? rel
|
|
117
|
+
level do
|
|
118
|
+
attrs = visit_edge_fields rel
|
|
119
|
+
attrs.delete :type
|
|
120
|
+
rel.contained_rels.map do |to_e|
|
|
121
|
+
visit_many_rel(from_e, from, rel, to_e, attrs)
|
|
122
|
+
end
|
|
123
|
+
end
|
|
124
|
+
end
|
|
125
|
+
|
|
126
|
+
def visit_many_rel(from_e, from, rel, to_e, attrs)
|
|
127
|
+
to = visit_element(to_e)
|
|
128
|
+
if from and to
|
|
129
|
+
graph.create_edge nil, from, to, rename[rel.name], attrs
|
|
130
|
+
end
|
|
131
|
+
end
|
|
132
|
+
end
|
|
133
|
+
|
|
134
|
+
|
|
135
|
+
class BuildGraphCached < BuildGraph
|
|
136
|
+
class << self
|
|
137
|
+
def empty_cache
|
|
138
|
+
cache = Hash.new { |h, k| h[k] = {} }
|
|
139
|
+
cache[:hits] = Hash.new 0
|
|
140
|
+
cache[:size] = 0
|
|
141
|
+
cache[:kill] = nil
|
|
142
|
+
cache[:skip] = Set[]
|
|
143
|
+
cache
|
|
144
|
+
end
|
|
145
|
+
end
|
|
146
|
+
|
|
147
|
+
attr_reader :cache
|
|
148
|
+
attr_accessor :fields
|
|
149
|
+
|
|
150
|
+
def initialize(graph, opts = {})
|
|
151
|
+
if opts[:cache]
|
|
152
|
+
@cache = self.class.empty_cache.merge! opts[:cache]
|
|
153
|
+
else
|
|
154
|
+
@cache = self.class.empty_cache
|
|
155
|
+
end
|
|
156
|
+
super
|
|
157
|
+
end
|
|
158
|
+
|
|
159
|
+
def build(doc)
|
|
160
|
+
result = super
|
|
161
|
+
#tell "CACHE size #{ cache[:size] }, hits:"
|
|
162
|
+
if cache[:stats] and documents % 100 == 99
|
|
163
|
+
tell '-----------------'
|
|
164
|
+
cache.each do |k, adds|
|
|
165
|
+
next unless k.is_a? String
|
|
166
|
+
adds = adds.length
|
|
167
|
+
hits = cache[:hits][k]
|
|
168
|
+
tell("%40s: %6s / %6s = %5.4f" % [k, hits, adds, (hits/adds.to_f)])
|
|
169
|
+
end
|
|
170
|
+
end
|
|
171
|
+
result
|
|
172
|
+
end
|
|
173
|
+
|
|
174
|
+
def cacheable?(e)
|
|
175
|
+
not cache[:skip].include?(rename[e.name]) and not visit_vertex_fields(e).empty?
|
|
176
|
+
end
|
|
177
|
+
|
|
178
|
+
def get_cached(e)
|
|
179
|
+
if cacheable?(e)
|
|
180
|
+
id = cache[rename[e.name]][visit_vertex_fields(e).hash]
|
|
181
|
+
#tell "cache hit: #{ e.description }" if el
|
|
182
|
+
if id
|
|
183
|
+
cache[:hits][rename[e.name]] += 1
|
|
184
|
+
graph.vertex(id)
|
|
185
|
+
end
|
|
186
|
+
end
|
|
187
|
+
end
|
|
188
|
+
|
|
189
|
+
def set_cached(e, el)
|
|
190
|
+
return unless el
|
|
191
|
+
if cacheable?(e)
|
|
192
|
+
ct = cache[rename[e.name]]
|
|
193
|
+
kill = cache[:kill]
|
|
194
|
+
if kill and cache[:hits][rename[e.name]] == 0 and ct.length > kill
|
|
195
|
+
tell "cache kill #{ e.description }"
|
|
196
|
+
cache[:skip] << rename[e.name]
|
|
197
|
+
cache[:size] -= ct.length
|
|
198
|
+
cache[rename[e.name]] = []
|
|
199
|
+
else
|
|
200
|
+
ct[visit_vertex_fields(e).hash] = el.element_id
|
|
201
|
+
cache[:size] += 1
|
|
202
|
+
end
|
|
203
|
+
end
|
|
204
|
+
el
|
|
205
|
+
end
|
|
206
|
+
|
|
207
|
+
def visit_vertex_fields(e)
|
|
208
|
+
self.fields ||= super
|
|
209
|
+
end
|
|
210
|
+
|
|
211
|
+
def visit_element(e)
|
|
212
|
+
self.fields = nil
|
|
213
|
+
get_cached(e) || set_cached(e, super)
|
|
214
|
+
end
|
|
215
|
+
end
|
|
216
|
+
end
|
|
@@ -0,0 +1,148 @@
|
|
|
1
|
+
class Nokogiri::XML::Text
|
|
2
|
+
def tree(_ = nil)
|
|
3
|
+
text unless text =~ /\A\s*\Z/
|
|
4
|
+
end
|
|
5
|
+
|
|
6
|
+
def inspect
|
|
7
|
+
if text =~ /\A\s*\Z/
|
|
8
|
+
"#<(whitespace)>"
|
|
9
|
+
else
|
|
10
|
+
"#<Text #{ text }>"
|
|
11
|
+
end
|
|
12
|
+
end
|
|
13
|
+
end
|
|
14
|
+
|
|
15
|
+
|
|
16
|
+
class Nokogiri::XML::Node
|
|
17
|
+
def tree(key_map = {})
|
|
18
|
+
c = elements.map { |x| x.tree(key_map) }.compact
|
|
19
|
+
if c.empty?
|
|
20
|
+
key_map.fetch(name, name)
|
|
21
|
+
else
|
|
22
|
+
ct = {}
|
|
23
|
+
texts = []
|
|
24
|
+
attrs = {}
|
|
25
|
+
if respond_to? :attributes
|
|
26
|
+
attrs = Hash[attributes.map { |k, a|
|
|
27
|
+
k = key_map.fetch(k, k)
|
|
28
|
+
[k, a.value] if k
|
|
29
|
+
}.compact]
|
|
30
|
+
end
|
|
31
|
+
c.each do |h|
|
|
32
|
+
if h.is_a? String
|
|
33
|
+
texts << h
|
|
34
|
+
next
|
|
35
|
+
end
|
|
36
|
+
h.each do |name, value|
|
|
37
|
+
if ct.key? name
|
|
38
|
+
if ct[name].is_a? Array
|
|
39
|
+
ct[name] << value unless ct[name].include? value
|
|
40
|
+
elsif ct[name] != value
|
|
41
|
+
ct[name] = [ct[name], value]
|
|
42
|
+
end
|
|
43
|
+
else
|
|
44
|
+
ct[name] = value
|
|
45
|
+
end
|
|
46
|
+
end
|
|
47
|
+
end
|
|
48
|
+
ct.merge! attrs
|
|
49
|
+
key = key_map.fetch(name, name)
|
|
50
|
+
if key
|
|
51
|
+
if ct.empty?
|
|
52
|
+
if texts.count < 2
|
|
53
|
+
{ key => texts.first }
|
|
54
|
+
else
|
|
55
|
+
{ key => texts.uniq }
|
|
56
|
+
end
|
|
57
|
+
elsif texts.any?
|
|
58
|
+
{ key => ct }
|
|
59
|
+
else
|
|
60
|
+
{ key => ct }
|
|
61
|
+
end
|
|
62
|
+
end
|
|
63
|
+
end
|
|
64
|
+
end
|
|
65
|
+
|
|
66
|
+
def inspect
|
|
67
|
+
if children.all? &:text?
|
|
68
|
+
"#<Property #{ name }>"
|
|
69
|
+
else
|
|
70
|
+
"#<Element #{ name } [#{ elements.map(&:name).uniq.join(', ') }]>"
|
|
71
|
+
end
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
def description
|
|
75
|
+
s = if property?
|
|
76
|
+
"property"
|
|
77
|
+
elsif container?
|
|
78
|
+
'container'
|
|
79
|
+
elsif vertex?
|
|
80
|
+
'vertex'
|
|
81
|
+
else
|
|
82
|
+
'other'
|
|
83
|
+
end
|
|
84
|
+
"#{ s } #{ name }"
|
|
85
|
+
end
|
|
86
|
+
|
|
87
|
+
def property?
|
|
88
|
+
children.all? &:text?
|
|
89
|
+
end
|
|
90
|
+
|
|
91
|
+
def container?
|
|
92
|
+
not property? and
|
|
93
|
+
elements.map(&:name).uniq.length == 1 and
|
|
94
|
+
elements.all? { |e| e.vertex? or e.container? }
|
|
95
|
+
end
|
|
96
|
+
|
|
97
|
+
def vertex?
|
|
98
|
+
not property? and not container?
|
|
99
|
+
end
|
|
100
|
+
|
|
101
|
+
def properties
|
|
102
|
+
elements.select(&:property?)
|
|
103
|
+
end
|
|
104
|
+
|
|
105
|
+
def attrs
|
|
106
|
+
if respond_to? :attributes
|
|
107
|
+
attributes
|
|
108
|
+
else
|
|
109
|
+
{}
|
|
110
|
+
end
|
|
111
|
+
end
|
|
112
|
+
|
|
113
|
+
def fields
|
|
114
|
+
result = {}
|
|
115
|
+
attrs.each do |name, attr|
|
|
116
|
+
result[name] = attr.value
|
|
117
|
+
end
|
|
118
|
+
properties.each do |e|
|
|
119
|
+
result[e.name] = e.text
|
|
120
|
+
end
|
|
121
|
+
result['type'] = name
|
|
122
|
+
result
|
|
123
|
+
end
|
|
124
|
+
|
|
125
|
+
def one_rels
|
|
126
|
+
elements.select &:vertex?
|
|
127
|
+
end
|
|
128
|
+
|
|
129
|
+
def contained_rels
|
|
130
|
+
if container?
|
|
131
|
+
elements.select(&:vertex?) +
|
|
132
|
+
elements.select(&:container?).flat_map(&:contained_rels)
|
|
133
|
+
else
|
|
134
|
+
[]
|
|
135
|
+
end
|
|
136
|
+
end
|
|
137
|
+
|
|
138
|
+
def many_rels
|
|
139
|
+
elements.select &:container?
|
|
140
|
+
end
|
|
141
|
+
|
|
142
|
+
def rels_hash
|
|
143
|
+
result = Hash.new { |h, k| h[k] = [] }
|
|
144
|
+
one_rels.each { |e| result[e.name] << e }
|
|
145
|
+
many_rels.each { |e| result[e.name] += e.contained_rels }
|
|
146
|
+
result
|
|
147
|
+
end
|
|
148
|
+
end
|
|
@@ -0,0 +1,107 @@
|
|
|
1
|
+
require 'set'
|
|
2
|
+
|
|
3
|
+
module PacerXml
|
|
4
|
+
module Sample
|
|
5
|
+
class << self
|
|
6
|
+
# Will actually load 101. To avoid this side-effect of
|
|
7
|
+
# prefetching, the route should be defined as:
|
|
8
|
+
# xml_route.limit(100).import(...)
|
|
9
|
+
def load_100(*args)
|
|
10
|
+
i = importer(*args).limit(100)
|
|
11
|
+
i.run!
|
|
12
|
+
i.graph
|
|
13
|
+
end
|
|
14
|
+
|
|
15
|
+
# Uses a Neo4j graph because the data is too big to fit in memory
|
|
16
|
+
# without configuring the JVM to use more than its small default
|
|
17
|
+
# footprint.
|
|
18
|
+
#
|
|
19
|
+
# Alternatively, to start the JVM with more memory, try:
|
|
20
|
+
# bundle exec jruby -J-Xmx2048m -S irb
|
|
21
|
+
def load_all(graph = nil, *args)
|
|
22
|
+
require 'pacer-neo4j'
|
|
23
|
+
n = Time.now.to_i % 1000000
|
|
24
|
+
graph ||= Pacer.neo4j "sample.#{n}.graph"
|
|
25
|
+
i = importer(graph, *args)
|
|
26
|
+
i.run!
|
|
27
|
+
i.graph
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
def structure(g)
|
|
31
|
+
Pacer::Utils::GraphAnalysis.structure g
|
|
32
|
+
end
|
|
33
|
+
|
|
34
|
+
def structure!(g, fn = 'patent-structure.graphml')
|
|
35
|
+
s = structure g
|
|
36
|
+
if fn
|
|
37
|
+
e = Pacer::Utils::YFilesExport.new
|
|
38
|
+
e.vertex_label = s.vertex_name
|
|
39
|
+
e.edge_label = s.edge_name
|
|
40
|
+
e.export s, fn
|
|
41
|
+
puts
|
|
42
|
+
puts "Wrote #{ fn }"
|
|
43
|
+
end
|
|
44
|
+
s
|
|
45
|
+
end
|
|
46
|
+
|
|
47
|
+
# Sample of using the xml import function with some advanced options to
|
|
48
|
+
# clean up the resulting graph.
|
|
49
|
+
#
|
|
50
|
+
# Import can successfully be run with no options specified, but this patent
|
|
51
|
+
# xml is particularly hairy.
|
|
52
|
+
def importer(graph = nil, fn = nil, start_rule = nil, end_rule = nil)
|
|
53
|
+
html = [:abstract]
|
|
54
|
+
rename = {
|
|
55
|
+
'classification-national' => 'classification',
|
|
56
|
+
'assistant-examiner' => 'examiner',
|
|
57
|
+
'primary-examiner' => 'examiner',
|
|
58
|
+
'us-term-of-grant' => 'term',
|
|
59
|
+
'addressbook' => 'entity',
|
|
60
|
+
'document-id' => 'document',
|
|
61
|
+
'us-related-documents' => 'related-document',
|
|
62
|
+
'us-patent-grant' => 'patent-version',
|
|
63
|
+
'us-bibliographic-data-grant' => 'patent'
|
|
64
|
+
}
|
|
65
|
+
cache = { stats: true }
|
|
66
|
+
graph ||= Pacer.tg
|
|
67
|
+
graph.create_key_index :type, :vertex
|
|
68
|
+
xml_route = xml(fn, start_rule, end_rule)
|
|
69
|
+
xml_route.
|
|
70
|
+
process { print '.' }.
|
|
71
|
+
import(graph, html: html, rename: rename, cache: cache)
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
def xml(fn = nil, *args)
|
|
75
|
+
fn ||= a_week
|
|
76
|
+
path = download_patent_grant fn
|
|
77
|
+
Pacer.xml path, *args
|
|
78
|
+
end
|
|
79
|
+
|
|
80
|
+
def cleanup(fn = nil)
|
|
81
|
+
fn ||= a_week
|
|
82
|
+
name, week = fn.split '_'
|
|
83
|
+
Dir["/tmp/#{name}*"].each { |f| File.delete f }
|
|
84
|
+
end
|
|
85
|
+
|
|
86
|
+
private
|
|
87
|
+
|
|
88
|
+
def a_week
|
|
89
|
+
'ipgb20120103_wk01'
|
|
90
|
+
end
|
|
91
|
+
|
|
92
|
+
def download_patent_grant(fn)
|
|
93
|
+
puts "Downloading a sample xml file from"
|
|
94
|
+
puts "http://www.google.com/googlebooks/uspto-patents-grants-biblio.html"
|
|
95
|
+
name, week = fn.split '_'
|
|
96
|
+
result = "/tmp/#{name}.xml"
|
|
97
|
+
Dir.chdir '/tmp' do
|
|
98
|
+
unless File.exists? result
|
|
99
|
+
system "curl http://storage.googleapis.com/patents/grantbib/2012/#{fn}.zip > #{fn}.zip"
|
|
100
|
+
system "unzip #{fn}.zip"
|
|
101
|
+
end
|
|
102
|
+
end
|
|
103
|
+
result
|
|
104
|
+
end
|
|
105
|
+
end
|
|
106
|
+
end
|
|
107
|
+
end
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
module Pacer
|
|
2
|
+
module Core
|
|
3
|
+
module StringRoute
|
|
4
|
+
def xml_stream(enter = nil, leave = nil)
|
|
5
|
+
enter ||= /<\?xml/
|
|
6
|
+
leave ||= enter
|
|
7
|
+
enter = build_rule :enter, enter
|
|
8
|
+
leave = build_rule :leave, leave
|
|
9
|
+
r = reducer(element_type: :array, enter: enter, leave: leave) do |s, lines|
|
|
10
|
+
lines << s
|
|
11
|
+
end.route
|
|
12
|
+
joined = r.map(element_type: :string, info: 'join', &:join).route
|
|
13
|
+
joined.xml
|
|
14
|
+
end
|
|
15
|
+
|
|
16
|
+
def xml
|
|
17
|
+
map(element_type: :xml) do |s|
|
|
18
|
+
Nokogiri::XML(s).first_element_child
|
|
19
|
+
end
|
|
20
|
+
end
|
|
21
|
+
|
|
22
|
+
private
|
|
23
|
+
|
|
24
|
+
def build_rule(type, rule)
|
|
25
|
+
rule = rule.to_s if rule.is_a? Symbol
|
|
26
|
+
if rule.is_a? String
|
|
27
|
+
if type == :leave
|
|
28
|
+
rule = "/#{rule}"
|
|
29
|
+
add_close_tag = true
|
|
30
|
+
end
|
|
31
|
+
rule = /<#{rule}\b/
|
|
32
|
+
end
|
|
33
|
+
if rule.is_a? Proc
|
|
34
|
+
rule
|
|
35
|
+
elsif add_close_tag
|
|
36
|
+
proc do |line, lines, set_value|
|
|
37
|
+
if line.nil? or rule =~ line
|
|
38
|
+
set_value.call(lines << line)
|
|
39
|
+
true
|
|
40
|
+
end
|
|
41
|
+
end
|
|
42
|
+
else
|
|
43
|
+
proc do |line|
|
|
44
|
+
[] if line.nil? or rule =~ line
|
|
45
|
+
end
|
|
46
|
+
end
|
|
47
|
+
end
|
|
48
|
+
end
|
|
49
|
+
end
|
|
50
|
+
end
|
|
@@ -0,0 +1,129 @@
|
|
|
1
|
+
module PacerXml
|
|
2
|
+
module XmlRoute
|
|
3
|
+
def help(section = nil)
|
|
4
|
+
case section
|
|
5
|
+
when nil
|
|
6
|
+
puts <<HELP
|
|
7
|
+
This is included via the pacer-xml gem plugin.
|
|
8
|
+
|
|
9
|
+
pacer-xml uses Nokogiri for its xml parsing. Each element in an xml route
|
|
10
|
+
is the first child element of the Nokogiri::XML::Document element. To get at
|
|
11
|
+
the document element, simply call #parent on the element.
|
|
12
|
+
|
|
13
|
+
An xml route can be created, transformed, filtered and otherwise
|
|
14
|
+
processed by all standard Pacer routes. For instance, if a graph element
|
|
15
|
+
has a property with xml data in it, we could process it as follows:
|
|
16
|
+
|
|
17
|
+
g.v.map(element_type: :xml) { |v| Nokogiri(v[:xml]) }
|
|
18
|
+
|
|
19
|
+
Method help sections:
|
|
20
|
+
:xml
|
|
21
|
+
:import
|
|
22
|
+
|
|
23
|
+
HELP
|
|
24
|
+
when :xml
|
|
25
|
+
puts <<HELP
|
|
26
|
+
|
|
27
|
+
|
|
28
|
+
|
|
29
|
+
Turn an xml file into a stream of xml nodes. Scans the xml file
|
|
30
|
+
line-by-line and uses arguments defined in start_section and end_section
|
|
31
|
+
to extract sections from the file.
|
|
32
|
+
|
|
33
|
+
Pacer.xml(file, start_section = nil, end_section = nil)
|
|
34
|
+
|
|
35
|
+
file: String | IO
|
|
36
|
+
String path to an xml file to read
|
|
37
|
+
IO an open resource that responds to #each_line
|
|
38
|
+
start_section: String | Symbol | Regex | Proc (optional)
|
|
39
|
+
String | Symbol name of xml tag to use as the root node of each
|
|
40
|
+
section of xml. The end_section will automatically be
|
|
41
|
+
set to the closing tag. This uses very simple regex
|
|
42
|
+
matching.
|
|
43
|
+
Regex If it matches, start the section from this line
|
|
44
|
+
Proc proc { |line| }
|
|
45
|
+
If it results in a truthy value, starts collecting
|
|
46
|
+
lines for the next section of xml.
|
|
47
|
+
end_section: Proc (optional)
|
|
48
|
+
Regex If it matches, end the section including this line
|
|
49
|
+
Proc proc { |line, lines| }
|
|
50
|
+
- If it results in a truthy value to indicate that the
|
|
51
|
+
current line is the last line in a section.
|
|
52
|
+
- if it results in an Array, pass the result of
|
|
53
|
+
joining the array to Nokogiri for the next section.
|
|
54
|
+
|
|
55
|
+
HELP
|
|
56
|
+
when :import
|
|
57
|
+
puts <<HELP
|
|
58
|
+
Turn the tree of xml in each node in the stream
|
|
59
|
+
|
|
60
|
+
xml_route.import(graph, opts = {})
|
|
61
|
+
|
|
62
|
+
graph: PacerGraph The graph to load the data into.
|
|
63
|
+
opts: Hash
|
|
64
|
+
:cache false | Hash
|
|
65
|
+
false disable caching
|
|
66
|
+
stats: true enable occasional dump of cache info
|
|
67
|
+
:rename Hash map of { 'old-name' => 'new-name' }
|
|
68
|
+
:html Array set of tag names to treat as containing HTML
|
|
69
|
+
:skip Array set of tag or attribute names to skip
|
|
70
|
+
|
|
71
|
+
Produces a vertex route where each vertex is the root vertex for each xml tree.
|
|
72
|
+
|
|
73
|
+
Look at the source of lib/pacer-xml/sample.rb a good example.
|
|
74
|
+
|
|
75
|
+
HELP
|
|
76
|
+
else
|
|
77
|
+
super
|
|
78
|
+
end
|
|
79
|
+
description
|
|
80
|
+
end
|
|
81
|
+
|
|
82
|
+
def children
|
|
83
|
+
flat_map(element_type: :xml) { |x| x.children.to_a }
|
|
84
|
+
end
|
|
85
|
+
|
|
86
|
+
def names
|
|
87
|
+
map element_type: :string, &:name
|
|
88
|
+
end
|
|
89
|
+
|
|
90
|
+
def text_nodes
|
|
91
|
+
select &:text?
|
|
92
|
+
end
|
|
93
|
+
|
|
94
|
+
def elements
|
|
95
|
+
select &:element?
|
|
96
|
+
end
|
|
97
|
+
|
|
98
|
+
def fields
|
|
99
|
+
elements.map element_type: :hash, &:fields
|
|
100
|
+
end
|
|
101
|
+
|
|
102
|
+
def import(graph, opts = {})
|
|
103
|
+
if opts[:cache] == false
|
|
104
|
+
builder = BuildGraph.new(graph, opts)
|
|
105
|
+
else
|
|
106
|
+
builder = BuildGraphCached.new(graph, opts)
|
|
107
|
+
end
|
|
108
|
+
graph.vertex_name ||= proc { |v| v[:type] }
|
|
109
|
+
to_route.map(route_name: 'import', graph: graph, element_type: :vertex, modules: [ImportHelp]) do |node|
|
|
110
|
+
graph.transaction do
|
|
111
|
+
builder.build(node)
|
|
112
|
+
end
|
|
113
|
+
end.route
|
|
114
|
+
end
|
|
115
|
+
|
|
116
|
+
module ImportHelp
|
|
117
|
+
def help(section = nil)
|
|
118
|
+
case section
|
|
119
|
+
when nil
|
|
120
|
+
back.help :import
|
|
121
|
+
else
|
|
122
|
+
super
|
|
123
|
+
end
|
|
124
|
+
description
|
|
125
|
+
end
|
|
126
|
+
end
|
|
127
|
+
end
|
|
128
|
+
Pacer::RouteBuilder.current.element_types[:xml] = [XmlRoute]
|
|
129
|
+
end
|
data/lib/pacer-xml.rb
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
require_relative 'pacer-xml/version'
|
|
2
|
+
require 'nokogiri'
|
|
3
|
+
require 'pacer'
|
|
4
|
+
|
|
5
|
+
module PacerXml
|
|
6
|
+
class << self
|
|
7
|
+
# Returns the time pacer-xml was last reloaded (or when it was started).
|
|
8
|
+
def reload_time
|
|
9
|
+
if defined? @reload_time
|
|
10
|
+
@reload_time
|
|
11
|
+
else
|
|
12
|
+
START_TIME
|
|
13
|
+
end
|
|
14
|
+
end
|
|
15
|
+
|
|
16
|
+
# Reload all Ruby modified files in the pacer-xml library. Useful for debugging
|
|
17
|
+
# in the console. Does not do any of the fancy stuff that Rails reloading
|
|
18
|
+
# does. Certain types of changes will still require restarting the session.
|
|
19
|
+
def reload!
|
|
20
|
+
require 'pathname'
|
|
21
|
+
Pathname.new(File.expand_path(__FILE__)).parent.find do |path|
|
|
22
|
+
if path.extname == '.rb' and path.mtime > reload_time
|
|
23
|
+
puts path.to_s
|
|
24
|
+
load path.to_s
|
|
25
|
+
end
|
|
26
|
+
end
|
|
27
|
+
@reload_time = Time.now
|
|
28
|
+
end
|
|
29
|
+
end
|
|
30
|
+
end
|
|
31
|
+
|
|
32
|
+
require_relative 'pacer-xml/build_graph'
|
|
33
|
+
require_relative 'pacer-xml/nokogiri_node'
|
|
34
|
+
require_relative 'pacer-xml/xml_route'
|
|
35
|
+
require_relative 'pacer-xml/string_route'
|
|
36
|
+
require_relative 'pacer-xml/sample'
|
|
37
|
+
|
|
38
|
+
module Pacer
|
|
39
|
+
class << self
|
|
40
|
+
def xml(file, enter = nil, leave = nil)
|
|
41
|
+
if file.is_a? String
|
|
42
|
+
file = File.open '/tmp/ipgb20120103.xml'
|
|
43
|
+
end
|
|
44
|
+
lines = file.each_line.to_route(element_type: :string, info: 'lines').route
|
|
45
|
+
lines.xml_stream(enter, leave).route
|
|
46
|
+
end
|
|
47
|
+
end
|
|
48
|
+
end
|
data/pacer-xml.gemspec
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
|
2
|
+
$:.push File.expand_path("../lib", __FILE__)
|
|
3
|
+
require "pacer-xml/version"
|
|
4
|
+
|
|
5
|
+
Gem::Specification.new do |s|
|
|
6
|
+
s.name = "pacer-xml"
|
|
7
|
+
s.version = PacerXml::VERSION
|
|
8
|
+
s.platform = 'java'
|
|
9
|
+
s.authors = ["Darrick Wiebe"]
|
|
10
|
+
s.email = ["dw@xnlogic.com"]
|
|
11
|
+
s.homepage = "http://xnlogic.com"
|
|
12
|
+
s.summary = %q{XML streaming and graph import for Pacer}
|
|
13
|
+
s.description = s.summary
|
|
14
|
+
|
|
15
|
+
s.add_dependency 'pacer', PacerXml::PACER_VERSION
|
|
16
|
+
s.add_dependency 'pacer-neo4j', ">= 2.1"
|
|
17
|
+
s.add_dependency 'nokogiri'
|
|
18
|
+
|
|
19
|
+
s.rubyforge_project = "pacer-xml"
|
|
20
|
+
|
|
21
|
+
s.files = `git ls-files`.split("\n")
|
|
22
|
+
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
|
23
|
+
s.require_paths = ["lib"]
|
|
24
|
+
end
|
metadata
CHANGED
|
@@ -2,14 +2,14 @@
|
|
|
2
2
|
name: pacer-xml
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
4
|
prerelease:
|
|
5
|
-
version: 0.2.
|
|
5
|
+
version: 0.2.2
|
|
6
6
|
platform: java
|
|
7
7
|
authors:
|
|
8
8
|
- Darrick Wiebe
|
|
9
9
|
autorequire:
|
|
10
10
|
bindir: bin
|
|
11
11
|
cert_chain: []
|
|
12
|
-
date: 2012-10-
|
|
12
|
+
date: 2012-10-31 00:00:00.000000000 Z
|
|
13
13
|
dependencies:
|
|
14
14
|
- !ruby/object:Gem::Dependency
|
|
15
15
|
name: pacer
|
|
@@ -67,7 +67,19 @@ email:
|
|
|
67
67
|
executables: []
|
|
68
68
|
extensions: []
|
|
69
69
|
extra_rdoc_files: []
|
|
70
|
-
files:
|
|
70
|
+
files:
|
|
71
|
+
- .gitignore
|
|
72
|
+
- Gemfile
|
|
73
|
+
- Rakefile
|
|
74
|
+
- Readme.markdown
|
|
75
|
+
- lib/pacer-xml.rb
|
|
76
|
+
- lib/pacer-xml/build_graph.rb
|
|
77
|
+
- lib/pacer-xml/nokogiri_node.rb
|
|
78
|
+
- lib/pacer-xml/sample.rb
|
|
79
|
+
- lib/pacer-xml/string_route.rb
|
|
80
|
+
- lib/pacer-xml/version.rb
|
|
81
|
+
- lib/pacer-xml/xml_route.rb
|
|
82
|
+
- pacer-xml.gemspec
|
|
71
83
|
homepage: http://xnlogic.com
|
|
72
84
|
licenses: []
|
|
73
85
|
post_install_message:
|