pacer-xml 0.2.1-java → 0.2.2-java
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +5 -0
- data/Gemfile +6 -0
- data/Rakefile +2 -0
- data/Readme.markdown +172 -0
- data/lib/pacer-xml/build_graph.rb +216 -0
- data/lib/pacer-xml/nokogiri_node.rb +148 -0
- data/lib/pacer-xml/sample.rb +107 -0
- data/lib/pacer-xml/string_route.rb +50 -0
- data/lib/pacer-xml/version.rb +7 -0
- data/lib/pacer-xml/xml_route.rb +129 -0
- data/lib/pacer-xml.rb +48 -0
- data/pacer-xml.gemspec +24 -0
- metadata +15 -3
data/Gemfile
ADDED
data/Rakefile
ADDED
data/Readme.markdown
ADDED
@@ -0,0 +1,172 @@
|
|
1
|
+
pacer-xml
|
2
|
+
=========
|
3
|
+
|
4
|
+
This Pacer plugin is designed to make it dead-simple to import any
|
5
|
+
arbitrary XML file (no matter how bizarre) into any graph database
|
6
|
+
supported by Pacer.
|
7
|
+
|
8
|
+
This library evolved out of my need to be able to easily pull in sample
|
9
|
+
data when demoing Pacer. GraphML is pretty rare and what I've been able
|
10
|
+
to find is mostly pretty lame anyway, but raw XML seems to be everywhere
|
11
|
+
(just check out [DATA.GOV](http://www.data.gov/)).
|
12
|
+
|
13
|
+
|
14
|
+
Usage
|
15
|
+
-----
|
16
|
+
|
17
|
+
I suggest looking at the implementation of the below sample to see how
|
18
|
+
I've used pacer-xml there.
|
19
|
+
|
20
|
+
There are 2 key methods:
|
21
|
+
|
22
|
+
`Pacer.xml(file, start_section = nil, end_section = nil)`
|
23
|
+
|
24
|
+
```
|
25
|
+
file: String | IO
|
26
|
+
String path to an xml file to read
|
27
|
+
IO an open resource that responds to #each_line
|
28
|
+
start_section: String | Symbol | Regex | Proc (optional)
|
29
|
+
String | Symbol name of xml tag to use as the root node of each
|
30
|
+
section of xml. The end_section will automatically be
|
31
|
+
set to the closing tag. This uses very simple regex
|
32
|
+
matching.
|
33
|
+
Regex If it matches, start the section from this line
|
34
|
+
Proc proc { |line| }
|
35
|
+
If it results in a truthy value, starts collecting
|
36
|
+
lines for the next section of xml.
|
37
|
+
end_section: Proc (optional)
|
38
|
+
Regex If it matches, end the section including this line
|
39
|
+
Proc proc { |line, lines| }
|
40
|
+
- If it results in a truthy value to indicate that the
|
41
|
+
current line is the last line in a section.
|
42
|
+
- if it results in an Array, pass the result of
|
43
|
+
joining the array to Nokogiri for the next section.
|
44
|
+
```
|
45
|
+
|
46
|
+
If the parser is building a section when it gets to the end of the file,
|
47
|
+
it will call the `end_section.call(nil, lines)`. To prevent the final
|
48
|
+
section from being processed, return `[]`.
|
49
|
+
|
50
|
+
Returns a Pacer Route to a series of Nokogiri::XML::Elements. Each
|
51
|
+
element is the root element of the its document. By default, chunks are
|
52
|
+
delimited by the presence of `<?xml`.
|
53
|
+
|
54
|
+
|
55
|
+
`xml_route.import(graph, opts = {})`
|
56
|
+
|
57
|
+
```
|
58
|
+
graph: PacerGraph The graph to load the data into.
|
59
|
+
opts: Hash
|
60
|
+
:cache false | Hash
|
61
|
+
false disable caching
|
62
|
+
stats: true enable occasional dump of cache info
|
63
|
+
:rename Hash map of { 'old-name' => 'new-name' }
|
64
|
+
:html Array set of tag names to treat as containing HTML
|
65
|
+
:skip Array set of tag or attribute names to skip
|
66
|
+
```
|
67
|
+
|
68
|
+
Baked-in Sample
|
69
|
+
---------------
|
70
|
+
|
71
|
+
This library started out with me tackling a chunk of [Patent Grants](https://explore.data.gov/Business-Enterprise/Patent-Grant-Bibliographic-Text-1976-Present-/8du5-jxih)
|
72
|
+
data, and my first attempt at importing it was with a hand-crafted set
|
73
|
+
of rules that walked the XML, creating graph elements along the way.
|
74
|
+
That was fairly painful and turned out to be very slow as well. My
|
75
|
+
second attempt evolved into this tool. The cool thing is that by the
|
76
|
+
end, everything specific to the patent grants data set was just a few
|
77
|
+
lines of configuration on top of a very powerful streaming XML parsing
|
78
|
+
tool.
|
79
|
+
|
80
|
+
I encourage you to check out the sample data, simply install this gem
|
81
|
+
and start up IRB, then:
|
82
|
+
|
83
|
+
```ruby
|
84
|
+
require 'pacer-xml'
|
85
|
+
|
86
|
+
graph = PacerXml::Sample.load_100
|
87
|
+
```
|
88
|
+
|
89
|
+
That will download and extract a 100M xml file full of 2 weeks of patent
|
90
|
+
grants data, then create a graph with the first 100 patents, including
|
91
|
+
every piece of data in the file.
|
92
|
+
|
93
|
+
I encourage you to take a look at [how it was done](https://github.com/xnlogic/pacer-xml/blob/master/lib/pacer-xml/sample.rb).
|
94
|
+
|
95
|
+
Once you've created a graph from the data, it may be useful for you to
|
96
|
+
check out how it's structured. Pacer's got a handy tool built in to do
|
97
|
+
that, `Pacer::Utils::GraphAnalysis.structure graph`, but let's go one
|
98
|
+
step further and visually analyze the graph. If we run the command
|
99
|
+
below, we'll see the same results as the GraphAnalysis, but it will
|
100
|
+
export a graphml file that we can load into yEd, an excellent free graph
|
101
|
+
visualization tool:
|
102
|
+
|
103
|
+
```ruby
|
104
|
+
PacerXml::Sample.structure! graph
|
105
|
+
# ... lots of output ...
|
106
|
+
#=> #<PacerGraph tinkergraph[vertices:90 edges:112]
|
107
|
+
```
|
108
|
+
|
109
|
+
The new file in your working directory is called
|
110
|
+
`patent-structure.graphml`. Open that file in yEd. You'll see a single
|
111
|
+
box... Fortunately, laying it out is fairly simple:
|
112
|
+
|
113
|
+
1. Tools / Fit Node To Label
|
114
|
+
1. OK
|
115
|
+
1. Layout / Hierarchical...
|
116
|
+
1. Labelling Tab / set Edge Labelling to Hierarchic
|
117
|
+
1. OK
|
118
|
+
|
119
|
+
Cool!
|
120
|
+
|
121
|
+
Contextual Help
|
122
|
+
---------------
|
123
|
+
|
124
|
+
Back to Pacer, there's lots to learn about Pacer. The best way to do
|
125
|
+
that is to use Pacer's own inline help:
|
126
|
+
|
127
|
+
* Use `Pacer.help` for general help
|
128
|
+
* Get into a general section with `Pacer.help :section`
|
129
|
+
* Get contextual help with `graph.v.map.help`
|
130
|
+
* Get more contextual help with `graph.v.map.help :section`
|
131
|
+
|
132
|
+
Contextual help was only added recently so it's not complete yet but
|
133
|
+
it's developing quickly and contributions are very welcome!
|
134
|
+
|
135
|
+
More
|
136
|
+
-----
|
137
|
+
|
138
|
+
To play with the xml tools themselves, try out the following commands:
|
139
|
+
|
140
|
+
```ruby
|
141
|
+
xml_route = PacerXml::Sample.xml(nil, start_rule, end_rule)
|
142
|
+
|
143
|
+
importer = PacerXml::Sample.importer
|
144
|
+
```
|
145
|
+
|
146
|
+
Performance Notes
|
147
|
+
-----------------
|
148
|
+
|
149
|
+
This section uses the `PacerXml::Sample.load_all` method. The `load_100`
|
150
|
+
method runs in just a couple of seconds.
|
151
|
+
|
152
|
+
The default sample file contains 3019840 lines representing 4479
|
153
|
+
documents. Running under the simple `bundle exec irb` command on a MBP
|
154
|
+
2.3 GHz i7, here are some quick timings (in seconds) for operations on
|
155
|
+
the entire file:
|
156
|
+
|
157
|
+
```
|
158
|
+
=> 8.36 iterate through 3019840 lines
|
159
|
+
=> 28.534 reduce the lines to 4479 arrays of lines
|
160
|
+
=> 29.753 join each array of lines into a string
|
161
|
+
=> 34.788 parse each string into a Nokogiri XML document
|
162
|
+
=> 812.732 create a graph, producing 494659 vertices and 629690 edges
|
163
|
+
```
|
164
|
+
|
165
|
+
Starting up with `bundle exec jruby --server -J-Xmx2048m -S irb`
|
166
|
+
slightly improves performance of the import but does not appear to
|
167
|
+
affect Pacer or Nokogiri's performance:
|
168
|
+
|
169
|
+
```
|
170
|
+
=> 34.857 parsed XML documents
|
171
|
+
=> 780.828 created graph
|
172
|
+
```
|
@@ -0,0 +1,216 @@
|
|
1
|
+
require 'set'
|
2
|
+
|
3
|
+
module PacerXml
|
4
|
+
class GraphVisitor
|
5
|
+
class << self
|
6
|
+
def build_rename(custom = {})
|
7
|
+
h = Hash.new { |h, k| h[k] = k.to_s }
|
8
|
+
h['id'] = 'identifier'
|
9
|
+
h.merge! custom if custom
|
10
|
+
h
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
attr_reader :graph
|
15
|
+
attr_accessor :depth, :documents
|
16
|
+
attr_reader :rename, :html, :skip
|
17
|
+
|
18
|
+
def initialize(graph, opts = {})
|
19
|
+
@documents = 0
|
20
|
+
@graph = graph
|
21
|
+
# treat tag as a property containing html
|
22
|
+
@html = (opts[:html] || []).map(&:to_s).to_set
|
23
|
+
# skip property or tag
|
24
|
+
@skip = (opts[:skip] || []).map(&:to_s).to_set
|
25
|
+
# rename type or property
|
26
|
+
@rename = self.class.build_rename(opts[:rename])
|
27
|
+
end
|
28
|
+
|
29
|
+
def build(doc)
|
30
|
+
self.documents += 1
|
31
|
+
self.depth = 0
|
32
|
+
if doc.is_a? Nokogiri::XML::Document
|
33
|
+
visit_element doc.first_element_child
|
34
|
+
elsif doc.element?
|
35
|
+
visit_element doc
|
36
|
+
elsif doc.is_a? Enumerable
|
37
|
+
doc.select(&:element?).each { |e| visit_element e }
|
38
|
+
else
|
39
|
+
fail "Don't know what you want to do"
|
40
|
+
end
|
41
|
+
end
|
42
|
+
|
43
|
+
def visit_vertex_fields(e)
|
44
|
+
h = e.fields
|
45
|
+
h['type'] = rename[h['type']]
|
46
|
+
rename.each do |from, to|
|
47
|
+
if h.key? from
|
48
|
+
h[to] = h.delete from
|
49
|
+
end
|
50
|
+
end
|
51
|
+
html.each do |name|
|
52
|
+
name = rename[name]
|
53
|
+
child = e.at_xpath(name)
|
54
|
+
h[name] = child.inner_html if child
|
55
|
+
end
|
56
|
+
skip.each do |name|
|
57
|
+
h.delete name
|
58
|
+
end
|
59
|
+
h
|
60
|
+
end
|
61
|
+
|
62
|
+
def visit_edge_fields(e)
|
63
|
+
h = visit_vertex_fields(e)
|
64
|
+
h.delete 'type'
|
65
|
+
h
|
66
|
+
end
|
67
|
+
|
68
|
+
def tell(x)
|
69
|
+
print(' ' * depth) if depth
|
70
|
+
if x.is_a? Hash or x.is_a? Array
|
71
|
+
p x
|
72
|
+
else
|
73
|
+
puts x
|
74
|
+
end
|
75
|
+
end
|
76
|
+
|
77
|
+
def skip?(e)
|
78
|
+
skip.include? e.name or html.include? e.name
|
79
|
+
end
|
80
|
+
|
81
|
+
def level
|
82
|
+
self.depth += 1
|
83
|
+
yield
|
84
|
+
ensure
|
85
|
+
self.depth -= 1
|
86
|
+
end
|
87
|
+
end
|
88
|
+
|
89
|
+
class BuildGraph < GraphVisitor
|
90
|
+
def visit_element(e)
|
91
|
+
return nil if skip? e
|
92
|
+
level do
|
93
|
+
vertex = graph.create_vertex visit_vertex_fields(e)
|
94
|
+
e.one_rels.each do |rel|
|
95
|
+
visit_one_rel e, vertex, rel
|
96
|
+
end
|
97
|
+
e.many_rels.each do |rel|
|
98
|
+
visit_many_rels e, vertex, rel
|
99
|
+
end
|
100
|
+
if block_given?
|
101
|
+
yield vertex
|
102
|
+
else
|
103
|
+
vertex
|
104
|
+
end
|
105
|
+
end
|
106
|
+
end
|
107
|
+
|
108
|
+
def visit_one_rel(e, from, rel)
|
109
|
+
to = visit_element(rel)
|
110
|
+
if from and to
|
111
|
+
graph.create_edge nil, from, to, rename[rel.name]
|
112
|
+
end
|
113
|
+
end
|
114
|
+
|
115
|
+
def visit_many_rels(from_e, from, rel)
|
116
|
+
return nil if skip? rel
|
117
|
+
level do
|
118
|
+
attrs = visit_edge_fields rel
|
119
|
+
attrs.delete :type
|
120
|
+
rel.contained_rels.map do |to_e|
|
121
|
+
visit_many_rel(from_e, from, rel, to_e, attrs)
|
122
|
+
end
|
123
|
+
end
|
124
|
+
end
|
125
|
+
|
126
|
+
def visit_many_rel(from_e, from, rel, to_e, attrs)
|
127
|
+
to = visit_element(to_e)
|
128
|
+
if from and to
|
129
|
+
graph.create_edge nil, from, to, rename[rel.name], attrs
|
130
|
+
end
|
131
|
+
end
|
132
|
+
end
|
133
|
+
|
134
|
+
|
135
|
+
class BuildGraphCached < BuildGraph
|
136
|
+
class << self
|
137
|
+
def empty_cache
|
138
|
+
cache = Hash.new { |h, k| h[k] = {} }
|
139
|
+
cache[:hits] = Hash.new 0
|
140
|
+
cache[:size] = 0
|
141
|
+
cache[:kill] = nil
|
142
|
+
cache[:skip] = Set[]
|
143
|
+
cache
|
144
|
+
end
|
145
|
+
end
|
146
|
+
|
147
|
+
attr_reader :cache
|
148
|
+
attr_accessor :fields
|
149
|
+
|
150
|
+
def initialize(graph, opts = {})
|
151
|
+
if opts[:cache]
|
152
|
+
@cache = self.class.empty_cache.merge! opts[:cache]
|
153
|
+
else
|
154
|
+
@cache = self.class.empty_cache
|
155
|
+
end
|
156
|
+
super
|
157
|
+
end
|
158
|
+
|
159
|
+
def build(doc)
|
160
|
+
result = super
|
161
|
+
#tell "CACHE size #{ cache[:size] }, hits:"
|
162
|
+
if cache[:stats] and documents % 100 == 99
|
163
|
+
tell '-----------------'
|
164
|
+
cache.each do |k, adds|
|
165
|
+
next unless k.is_a? String
|
166
|
+
adds = adds.length
|
167
|
+
hits = cache[:hits][k]
|
168
|
+
tell("%40s: %6s / %6s = %5.4f" % [k, hits, adds, (hits/adds.to_f)])
|
169
|
+
end
|
170
|
+
end
|
171
|
+
result
|
172
|
+
end
|
173
|
+
|
174
|
+
def cacheable?(e)
|
175
|
+
not cache[:skip].include?(rename[e.name]) and not visit_vertex_fields(e).empty?
|
176
|
+
end
|
177
|
+
|
178
|
+
def get_cached(e)
|
179
|
+
if cacheable?(e)
|
180
|
+
id = cache[rename[e.name]][visit_vertex_fields(e).hash]
|
181
|
+
#tell "cache hit: #{ e.description }" if el
|
182
|
+
if id
|
183
|
+
cache[:hits][rename[e.name]] += 1
|
184
|
+
graph.vertex(id)
|
185
|
+
end
|
186
|
+
end
|
187
|
+
end
|
188
|
+
|
189
|
+
def set_cached(e, el)
|
190
|
+
return unless el
|
191
|
+
if cacheable?(e)
|
192
|
+
ct = cache[rename[e.name]]
|
193
|
+
kill = cache[:kill]
|
194
|
+
if kill and cache[:hits][rename[e.name]] == 0 and ct.length > kill
|
195
|
+
tell "cache kill #{ e.description }"
|
196
|
+
cache[:skip] << rename[e.name]
|
197
|
+
cache[:size] -= ct.length
|
198
|
+
cache[rename[e.name]] = []
|
199
|
+
else
|
200
|
+
ct[visit_vertex_fields(e).hash] = el.element_id
|
201
|
+
cache[:size] += 1
|
202
|
+
end
|
203
|
+
end
|
204
|
+
el
|
205
|
+
end
|
206
|
+
|
207
|
+
def visit_vertex_fields(e)
|
208
|
+
self.fields ||= super
|
209
|
+
end
|
210
|
+
|
211
|
+
def visit_element(e)
|
212
|
+
self.fields = nil
|
213
|
+
get_cached(e) || set_cached(e, super)
|
214
|
+
end
|
215
|
+
end
|
216
|
+
end
|
@@ -0,0 +1,148 @@
|
|
1
|
+
class Nokogiri::XML::Text
|
2
|
+
def tree(_ = nil)
|
3
|
+
text unless text =~ /\A\s*\Z/
|
4
|
+
end
|
5
|
+
|
6
|
+
def inspect
|
7
|
+
if text =~ /\A\s*\Z/
|
8
|
+
"#<(whitespace)>"
|
9
|
+
else
|
10
|
+
"#<Text #{ text }>"
|
11
|
+
end
|
12
|
+
end
|
13
|
+
end
|
14
|
+
|
15
|
+
|
16
|
+
class Nokogiri::XML::Node
|
17
|
+
def tree(key_map = {})
|
18
|
+
c = elements.map { |x| x.tree(key_map) }.compact
|
19
|
+
if c.empty?
|
20
|
+
key_map.fetch(name, name)
|
21
|
+
else
|
22
|
+
ct = {}
|
23
|
+
texts = []
|
24
|
+
attrs = {}
|
25
|
+
if respond_to? :attributes
|
26
|
+
attrs = Hash[attributes.map { |k, a|
|
27
|
+
k = key_map.fetch(k, k)
|
28
|
+
[k, a.value] if k
|
29
|
+
}.compact]
|
30
|
+
end
|
31
|
+
c.each do |h|
|
32
|
+
if h.is_a? String
|
33
|
+
texts << h
|
34
|
+
next
|
35
|
+
end
|
36
|
+
h.each do |name, value|
|
37
|
+
if ct.key? name
|
38
|
+
if ct[name].is_a? Array
|
39
|
+
ct[name] << value unless ct[name].include? value
|
40
|
+
elsif ct[name] != value
|
41
|
+
ct[name] = [ct[name], value]
|
42
|
+
end
|
43
|
+
else
|
44
|
+
ct[name] = value
|
45
|
+
end
|
46
|
+
end
|
47
|
+
end
|
48
|
+
ct.merge! attrs
|
49
|
+
key = key_map.fetch(name, name)
|
50
|
+
if key
|
51
|
+
if ct.empty?
|
52
|
+
if texts.count < 2
|
53
|
+
{ key => texts.first }
|
54
|
+
else
|
55
|
+
{ key => texts.uniq }
|
56
|
+
end
|
57
|
+
elsif texts.any?
|
58
|
+
{ key => ct }
|
59
|
+
else
|
60
|
+
{ key => ct }
|
61
|
+
end
|
62
|
+
end
|
63
|
+
end
|
64
|
+
end
|
65
|
+
|
66
|
+
def inspect
|
67
|
+
if children.all? &:text?
|
68
|
+
"#<Property #{ name }>"
|
69
|
+
else
|
70
|
+
"#<Element #{ name } [#{ elements.map(&:name).uniq.join(', ') }]>"
|
71
|
+
end
|
72
|
+
end
|
73
|
+
|
74
|
+
def description
|
75
|
+
s = if property?
|
76
|
+
"property"
|
77
|
+
elsif container?
|
78
|
+
'container'
|
79
|
+
elsif vertex?
|
80
|
+
'vertex'
|
81
|
+
else
|
82
|
+
'other'
|
83
|
+
end
|
84
|
+
"#{ s } #{ name }"
|
85
|
+
end
|
86
|
+
|
87
|
+
def property?
|
88
|
+
children.all? &:text?
|
89
|
+
end
|
90
|
+
|
91
|
+
def container?
|
92
|
+
not property? and
|
93
|
+
elements.map(&:name).uniq.length == 1 and
|
94
|
+
elements.all? { |e| e.vertex? or e.container? }
|
95
|
+
end
|
96
|
+
|
97
|
+
def vertex?
|
98
|
+
not property? and not container?
|
99
|
+
end
|
100
|
+
|
101
|
+
def properties
|
102
|
+
elements.select(&:property?)
|
103
|
+
end
|
104
|
+
|
105
|
+
def attrs
|
106
|
+
if respond_to? :attributes
|
107
|
+
attributes
|
108
|
+
else
|
109
|
+
{}
|
110
|
+
end
|
111
|
+
end
|
112
|
+
|
113
|
+
def fields
|
114
|
+
result = {}
|
115
|
+
attrs.each do |name, attr|
|
116
|
+
result[name] = attr.value
|
117
|
+
end
|
118
|
+
properties.each do |e|
|
119
|
+
result[e.name] = e.text
|
120
|
+
end
|
121
|
+
result['type'] = name
|
122
|
+
result
|
123
|
+
end
|
124
|
+
|
125
|
+
def one_rels
|
126
|
+
elements.select &:vertex?
|
127
|
+
end
|
128
|
+
|
129
|
+
def contained_rels
|
130
|
+
if container?
|
131
|
+
elements.select(&:vertex?) +
|
132
|
+
elements.select(&:container?).flat_map(&:contained_rels)
|
133
|
+
else
|
134
|
+
[]
|
135
|
+
end
|
136
|
+
end
|
137
|
+
|
138
|
+
def many_rels
|
139
|
+
elements.select &:container?
|
140
|
+
end
|
141
|
+
|
142
|
+
def rels_hash
|
143
|
+
result = Hash.new { |h, k| h[k] = [] }
|
144
|
+
one_rels.each { |e| result[e.name] << e }
|
145
|
+
many_rels.each { |e| result[e.name] += e.contained_rels }
|
146
|
+
result
|
147
|
+
end
|
148
|
+
end
|
@@ -0,0 +1,107 @@
|
|
1
|
+
require 'set'
|
2
|
+
|
3
|
+
module PacerXml
|
4
|
+
module Sample
|
5
|
+
class << self
|
6
|
+
# Will actually load 101. To avoid this side-effect of
|
7
|
+
# prefetching, the route should be defined as:
|
8
|
+
# xml_route.limit(100).import(...)
|
9
|
+
def load_100(*args)
|
10
|
+
i = importer(*args).limit(100)
|
11
|
+
i.run!
|
12
|
+
i.graph
|
13
|
+
end
|
14
|
+
|
15
|
+
# Uses a Neo4j graph because the data is too big to fit in memory
|
16
|
+
# without configuring the JVM to use more than its small default
|
17
|
+
# footprint.
|
18
|
+
#
|
19
|
+
# Alternatively, to start the JVM with more memory, try:
|
20
|
+
# bundle exec jruby -J-Xmx2048m -S irb
|
21
|
+
def load_all(graph = nil, *args)
|
22
|
+
require 'pacer-neo4j'
|
23
|
+
n = Time.now.to_i % 1000000
|
24
|
+
graph ||= Pacer.neo4j "sample.#{n}.graph"
|
25
|
+
i = importer(graph, *args)
|
26
|
+
i.run!
|
27
|
+
i.graph
|
28
|
+
end
|
29
|
+
|
30
|
+
def structure(g)
|
31
|
+
Pacer::Utils::GraphAnalysis.structure g
|
32
|
+
end
|
33
|
+
|
34
|
+
def structure!(g, fn = 'patent-structure.graphml')
|
35
|
+
s = structure g
|
36
|
+
if fn
|
37
|
+
e = Pacer::Utils::YFilesExport.new
|
38
|
+
e.vertex_label = s.vertex_name
|
39
|
+
e.edge_label = s.edge_name
|
40
|
+
e.export s, fn
|
41
|
+
puts
|
42
|
+
puts "Wrote #{ fn }"
|
43
|
+
end
|
44
|
+
s
|
45
|
+
end
|
46
|
+
|
47
|
+
# Sample of using the xml import function with some advanced options to
|
48
|
+
# clean up the resulting graph.
|
49
|
+
#
|
50
|
+
# Import can successfully be run with no options specified, but this patent
|
51
|
+
# xml is particularly hairy.
|
52
|
+
def importer(graph = nil, fn = nil, start_rule = nil, end_rule = nil)
|
53
|
+
html = [:abstract]
|
54
|
+
rename = {
|
55
|
+
'classification-national' => 'classification',
|
56
|
+
'assistant-examiner' => 'examiner',
|
57
|
+
'primary-examiner' => 'examiner',
|
58
|
+
'us-term-of-grant' => 'term',
|
59
|
+
'addressbook' => 'entity',
|
60
|
+
'document-id' => 'document',
|
61
|
+
'us-related-documents' => 'related-document',
|
62
|
+
'us-patent-grant' => 'patent-version',
|
63
|
+
'us-bibliographic-data-grant' => 'patent'
|
64
|
+
}
|
65
|
+
cache = { stats: true }
|
66
|
+
graph ||= Pacer.tg
|
67
|
+
graph.create_key_index :type, :vertex
|
68
|
+
xml_route = xml(fn, start_rule, end_rule)
|
69
|
+
xml_route.
|
70
|
+
process { print '.' }.
|
71
|
+
import(graph, html: html, rename: rename, cache: cache)
|
72
|
+
end
|
73
|
+
|
74
|
+
def xml(fn = nil, *args)
|
75
|
+
fn ||= a_week
|
76
|
+
path = download_patent_grant fn
|
77
|
+
Pacer.xml path, *args
|
78
|
+
end
|
79
|
+
|
80
|
+
def cleanup(fn = nil)
|
81
|
+
fn ||= a_week
|
82
|
+
name, week = fn.split '_'
|
83
|
+
Dir["/tmp/#{name}*"].each { |f| File.delete f }
|
84
|
+
end
|
85
|
+
|
86
|
+
private
|
87
|
+
|
88
|
+
def a_week
|
89
|
+
'ipgb20120103_wk01'
|
90
|
+
end
|
91
|
+
|
92
|
+
def download_patent_grant(fn)
|
93
|
+
puts "Downloading a sample xml file from"
|
94
|
+
puts "http://www.google.com/googlebooks/uspto-patents-grants-biblio.html"
|
95
|
+
name, week = fn.split '_'
|
96
|
+
result = "/tmp/#{name}.xml"
|
97
|
+
Dir.chdir '/tmp' do
|
98
|
+
unless File.exists? result
|
99
|
+
system "curl http://storage.googleapis.com/patents/grantbib/2012/#{fn}.zip > #{fn}.zip"
|
100
|
+
system "unzip #{fn}.zip"
|
101
|
+
end
|
102
|
+
end
|
103
|
+
result
|
104
|
+
end
|
105
|
+
end
|
106
|
+
end
|
107
|
+
end
|
@@ -0,0 +1,50 @@
|
|
1
|
+
module Pacer
|
2
|
+
module Core
|
3
|
+
module StringRoute
|
4
|
+
def xml_stream(enter = nil, leave = nil)
|
5
|
+
enter ||= /<\?xml/
|
6
|
+
leave ||= enter
|
7
|
+
enter = build_rule :enter, enter
|
8
|
+
leave = build_rule :leave, leave
|
9
|
+
r = reducer(element_type: :array, enter: enter, leave: leave) do |s, lines|
|
10
|
+
lines << s
|
11
|
+
end.route
|
12
|
+
joined = r.map(element_type: :string, info: 'join', &:join).route
|
13
|
+
joined.xml
|
14
|
+
end
|
15
|
+
|
16
|
+
def xml
|
17
|
+
map(element_type: :xml) do |s|
|
18
|
+
Nokogiri::XML(s).first_element_child
|
19
|
+
end
|
20
|
+
end
|
21
|
+
|
22
|
+
private
|
23
|
+
|
24
|
+
def build_rule(type, rule)
|
25
|
+
rule = rule.to_s if rule.is_a? Symbol
|
26
|
+
if rule.is_a? String
|
27
|
+
if type == :leave
|
28
|
+
rule = "/#{rule}"
|
29
|
+
add_close_tag = true
|
30
|
+
end
|
31
|
+
rule = /<#{rule}\b/
|
32
|
+
end
|
33
|
+
if rule.is_a? Proc
|
34
|
+
rule
|
35
|
+
elsif add_close_tag
|
36
|
+
proc do |line, lines, set_value|
|
37
|
+
if line.nil? or rule =~ line
|
38
|
+
set_value.call(lines << line)
|
39
|
+
true
|
40
|
+
end
|
41
|
+
end
|
42
|
+
else
|
43
|
+
proc do |line|
|
44
|
+
[] if line.nil? or rule =~ line
|
45
|
+
end
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
50
|
+
end
|
@@ -0,0 +1,129 @@
|
|
1
|
+
module PacerXml
|
2
|
+
module XmlRoute
|
3
|
+
def help(section = nil)
|
4
|
+
case section
|
5
|
+
when nil
|
6
|
+
puts <<HELP
|
7
|
+
This is included via the pacer-xml gem plugin.
|
8
|
+
|
9
|
+
pacer-xml uses Nokogiri for its xml parsing. Each element in an xml route
|
10
|
+
is the first child element of the Nokogiri::XML::Document element. To get at
|
11
|
+
the document element, simply call #parent on the element.
|
12
|
+
|
13
|
+
An xml route can be created, transformed, filtered and otherwise
|
14
|
+
processed by all standard Pacer routes. For instance, if a graph element
|
15
|
+
has a property with xml data in it, we could process it as follows:
|
16
|
+
|
17
|
+
g.v.map(element_type: :xml) { |v| Nokogiri(v[:xml]) }
|
18
|
+
|
19
|
+
Method help sections:
|
20
|
+
:xml
|
21
|
+
:import
|
22
|
+
|
23
|
+
HELP
|
24
|
+
when :xml
|
25
|
+
puts <<HELP
|
26
|
+
|
27
|
+
|
28
|
+
|
29
|
+
Turn an xml file into a stream of xml nodes. Scans the xml file
|
30
|
+
line-by-line and uses arguments defined in start_section and end_section
|
31
|
+
to extract sections from the file.
|
32
|
+
|
33
|
+
Pacer.xml(file, start_section = nil, end_section = nil)
|
34
|
+
|
35
|
+
file: String | IO
|
36
|
+
String path to an xml file to read
|
37
|
+
IO an open resource that responds to #each_line
|
38
|
+
start_section: String | Symbol | Regex | Proc (optional)
|
39
|
+
String | Symbol name of xml tag to use as the root node of each
|
40
|
+
section of xml. The end_section will automatically be
|
41
|
+
set to the closing tag. This uses very simple regex
|
42
|
+
matching.
|
43
|
+
Regex If it matches, start the section from this line
|
44
|
+
Proc proc { |line| }
|
45
|
+
If it results in a truthy value, starts collecting
|
46
|
+
lines for the next section of xml.
|
47
|
+
end_section: Proc (optional)
|
48
|
+
Regex If it matches, end the section including this line
|
49
|
+
Proc proc { |line, lines| }
|
50
|
+
- If it results in a truthy value to indicate that the
|
51
|
+
current line is the last line in a section.
|
52
|
+
- if it results in an Array, pass the result of
|
53
|
+
joining the array to Nokogiri for the next section.
|
54
|
+
|
55
|
+
HELP
|
56
|
+
when :import
|
57
|
+
puts <<HELP
|
58
|
+
Turn the tree of xml in each node in the stream
|
59
|
+
|
60
|
+
xml_route.import(graph, opts = {})
|
61
|
+
|
62
|
+
graph: PacerGraph The graph to load the data into.
|
63
|
+
opts: Hash
|
64
|
+
:cache false | Hash
|
65
|
+
false disable caching
|
66
|
+
stats: true enable occasional dump of cache info
|
67
|
+
:rename Hash map of { 'old-name' => 'new-name' }
|
68
|
+
:html Array set of tag names to treat as containing HTML
|
69
|
+
:skip Array set of tag or attribute names to skip
|
70
|
+
|
71
|
+
Produces a vertex route where each vertex is the root vertex for each xml tree.
|
72
|
+
|
73
|
+
Look at the source of lib/pacer-xml/sample.rb a good example.
|
74
|
+
|
75
|
+
HELP
|
76
|
+
else
|
77
|
+
super
|
78
|
+
end
|
79
|
+
description
|
80
|
+
end
|
81
|
+
|
82
|
+
def children
|
83
|
+
flat_map(element_type: :xml) { |x| x.children.to_a }
|
84
|
+
end
|
85
|
+
|
86
|
+
def names
|
87
|
+
map element_type: :string, &:name
|
88
|
+
end
|
89
|
+
|
90
|
+
def text_nodes
|
91
|
+
select &:text?
|
92
|
+
end
|
93
|
+
|
94
|
+
def elements
|
95
|
+
select &:element?
|
96
|
+
end
|
97
|
+
|
98
|
+
def fields
|
99
|
+
elements.map element_type: :hash, &:fields
|
100
|
+
end
|
101
|
+
|
102
|
+
def import(graph, opts = {})
|
103
|
+
if opts[:cache] == false
|
104
|
+
builder = BuildGraph.new(graph, opts)
|
105
|
+
else
|
106
|
+
builder = BuildGraphCached.new(graph, opts)
|
107
|
+
end
|
108
|
+
graph.vertex_name ||= proc { |v| v[:type] }
|
109
|
+
to_route.map(route_name: 'import', graph: graph, element_type: :vertex, modules: [ImportHelp]) do |node|
|
110
|
+
graph.transaction do
|
111
|
+
builder.build(node)
|
112
|
+
end
|
113
|
+
end.route
|
114
|
+
end
|
115
|
+
|
116
|
+
module ImportHelp
|
117
|
+
def help(section = nil)
|
118
|
+
case section
|
119
|
+
when nil
|
120
|
+
back.help :import
|
121
|
+
else
|
122
|
+
super
|
123
|
+
end
|
124
|
+
description
|
125
|
+
end
|
126
|
+
end
|
127
|
+
end
|
128
|
+
Pacer::RouteBuilder.current.element_types[:xml] = [XmlRoute]
|
129
|
+
end
|
data/lib/pacer-xml.rb
ADDED
@@ -0,0 +1,48 @@
|
|
1
|
+
require_relative 'pacer-xml/version'
|
2
|
+
require 'nokogiri'
|
3
|
+
require 'pacer'
|
4
|
+
|
5
|
+
module PacerXml
|
6
|
+
class << self
|
7
|
+
# Returns the time pacer-xml was last reloaded (or when it was started).
|
8
|
+
def reload_time
|
9
|
+
if defined? @reload_time
|
10
|
+
@reload_time
|
11
|
+
else
|
12
|
+
START_TIME
|
13
|
+
end
|
14
|
+
end
|
15
|
+
|
16
|
+
# Reload all Ruby modified files in the pacer-xml library. Useful for debugging
|
17
|
+
# in the console. Does not do any of the fancy stuff that Rails reloading
|
18
|
+
# does. Certain types of changes will still require restarting the session.
|
19
|
+
def reload!
|
20
|
+
require 'pathname'
|
21
|
+
Pathname.new(File.expand_path(__FILE__)).parent.find do |path|
|
22
|
+
if path.extname == '.rb' and path.mtime > reload_time
|
23
|
+
puts path.to_s
|
24
|
+
load path.to_s
|
25
|
+
end
|
26
|
+
end
|
27
|
+
@reload_time = Time.now
|
28
|
+
end
|
29
|
+
end
|
30
|
+
end
|
31
|
+
|
32
|
+
require_relative 'pacer-xml/build_graph'
|
33
|
+
require_relative 'pacer-xml/nokogiri_node'
|
34
|
+
require_relative 'pacer-xml/xml_route'
|
35
|
+
require_relative 'pacer-xml/string_route'
|
36
|
+
require_relative 'pacer-xml/sample'
|
37
|
+
|
38
|
+
module Pacer
|
39
|
+
class << self
|
40
|
+
def xml(file, enter = nil, leave = nil)
|
41
|
+
if file.is_a? String
|
42
|
+
file = File.open '/tmp/ipgb20120103.xml'
|
43
|
+
end
|
44
|
+
lines = file.each_line.to_route(element_type: :string, info: 'lines').route
|
45
|
+
lines.xml_stream(enter, leave).route
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
data/pacer-xml.gemspec
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
$:.push File.expand_path("../lib", __FILE__)
|
3
|
+
require "pacer-xml/version"
|
4
|
+
|
5
|
+
Gem::Specification.new do |s|
|
6
|
+
s.name = "pacer-xml"
|
7
|
+
s.version = PacerXml::VERSION
|
8
|
+
s.platform = 'java'
|
9
|
+
s.authors = ["Darrick Wiebe"]
|
10
|
+
s.email = ["dw@xnlogic.com"]
|
11
|
+
s.homepage = "http://xnlogic.com"
|
12
|
+
s.summary = %q{XML streaming and graph import for Pacer}
|
13
|
+
s.description = s.summary
|
14
|
+
|
15
|
+
s.add_dependency 'pacer', PacerXml::PACER_VERSION
|
16
|
+
s.add_dependency 'pacer-neo4j', ">= 2.1"
|
17
|
+
s.add_dependency 'nokogiri'
|
18
|
+
|
19
|
+
s.rubyforge_project = "pacer-xml"
|
20
|
+
|
21
|
+
s.files = `git ls-files`.split("\n")
|
22
|
+
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
23
|
+
s.require_paths = ["lib"]
|
24
|
+
end
|
metadata
CHANGED
@@ -2,14 +2,14 @@
|
|
2
2
|
name: pacer-xml
|
3
3
|
version: !ruby/object:Gem::Version
|
4
4
|
prerelease:
|
5
|
-
version: 0.2.
|
5
|
+
version: 0.2.2
|
6
6
|
platform: java
|
7
7
|
authors:
|
8
8
|
- Darrick Wiebe
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-10-
|
12
|
+
date: 2012-10-31 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: pacer
|
@@ -67,7 +67,19 @@ email:
|
|
67
67
|
executables: []
|
68
68
|
extensions: []
|
69
69
|
extra_rdoc_files: []
|
70
|
-
files:
|
70
|
+
files:
|
71
|
+
- .gitignore
|
72
|
+
- Gemfile
|
73
|
+
- Rakefile
|
74
|
+
- Readme.markdown
|
75
|
+
- lib/pacer-xml.rb
|
76
|
+
- lib/pacer-xml/build_graph.rb
|
77
|
+
- lib/pacer-xml/nokogiri_node.rb
|
78
|
+
- lib/pacer-xml/sample.rb
|
79
|
+
- lib/pacer-xml/string_route.rb
|
80
|
+
- lib/pacer-xml/version.rb
|
81
|
+
- lib/pacer-xml/xml_route.rb
|
82
|
+
- pacer-xml.gemspec
|
71
83
|
homepage: http://xnlogic.com
|
72
84
|
licenses: []
|
73
85
|
post_install_message:
|