pacer-xml 0.2.1-java → 0.2.2-java

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore ADDED
@@ -0,0 +1,5 @@
1
+ *.graph
2
+ *.lock
3
+ *.xml
4
+ pkg
5
+ *.graphml
data/Gemfile ADDED
@@ -0,0 +1,6 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in pacer-graph.gemspec
4
+ gemspec
5
+
6
+ gem 'pacer', path: '~/xn/pacer'
data/Rakefile ADDED
@@ -0,0 +1,2 @@
1
+ require 'bundler'
2
+ Bundler::GemHelper.install_tasks
data/Readme.markdown ADDED
@@ -0,0 +1,172 @@
1
+ pacer-xml
2
+ =========
3
+
4
+ This Pacer plugin is designed to make it dead-simple to import any
5
+ arbitrary XML file (no matter how bizarre) into any graph database
6
+ supported by Pacer.
7
+
8
+ This library evolved out of my need to be able to easily pull in sample
9
+ data when demoing Pacer. GraphML is pretty rare and what I've been able
10
+ to find is mostly pretty lame anyway, but raw XML seems to be everywhere
11
+ (just check out [DATA.GOV](http://www.data.gov/)).
12
+
13
+
14
+ Usage
15
+ -----
16
+
17
+ I suggest looking at the implementation of the below sample to see how
18
+ I've used pacer-xml there.
19
+
20
+ There are 2 key methods:
21
+
22
+ `Pacer.xml(file, start_section = nil, end_section = nil)`
23
+
24
+ ```
25
+ file: String | IO
26
+ String path to an xml file to read
27
+ IO an open resource that responds to #each_line
28
+ start_section: String | Symbol | Regex | Proc (optional)
29
+ String | Symbol name of xml tag to use as the root node of each
30
+ section of xml. The end_section will automatically be
31
+ set to the closing tag. This uses very simple regex
32
+ matching.
33
+ Regex If it matches, start the section from this line
34
+ Proc proc { |line| }
35
+ If it results in a truthy value, starts collecting
36
+ lines for the next section of xml.
37
+ end_section: Proc (optional)
38
+ Regex If it matches, end the section including this line
39
+ Proc proc { |line, lines| }
40
+ - If it results in a truthy value to indicate that the
41
+ current line is the last line in a section.
42
+ - if it results in an Array, pass the result of
43
+ joining the array to Nokogiri for the next section.
44
+ ```
45
+
46
+ If the parser is building a section when it gets to the end of the file,
47
+ it will call the `end_section.call(nil, lines)`. To prevent the final
48
+ section from being processed, return `[]`.
49
+
50
+ Returns a Pacer Route to a series of Nokogiri::XML::Elements. Each
51
+ element is the root element of the its document. By default, chunks are
52
+ delimited by the presence of `<?xml`.
53
+
54
+
55
+ `xml_route.import(graph, opts = {})`
56
+
57
+ ```
58
+ graph: PacerGraph The graph to load the data into.
59
+ opts: Hash
60
+ :cache false | Hash
61
+ false disable caching
62
+ stats: true enable occasional dump of cache info
63
+ :rename Hash map of { 'old-name' => 'new-name' }
64
+ :html Array set of tag names to treat as containing HTML
65
+ :skip Array set of tag or attribute names to skip
66
+ ```
67
+
68
+ Baked-in Sample
69
+ ---------------
70
+
71
+ This library started out with me tackling a chunk of [Patent Grants](https://explore.data.gov/Business-Enterprise/Patent-Grant-Bibliographic-Text-1976-Present-/8du5-jxih)
72
+ data, and my first attempt at importing it was with a hand-crafted set
73
+ of rules that walked the XML, creating graph elements along the way.
74
+ That was fairly painful and turned out to be very slow as well. My
75
+ second attempt evolved into this tool. The cool thing is that by the
76
+ end, everything specific to the patent grants data set was just a few
77
+ lines of configuration on top of a very powerful streaming XML parsing
78
+ tool.
79
+
80
+ I encourage you to check out the sample data, simply install this gem
81
+ and start up IRB, then:
82
+
83
+ ```ruby
84
+ require 'pacer-xml'
85
+
86
+ graph = PacerXml::Sample.load_100
87
+ ```
88
+
89
+ That will download and extract a 100M xml file full of 2 weeks of patent
90
+ grants data, then create a graph with the first 100 patents, including
91
+ every piece of data in the file.
92
+
93
+ I encourage you to take a look at [how it was done](https://github.com/xnlogic/pacer-xml/blob/master/lib/pacer-xml/sample.rb).
94
+
95
+ Once you've created a graph from the data, it may be useful for you to
96
+ check out how it's structured. Pacer's got a handy tool built in to do
97
+ that, `Pacer::Utils::GraphAnalysis.structure graph`, but let's go one
98
+ step further and visually analyze the graph. If we run the command
99
+ below, we'll see the same results as the GraphAnalysis, but it will
100
+ export a graphml file that we can load into yEd, an excellent free graph
101
+ visualization tool:
102
+
103
+ ```ruby
104
+ PacerXml::Sample.structure! graph
105
+ # ... lots of output ...
106
+ #=> #<PacerGraph tinkergraph[vertices:90 edges:112]
107
+ ```
108
+
109
+ The new file in your working directory is called
110
+ `patent-structure.graphml`. Open that file in yEd. You'll see a single
111
+ box... Fortunately, laying it out is fairly simple:
112
+
113
+ 1. Tools / Fit Node To Label
114
+ 1. OK
115
+ 1. Layout / Hierarchical...
116
+ 1. Labelling Tab / set Edge Labelling to Hierarchic
117
+ 1. OK
118
+
119
+ Cool!
120
+
121
+ Contextual Help
122
+ ---------------
123
+
124
+ Back to Pacer, there's lots to learn about Pacer. The best way to do
125
+ that is to use Pacer's own inline help:
126
+
127
+ * Use `Pacer.help` for general help
128
+ * Get into a general section with `Pacer.help :section`
129
+ * Get contextual help with `graph.v.map.help`
130
+ * Get more contextual help with `graph.v.map.help :section`
131
+
132
+ Contextual help was only added recently so it's not complete yet but
133
+ it's developing quickly and contributions are very welcome!
134
+
135
+ More
136
+ -----
137
+
138
+ To play with the xml tools themselves, try out the following commands:
139
+
140
+ ```ruby
141
+ xml_route = PacerXml::Sample.xml(nil, start_rule, end_rule)
142
+
143
+ importer = PacerXml::Sample.importer
144
+ ```
145
+
146
+ Performance Notes
147
+ -----------------
148
+
149
+ This section uses the `PacerXml::Sample.load_all` method. The `load_100`
150
+ method runs in just a couple of seconds.
151
+
152
+ The default sample file contains 3019840 lines representing 4479
153
+ documents. Running under the simple `bundle exec irb` command on a MBP
154
+ 2.3 GHz i7, here are some quick timings (in seconds) for operations on
155
+ the entire file:
156
+
157
+ ```
158
+ => 8.36 iterate through 3019840 lines
159
+ => 28.534 reduce the lines to 4479 arrays of lines
160
+ => 29.753 join each array of lines into a string
161
+ => 34.788 parse each string into a Nokogiri XML document
162
+ => 812.732 create a graph, producing 494659 vertices and 629690 edges
163
+ ```
164
+
165
+ Starting up with `bundle exec jruby --server -J-Xmx2048m -S irb`
166
+ slightly improves performance of the import but does not appear to
167
+ affect Pacer or Nokogiri's performance:
168
+
169
+ ```
170
+ => 34.857 parsed XML documents
171
+ => 780.828 created graph
172
+ ```
@@ -0,0 +1,216 @@
1
+ require 'set'
2
+
3
+ module PacerXml
4
+ class GraphVisitor
5
+ class << self
6
+ def build_rename(custom = {})
7
+ h = Hash.new { |h, k| h[k] = k.to_s }
8
+ h['id'] = 'identifier'
9
+ h.merge! custom if custom
10
+ h
11
+ end
12
+ end
13
+
14
+ attr_reader :graph
15
+ attr_accessor :depth, :documents
16
+ attr_reader :rename, :html, :skip
17
+
18
+ def initialize(graph, opts = {})
19
+ @documents = 0
20
+ @graph = graph
21
+ # treat tag as a property containing html
22
+ @html = (opts[:html] || []).map(&:to_s).to_set
23
+ # skip property or tag
24
+ @skip = (opts[:skip] || []).map(&:to_s).to_set
25
+ # rename type or property
26
+ @rename = self.class.build_rename(opts[:rename])
27
+ end
28
+
29
+ def build(doc)
30
+ self.documents += 1
31
+ self.depth = 0
32
+ if doc.is_a? Nokogiri::XML::Document
33
+ visit_element doc.first_element_child
34
+ elsif doc.element?
35
+ visit_element doc
36
+ elsif doc.is_a? Enumerable
37
+ doc.select(&:element?).each { |e| visit_element e }
38
+ else
39
+ fail "Don't know what you want to do"
40
+ end
41
+ end
42
+
43
+ def visit_vertex_fields(e)
44
+ h = e.fields
45
+ h['type'] = rename[h['type']]
46
+ rename.each do |from, to|
47
+ if h.key? from
48
+ h[to] = h.delete from
49
+ end
50
+ end
51
+ html.each do |name|
52
+ name = rename[name]
53
+ child = e.at_xpath(name)
54
+ h[name] = child.inner_html if child
55
+ end
56
+ skip.each do |name|
57
+ h.delete name
58
+ end
59
+ h
60
+ end
61
+
62
+ def visit_edge_fields(e)
63
+ h = visit_vertex_fields(e)
64
+ h.delete 'type'
65
+ h
66
+ end
67
+
68
+ def tell(x)
69
+ print(' ' * depth) if depth
70
+ if x.is_a? Hash or x.is_a? Array
71
+ p x
72
+ else
73
+ puts x
74
+ end
75
+ end
76
+
77
+ def skip?(e)
78
+ skip.include? e.name or html.include? e.name
79
+ end
80
+
81
+ def level
82
+ self.depth += 1
83
+ yield
84
+ ensure
85
+ self.depth -= 1
86
+ end
87
+ end
88
+
89
+ class BuildGraph < GraphVisitor
90
+ def visit_element(e)
91
+ return nil if skip? e
92
+ level do
93
+ vertex = graph.create_vertex visit_vertex_fields(e)
94
+ e.one_rels.each do |rel|
95
+ visit_one_rel e, vertex, rel
96
+ end
97
+ e.many_rels.each do |rel|
98
+ visit_many_rels e, vertex, rel
99
+ end
100
+ if block_given?
101
+ yield vertex
102
+ else
103
+ vertex
104
+ end
105
+ end
106
+ end
107
+
108
+ def visit_one_rel(e, from, rel)
109
+ to = visit_element(rel)
110
+ if from and to
111
+ graph.create_edge nil, from, to, rename[rel.name]
112
+ end
113
+ end
114
+
115
+ def visit_many_rels(from_e, from, rel)
116
+ return nil if skip? rel
117
+ level do
118
+ attrs = visit_edge_fields rel
119
+ attrs.delete :type
120
+ rel.contained_rels.map do |to_e|
121
+ visit_many_rel(from_e, from, rel, to_e, attrs)
122
+ end
123
+ end
124
+ end
125
+
126
+ def visit_many_rel(from_e, from, rel, to_e, attrs)
127
+ to = visit_element(to_e)
128
+ if from and to
129
+ graph.create_edge nil, from, to, rename[rel.name], attrs
130
+ end
131
+ end
132
+ end
133
+
134
+
135
+ class BuildGraphCached < BuildGraph
136
+ class << self
137
+ def empty_cache
138
+ cache = Hash.new { |h, k| h[k] = {} }
139
+ cache[:hits] = Hash.new 0
140
+ cache[:size] = 0
141
+ cache[:kill] = nil
142
+ cache[:skip] = Set[]
143
+ cache
144
+ end
145
+ end
146
+
147
+ attr_reader :cache
148
+ attr_accessor :fields
149
+
150
+ def initialize(graph, opts = {})
151
+ if opts[:cache]
152
+ @cache = self.class.empty_cache.merge! opts[:cache]
153
+ else
154
+ @cache = self.class.empty_cache
155
+ end
156
+ super
157
+ end
158
+
159
+ def build(doc)
160
+ result = super
161
+ #tell "CACHE size #{ cache[:size] }, hits:"
162
+ if cache[:stats] and documents % 100 == 99
163
+ tell '-----------------'
164
+ cache.each do |k, adds|
165
+ next unless k.is_a? String
166
+ adds = adds.length
167
+ hits = cache[:hits][k]
168
+ tell("%40s: %6s / %6s = %5.4f" % [k, hits, adds, (hits/adds.to_f)])
169
+ end
170
+ end
171
+ result
172
+ end
173
+
174
+ def cacheable?(e)
175
+ not cache[:skip].include?(rename[e.name]) and not visit_vertex_fields(e).empty?
176
+ end
177
+
178
+ def get_cached(e)
179
+ if cacheable?(e)
180
+ id = cache[rename[e.name]][visit_vertex_fields(e).hash]
181
+ #tell "cache hit: #{ e.description }" if el
182
+ if id
183
+ cache[:hits][rename[e.name]] += 1
184
+ graph.vertex(id)
185
+ end
186
+ end
187
+ end
188
+
189
+ def set_cached(e, el)
190
+ return unless el
191
+ if cacheable?(e)
192
+ ct = cache[rename[e.name]]
193
+ kill = cache[:kill]
194
+ if kill and cache[:hits][rename[e.name]] == 0 and ct.length > kill
195
+ tell "cache kill #{ e.description }"
196
+ cache[:skip] << rename[e.name]
197
+ cache[:size] -= ct.length
198
+ cache[rename[e.name]] = []
199
+ else
200
+ ct[visit_vertex_fields(e).hash] = el.element_id
201
+ cache[:size] += 1
202
+ end
203
+ end
204
+ el
205
+ end
206
+
207
+ def visit_vertex_fields(e)
208
+ self.fields ||= super
209
+ end
210
+
211
+ def visit_element(e)
212
+ self.fields = nil
213
+ get_cached(e) || set_cached(e, super)
214
+ end
215
+ end
216
+ end
@@ -0,0 +1,148 @@
1
+ class Nokogiri::XML::Text
2
+ def tree(_ = nil)
3
+ text unless text =~ /\A\s*\Z/
4
+ end
5
+
6
+ def inspect
7
+ if text =~ /\A\s*\Z/
8
+ "#<(whitespace)>"
9
+ else
10
+ "#<Text #{ text }>"
11
+ end
12
+ end
13
+ end
14
+
15
+
16
+ class Nokogiri::XML::Node
17
+ def tree(key_map = {})
18
+ c = elements.map { |x| x.tree(key_map) }.compact
19
+ if c.empty?
20
+ key_map.fetch(name, name)
21
+ else
22
+ ct = {}
23
+ texts = []
24
+ attrs = {}
25
+ if respond_to? :attributes
26
+ attrs = Hash[attributes.map { |k, a|
27
+ k = key_map.fetch(k, k)
28
+ [k, a.value] if k
29
+ }.compact]
30
+ end
31
+ c.each do |h|
32
+ if h.is_a? String
33
+ texts << h
34
+ next
35
+ end
36
+ h.each do |name, value|
37
+ if ct.key? name
38
+ if ct[name].is_a? Array
39
+ ct[name] << value unless ct[name].include? value
40
+ elsif ct[name] != value
41
+ ct[name] = [ct[name], value]
42
+ end
43
+ else
44
+ ct[name] = value
45
+ end
46
+ end
47
+ end
48
+ ct.merge! attrs
49
+ key = key_map.fetch(name, name)
50
+ if key
51
+ if ct.empty?
52
+ if texts.count < 2
53
+ { key => texts.first }
54
+ else
55
+ { key => texts.uniq }
56
+ end
57
+ elsif texts.any?
58
+ { key => ct }
59
+ else
60
+ { key => ct }
61
+ end
62
+ end
63
+ end
64
+ end
65
+
66
+ def inspect
67
+ if children.all? &:text?
68
+ "#<Property #{ name }>"
69
+ else
70
+ "#<Element #{ name } [#{ elements.map(&:name).uniq.join(', ') }]>"
71
+ end
72
+ end
73
+
74
+ def description
75
+ s = if property?
76
+ "property"
77
+ elsif container?
78
+ 'container'
79
+ elsif vertex?
80
+ 'vertex'
81
+ else
82
+ 'other'
83
+ end
84
+ "#{ s } #{ name }"
85
+ end
86
+
87
+ def property?
88
+ children.all? &:text?
89
+ end
90
+
91
+ def container?
92
+ not property? and
93
+ elements.map(&:name).uniq.length == 1 and
94
+ elements.all? { |e| e.vertex? or e.container? }
95
+ end
96
+
97
+ def vertex?
98
+ not property? and not container?
99
+ end
100
+
101
+ def properties
102
+ elements.select(&:property?)
103
+ end
104
+
105
+ def attrs
106
+ if respond_to? :attributes
107
+ attributes
108
+ else
109
+ {}
110
+ end
111
+ end
112
+
113
+ def fields
114
+ result = {}
115
+ attrs.each do |name, attr|
116
+ result[name] = attr.value
117
+ end
118
+ properties.each do |e|
119
+ result[e.name] = e.text
120
+ end
121
+ result['type'] = name
122
+ result
123
+ end
124
+
125
+ def one_rels
126
+ elements.select &:vertex?
127
+ end
128
+
129
+ def contained_rels
130
+ if container?
131
+ elements.select(&:vertex?) +
132
+ elements.select(&:container?).flat_map(&:contained_rels)
133
+ else
134
+ []
135
+ end
136
+ end
137
+
138
+ def many_rels
139
+ elements.select &:container?
140
+ end
141
+
142
+ def rels_hash
143
+ result = Hash.new { |h, k| h[k] = [] }
144
+ one_rels.each { |e| result[e.name] << e }
145
+ many_rels.each { |e| result[e.name] += e.contained_rels }
146
+ result
147
+ end
148
+ end
@@ -0,0 +1,107 @@
1
+ require 'set'
2
+
3
+ module PacerXml
4
+ module Sample
5
+ class << self
6
+ # Will actually load 101. To avoid this side-effect of
7
+ # prefetching, the route should be defined as:
8
+ # xml_route.limit(100).import(...)
9
+ def load_100(*args)
10
+ i = importer(*args).limit(100)
11
+ i.run!
12
+ i.graph
13
+ end
14
+
15
+ # Uses a Neo4j graph because the data is too big to fit in memory
16
+ # without configuring the JVM to use more than its small default
17
+ # footprint.
18
+ #
19
+ # Alternatively, to start the JVM with more memory, try:
20
+ # bundle exec jruby -J-Xmx2048m -S irb
21
+ def load_all(graph = nil, *args)
22
+ require 'pacer-neo4j'
23
+ n = Time.now.to_i % 1000000
24
+ graph ||= Pacer.neo4j "sample.#{n}.graph"
25
+ i = importer(graph, *args)
26
+ i.run!
27
+ i.graph
28
+ end
29
+
30
+ def structure(g)
31
+ Pacer::Utils::GraphAnalysis.structure g
32
+ end
33
+
34
+ def structure!(g, fn = 'patent-structure.graphml')
35
+ s = structure g
36
+ if fn
37
+ e = Pacer::Utils::YFilesExport.new
38
+ e.vertex_label = s.vertex_name
39
+ e.edge_label = s.edge_name
40
+ e.export s, fn
41
+ puts
42
+ puts "Wrote #{ fn }"
43
+ end
44
+ s
45
+ end
46
+
47
+ # Sample of using the xml import function with some advanced options to
48
+ # clean up the resulting graph.
49
+ #
50
+ # Import can successfully be run with no options specified, but this patent
51
+ # xml is particularly hairy.
52
+ def importer(graph = nil, fn = nil, start_rule = nil, end_rule = nil)
53
+ html = [:abstract]
54
+ rename = {
55
+ 'classification-national' => 'classification',
56
+ 'assistant-examiner' => 'examiner',
57
+ 'primary-examiner' => 'examiner',
58
+ 'us-term-of-grant' => 'term',
59
+ 'addressbook' => 'entity',
60
+ 'document-id' => 'document',
61
+ 'us-related-documents' => 'related-document',
62
+ 'us-patent-grant' => 'patent-version',
63
+ 'us-bibliographic-data-grant' => 'patent'
64
+ }
65
+ cache = { stats: true }
66
+ graph ||= Pacer.tg
67
+ graph.create_key_index :type, :vertex
68
+ xml_route = xml(fn, start_rule, end_rule)
69
+ xml_route.
70
+ process { print '.' }.
71
+ import(graph, html: html, rename: rename, cache: cache)
72
+ end
73
+
74
+ def xml(fn = nil, *args)
75
+ fn ||= a_week
76
+ path = download_patent_grant fn
77
+ Pacer.xml path, *args
78
+ end
79
+
80
+ def cleanup(fn = nil)
81
+ fn ||= a_week
82
+ name, week = fn.split '_'
83
+ Dir["/tmp/#{name}*"].each { |f| File.delete f }
84
+ end
85
+
86
+ private
87
+
88
+ def a_week
89
+ 'ipgb20120103_wk01'
90
+ end
91
+
92
+ def download_patent_grant(fn)
93
+ puts "Downloading a sample xml file from"
94
+ puts "http://www.google.com/googlebooks/uspto-patents-grants-biblio.html"
95
+ name, week = fn.split '_'
96
+ result = "/tmp/#{name}.xml"
97
+ Dir.chdir '/tmp' do
98
+ unless File.exists? result
99
+ system "curl http://storage.googleapis.com/patents/grantbib/2012/#{fn}.zip > #{fn}.zip"
100
+ system "unzip #{fn}.zip"
101
+ end
102
+ end
103
+ result
104
+ end
105
+ end
106
+ end
107
+ end
@@ -0,0 +1,50 @@
1
+ module Pacer
2
+ module Core
3
+ module StringRoute
4
+ def xml_stream(enter = nil, leave = nil)
5
+ enter ||= /<\?xml/
6
+ leave ||= enter
7
+ enter = build_rule :enter, enter
8
+ leave = build_rule :leave, leave
9
+ r = reducer(element_type: :array, enter: enter, leave: leave) do |s, lines|
10
+ lines << s
11
+ end.route
12
+ joined = r.map(element_type: :string, info: 'join', &:join).route
13
+ joined.xml
14
+ end
15
+
16
+ def xml
17
+ map(element_type: :xml) do |s|
18
+ Nokogiri::XML(s).first_element_child
19
+ end
20
+ end
21
+
22
+ private
23
+
24
+ def build_rule(type, rule)
25
+ rule = rule.to_s if rule.is_a? Symbol
26
+ if rule.is_a? String
27
+ if type == :leave
28
+ rule = "/#{rule}"
29
+ add_close_tag = true
30
+ end
31
+ rule = /<#{rule}\b/
32
+ end
33
+ if rule.is_a? Proc
34
+ rule
35
+ elsif add_close_tag
36
+ proc do |line, lines, set_value|
37
+ if line.nil? or rule =~ line
38
+ set_value.call(lines << line)
39
+ true
40
+ end
41
+ end
42
+ else
43
+ proc do |line|
44
+ [] if line.nil? or rule =~ line
45
+ end
46
+ end
47
+ end
48
+ end
49
+ end
50
+ end
@@ -0,0 +1,7 @@
1
+ module PacerXml
2
+ unless const_defined? :VERSION
3
+ START_TIME = Time.now
4
+ VERSION = '0.2.2'
5
+ PACER_VERSION = '>= 1.1.1'
6
+ end
7
+ end
@@ -0,0 +1,129 @@
1
+ module PacerXml
2
+ module XmlRoute
3
+ def help(section = nil)
4
+ case section
5
+ when nil
6
+ puts <<HELP
7
+ This is included via the pacer-xml gem plugin.
8
+
9
+ pacer-xml uses Nokogiri for its xml parsing. Each element in an xml route
10
+ is the first child element of the Nokogiri::XML::Document element. To get at
11
+ the document element, simply call #parent on the element.
12
+
13
+ An xml route can be created, transformed, filtered and otherwise
14
+ processed by all standard Pacer routes. For instance, if a graph element
15
+ has a property with xml data in it, we could process it as follows:
16
+
17
+ g.v.map(element_type: :xml) { |v| Nokogiri(v[:xml]) }
18
+
19
+ Method help sections:
20
+ :xml
21
+ :import
22
+
23
+ HELP
24
+ when :xml
25
+ puts <<HELP
26
+
27
+
28
+
29
+ Turn an xml file into a stream of xml nodes. Scans the xml file
30
+ line-by-line and uses arguments defined in start_section and end_section
31
+ to extract sections from the file.
32
+
33
+ Pacer.xml(file, start_section = nil, end_section = nil)
34
+
35
+ file: String | IO
36
+ String path to an xml file to read
37
+ IO an open resource that responds to #each_line
38
+ start_section: String | Symbol | Regex | Proc (optional)
39
+ String | Symbol name of xml tag to use as the root node of each
40
+ section of xml. The end_section will automatically be
41
+ set to the closing tag. This uses very simple regex
42
+ matching.
43
+ Regex If it matches, start the section from this line
44
+ Proc proc { |line| }
45
+ If it results in a truthy value, starts collecting
46
+ lines for the next section of xml.
47
+ end_section: Proc (optional)
48
+ Regex If it matches, end the section including this line
49
+ Proc proc { |line, lines| }
50
+ - If it results in a truthy value to indicate that the
51
+ current line is the last line in a section.
52
+ - if it results in an Array, pass the result of
53
+ joining the array to Nokogiri for the next section.
54
+
55
+ HELP
56
+ when :import
57
+ puts <<HELP
58
+ Turn the tree of xml in each node in the stream
59
+
60
+ xml_route.import(graph, opts = {})
61
+
62
+ graph: PacerGraph The graph to load the data into.
63
+ opts: Hash
64
+ :cache false | Hash
65
+ false disable caching
66
+ stats: true enable occasional dump of cache info
67
+ :rename Hash map of { 'old-name' => 'new-name' }
68
+ :html Array set of tag names to treat as containing HTML
69
+ :skip Array set of tag or attribute names to skip
70
+
71
+ Produces a vertex route where each vertex is the root vertex for each xml tree.
72
+
73
+ Look at the source of lib/pacer-xml/sample.rb a good example.
74
+
75
+ HELP
76
+ else
77
+ super
78
+ end
79
+ description
80
+ end
81
+
82
+ def children
83
+ flat_map(element_type: :xml) { |x| x.children.to_a }
84
+ end
85
+
86
+ def names
87
+ map element_type: :string, &:name
88
+ end
89
+
90
+ def text_nodes
91
+ select &:text?
92
+ end
93
+
94
+ def elements
95
+ select &:element?
96
+ end
97
+
98
+ def fields
99
+ elements.map element_type: :hash, &:fields
100
+ end
101
+
102
+ def import(graph, opts = {})
103
+ if opts[:cache] == false
104
+ builder = BuildGraph.new(graph, opts)
105
+ else
106
+ builder = BuildGraphCached.new(graph, opts)
107
+ end
108
+ graph.vertex_name ||= proc { |v| v[:type] }
109
+ to_route.map(route_name: 'import', graph: graph, element_type: :vertex, modules: [ImportHelp]) do |node|
110
+ graph.transaction do
111
+ builder.build(node)
112
+ end
113
+ end.route
114
+ end
115
+
116
+ module ImportHelp
117
+ def help(section = nil)
118
+ case section
119
+ when nil
120
+ back.help :import
121
+ else
122
+ super
123
+ end
124
+ description
125
+ end
126
+ end
127
+ end
128
+ Pacer::RouteBuilder.current.element_types[:xml] = [XmlRoute]
129
+ end
data/lib/pacer-xml.rb ADDED
@@ -0,0 +1,48 @@
1
+ require_relative 'pacer-xml/version'
2
+ require 'nokogiri'
3
+ require 'pacer'
4
+
5
+ module PacerXml
6
+ class << self
7
+ # Returns the time pacer-xml was last reloaded (or when it was started).
8
+ def reload_time
9
+ if defined? @reload_time
10
+ @reload_time
11
+ else
12
+ START_TIME
13
+ end
14
+ end
15
+
16
+ # Reload all Ruby modified files in the pacer-xml library. Useful for debugging
17
+ # in the console. Does not do any of the fancy stuff that Rails reloading
18
+ # does. Certain types of changes will still require restarting the session.
19
+ def reload!
20
+ require 'pathname'
21
+ Pathname.new(File.expand_path(__FILE__)).parent.find do |path|
22
+ if path.extname == '.rb' and path.mtime > reload_time
23
+ puts path.to_s
24
+ load path.to_s
25
+ end
26
+ end
27
+ @reload_time = Time.now
28
+ end
29
+ end
30
+ end
31
+
32
+ require_relative 'pacer-xml/build_graph'
33
+ require_relative 'pacer-xml/nokogiri_node'
34
+ require_relative 'pacer-xml/xml_route'
35
+ require_relative 'pacer-xml/string_route'
36
+ require_relative 'pacer-xml/sample'
37
+
38
+ module Pacer
39
+ class << self
40
+ def xml(file, enter = nil, leave = nil)
41
+ if file.is_a? String
42
+ file = File.open '/tmp/ipgb20120103.xml'
43
+ end
44
+ lines = file.each_line.to_route(element_type: :string, info: 'lines').route
45
+ lines.xml_stream(enter, leave).route
46
+ end
47
+ end
48
+ end
data/pacer-xml.gemspec ADDED
@@ -0,0 +1,24 @@
1
+ # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib", __FILE__)
3
+ require "pacer-xml/version"
4
+
5
+ Gem::Specification.new do |s|
6
+ s.name = "pacer-xml"
7
+ s.version = PacerXml::VERSION
8
+ s.platform = 'java'
9
+ s.authors = ["Darrick Wiebe"]
10
+ s.email = ["dw@xnlogic.com"]
11
+ s.homepage = "http://xnlogic.com"
12
+ s.summary = %q{XML streaming and graph import for Pacer}
13
+ s.description = s.summary
14
+
15
+ s.add_dependency 'pacer', PacerXml::PACER_VERSION
16
+ s.add_dependency 'pacer-neo4j', ">= 2.1"
17
+ s.add_dependency 'nokogiri'
18
+
19
+ s.rubyforge_project = "pacer-xml"
20
+
21
+ s.files = `git ls-files`.split("\n")
22
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
23
+ s.require_paths = ["lib"]
24
+ end
metadata CHANGED
@@ -2,14 +2,14 @@
2
2
  name: pacer-xml
3
3
  version: !ruby/object:Gem::Version
4
4
  prerelease:
5
- version: 0.2.1
5
+ version: 0.2.2
6
6
  platform: java
7
7
  authors:
8
8
  - Darrick Wiebe
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-10-27 00:00:00.000000000 Z
12
+ date: 2012-10-31 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: pacer
@@ -67,7 +67,19 @@ email:
67
67
  executables: []
68
68
  extensions: []
69
69
  extra_rdoc_files: []
70
- files: []
70
+ files:
71
+ - .gitignore
72
+ - Gemfile
73
+ - Rakefile
74
+ - Readme.markdown
75
+ - lib/pacer-xml.rb
76
+ - lib/pacer-xml/build_graph.rb
77
+ - lib/pacer-xml/nokogiri_node.rb
78
+ - lib/pacer-xml/sample.rb
79
+ - lib/pacer-xml/string_route.rb
80
+ - lib/pacer-xml/version.rb
81
+ - lib/pacer-xml/xml_route.rb
82
+ - pacer-xml.gemspec
71
83
  homepage: http://xnlogic.com
72
84
  licenses: []
73
85
  post_install_message: