pacer-xml 0.2.1-java → 0.2.2-java

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.gitignore ADDED
@@ -0,0 +1,5 @@
1
+ *.graph
2
+ *.lock
3
+ *.xml
4
+ pkg
5
+ *.graphml
data/Gemfile ADDED
@@ -0,0 +1,6 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in pacer-graph.gemspec
4
+ gemspec
5
+
6
+ gem 'pacer', path: '~/xn/pacer'
data/Rakefile ADDED
@@ -0,0 +1,2 @@
1
+ require 'bundler'
2
+ Bundler::GemHelper.install_tasks
data/Readme.markdown ADDED
@@ -0,0 +1,172 @@
1
+ pacer-xml
2
+ =========
3
+
4
+ This Pacer plugin is designed to make it dead-simple to import any
5
+ arbitrary XML file (no matter how bizarre) into any graph database
6
+ supported by Pacer.
7
+
8
+ This library evolved out of my need to be able to easily pull in sample
9
+ data when demoing Pacer. GraphML is pretty rare and what I've been able
10
+ to find is mostly pretty lame anyway, but raw XML seems to be everywhere
11
+ (just check out [DATA.GOV](http://www.data.gov/)).
12
+
13
+
14
+ Usage
15
+ -----
16
+
17
+ I suggest looking at the implementation of the below sample to see how
18
+ I've used pacer-xml there.
19
+
20
+ There are 2 key methods:
21
+
22
+ `Pacer.xml(file, start_section = nil, end_section = nil)`
23
+
24
+ ```
25
+ file: String | IO
26
+ String path to an xml file to read
27
+ IO an open resource that responds to #each_line
28
+ start_section: String | Symbol | Regex | Proc (optional)
29
+ String | Symbol name of xml tag to use as the root node of each
30
+ section of xml. The end_section will automatically be
31
+ set to the closing tag. This uses very simple regex
32
+ matching.
33
+ Regex If it matches, start the section from this line
34
+ Proc proc { |line| }
35
+ If it results in a truthy value, starts collecting
36
+ lines for the next section of xml.
37
+ end_section: Proc (optional)
38
+ Regex If it matches, end the section including this line
39
+ Proc proc { |line, lines| }
40
+ - If it results in a truthy value to indicate that the
41
+ current line is the last line in a section.
42
+ - if it results in an Array, pass the result of
43
+ joining the array to Nokogiri for the next section.
44
+ ```
45
+
46
+ If the parser is building a section when it gets to the end of the file,
47
+ it will call the `end_section.call(nil, lines)`. To prevent the final
48
+ section from being processed, return `[]`.
49
+
50
+ Returns a Pacer Route to a series of Nokogiri::XML::Elements. Each
51
+ element is the root element of the its document. By default, chunks are
52
+ delimited by the presence of `<?xml`.
53
+
54
+
55
+ `xml_route.import(graph, opts = {})`
56
+
57
+ ```
58
+ graph: PacerGraph The graph to load the data into.
59
+ opts: Hash
60
+ :cache false | Hash
61
+ false disable caching
62
+ stats: true enable occasional dump of cache info
63
+ :rename Hash map of { 'old-name' => 'new-name' }
64
+ :html Array set of tag names to treat as containing HTML
65
+ :skip Array set of tag or attribute names to skip
66
+ ```
67
+
68
+ Baked-in Sample
69
+ ---------------
70
+
71
+ This library started out with me tackling a chunk of [Patent Grants](https://explore.data.gov/Business-Enterprise/Patent-Grant-Bibliographic-Text-1976-Present-/8du5-jxih)
72
+ data, and my first attempt at importing it was with a hand-crafted set
73
+ of rules that walked the XML, creating graph elements along the way.
74
+ That was fairly painful and turned out to be very slow as well. My
75
+ second attempt evolved into this tool. The cool thing is that by the
76
+ end, everything specific to the patent grants data set was just a few
77
+ lines of configuration on top of a very powerful streaming XML parsing
78
+ tool.
79
+
80
+ I encourage you to check out the sample data, simply install this gem
81
+ and start up IRB, then:
82
+
83
+ ```ruby
84
+ require 'pacer-xml'
85
+
86
+ graph = PacerXml::Sample.load_100
87
+ ```
88
+
89
+ That will download and extract a 100M xml file full of 2 weeks of patent
90
+ grants data, then create a graph with the first 100 patents, including
91
+ every piece of data in the file.
92
+
93
+ I encourage you to take a look at [how it was done](https://github.com/xnlogic/pacer-xml/blob/master/lib/pacer-xml/sample.rb).
94
+
95
+ Once you've created a graph from the data, it may be useful for you to
96
+ check out how it's structured. Pacer's got a handy tool built in to do
97
+ that, `Pacer::Utils::GraphAnalysis.structure graph`, but let's go one
98
+ step further and visually analyze the graph. If we run the command
99
+ below, we'll see the same results as the GraphAnalysis, but it will
100
+ export a graphml file that we can load into yEd, an excellent free graph
101
+ visualization tool:
102
+
103
+ ```ruby
104
+ PacerXml::Sample.structure! graph
105
+ # ... lots of output ...
106
+ #=> #<PacerGraph tinkergraph[vertices:90 edges:112]
107
+ ```
108
+
109
+ The new file in your working directory is called
110
+ `patent-structure.graphml`. Open that file in yEd. You'll see a single
111
+ box... Fortunately, laying it out is fairly simple:
112
+
113
+ 1. Tools / Fit Node To Label
114
+ 1. OK
115
+ 1. Layout / Hierarchical...
116
+ 1. Labelling Tab / set Edge Labelling to Hierarchic
117
+ 1. OK
118
+
119
+ Cool!
120
+
121
+ Contextual Help
122
+ ---------------
123
+
124
+ Back to Pacer, there's lots to learn about Pacer. The best way to do
125
+ that is to use Pacer's own inline help:
126
+
127
+ * Use `Pacer.help` for general help
128
+ * Get into a general section with `Pacer.help :section`
129
+ * Get contextual help with `graph.v.map.help`
130
+ * Get more contextual help with `graph.v.map.help :section`
131
+
132
+ Contextual help was only added recently so it's not complete yet but
133
+ it's developing quickly and contributions are very welcome!
134
+
135
+ More
136
+ -----
137
+
138
+ To play with the xml tools themselves, try out the following commands:
139
+
140
+ ```ruby
141
+ xml_route = PacerXml::Sample.xml(nil, start_rule, end_rule)
142
+
143
+ importer = PacerXml::Sample.importer
144
+ ```
145
+
146
+ Performance Notes
147
+ -----------------
148
+
149
+ This section uses the `PacerXml::Sample.load_all` method. The `load_100`
150
+ method runs in just a couple of seconds.
151
+
152
+ The default sample file contains 3019840 lines representing 4479
153
+ documents. Running under the simple `bundle exec irb` command on a MBP
154
+ 2.3 GHz i7, here are some quick timings (in seconds) for operations on
155
+ the entire file:
156
+
157
+ ```
158
+ => 8.36 iterate through 3019840 lines
159
+ => 28.534 reduce the lines to 4479 arrays of lines
160
+ => 29.753 join each array of lines into a string
161
+ => 34.788 parse each string into a Nokogiri XML document
162
+ => 812.732 create a graph, producing 494659 vertices and 629690 edges
163
+ ```
164
+
165
+ Starting up with `bundle exec jruby --server -J-Xmx2048m -S irb`
166
+ slightly improves performance of the import but does not appear to
167
+ affect Pacer or Nokogiri's performance:
168
+
169
+ ```
170
+ => 34.857 parsed XML documents
171
+ => 780.828 created graph
172
+ ```
@@ -0,0 +1,216 @@
1
+ require 'set'
2
+
3
+ module PacerXml
4
+ class GraphVisitor
5
+ class << self
6
+ def build_rename(custom = {})
7
+ h = Hash.new { |h, k| h[k] = k.to_s }
8
+ h['id'] = 'identifier'
9
+ h.merge! custom if custom
10
+ h
11
+ end
12
+ end
13
+
14
+ attr_reader :graph
15
+ attr_accessor :depth, :documents
16
+ attr_reader :rename, :html, :skip
17
+
18
+ def initialize(graph, opts = {})
19
+ @documents = 0
20
+ @graph = graph
21
+ # treat tag as a property containing html
22
+ @html = (opts[:html] || []).map(&:to_s).to_set
23
+ # skip property or tag
24
+ @skip = (opts[:skip] || []).map(&:to_s).to_set
25
+ # rename type or property
26
+ @rename = self.class.build_rename(opts[:rename])
27
+ end
28
+
29
+ def build(doc)
30
+ self.documents += 1
31
+ self.depth = 0
32
+ if doc.is_a? Nokogiri::XML::Document
33
+ visit_element doc.first_element_child
34
+ elsif doc.element?
35
+ visit_element doc
36
+ elsif doc.is_a? Enumerable
37
+ doc.select(&:element?).each { |e| visit_element e }
38
+ else
39
+ fail "Don't know what you want to do"
40
+ end
41
+ end
42
+
43
+ def visit_vertex_fields(e)
44
+ h = e.fields
45
+ h['type'] = rename[h['type']]
46
+ rename.each do |from, to|
47
+ if h.key? from
48
+ h[to] = h.delete from
49
+ end
50
+ end
51
+ html.each do |name|
52
+ name = rename[name]
53
+ child = e.at_xpath(name)
54
+ h[name] = child.inner_html if child
55
+ end
56
+ skip.each do |name|
57
+ h.delete name
58
+ end
59
+ h
60
+ end
61
+
62
+ def visit_edge_fields(e)
63
+ h = visit_vertex_fields(e)
64
+ h.delete 'type'
65
+ h
66
+ end
67
+
68
+ def tell(x)
69
+ print(' ' * depth) if depth
70
+ if x.is_a? Hash or x.is_a? Array
71
+ p x
72
+ else
73
+ puts x
74
+ end
75
+ end
76
+
77
+ def skip?(e)
78
+ skip.include? e.name or html.include? e.name
79
+ end
80
+
81
+ def level
82
+ self.depth += 1
83
+ yield
84
+ ensure
85
+ self.depth -= 1
86
+ end
87
+ end
88
+
89
+ class BuildGraph < GraphVisitor
90
+ def visit_element(e)
91
+ return nil if skip? e
92
+ level do
93
+ vertex = graph.create_vertex visit_vertex_fields(e)
94
+ e.one_rels.each do |rel|
95
+ visit_one_rel e, vertex, rel
96
+ end
97
+ e.many_rels.each do |rel|
98
+ visit_many_rels e, vertex, rel
99
+ end
100
+ if block_given?
101
+ yield vertex
102
+ else
103
+ vertex
104
+ end
105
+ end
106
+ end
107
+
108
+ def visit_one_rel(e, from, rel)
109
+ to = visit_element(rel)
110
+ if from and to
111
+ graph.create_edge nil, from, to, rename[rel.name]
112
+ end
113
+ end
114
+
115
+ def visit_many_rels(from_e, from, rel)
116
+ return nil if skip? rel
117
+ level do
118
+ attrs = visit_edge_fields rel
119
+ attrs.delete :type
120
+ rel.contained_rels.map do |to_e|
121
+ visit_many_rel(from_e, from, rel, to_e, attrs)
122
+ end
123
+ end
124
+ end
125
+
126
+ def visit_many_rel(from_e, from, rel, to_e, attrs)
127
+ to = visit_element(to_e)
128
+ if from and to
129
+ graph.create_edge nil, from, to, rename[rel.name], attrs
130
+ end
131
+ end
132
+ end
133
+
134
+
135
+ class BuildGraphCached < BuildGraph
136
+ class << self
137
+ def empty_cache
138
+ cache = Hash.new { |h, k| h[k] = {} }
139
+ cache[:hits] = Hash.new 0
140
+ cache[:size] = 0
141
+ cache[:kill] = nil
142
+ cache[:skip] = Set[]
143
+ cache
144
+ end
145
+ end
146
+
147
+ attr_reader :cache
148
+ attr_accessor :fields
149
+
150
+ def initialize(graph, opts = {})
151
+ if opts[:cache]
152
+ @cache = self.class.empty_cache.merge! opts[:cache]
153
+ else
154
+ @cache = self.class.empty_cache
155
+ end
156
+ super
157
+ end
158
+
159
+ def build(doc)
160
+ result = super
161
+ #tell "CACHE size #{ cache[:size] }, hits:"
162
+ if cache[:stats] and documents % 100 == 99
163
+ tell '-----------------'
164
+ cache.each do |k, adds|
165
+ next unless k.is_a? String
166
+ adds = adds.length
167
+ hits = cache[:hits][k]
168
+ tell("%40s: %6s / %6s = %5.4f" % [k, hits, adds, (hits/adds.to_f)])
169
+ end
170
+ end
171
+ result
172
+ end
173
+
174
+ def cacheable?(e)
175
+ not cache[:skip].include?(rename[e.name]) and not visit_vertex_fields(e).empty?
176
+ end
177
+
178
+ def get_cached(e)
179
+ if cacheable?(e)
180
+ id = cache[rename[e.name]][visit_vertex_fields(e).hash]
181
+ #tell "cache hit: #{ e.description }" if el
182
+ if id
183
+ cache[:hits][rename[e.name]] += 1
184
+ graph.vertex(id)
185
+ end
186
+ end
187
+ end
188
+
189
+ def set_cached(e, el)
190
+ return unless el
191
+ if cacheable?(e)
192
+ ct = cache[rename[e.name]]
193
+ kill = cache[:kill]
194
+ if kill and cache[:hits][rename[e.name]] == 0 and ct.length > kill
195
+ tell "cache kill #{ e.description }"
196
+ cache[:skip] << rename[e.name]
197
+ cache[:size] -= ct.length
198
+ cache[rename[e.name]] = []
199
+ else
200
+ ct[visit_vertex_fields(e).hash] = el.element_id
201
+ cache[:size] += 1
202
+ end
203
+ end
204
+ el
205
+ end
206
+
207
+ def visit_vertex_fields(e)
208
+ self.fields ||= super
209
+ end
210
+
211
+ def visit_element(e)
212
+ self.fields = nil
213
+ get_cached(e) || set_cached(e, super)
214
+ end
215
+ end
216
+ end
@@ -0,0 +1,148 @@
1
+ class Nokogiri::XML::Text
2
+ def tree(_ = nil)
3
+ text unless text =~ /\A\s*\Z/
4
+ end
5
+
6
+ def inspect
7
+ if text =~ /\A\s*\Z/
8
+ "#<(whitespace)>"
9
+ else
10
+ "#<Text #{ text }>"
11
+ end
12
+ end
13
+ end
14
+
15
+
16
+ class Nokogiri::XML::Node
17
+ def tree(key_map = {})
18
+ c = elements.map { |x| x.tree(key_map) }.compact
19
+ if c.empty?
20
+ key_map.fetch(name, name)
21
+ else
22
+ ct = {}
23
+ texts = []
24
+ attrs = {}
25
+ if respond_to? :attributes
26
+ attrs = Hash[attributes.map { |k, a|
27
+ k = key_map.fetch(k, k)
28
+ [k, a.value] if k
29
+ }.compact]
30
+ end
31
+ c.each do |h|
32
+ if h.is_a? String
33
+ texts << h
34
+ next
35
+ end
36
+ h.each do |name, value|
37
+ if ct.key? name
38
+ if ct[name].is_a? Array
39
+ ct[name] << value unless ct[name].include? value
40
+ elsif ct[name] != value
41
+ ct[name] = [ct[name], value]
42
+ end
43
+ else
44
+ ct[name] = value
45
+ end
46
+ end
47
+ end
48
+ ct.merge! attrs
49
+ key = key_map.fetch(name, name)
50
+ if key
51
+ if ct.empty?
52
+ if texts.count < 2
53
+ { key => texts.first }
54
+ else
55
+ { key => texts.uniq }
56
+ end
57
+ elsif texts.any?
58
+ { key => ct }
59
+ else
60
+ { key => ct }
61
+ end
62
+ end
63
+ end
64
+ end
65
+
66
+ def inspect
67
+ if children.all? &:text?
68
+ "#<Property #{ name }>"
69
+ else
70
+ "#<Element #{ name } [#{ elements.map(&:name).uniq.join(', ') }]>"
71
+ end
72
+ end
73
+
74
+ def description
75
+ s = if property?
76
+ "property"
77
+ elsif container?
78
+ 'container'
79
+ elsif vertex?
80
+ 'vertex'
81
+ else
82
+ 'other'
83
+ end
84
+ "#{ s } #{ name }"
85
+ end
86
+
87
+ def property?
88
+ children.all? &:text?
89
+ end
90
+
91
+ def container?
92
+ not property? and
93
+ elements.map(&:name).uniq.length == 1 and
94
+ elements.all? { |e| e.vertex? or e.container? }
95
+ end
96
+
97
+ def vertex?
98
+ not property? and not container?
99
+ end
100
+
101
+ def properties
102
+ elements.select(&:property?)
103
+ end
104
+
105
+ def attrs
106
+ if respond_to? :attributes
107
+ attributes
108
+ else
109
+ {}
110
+ end
111
+ end
112
+
113
+ def fields
114
+ result = {}
115
+ attrs.each do |name, attr|
116
+ result[name] = attr.value
117
+ end
118
+ properties.each do |e|
119
+ result[e.name] = e.text
120
+ end
121
+ result['type'] = name
122
+ result
123
+ end
124
+
125
+ def one_rels
126
+ elements.select &:vertex?
127
+ end
128
+
129
+ def contained_rels
130
+ if container?
131
+ elements.select(&:vertex?) +
132
+ elements.select(&:container?).flat_map(&:contained_rels)
133
+ else
134
+ []
135
+ end
136
+ end
137
+
138
+ def many_rels
139
+ elements.select &:container?
140
+ end
141
+
142
+ def rels_hash
143
+ result = Hash.new { |h, k| h[k] = [] }
144
+ one_rels.each { |e| result[e.name] << e }
145
+ many_rels.each { |e| result[e.name] += e.contained_rels }
146
+ result
147
+ end
148
+ end
@@ -0,0 +1,107 @@
1
+ require 'set'
2
+
3
+ module PacerXml
4
+ module Sample
5
+ class << self
6
+ # Will actually load 101. To avoid this side-effect of
7
+ # prefetching, the route should be defined as:
8
+ # xml_route.limit(100).import(...)
9
+ def load_100(*args)
10
+ i = importer(*args).limit(100)
11
+ i.run!
12
+ i.graph
13
+ end
14
+
15
+ # Uses a Neo4j graph because the data is too big to fit in memory
16
+ # without configuring the JVM to use more than its small default
17
+ # footprint.
18
+ #
19
+ # Alternatively, to start the JVM with more memory, try:
20
+ # bundle exec jruby -J-Xmx2048m -S irb
21
+ def load_all(graph = nil, *args)
22
+ require 'pacer-neo4j'
23
+ n = Time.now.to_i % 1000000
24
+ graph ||= Pacer.neo4j "sample.#{n}.graph"
25
+ i = importer(graph, *args)
26
+ i.run!
27
+ i.graph
28
+ end
29
+
30
+ def structure(g)
31
+ Pacer::Utils::GraphAnalysis.structure g
32
+ end
33
+
34
+ def structure!(g, fn = 'patent-structure.graphml')
35
+ s = structure g
36
+ if fn
37
+ e = Pacer::Utils::YFilesExport.new
38
+ e.vertex_label = s.vertex_name
39
+ e.edge_label = s.edge_name
40
+ e.export s, fn
41
+ puts
42
+ puts "Wrote #{ fn }"
43
+ end
44
+ s
45
+ end
46
+
47
+ # Sample of using the xml import function with some advanced options to
48
+ # clean up the resulting graph.
49
+ #
50
+ # Import can successfully be run with no options specified, but this patent
51
+ # xml is particularly hairy.
52
+ def importer(graph = nil, fn = nil, start_rule = nil, end_rule = nil)
53
+ html = [:abstract]
54
+ rename = {
55
+ 'classification-national' => 'classification',
56
+ 'assistant-examiner' => 'examiner',
57
+ 'primary-examiner' => 'examiner',
58
+ 'us-term-of-grant' => 'term',
59
+ 'addressbook' => 'entity',
60
+ 'document-id' => 'document',
61
+ 'us-related-documents' => 'related-document',
62
+ 'us-patent-grant' => 'patent-version',
63
+ 'us-bibliographic-data-grant' => 'patent'
64
+ }
65
+ cache = { stats: true }
66
+ graph ||= Pacer.tg
67
+ graph.create_key_index :type, :vertex
68
+ xml_route = xml(fn, start_rule, end_rule)
69
+ xml_route.
70
+ process { print '.' }.
71
+ import(graph, html: html, rename: rename, cache: cache)
72
+ end
73
+
74
+ def xml(fn = nil, *args)
75
+ fn ||= a_week
76
+ path = download_patent_grant fn
77
+ Pacer.xml path, *args
78
+ end
79
+
80
+ def cleanup(fn = nil)
81
+ fn ||= a_week
82
+ name, week = fn.split '_'
83
+ Dir["/tmp/#{name}*"].each { |f| File.delete f }
84
+ end
85
+
86
+ private
87
+
88
+ def a_week
89
+ 'ipgb20120103_wk01'
90
+ end
91
+
92
+ def download_patent_grant(fn)
93
+ puts "Downloading a sample xml file from"
94
+ puts "http://www.google.com/googlebooks/uspto-patents-grants-biblio.html"
95
+ name, week = fn.split '_'
96
+ result = "/tmp/#{name}.xml"
97
+ Dir.chdir '/tmp' do
98
+ unless File.exists? result
99
+ system "curl http://storage.googleapis.com/patents/grantbib/2012/#{fn}.zip > #{fn}.zip"
100
+ system "unzip #{fn}.zip"
101
+ end
102
+ end
103
+ result
104
+ end
105
+ end
106
+ end
107
+ end
@@ -0,0 +1,50 @@
1
+ module Pacer
2
+ module Core
3
+ module StringRoute
4
+ def xml_stream(enter = nil, leave = nil)
5
+ enter ||= /<\?xml/
6
+ leave ||= enter
7
+ enter = build_rule :enter, enter
8
+ leave = build_rule :leave, leave
9
+ r = reducer(element_type: :array, enter: enter, leave: leave) do |s, lines|
10
+ lines << s
11
+ end.route
12
+ joined = r.map(element_type: :string, info: 'join', &:join).route
13
+ joined.xml
14
+ end
15
+
16
+ def xml
17
+ map(element_type: :xml) do |s|
18
+ Nokogiri::XML(s).first_element_child
19
+ end
20
+ end
21
+
22
+ private
23
+
24
+ def build_rule(type, rule)
25
+ rule = rule.to_s if rule.is_a? Symbol
26
+ if rule.is_a? String
27
+ if type == :leave
28
+ rule = "/#{rule}"
29
+ add_close_tag = true
30
+ end
31
+ rule = /<#{rule}\b/
32
+ end
33
+ if rule.is_a? Proc
34
+ rule
35
+ elsif add_close_tag
36
+ proc do |line, lines, set_value|
37
+ if line.nil? or rule =~ line
38
+ set_value.call(lines << line)
39
+ true
40
+ end
41
+ end
42
+ else
43
+ proc do |line|
44
+ [] if line.nil? or rule =~ line
45
+ end
46
+ end
47
+ end
48
+ end
49
+ end
50
+ end
@@ -0,0 +1,7 @@
1
+ module PacerXml
2
+ unless const_defined? :VERSION
3
+ START_TIME = Time.now
4
+ VERSION = '0.2.2'
5
+ PACER_VERSION = '>= 1.1.1'
6
+ end
7
+ end
@@ -0,0 +1,129 @@
1
+ module PacerXml
2
+ module XmlRoute
3
+ def help(section = nil)
4
+ case section
5
+ when nil
6
+ puts <<HELP
7
+ This is included via the pacer-xml gem plugin.
8
+
9
+ pacer-xml uses Nokogiri for its xml parsing. Each element in an xml route
10
+ is the first child element of the Nokogiri::XML::Document element. To get at
11
+ the document element, simply call #parent on the element.
12
+
13
+ An xml route can be created, transformed, filtered and otherwise
14
+ processed by all standard Pacer routes. For instance, if a graph element
15
+ has a property with xml data in it, we could process it as follows:
16
+
17
+ g.v.map(element_type: :xml) { |v| Nokogiri(v[:xml]) }
18
+
19
+ Method help sections:
20
+ :xml
21
+ :import
22
+
23
+ HELP
24
+ when :xml
25
+ puts <<HELP
26
+
27
+
28
+
29
+ Turn an xml file into a stream of xml nodes. Scans the xml file
30
+ line-by-line and uses arguments defined in start_section and end_section
31
+ to extract sections from the file.
32
+
33
+ Pacer.xml(file, start_section = nil, end_section = nil)
34
+
35
+ file: String | IO
36
+ String path to an xml file to read
37
+ IO an open resource that responds to #each_line
38
+ start_section: String | Symbol | Regex | Proc (optional)
39
+ String | Symbol name of xml tag to use as the root node of each
40
+ section of xml. The end_section will automatically be
41
+ set to the closing tag. This uses very simple regex
42
+ matching.
43
+ Regex If it matches, start the section from this line
44
+ Proc proc { |line| }
45
+ If it results in a truthy value, starts collecting
46
+ lines for the next section of xml.
47
+ end_section: Proc (optional)
48
+ Regex If it matches, end the section including this line
49
+ Proc proc { |line, lines| }
50
+ - If it results in a truthy value to indicate that the
51
+ current line is the last line in a section.
52
+ - if it results in an Array, pass the result of
53
+ joining the array to Nokogiri for the next section.
54
+
55
+ HELP
56
+ when :import
57
+ puts <<HELP
58
+ Turn the tree of xml in each node in the stream
59
+
60
+ xml_route.import(graph, opts = {})
61
+
62
+ graph: PacerGraph The graph to load the data into.
63
+ opts: Hash
64
+ :cache false | Hash
65
+ false disable caching
66
+ stats: true enable occasional dump of cache info
67
+ :rename Hash map of { 'old-name' => 'new-name' }
68
+ :html Array set of tag names to treat as containing HTML
69
+ :skip Array set of tag or attribute names to skip
70
+
71
+ Produces a vertex route where each vertex is the root vertex for each xml tree.
72
+
73
+ Look at the source of lib/pacer-xml/sample.rb a good example.
74
+
75
+ HELP
76
+ else
77
+ super
78
+ end
79
+ description
80
+ end
81
+
82
+ def children
83
+ flat_map(element_type: :xml) { |x| x.children.to_a }
84
+ end
85
+
86
+ def names
87
+ map element_type: :string, &:name
88
+ end
89
+
90
+ def text_nodes
91
+ select &:text?
92
+ end
93
+
94
+ def elements
95
+ select &:element?
96
+ end
97
+
98
+ def fields
99
+ elements.map element_type: :hash, &:fields
100
+ end
101
+
102
+ def import(graph, opts = {})
103
+ if opts[:cache] == false
104
+ builder = BuildGraph.new(graph, opts)
105
+ else
106
+ builder = BuildGraphCached.new(graph, opts)
107
+ end
108
+ graph.vertex_name ||= proc { |v| v[:type] }
109
+ to_route.map(route_name: 'import', graph: graph, element_type: :vertex, modules: [ImportHelp]) do |node|
110
+ graph.transaction do
111
+ builder.build(node)
112
+ end
113
+ end.route
114
+ end
115
+
116
+ module ImportHelp
117
+ def help(section = nil)
118
+ case section
119
+ when nil
120
+ back.help :import
121
+ else
122
+ super
123
+ end
124
+ description
125
+ end
126
+ end
127
+ end
128
+ Pacer::RouteBuilder.current.element_types[:xml] = [XmlRoute]
129
+ end
data/lib/pacer-xml.rb ADDED
@@ -0,0 +1,48 @@
1
+ require_relative 'pacer-xml/version'
2
+ require 'nokogiri'
3
+ require 'pacer'
4
+
5
+ module PacerXml
6
+ class << self
7
+ # Returns the time pacer-xml was last reloaded (or when it was started).
8
+ def reload_time
9
+ if defined? @reload_time
10
+ @reload_time
11
+ else
12
+ START_TIME
13
+ end
14
+ end
15
+
16
+ # Reload all Ruby modified files in the pacer-xml library. Useful for debugging
17
+ # in the console. Does not do any of the fancy stuff that Rails reloading
18
+ # does. Certain types of changes will still require restarting the session.
19
+ def reload!
20
+ require 'pathname'
21
+ Pathname.new(File.expand_path(__FILE__)).parent.find do |path|
22
+ if path.extname == '.rb' and path.mtime > reload_time
23
+ puts path.to_s
24
+ load path.to_s
25
+ end
26
+ end
27
+ @reload_time = Time.now
28
+ end
29
+ end
30
+ end
31
+
32
+ require_relative 'pacer-xml/build_graph'
33
+ require_relative 'pacer-xml/nokogiri_node'
34
+ require_relative 'pacer-xml/xml_route'
35
+ require_relative 'pacer-xml/string_route'
36
+ require_relative 'pacer-xml/sample'
37
+
38
+ module Pacer
39
+ class << self
40
+ def xml(file, enter = nil, leave = nil)
41
+ if file.is_a? String
42
+ file = File.open '/tmp/ipgb20120103.xml'
43
+ end
44
+ lines = file.each_line.to_route(element_type: :string, info: 'lines').route
45
+ lines.xml_stream(enter, leave).route
46
+ end
47
+ end
48
+ end
data/pacer-xml.gemspec ADDED
@@ -0,0 +1,24 @@
1
+ # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib", __FILE__)
3
+ require "pacer-xml/version"
4
+
5
+ Gem::Specification.new do |s|
6
+ s.name = "pacer-xml"
7
+ s.version = PacerXml::VERSION
8
+ s.platform = 'java'
9
+ s.authors = ["Darrick Wiebe"]
10
+ s.email = ["dw@xnlogic.com"]
11
+ s.homepage = "http://xnlogic.com"
12
+ s.summary = %q{XML streaming and graph import for Pacer}
13
+ s.description = s.summary
14
+
15
+ s.add_dependency 'pacer', PacerXml::PACER_VERSION
16
+ s.add_dependency 'pacer-neo4j', ">= 2.1"
17
+ s.add_dependency 'nokogiri'
18
+
19
+ s.rubyforge_project = "pacer-xml"
20
+
21
+ s.files = `git ls-files`.split("\n")
22
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
23
+ s.require_paths = ["lib"]
24
+ end
metadata CHANGED
@@ -2,14 +2,14 @@
2
2
  name: pacer-xml
3
3
  version: !ruby/object:Gem::Version
4
4
  prerelease:
5
- version: 0.2.1
5
+ version: 0.2.2
6
6
  platform: java
7
7
  authors:
8
8
  - Darrick Wiebe
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-10-27 00:00:00.000000000 Z
12
+ date: 2012-10-31 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: pacer
@@ -67,7 +67,19 @@ email:
67
67
  executables: []
68
68
  extensions: []
69
69
  extra_rdoc_files: []
70
- files: []
70
+ files:
71
+ - .gitignore
72
+ - Gemfile
73
+ - Rakefile
74
+ - Readme.markdown
75
+ - lib/pacer-xml.rb
76
+ - lib/pacer-xml/build_graph.rb
77
+ - lib/pacer-xml/nokogiri_node.rb
78
+ - lib/pacer-xml/sample.rb
79
+ - lib/pacer-xml/string_route.rb
80
+ - lib/pacer-xml/version.rb
81
+ - lib/pacer-xml/xml_route.rb
82
+ - pacer-xml.gemspec
71
83
  homepage: http://xnlogic.com
72
84
  licenses: []
73
85
  post_install_message: