creeker 2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: c4c47c4c2ab4eab11da48520d2fff4aa20ef18ef
4
+ data.tar.gz: 45a5fc20d28267e258d818edd306c5bd8724a071
5
+ SHA512:
6
+ metadata.gz: af24552f0fa299657a56991061c966e13fdbcb45a8398370e4c9e9dc6ef211c9e928b9711360b8d410b1edd3ef98eac25f8c47759a0947362ee2dfb315b3fa5a
7
+ data.tar.gz: e2916b4131c6b2646b0e994ae245b5597fe16c389a24472725deaa78f10d61bda6cd6cbce011484b2fecee212af8383a9974b65e1229240e47bb401dbf51dbf1
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2017 Ramtin Vaziri
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,117 @@
1
+ # Creeker - Stream parser for large Excel (xlsx and xlsm) files.
2
+
3
+ Creeker is a Ruby gem that provides a fast, simple and efficient method of parsing large Excel (xlsx and xlsm) files.
4
+
5
+
6
+ ## Installation
7
+
8
+ Creeker can be used from the command line or as part of a Ruby web framework. To install the gem using terminal, run the following command:
9
+
10
+ ```
11
+ gem install creeker
12
+ ```
13
+
14
+ To use it in Rails, add this line to your Gemfile:
15
+
16
+ ```ruby
17
+ gem 'creeker'
18
+ ```
19
+
20
+ ## Basic Usage
21
+ Creeker can simply parse an Excel file by looping through the rows enumerator:
22
+
23
+ ```ruby
24
+ require 'creeker'
25
+ creeker = Creeker::Book.new 'spec/fixtures/sample.xlsx'
26
+ sheet = creeker.sheets[0]
27
+
28
+ sheet.rows.each do |row|
29
+ puts row # => {"A1"=>"Content 1", "B1"=>nil, C1"=>nil, "D1"=>"Content 3"}
30
+ end
31
+
32
+ sheet.rows_with_meta_data.each do |row|
33
+ puts row # => {"collapsed"=>"false", "customFormat"=>"false", "customHeight"=>"true", "hidden"=>"false", "ht"=>"12.1", "outlineLevel"=>"0", "r"=>"1", "cells"=>{"A1"=>"Content 1", "B1"=>nil, C1"=>nil, "D1"=>"Content 3"}}
34
+ end
35
+
36
+ sheet.state # => 'visible'
37
+ sheet.name # => 'Sheet1'
38
+ sheet.rid # => 'rId2'
39
+ ```
40
+
41
+ ## Filename considerations
42
+ By default, Creeker will ensure that the file extension is either *.xlsx or *.xlsm, but this check can be circumvented as needed:
43
+
44
+ ```ruby
45
+ path = 'sample-as-zip.zip'
46
+ Creeker::Book.new path, :check_file_extension => false
47
+ ```
48
+
49
+ By default, the Rails [file_field_tag](http://api.rubyonrails.org/classes/ActionView/Helpers/FormTagHelper.html#method-i-file_field_tag) uploads to a temporary location and stores the original filename with the StringIO object. (See [this section](http://guides.rubyonrails.org/form_helpers.html#uploading-files) of the Rails Guides for more information.)
50
+
51
+ Creeker can parse this directly without the need for file upload gems such as Carrierwave or Paperclip by passing the original filename as an option:
52
+
53
+ ```ruby
54
+ # Import endpoint in Rails controller
55
+ def import
56
+ file = params[:file]
57
+ Creeker::Book.new file.path, check_file_extension: false
58
+ end
59
+ ```
60
+
61
+ ## Parsing images
62
+ Creeker does not parse images by default. If you want to parse the images,
63
+ use `with_images` method before iterating over rows to preload images information. If you don't call this method, Creeker will not return images anywhere.
64
+
65
+ Cells with images will be an array of Pathname objects.
66
+ If an image is spread across multiple cells, same Pathname object will be returned for each cell.
67
+
68
+ ```ruby
69
+ sheet.with_images.rows.each do |row|
70
+ puts row # => {"A1"=>[#<Pathname:/var/folders/ck/l64nmm3d4k75pvxr03ndk1tm0000gn/T/creeker__drawing20161101-53599-274q0vimage1.jpeg>], "B2"=>"Fluffy"}
71
+ end
72
+ ```
73
+
74
+ Images for a specific cell can be obtained with images_at method:
75
+
76
+ ```ruby
77
+ puts sheet.images_at('A1') # => [#<Pathname:/var/folders/ck/l64nmm3d4k75pvxr03ndk1tm0000gn/T/creeker__drawing20161101-53599-274q0vimage1.jpeg>]
78
+
79
+ # no images in a cell
80
+ puts sheet.images_at('C1') # => nil
81
+ ```
82
+
83
+ Creeker will most likely return nil for a cell with images if there is no other text cell in that row - you can use *images_at* method for retrieving images in that cell.
84
+
85
+ ## Remote files
86
+
87
+ ```ruby
88
+ remote_url = 'http://dev-builds.libreoffice.org/tmp/test.xlsx'
89
+ Creeker::Book.new remote_url, remote: true
90
+ ```
91
+
92
+ ## Contributing
93
+
94
+ Contributions are welcomed. You can fork a repository, add your code changes to the forked branch, ensure all existing unit tests pass, create new unit tests which cover your new changes and finally create a pull request.
95
+
96
+ After forking and then cloning the repository locally, install the Bundler and then use it
97
+ to install the development gem dependencies:
98
+
99
+ ```
100
+ gem install bundler
101
+ bundle install
102
+ ```
103
+
104
+ Once this is complete, you should be able to run the test suite:
105
+
106
+ ```
107
+ rake
108
+ ```
109
+
110
+ ## Bug Reporting
111
+
112
+ Please use the [Issues](https://github.com/huntcode/creeker/issues) page to report bugs or suggest new enhancements.
113
+
114
+
115
+ ## License
116
+
117
+ Creeker has been published under [MIT License](https://github.com/huntcode/creeker/blob/master/LICENSE.txt)
data/Rakefile ADDED
@@ -0,0 +1,7 @@
1
+ require "bundler/gem_tasks"
2
+ require 'rspec/core/rake_task'
3
+
4
+ RSpec::Core::RakeTask.new('spec')
5
+
6
+ # If you want to make this the default task
7
+ task :default => :spec
data/lib/creeker.rb ADDED
@@ -0,0 +1,14 @@
1
+ require 'creeker/version'
2
+ require 'creeker/book'
3
+ require 'creeker/styles/constants'
4
+ require 'creeker/styles/style_types'
5
+ require 'creeker/styles/converter'
6
+ require 'creeker/utils'
7
+ require 'creeker/styles'
8
+ require 'creeker/drawing'
9
+ require 'creeker/sheet'
10
+ require 'creeker/shared_strings'
11
+
12
+ module Creeker
13
+ # Your code goes here...
14
+ end
@@ -0,0 +1,84 @@
1
+ require 'zip/filesystem'
2
+ require 'nokogiri'
3
+ require 'date'
4
+ require 'httparty'
5
+
6
+ module Creeker
7
+
8
+ class Creeker::Book
9
+
10
+ attr_reader :files,
11
+ :sheets,
12
+ :shared_strings
13
+
14
+ DATE_1900 = Date.new(1899, 12, 30).freeze
15
+ DATE_1904 = Date.new(1904, 1, 1).freeze
16
+
17
+ def initialize path, options = {}
18
+ check_file_extension = options.fetch(:check_file_extension, true)
19
+ if check_file_extension
20
+ extension = File.extname(options[:original_filename] || path).downcase
21
+ raise 'Not a valid file format.' unless (['.xlsx', '.xlsm'].include? extension)
22
+ end
23
+ if options[:remote]
24
+ zipfile = Tempfile.new("file")
25
+ zipfile.binmode
26
+ zipfile.write(HTTParty.get(path).body)
27
+ zipfile.close
28
+ path = zipfile.path
29
+ end
30
+
31
+ @files = Zip::File.open(path)
32
+ @shared_strings = SharedStrings.new(self, options[:multi_thread])
33
+ end
34
+
35
+ def sheets
36
+ doc = @files.file.open "xl/workbook.xml"
37
+ xml = Nokogiri::XML::Document.parse doc
38
+ namespaces = xml.namespaces
39
+
40
+ cssPrefix = ''
41
+ namespaces.each do |namespace|
42
+ if namespace[1] == 'http://schemas.openxmlformats.org/spreadsheetml/2006/main' && namespace[0] != 'xmlns' then
43
+ cssPrefix = namespace[0].split(':')[1]+'|'
44
+ end
45
+ end
46
+
47
+ rels_doc = @files.file.open "xl/_rels/workbook.xml.rels"
48
+ rels = Nokogiri::XML::Document.parse(rels_doc).css("Relationship")
49
+ @sheets = xml.css(cssPrefix+'sheet').map do |sheet|
50
+ sheetfile = rels.find { |el| sheet.attr("r:id") == el.attr("Id") }.attr("Target")
51
+ Sheet.new(self, sheet.attr("name"), sheet.attr("sheetid"), sheet.attr("state"), sheet.attr("visible"), sheet.attr("r:id"), sheetfile)
52
+ end
53
+ end
54
+
55
+ def style_types
56
+ @style_types ||= Creeker::Styles.new(self).style_types
57
+ end
58
+
59
+ def close
60
+ @files.close
61
+ end
62
+
63
+ def base_date
64
+ @base_date ||=
65
+ begin
66
+ # Default to 1900 (minus one day due to excel quirk) but use 1904 if
67
+ # it's set in the Workbook's workbookPr
68
+ # http://msdn.microsoft.com/en-us/library/ff530155(v=office.12).aspx
69
+ result = DATE_1900 # default
70
+
71
+ doc = @files.file.open "xl/workbook.xml"
72
+ xml = Nokogiri::XML::Document.parse doc
73
+ xml.css('workbookPr[date1904]').each do |workbookPr|
74
+ if workbookPr['date1904'] =~ /true|1/i
75
+ result = DATE_1904
76
+ break
77
+ end
78
+ end
79
+
80
+ result
81
+ end
82
+ end
83
+ end
84
+ end
@@ -0,0 +1,109 @@
1
+ require 'pathname'
2
+
3
+ module Creeker
4
+ class Creeker::Drawing
5
+ include Creeker::Utils
6
+
7
+ COLUMNS = ('A'..'AZ').to_a
8
+
9
+ def initialize(book, drawing_filepath)
10
+ @book = book
11
+ @drawing_filepath = drawing_filepath
12
+ @drawings = []
13
+ @drawings_rels = []
14
+ @images_pathnames = Hash.new { |hash, key| hash[key] = [] }
15
+
16
+ if file_exist?(@drawing_filepath)
17
+ load_drawings_and_rels
18
+ load_images_pathnames_by_cells if has_images?
19
+ end
20
+ end
21
+
22
+ ##
23
+ # Returns false if there are no images in the drawing file or the drawing file does not exist, true otherwise.
24
+ def has_images?
25
+ @has_images ||= !@drawings.empty?
26
+ end
27
+
28
+ ##
29
+ # Extracts images from excel to tmpdir for a cell, if the images are not already extracted (multiple calls or same image file in multiple cells).
30
+ # Returns array of images as Pathname objects or nil.
31
+ def images_at(cell_name)
32
+ coordinate = calc_coordinate(cell_name)
33
+ pathnames_at_coordinate = @images_pathnames[coordinate]
34
+ return if pathnames_at_coordinate.empty?
35
+
36
+ pathnames_at_coordinate.map do |image_pathname|
37
+ if image_pathname.exist?
38
+ image_pathname
39
+ else
40
+ excel_image_path = "xl/media#{image_pathname.to_path.split(tmpdir).last}"
41
+ IO.copy_stream(@book.files.file.open(excel_image_path), image_pathname.to_path)
42
+ image_pathname
43
+ end
44
+ end
45
+ end
46
+
47
+ private
48
+
49
+ ##
50
+ # Transforms cell name to [row, col], e.g. A1 => [0, 0], B3 => [1, 2]
51
+ # Rows and cols start with 0.
52
+ def calc_coordinate(cell_name)
53
+ col = COLUMNS.index(cell_name.slice /[A-Z]+/)
54
+ row = (cell_name.slice /\d+/).to_i - 1 # rows in drawings start with 0
55
+ [row, col]
56
+ end
57
+
58
+ ##
59
+ # Creates/loads temporary directory for extracting images from excel
60
+ def tmpdir
61
+ @tmpdir ||= ::Dir.mktmpdir('creeker__drawing')
62
+ end
63
+
64
+ ##
65
+ # Parses drawing and drawing's relationships xmls.
66
+ # Drawing xml contains relationships ID's and coordinates (row, col).
67
+ # Drawing relationships xml contains images' locations.
68
+ def load_drawings_and_rels
69
+ @drawings = parse_xml(@drawing_filepath).css('xdr|twoCellAnchor')
70
+ drawing_rels_filepath = expand_to_rels_path(@drawing_filepath)
71
+ @drawings_rels = parse_xml(drawing_rels_filepath).css('Relationships')
72
+ end
73
+
74
+ ##
75
+ # Iterates through the drawings and saves images' paths as Pathname objects to a hash with [row, col] keys.
76
+ # As multiple images can be located in a single cell, hash values are array of Pathname objects.
77
+ # One image can be spread across multiple cells (defined with from-row/to-row/from-col/to-col attributes) - same Pathname object is associated to each row-col combination for the range.
78
+ def load_images_pathnames_by_cells
79
+ image_selector = 'xdr:pic/xdr:blipFill/a:blip'.freeze
80
+ row_from_selector = 'xdr:from/xdr:row'.freeze
81
+ row_to_selector = 'xdr:to/xdr:row'.freeze
82
+ col_from_selector = 'xdr:from/xdr:col'.freeze
83
+ col_to_selector = 'xdr:to/xdr:col'.freeze
84
+
85
+ @drawings.xpath('//xdr:twoCellAnchor').each do |drawing|
86
+ embed = drawing.xpath(image_selector).first.attributes['embed']
87
+ next if embed.nil?
88
+
89
+ rid = embed.value
90
+ path = Pathname.new("#{tmpdir}/#{extract_drawing_path(rid).slice(/[^\/]*$/)}")
91
+
92
+ row_from = drawing.xpath(row_from_selector).text.to_i
93
+ col_from = drawing.xpath(col_from_selector).text.to_i
94
+ row_to = drawing.xpath(row_to_selector).text.to_i
95
+ col_to = drawing.xpath(col_to_selector).text.to_i
96
+
97
+ (col_from..col_to).each do |col|
98
+ (row_from..row_to).each do |row|
99
+ @images_pathnames[[row, col]].push(path)
100
+ end
101
+ end
102
+ end
103
+ end
104
+
105
+ def extract_drawing_path(rid)
106
+ @drawings_rels.css("Relationship[@Id=#{rid}]").first.attributes['Target'].value
107
+ end
108
+ end
109
+ end
@@ -0,0 +1,78 @@
1
+ require 'zip/filesystem'
2
+ require 'nokogiri'
3
+
4
+ module Creeker
5
+
6
+ class Creeker::SharedStrings
7
+
8
+ attr_reader :book, :dictionary
9
+
10
+ def initialize book, multi_thread = false
11
+ @book = book
12
+ @multi_thread = multi_thread
13
+ parse_shared_shared_strings
14
+ end
15
+
16
+ def parse_shared_shared_strings
17
+ path = "xl/sharedStrings.xml"
18
+ if @book.files.file.exist?(path)
19
+ doc = @book.files.file.open path
20
+ xml = Nokogiri::XML::Document.parse doc
21
+ parse_shared_string_from_document(xml)
22
+ end
23
+ end
24
+
25
+ def parse_shared_string_from_document(xml)
26
+ @dictionary = self.class.parse_shared_string_from_document(xml)
27
+ end
28
+
29
+ def self.parse_shared_string_from_document(xml)
30
+ dictionary = Hash.new
31
+ # (1..10).each do |i|
32
+ # thread = Thread.new do
33
+ # xml.css('si').first(i * 10000).each_with_index do |si, idx|
34
+ # text_nodes = si.css('t')
35
+ # if text_nodes.count == 1 # plain text node
36
+ # dictionary[idx] = text_nodes.first.content
37
+ # else # rich text nodes with text fragments
38
+ # dictionary[idx] = text_nodes.map(&:content).join('')
39
+ # end
40
+ # end
41
+ # end
42
+
43
+ # sleep 1*i
44
+ # GC.start
45
+ # end
46
+
47
+ # Creeker::Book.new(Upload.last.document.path)
48
+ if @multi_thread
49
+ xml.css('si').to_a.in_groups_of(10000).each do |group|
50
+ thread = Thread.new do
51
+ group.each_with_index do |si, idx|
52
+ text_nodes = si.css('t')
53
+ if text_nodes.count == 1 # plain text node
54
+ dictionary[idx] = text_nodes.first.content
55
+ else # rich text nodes with text fragments
56
+ dictionary[idx] = text_nodes.map(&:content).join('')
57
+ end
58
+ end
59
+ end
60
+
61
+ # sleep 5
62
+ GC.start
63
+ end
64
+ else
65
+ xml.css('si').first(20).each_with_index do |si, idx|
66
+ text_nodes = si.css('t')
67
+ if text_nodes.count == 1 # plain text node
68
+ dictionary[idx] = text_nodes.first.content
69
+ else # rich text nodes with text fragments
70
+ dictionary[idx] = text_nodes.map(&:content).join('')
71
+ end
72
+ end
73
+ end
74
+ dictionary
75
+ end
76
+
77
+ end
78
+ end
@@ -0,0 +1,164 @@
1
+ require 'zip/filesystem'
2
+ require 'nokogiri'
3
+
4
+ module Creeker
5
+ class Creeker::Sheet
6
+ include Creeker::Utils
7
+
8
+ attr_reader :book,
9
+ :name,
10
+ :sheetid,
11
+ :state,
12
+ :visible,
13
+ :rid,
14
+ :index
15
+
16
+
17
+ def initialize book, name, sheetid, state, visible, rid, sheetfile
18
+ @book = book
19
+ @name = name
20
+ @sheetid = sheetid
21
+ @visible = visible
22
+ @rid = rid
23
+ @state = state
24
+ @sheetfile = sheetfile
25
+ @images_present = false
26
+ end
27
+
28
+ ##
29
+ # Preloads images info (coordinates and paths) from related drawing.xml and drawing rels.
30
+ # Must be called before #rows method if you want to have images included.
31
+ # Returns self so you can chain the calls (sheet.with_images.rows).
32
+ def with_images
33
+ @drawingfile = extract_drawing_filepath
34
+ if @drawingfile
35
+ @drawing = Creeker::Drawing.new(@book, @drawingfile.sub('..', 'xl'))
36
+ @images_present = @drawing.has_images?
37
+ end
38
+ self
39
+ end
40
+
41
+ ##
42
+ # Extracts images for a cell to a temporary folder.
43
+ # Returns array of Pathnames for the cell.
44
+ # Returns nil if images asre not found for the cell or images were not preloaded with #with_images.
45
+ def images_at(cell)
46
+ @drawing.images_at(cell) if @images_present
47
+ end
48
+
49
+ ##
50
+ # Provides an Enumerator that returns a hash representing each row.
51
+ # The key of the hash is the Cell id and the value is the value of the cell.
52
+ def rows
53
+ rows_generator
54
+ end
55
+
56
+ ##
57
+ # Provides an Enumerator that returns a hash representing each row.
58
+ # The hash contains meta data of the row and a 'cells' embended hash which contains the cell contents.
59
+ def rows_with_meta_data
60
+ rows_generator true
61
+ end
62
+
63
+ private
64
+
65
+ ##
66
+ # Returns a hash per row that includes the cell ids and values.
67
+ # Empty cells will be also included in the hash with a nil value.
68
+ def rows_generator include_meta_data=false
69
+ path = if @sheetfile.start_with? "/xl/" or @sheetfile.start_with? "xl/" then @sheetfile else "xl/#{@sheetfile}" end
70
+ if @book.files.file.exist?(path)
71
+ # SAX parsing, Each element in the stream comes through as two events:
72
+ # one to open the element and one to close it.
73
+ opener = Nokogiri::XML::Reader::TYPE_ELEMENT
74
+ closer = Nokogiri::XML::Reader::TYPE_END_ELEMENT
75
+ Enumerator.new do |y|
76
+ row, cells, cell = nil, {}, nil
77
+ cell_type = nil
78
+ cell_style_idx = nil
79
+ @book.files.file.open(path) do |xml|
80
+ Nokogiri::XML::Reader.from_io(xml).each do |node|
81
+ if (node.name.eql? 'row') and (node.node_type.eql? opener)
82
+ row = node.attributes
83
+ row['cells'] = Hash.new
84
+ cells = Hash.new
85
+ y << (include_meta_data ? row : cells) if node.self_closing?
86
+ elsif (node.name.eql? 'row') and (node.node_type.eql? closer)
87
+ processed_cells = fill_in_empty_cells(cells, row['r'], cell)
88
+
89
+ if @images_present
90
+ processed_cells.each do |cell_name, cell_value|
91
+ next unless cell_value.nil?
92
+ processed_cells[cell_name] = images_at(cell_name)
93
+ end
94
+ end
95
+
96
+ row['cells'] = processed_cells
97
+ y << (include_meta_data ? row : processed_cells)
98
+ elsif (node.name.eql? 'c') and (node.node_type.eql? opener)
99
+ cell_type = node.attributes['t']
100
+ cell_style_idx = node.attributes['s']
101
+ cell = node.attributes['r']
102
+ elsif (node.name.eql? 'v') and (node.node_type.eql? opener)
103
+ unless cell.nil?
104
+ cells[cell] = convert(node.inner_xml, cell_type, cell_style_idx)
105
+ end
106
+ elsif (node.name.eql? 't') and (node.node_type.eql? opener)
107
+ unless cell.nil?
108
+ cells[cell] = convert(node.inner_xml, cell_type, cell_style_idx)
109
+ end
110
+ end
111
+ end
112
+ end
113
+ end
114
+ end
115
+ end
116
+
117
+ def convert(value, type, style_idx)
118
+ style = @book.style_types[style_idx.to_i]
119
+ Creeker::Styles::Converter.call(value, type, style, converter_options)
120
+ end
121
+
122
+ def converter_options
123
+ @converter_options ||= {
124
+ shared_strings: @book.shared_strings.dictionary,
125
+ base_date: @book.base_date
126
+ }
127
+ end
128
+
129
+ ##
130
+ # The unzipped XML file does not contain any node for empty cells.
131
+ # Empty cells are being padded in using this function
132
+ def fill_in_empty_cells(cells, row_number, last_col)
133
+ new_cells = Hash.new
134
+
135
+ unless cells.empty?
136
+ last_col = last_col.gsub(row_number, '')
137
+
138
+ ("A"..last_col).to_a.each do |column|
139
+ id = "#{column}#{row_number}"
140
+ new_cells[id] = cells[id]
141
+ end
142
+ end
143
+
144
+ new_cells
145
+ end
146
+
147
+ ##
148
+ # Find drawing filepath for the current sheet.
149
+ # Sheet xml contains drawing relationship ID.
150
+ # Sheet relationships xml contains drawing file's location.
151
+ def extract_drawing_filepath
152
+ # Read drawing relationship ID from the sheet.
153
+ sheet_filepath = "xl/#{@sheetfile}"
154
+ drawing = parse_xml(sheet_filepath).css('drawing').first
155
+ return if drawing.nil?
156
+
157
+ drawing_rid = drawing.attributes['id'].value
158
+
159
+ # Read sheet rels to find drawing file's location.
160
+ sheet_rels_filepath = expand_to_rels_path(sheet_filepath)
161
+ parse_xml(sheet_rels_filepath).css("Relationship[@Id='#{drawing_rid}']").first.attributes['Target'].value
162
+ end
163
+ end
164
+ end
@@ -0,0 +1,27 @@
1
+ module Creeker
2
+ class Styles
3
+ attr_accessor :book
4
+ def initialize(book)
5
+ @book = book
6
+ end
7
+
8
+ def path
9
+ "xl/styles.xml"
10
+ end
11
+
12
+ def styles_xml
13
+ @styles_xml ||= begin
14
+ if @book.files.file.exist?(path)
15
+ doc = @book.files.file.open path
16
+ Nokogiri::XML::Document.parse doc
17
+ end
18
+ end
19
+ end
20
+
21
+ def style_types
22
+ @style_types ||= begin
23
+ Creeker::Styles::StyleTypes.new(styles_xml).call
24
+ end
25
+ end
26
+ end
27
+ end
@@ -0,0 +1,41 @@
1
+ module Creeker
2
+ class Styles
3
+ module Constants
4
+ # Map of non-custom numFmtId to casting symbol
5
+ NumFmtMap = {
6
+ 0 => :string, # General
7
+ 1 => :fixnum, # 0
8
+ 2 => :float, # 0.00
9
+ 3 => :fixnum, # #,##0
10
+ 4 => :float, # #,##0.00
11
+ 5 => :unsupported, # $#,##0_);($#,##0)
12
+ 6 => :unsupported, # $#,##0_);[Red]($#,##0)
13
+ 7 => :unsupported, # $#,##0.00_);($#,##0.00)
14
+ 8 => :unsupported, # $#,##0.00_);[Red]($#,##0.00)
15
+ 9 => :percentage, # 0%
16
+ 10 => :percentage, # 0.00%
17
+ 11 => :bignum, # 0.00E+00
18
+ 12 => :unsupported, # # ?/?
19
+ 13 => :unsupported, # # ??/??
20
+ 14 => :date, # mm-dd-yy
21
+ 15 => :date, # d-mmm-yy
22
+ 16 => :date, # d-mmm
23
+ 17 => :date, # mmm-yy
24
+ 18 => :time, # h:mm AM/PM
25
+ 19 => :time, # h:mm:ss AM/PM
26
+ 20 => :time, # h:mm
27
+ 21 => :time, # h:mm:ss
28
+ 22 => :date_time, # m/d/yy h:mm
29
+ 37 => :unsupported, # #,##0 ;(#,##0)
30
+ 38 => :unsupported, # #,##0 ;[Red](#,##0)
31
+ 39 => :unsupported, # #,##0.00;(#,##0.00)
32
+ 40 => :unsupported, # #,##0.00;[Red](#,##0.00)
33
+ 45 => :time, # mm:ss
34
+ 46 => :time, # [h]:mm:ss
35
+ 47 => :time, # mmss.0
36
+ 48 => :bignum, # ##0.0E+0
37
+ 49 => :unsupported # @
38
+ }
39
+ end
40
+ end
41
+ end
@@ -0,0 +1,101 @@
1
+ require 'set'
2
+
3
+ module Creeker
4
+ class Styles
5
+ class Converter
6
+ include Creeker::Styles::Constants
7
+ ##
8
+ # The heart of typecasting. The ruby type is determined either explicitly
9
+ # from the cell xml or implicitly from the cell style, and this
10
+ # method expects that work to have been done already. This, then,
11
+ # takes the type we determined it to be and casts the cell value
12
+ # to that type.
13
+ #
14
+ # types:
15
+ # - s: shared string (see #shared_string)
16
+ # - n: number (cast to a float)
17
+ # - b: boolean
18
+ # - str: string
19
+ # - inlineStr: string
20
+ # - ruby symbol: for when type has been determined by style
21
+ #
22
+ # options:
23
+ # - shared_strings: needed for 's' (shared string) type
24
+ # - base_date: from what date to begin, see method #base_date
25
+
26
+ DATE_TYPES = [:date, :time, :date_time].to_set
27
+ def self.call(value, type, style, options = {})
28
+ return nil if value.nil? || value.empty?
29
+
30
+ # Sometimes the type is dictated by the style alone
31
+ if type.nil? || (type == 'n' && DATE_TYPES.include?(style))
32
+ type = style
33
+ end
34
+
35
+ case type
36
+
37
+ ##
38
+ # There are few built-in types
39
+ ##
40
+
41
+ when 's' # shared string
42
+ options[:shared_strings][value.to_i]
43
+ when 'n' # number
44
+ value.to_f
45
+ when 'b'
46
+ value.to_i == 1
47
+ when 'str'
48
+ value
49
+ when 'inlineStr'
50
+ value
51
+
52
+ ##
53
+ # Type can also be determined by a style,
54
+ # detected earlier and cast here by its standardized symbol
55
+ ##
56
+
57
+ when :string, :unsupported
58
+ value
59
+ when :fixnum
60
+ value.to_i
61
+ when :float, :percentage
62
+ value.to_f
63
+ when :date, :time, :date_time
64
+ convert_date(value, options)
65
+ when :bignum
66
+ convert_bignum(value)
67
+
68
+ ## Nothing matched
69
+ else
70
+ value
71
+ end
72
+ end
73
+
74
+ # the trickiest. note that all these formats can vary on
75
+ # whether they actually contain a date, time, or datetime.
76
+ def self.convert_date(value, options)
77
+ value = value.to_f
78
+ days_since_date_system_start = value.to_i
79
+ fraction_of_24 = value - days_since_date_system_start
80
+
81
+ # http://stackoverflow.com/questions/10559767/how-to-convert-ms-excel-date-from-float-to-date-format-in-ruby
82
+ date = options.fetch(:base_date, Date.new(1899, 12, 30)) + days_since_date_system_start
83
+
84
+ if fraction_of_24 > 0 # there is a time associated
85
+ seconds = (fraction_of_24 * 86400).round
86
+ return Time.utc(date.year, date.month, date.day) + seconds
87
+ else
88
+ return date
89
+ end
90
+ end
91
+
92
+ def self.convert_bignum(value)
93
+ if defined?(BigDecimal)
94
+ BigDecimal.new(value)
95
+ else
96
+ value.to_f
97
+ end
98
+ end
99
+ end
100
+ end
101
+ end
@@ -0,0 +1,85 @@
1
+ # https://github.com/hmcgowan/roo/blob/master/lib/roo/excelx.rb
2
+ # https://github.com/woahdae/simple_xlsx_reader/blob/master/lib/simple_xlsx_reader.rb#L231
3
+ module Creeker
4
+ class Styles
5
+ class StyleTypes
6
+ include Creeker::Styles::Constants
7
+ attr_accessor :styles_xml_doc
8
+ def initialize(styles_xml_doc)
9
+ @styles_xml_doc = styles_xml_doc
10
+ end
11
+
12
+ # Excel doesn't record types for some cells, only its display style, so
13
+ # we have to back out the type from that style.
14
+ #
15
+ # Some of these styles can be determined from a known set (see NumFmtMap),
16
+ # while others are 'custom' and we have to make a best guess.
17
+ #
18
+ # This is the array of types corresponding to the styles a spreadsheet
19
+ # uses, and includes both the known style types and the custom styles.
20
+ #
21
+ # Note that the xml sheet cells that use this don't reference the
22
+ # numFmtId, but instead the array index of a style in the stored list of
23
+ # only the styles used in the spreadsheet (which can be either known or
24
+ # custom). Hence this style types array, rather than a map of numFmtId to
25
+ # type.
26
+ def call
27
+ @style_types ||= begin
28
+ styles_xml_doc.css('styleSheet cellXfs xf').map do |xstyle|
29
+ a = num_fmt_id(xstyle)
30
+ style_type_by_num_fmt_id( a )
31
+ end
32
+ end
33
+ end
34
+
35
+ #returns the numFmtId value if it's available
36
+ def num_fmt_id(xstyle)
37
+ return nil unless xstyle.attributes['numFmtId']
38
+ xstyle.attributes['numFmtId'].value
39
+ end
40
+
41
+ # Finds the type we think a style is; For example, fmtId 14 is a date
42
+ # style, so this would return :date.
43
+ #
44
+ # Note, custom styles usually (are supposed to?) have a numFmtId >= 164,
45
+ # but in practice can sometimes be simply out of the usual "Any Language"
46
+ # id range that goes up to 49. For example, I have seen a numFmtId of
47
+ # 59 specified as a date. In Thai, 59 is a number format, so this seems
48
+ # like a bad idea, but we try to be flexible and just go with it.
49
+ def style_type_by_num_fmt_id(id)
50
+ return nil unless id
51
+ id = id.to_i
52
+ NumFmtMap[id] || custom_style_types[id]
53
+ end
54
+
55
+ # Map of (numFmtId >= 164) (custom styles) to our best guess at the type
56
+ # ex. {164 => :date_time}
57
+ def custom_style_types
58
+ @custom_style_types ||= begin
59
+ styles_xml_doc.css('styleSheet numFmts numFmt').inject({}) do |acc, xstyle|
60
+ index = xstyle.attributes['numFmtId'].value.to_i
61
+ value = xstyle.attributes['formatCode'].value
62
+ acc[index] = determine_custom_style_type(value)
63
+ acc
64
+ end
65
+ end
66
+ end
67
+
68
+ # This is the least deterministic part of reading xlsx files. Due to
69
+ # custom styles, you can't know for sure when a date is a date other than
70
+ # looking at its format and gessing. It's not impossible to guess right,
71
+ # though.
72
+ #
73
+ # http://stackoverflow.com/questions/4948998/determining-if-an-xlsx-cell-is-date-formatted-for-excel-2007-spreadsheets
74
+ def determine_custom_style_type(string)
75
+ return :float if string[0] == '_'
76
+ return :float if string[0] == ' 0'
77
+
78
+ # Looks for one of ymdhis outside of meta-stuff like [Red]
79
+ return :date_time if string =~ /(^|\])[^\[]*[ymdhis]/i
80
+
81
+ return :unsupported
82
+ end
83
+ end
84
+ end
85
+ end
@@ -0,0 +1,16 @@
1
+ module Creeker
2
+ module Utils
3
+ def expand_to_rels_path(filepath)
4
+ filepath.sub(/(\/[^\/]+$)/, '/_rels\1.rels')
5
+ end
6
+
7
+ def file_exist?(path)
8
+ @book.files.file.exist?(path)
9
+ end
10
+
11
+ def parse_xml(xml_path)
12
+ doc = @book.files.file.open(xml_path)
13
+ Nokogiri::XML::Document.parse(doc)
14
+ end
15
+ end
16
+ end
@@ -0,0 +1,3 @@
1
+ module Creeker
2
+ VERSION = "2.1"
3
+ end
metadata ADDED
@@ -0,0 +1,157 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: creeker
3
+ version: !ruby/object:Gem::Version
4
+ version: '2.1'
5
+ platform: ruby
6
+ authors:
7
+ - huntcode
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2017-12-15 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.3'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.3'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rspec
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: 3.6.0
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: 3.6.0
55
+ - !ruby/object:Gem::Dependency
56
+ name: pry
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
69
+ - !ruby/object:Gem::Dependency
70
+ name: nokogiri
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ">="
74
+ - !ruby/object:Gem::Version
75
+ version: 1.7.0
76
+ type: :runtime
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">="
81
+ - !ruby/object:Gem::Version
82
+ version: 1.7.0
83
+ - !ruby/object:Gem::Dependency
84
+ name: rubyzip
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - ">="
88
+ - !ruby/object:Gem::Version
89
+ version: 1.0.0
90
+ type: :runtime
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - ">="
95
+ - !ruby/object:Gem::Version
96
+ version: 1.0.0
97
+ - !ruby/object:Gem::Dependency
98
+ name: httparty
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - "~>"
102
+ - !ruby/object:Gem::Version
103
+ version: 0.15.5
104
+ type: :runtime
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - "~>"
109
+ - !ruby/object:Gem::Version
110
+ version: 0.15.5
111
+ description: A Ruby gem that streams and parses large Excel(xlsx and xlsm) files fast
112
+ and efficiently. Based on Creek gem, but with Multi-threading and Garbage Collection
113
+ email:
114
+ - vivektripathi_cse@hotmail.com
115
+ executables: []
116
+ extensions: []
117
+ extra_rdoc_files: []
118
+ files:
119
+ - LICENSE.txt
120
+ - README.md
121
+ - Rakefile
122
+ - lib/creeker.rb
123
+ - lib/creeker/book.rb
124
+ - lib/creeker/drawing.rb
125
+ - lib/creeker/shared_strings.rb
126
+ - lib/creeker/sheet.rb
127
+ - lib/creeker/styles.rb
128
+ - lib/creeker/styles/constants.rb
129
+ - lib/creeker/styles/converter.rb
130
+ - lib/creeker/styles/style_types.rb
131
+ - lib/creeker/utils.rb
132
+ - lib/creeker/version.rb
133
+ homepage: https://github.com/huntcode/creeker
134
+ licenses:
135
+ - MIT
136
+ metadata: {}
137
+ post_install_message:
138
+ rdoc_options: []
139
+ require_paths:
140
+ - lib
141
+ required_ruby_version: !ruby/object:Gem::Requirement
142
+ requirements:
143
+ - - ">="
144
+ - !ruby/object:Gem::Version
145
+ version: 2.0.0
146
+ required_rubygems_version: !ruby/object:Gem::Requirement
147
+ requirements:
148
+ - - ">="
149
+ - !ruby/object:Gem::Version
150
+ version: '0'
151
+ requirements: []
152
+ rubyforge_project:
153
+ rubygems_version: 2.6.14
154
+ signing_key:
155
+ specification_version: 4
156
+ summary: A Ruby gem for parsing large Excel(xlsx and xlsm) files.
157
+ test_files: []