LinkGrouper 1.0.1-universal-linux

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: df01bc2b149a8f1e15377f457b633157c444acd0
4
+ data.tar.gz: cad193cef4ee4aaa27969a34c1bc3c83c3fec7fb
5
+ SHA512:
6
+ metadata.gz: 4e96a95a5377c5dc25ec77a76b6c5e7216ecfac290109f79d4af3e056aaadf892f8b6154bf5a967246c6adf006a9c925d3d400e2e7340eb82548efb67d8552e1
7
+ data.tar.gz: 44d584956cf46b0e8f9cc78d45a86d1715fd2bf4b84b2db55908b1f0dbf0cf108e37496437bc696cedcd4a686abdc77a65d6d79d14cbbd5716997aaa52e989af
data/README ADDED
@@ -0,0 +1,39 @@
1
+ Given a list of links (as a text file), this module will group similar links (or find unique links) and write them to separate files.
2
+ e.g. URL list --
3
+ /level1/uid8
4
+ /level1/level2/gid6
5
+ /status
6
+ /level1/uid10
7
+ /level1/level2/gid0
8
+ /xyz
9
+ /level1
10
+ /level1
11
+ /level1
12
+ /level1/level2
13
+
14
+ e.g. output (given groupsize of 2)
15
+ Then this will list the following in file 1 (URLs which could not be grouped) --
16
+ /status
17
+ /xyz
18
+ This will exist in file 2 --
19
+ /level1/uid8
20
+ /level1/uid10
21
+ Contents of file 3 --
22
+ /level1/level2/gid0
23
+ /level1/level2/gid6
24
+ Contents of file 4 --
25
+ /level1/level2
26
+ conents of file 5 --
27
+ /level1
28
+ /level1
29
+ /level1
30
+
31
+ Use LinkGrouper.uniqcounter method as the entry point.
32
+ 1st arg -- The directory in which files will be made each of which represents a group.
33
+ 2nd arg -- A file which contains a \n separated list of absolute URLs
34
+ 3rd arg -- The 'groupsize'. A identified category/class/group of URLs must be of this size to be considered a category/class/group.
35
+
36
+ LinkGrouper_app.rb has been made which is an app using the module.
37
+ # First arg -- Will write into a directory (plist) with it's path supplied to argument. This contains files each of which represents a class/category/group of log entries. The file is named with a random UUID.
38
+ # 2nd arg -- A file which contains a \n separated list of absolute URLs
39
+ # 3rd arg -- The 'groupsize'. A identified category/class/group of URLs must be of this size to be considered a category/class/group.
@@ -0,0 +1,7 @@
1
+ #! /usr/bin/ruby
2
+ # An app using the module.
3
+ # First arg -- Will write into a directory (plist) with it's path supplied to argument. This contains files each of which represents a class/category/group of log entries. The file is named with a random UUID.
4
+ # 2nd arg -- A file which contains a \n separated list of absolute URLs
5
+ # 3rd arg -- The 'groupsize'. A identified category/class/group of URLs must be of this size to be considered a category/class/group.
6
+ require "LinkGrouper.rb"
7
+ LinkGrouper.uniqcounter(ARGV[0], ARGV[1], ARGV[2].to_i)
@@ -0,0 +1,308 @@
1
+ #! /usr/bin/ruby
2
+ # Licence: GPLv3
3
+ module LinkGrouper
4
+ require "fileutils.rb"
5
+ # Takes as argument a file which contains a \n separated list of URLs
6
+ # uniqcounter is the endpoint function.
7
+ # Will write into a directory (plist) with it's name supplied to argument. This contains files each of which represents a class/catagory/group of log entries.
8
+ # takes in as argument the groupsize
9
+ # e.g. URL list --
10
+ # /level1/uid8
11
+ # /level1/level2/gid6
12
+ # /status
13
+ # /level1/uid10
14
+ # /level1/level2/gid0
15
+ # /xyz
16
+ # /level1
17
+ # /level1
18
+ # /level1
19
+ # /level1/level2
20
+ # Then this will list the following in file 1 (URLs which could not be grouped) --
21
+ # /status
22
+ # /xyz
23
+ # This will exist in file 2 --
24
+ # /level1/uid8
25
+ # /level1/uid10
26
+ # Contents of file 3 --
27
+ # /level1/level2/gid0
28
+ # /level1/level2/gid6
29
+ # Contents of file 4 --
30
+ # /level1/level2
31
+ # conents of file 5 --
32
+ # /level1
33
+ # /level1
34
+ # /level1
35
+ # 'groupsize' -- A identified category of URLs must be of this size to be considered as a group.
36
+ # level1 and level1/ are 2 different URLs. bundle
37
+
38
+ # uniqcounter returns a hash as a number (a kind of status code) and it's value which is a human readable description of the status
39
+
40
+ # Algo (see unique_link_classifier.txt in solution process to see the method) --
41
+ # 1) Sort the file using previously packaged code. The sorted version of the file (named sorted_links) will be placed in plist[uniqcounter calls sorted_links which uses unix commands to achieve the same].
42
+ # 2) Start creating the bundle[uniqcounter calls nextBundle]. In case you've redirected from 3), the next bundle will include the last line of the previous bundle [nextBundle call nextBundleCalculate(true)]. For each line being added to the bundle, check if no. of / is the same (procedures as stated in unique_link_classifier.txt)[nextBundle calls bundleSlashCount using the array returned by bundleLines]; if the no. of / are different, move to 3). If this's not the first time 2) is called, move to the next bundle in the same way as discussed in unique_link_classifier.txt [uniqcounter calls nextBundle with no arguments and then calls bundleLines]. If 10) was executed, the next bundle will start from the lines after the lines that were written [uniqcounter calls to_slIncrement() with arguments as the no. of successive lines detected in the bundle. to_slIncrement is called (with argument 0) even when there were no successive lines added. This has been done for proper behavior when nextBundleCalculate would return false] and restart from the beginning of 2). If a bundle cannot be formed because of missing lines, terminate program [uniqcounter exists program if nextBundle returns false] after writing all remaining lines to separate files [nextBundleCalculate would return nil because of missing lines, and before that it'll write the remaining lines to separate files].
43
+ # 3) in case a bundle couldn't be formed, write all these links except the last link to separate files [nextBundle calls writeLink] and move to 2).
44
+ # 5) Identify common parts in the bundle.[uniqcounter calls bundleIdnetifyGroup]
45
+ # 6) If a common part is found [bundleIdnetifyGroup returns true], move to 7), otherwise write each of the links to different files [uniqcounter calls writeLink passing the array containing the lines of the bundle] and move to 2).
46
+ # 7) Expand the bundle size by 1 [uniqcounter calls bundleLines with argument x], if there are no new lines, terminate program after writing the bundle [if bundleLines returns nil, it calls writeLinks with arguments as returned by the last successful call to bundleLines and terminates program].
47
+ # 8) Check if the newly added line in the bundle has same no of / [uniqcounter calls bundleSlashCount with value returned by bundleLines just above]. If not, move to 10 [if bundleSlashCount retruns false]
48
+ # 9) Check if the newly added line in the bundle has the same prefix as other lines of the bundle[uniqcounter calls bundleIdnetifyGroup with the array returned by bundleLines], in other words check if the new line can be a part of the group. If so, move to 7[if bundleIdnetifyGroup returns true add value of x to it's previous value +1. Initial value of x is 0.], otherwise move to 10[bundleIdnetifyGroup returned false].
49
+ # 10) Write the bundle to a file if a group was made off the bundle without increasing x [uniqcounter calls writeLinks with an array as an argument which did not make bundleIdnetifyGroup return false] and move to 2. If no group could be formed in the 1st place, write the bundle to individual files[uniqcounter calls writeLink with the the bundled array as argument].
50
+
51
+ # function/variable list
52
+ # $from_sl -- Points to the line from which the current bundle starts in sorted_links. First line is 1. Initial value is 1.
53
+ # $to_sl -- Points to the line till which the current bundle end in sorted_links. Initial value is 1
54
+ # $slinksIO -- IO object which opens sorted_links ro.
55
+ # $plist
56
+ # $slNo -- no. of lines in in sorted_links
57
+ # $groupsize
58
+ # $to_slCaller --
59
+ # sorted_links -- Writes file sorted_links. Takes a single argument of the original list.
60
+ # to_slIncrement(<int>) -- Will increase the size of $to_sl by (int), which default to 1. Will update $to_slCaller to point to the last function which called this function.
61
+ # boolean bundleSlashCount(<Array>) -- Takes in an array and returns true only if the no. of / in them are identical otherwise it returns false.
62
+ # bool nextBundle() -- Calls nextBundleCalculate to check if no. of lines to create the next bundle exists; if it returns nil, this will return false, if it returns true, uses it to create the next bundle. Value of from_sl and to_sl will be increased 1 at a time via to_slIncrement and for each addition, the corresponding lines will be made an array off (using bundleLines) and sent to bundleSlashCount; if bundleSlashCount returns true, will add another line to it (using bundleLines). This process will happen until the target/new from_sl and to_sl is reached. In case bundleSlashCount returned false at any time, will call writeLink with an array from index from_sl to to_sl - 1, then call nextBundleCalculate(true) (i.e. increment new bundle with an overlap) and then restart from the beginning.
63
+ # Array nextBundleCalculate(<boolean>) -- It'll increase to_sl by to_sl+<int>, then set from_sl to to_sl+1, and then sets to_sl to to_sl+1+<groupsize> after checking if line no. to_sl+1+<groupsize> exists. If it does not, will return nil, otherwise will return an array [from_sl, to_sl] without increase them actually. The algorithm differs if $from_sl and $to_sl are at their initial values; it'll just increase to_sl to accommodate the $groupsize. If <boolean> is set to true, the array returned will be an 'overlap' from the previous bundle, i.e. the first line will be common between the previous bundle and this new target bundle.
64
+ # Whenever this returns nil because there are not enough lines, each remaining link will be written to a file via function writeEnd() which determines what to write.
65
+ #writeEnd -- Writes remaining lines of sorted_links based on certain condition; if nextBundleCalculate was called because the no. of / were different for the lines which were being added to the under-formation bundle (by nextBundle), then we need to write all lines from $to_sl onwards. If this function was called after a successful bundle was identified (and $to_sl was increased by uniqcounter instead via function to_slIncrement), then we need to write all lines after $to_sl. Look at the value of $to_slCaller to determine who incremented $to_sl last (either uniqcounter or nextBundle).
66
+ # Array bundleLines($slinksIO,<int>) -- Returns an array which contains all lines between and including from_sl to to_sl, if the argument is set to nil (default arg). If not, will return an array which contains lines exceeding to_sl by <int>. When <int> is not nil, will also check if these excess lines (<int>) exists or not. If it does not, will return nil; when using <int>, will not update to_sl.
67
+ # writeLinks(<Array>, <string>) -- Will generate a uuid and write a file named as the generated uuid if that file does not exist in directory as specified by <string> (which will defaults to plist). If the file named uuid exists, will exit the program with a message.
68
+ # writeLink(<Array>, <string>) -- For each element of <Array> will call writeLinks passing a single element array as argument. If <string> is not present (default arguments nil), will execute writeLinks without a second argument.
69
+ # <string> bundleIdnetifyGroup(<Array>, prefix=nil) -- Will try and find common parts in the <Array> (the same one as returned by bundleLines) and return nil if there are no common parts; if a common part is found, will return the common parts. The second argument, if present, will check if all the elements of the array has the same prefix, if so will return the prefix itself, otherwise return nil.
70
+ # uniqcounter -- Coordinator function, calls various functions with the correct arguments. Working within the algo itself.
71
+
72
+ require "securerandom"
73
+
74
+ $from_sl = 1
75
+ $to_sl = 1
76
+
77
+ def LinkGrouper.to_slIncrement(incby)
78
+ $to_sl = $to_sl+incby
79
+ $to_slCaller = caller(1,1)[0].sub(/.*`([a-zA-Z]+)'$/, "\\1")
80
+ end
81
+
82
+ def LinkGrouper.sorted_links(flist)
83
+ unless system("sort -r #{flist} > #{$plist + '/sorted_links'}")
84
+ return false
85
+ end
86
+ return true
87
+ end
88
+
89
+ def LinkGrouper.bundleSlashCount(links)
90
+ counter = nil
91
+ links.each {
92
+ |i|
93
+ unless counter
94
+ counter = i.count('/')
95
+ end
96
+ return false if i.count('/') != counter
97
+ }
98
+ true
99
+ end
100
+
101
+ # TODO: Existing file detection not implemented.
102
+ def LinkGrouper.writeLinks(links, plist=$plist)
103
+ IO.write("#{plist}/#{SecureRandom.uuid}", links.join("\n"))
104
+ end
105
+
106
+ # TODO:Plist 2nd argument not handled.
107
+ def LinkGrouper.writeLink(links, plist=nil)
108
+ links.each {
109
+ |line|
110
+ writeLinks([line])
111
+ }
112
+ end
113
+
114
+ def LinkGrouper.nextBundle
115
+ # Increment $to_sl 1 by 1 until we reach the targets. Retrieving targets.
116
+ targets = nextBundleCalculate
117
+ # no. of lines not sufficient.
118
+ return false unless targets
119
+
120
+ ctr = 1
121
+ $from_sl, $to_sl = targets[0], targets[0]
122
+ # for each added line, check if bundleSlashCount returns true for the under-creating bundle.
123
+ while $to_sl < targets[1]
124
+ to_slIncrement(1)
125
+ bundledLines = bundleLines($slinksIO)
126
+ # if no. of / in the newly added line differs...
127
+ unless bundleSlashCount(bundledLines)
128
+ writeBundledLines = Array.new
129
+ ctr = 1
130
+ # Write all the lines returned by bundleLines except the last.
131
+ bundledLines.each {
132
+ |i|
133
+ while ctr < bundledLines.length
134
+ writeBundledLines.push(i)
135
+ ctr += 1
136
+ end
137
+ }
138
+ writeLink(writeBundledLines)
139
+ targets = nextBundleCalculate(true)
140
+ return false unless targets
141
+ $from_sl, $to_sl = targets[0], targets[0]
142
+ end
143
+ end
144
+ true
145
+ end
146
+
147
+ def LinkGrouper.bundleLines(linksIO, extra=nil)
148
+ if extra && (extra + $to_sl > $slNo)
149
+ return nil
150
+ end
151
+ linesArray = Array.new
152
+ linksIO.pos = 0
153
+ linksIO.lineno = 0
154
+ linescount=($to_sl - $from_sl) + 1
155
+ linksIO.each_line {
156
+ |line|
157
+ line = line.chomp
158
+ if linksIO.lineno > $to_sl + (extra == nil ? 0 : extra)
159
+ break
160
+ end
161
+ if linksIO.lineno >= $from_sl
162
+ linesArray.push(line)
163
+ end
164
+ }
165
+ linesArray
166
+ end
167
+
168
+ def LinkGrouper.writeEnd
169
+ if $to_slCaller == "uniqcounter"
170
+ $from_sl = $to_sl + 1
171
+ if $from_sl <= $slNo
172
+ $to_sl = $slNo
173
+ writeLink(bundleLines($slinksIO))
174
+ end
175
+ else
176
+ $from_sl = $to_sl
177
+ if $from_sl <= $slNo
178
+ $to_sl = $slNo
179
+ writeLink(bundleLines($slinksIO))
180
+ end
181
+ end
182
+ end
183
+
184
+ def LinkGrouper.nextBundleCalculate(overlap=nil)
185
+ # check if there are enough lines remaining. If not, write the remaining lines.
186
+ if !overlap
187
+ if ($to_sl+1)+($groupsize - 1) > $slNo
188
+ writeEnd
189
+ return nil
190
+ end
191
+ else
192
+ if ($to_sl)+($groupsize - 1) > $slNo
193
+ writeEnd
194
+ return nil
195
+ end
196
+ end
197
+ # if it's not the initial value
198
+ if ($from_sl != 1 || $to_sl != 1)
199
+ from_slNext = $to_sl+1
200
+ to_slNext = from_slNext + ($groupsize - 1)
201
+ # if it is...
202
+ else
203
+ from_slNext = $from_sl
204
+ to_slNext = $to_sl + ($groupsize - 1)
205
+ end
206
+ unless overlap
207
+ return [from_slNext, to_slNext]
208
+ end
209
+ # For the initial values of from_sl and to_sl, this will return 0, 0 which is invalid.
210
+ if ($from_sl != 1 || $to_sl != 1)
211
+ return [from_slNext-1, to_slNext-1]
212
+ else
213
+ return nil
214
+ end
215
+ end
216
+
217
+ def LinkGrouper.bundleIdnetifyGroup(links, prefix=nil)
218
+ # Algo
219
+ # 1) Identify everything before the 1st separator for the first log entry. Call this string D.
220
+ # 2) Search for D in other logs in the same position. If even one of the log entries dont have the 1st separator, return false. If this's not the 1st time 2) is called, return the D that was common to all log entries.
221
+ # 3) Identify the next separator, call it D and move to 2)
222
+ if prefix
223
+ # TODO: If we search only the last element of links for the prefix, it'll be a lot faster. We dont need to search other links.
224
+ return nil if links.grep(/^#{prefix}/).length != links.length
225
+ return prefix
226
+ end
227
+ dir = nil
228
+ # loop till we reach the end of the first link; i.e. everything in the 1st link was found to be common.
229
+ while ( (dir == nil ? 0 : dir.length) != links[0].length )
230
+ # identify next directory/prefix
231
+ direscaped = nil
232
+ direscaped = Regexp.escape(dir) if dir
233
+ dir = links[0].sub(/^(#{direscaped}\/*[^\/]*\/*).*/, "\\1")
234
+ # break if this prefix was not found in even a single link.
235
+ (links.grep(/^#{dir}/).length != links.length) ? break : predir = dir
236
+ end
237
+ # this was the last prefix which was common for all the links.
238
+ predir
239
+ end
240
+ # signature -- plist (explained above), flist (input file), groupsize
241
+ def LinkGrouper.uniqcounter(plist, flist, groupsize)
242
+ if groupsize < 2
243
+ return {3 => "We dont deal with groupsize below 2."}
244
+ end
245
+
246
+ unless Dir.exist?(plist)
247
+ return {4 => "The write directory does not exist."}
248
+ end
249
+
250
+ plist = plist.chomp('/')
251
+
252
+ $groupsize = groupsize
253
+ $plist = plist
254
+
255
+ linectr = 0
256
+ IO.foreach(flist) {
257
+ |line|
258
+ unless line =~ /^\//
259
+ return {1 => "Found line(s)(#{line}) which are not absolute links. Sanitize file."}
260
+ end
261
+ linectr += 1
262
+ }
263
+ $slNo = linectr
264
+
265
+ unless sorted_links(flist)
266
+ # TODO -- Force this.
267
+ FileUtils.rm "#{$plist}/sorted_links"
268
+ return {2 => "sort command was not successful"}
269
+ end
270
+
271
+ $slinksIO = flistio = IO.new(IO.sysopen($plist+'/sorted_links', 'r'))
272
+
273
+ while nextBundle
274
+ bundledLines = bundleLines($slinksIO)
275
+ bundledLinesExtra = bundledLines
276
+ workingPrefix = nil
277
+ testnext = 0
278
+ # TODO -- optimization, first call bundleSlashCount, then call bundleIdnetifyGroup, possibally both of them at parallel
279
+ while (prefix = bundleIdnetifyGroup(bundledLinesExtra, workingPrefix)) && bundleSlashCount(bundledLinesExtra)
280
+ # bundledLines always contains lines in which a group was found. bundledLinesExtra contains the extra line which needs to be tested for an already identified prefix for the group.
281
+ workingPrefix = prefix
282
+ bundledLines = bundledLinesExtra
283
+ testnext += 1
284
+ bundledLinesExtra = bundleLines($slinksIO, testnext)
285
+ # logs ran out... exit
286
+ unless bundledLinesExtra
287
+ writeLinks(bundledLines)
288
+ $slinksIO.close
289
+ FileUtils.rm "#{$plist}/sorted_links"
290
+ return {0 => "Logs ran out while a group was identified"}
291
+ end
292
+ end
293
+ #Write only if a group was found for the bundle. Also increase $to_sl to the last successfully added line.
294
+ if workingPrefix
295
+ writeLinks(bundledLines)
296
+ to_slIncrement(testnext - 1)
297
+ else
298
+ # Write to individual files
299
+ writeLink(bundledLines)
300
+ # for proper functioning of writeEnd/nextBundleCalculate
301
+ to_slIncrement(0)
302
+ end
303
+ end
304
+ $slinksIO.close
305
+ FileUtils.rm "#{$plist}/sorted_links"
306
+ return {0 => "All done."}
307
+ end
308
+ end
metadata ADDED
@@ -0,0 +1,50 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: LinkGrouper
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.1
5
+ platform: universal-linux
6
+ authors:
7
+ - dE
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2017-03-25 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: Given a text file as input, will group similar links (or find unique
14
+ links) and write them to separate files. Read the README of the gem install directory
15
+ for the usage
16
+ email: de.techno@gmail.com
17
+ executables:
18
+ - LinkGrouper_app.rb
19
+ extensions: []
20
+ extra_rdoc_files: []
21
+ files:
22
+ - README
23
+ - bin/LinkGrouper_app.rb
24
+ - lib/LinkGrouper.rb
25
+ homepage: http://delogics.blogspot.com
26
+ licenses:
27
+ - Apache License, Version 2.0
28
+ metadata: {}
29
+ post_install_message:
30
+ rdoc_options: []
31
+ require_paths:
32
+ - lib
33
+ required_ruby_version: !ruby/object:Gem::Requirement
34
+ requirements:
35
+ - - ">="
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ required_rubygems_version: !ruby/object:Gem::Requirement
39
+ requirements:
40
+ - - ">="
41
+ - !ruby/object:Gem::Version
42
+ version: '0'
43
+ requirements:
44
+ - The sort command in the OS search path
45
+ rubyforge_project:
46
+ rubygems_version: 2.2.2
47
+ signing_key:
48
+ specification_version: 4
49
+ summary: Library and application to group similar logs
50
+ test_files: []