bookshark 1.0.0.alpha.2

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: cd72b117f3ad276a049e9e273461125a5080de60
4
+ data.tar.gz: dc0ece907de78acddf1e2077c98ef56b8c45bd0b
5
+ SHA512:
6
+ metadata.gz: 7469001e1d4a3117d10fa05da6316b704033300ebc8fed1da636886789cd90b3536d1d1fdf9aa7cfecc2a743b1a294ee8fc1020f9529694dcf3992df89f926eb
7
+ data.tar.gz: daa1103bd41e5f5d0b374d794b078a9ae38dbbfddcf86ff0c1acb2ac9224032a626b92c1b69dce2e6c8f1eca29e3aebcfa0f89d366c50937dc19544afae5d49f
@@ -0,0 +1,20 @@
1
+ *.gem
2
+ /.bundle/
3
+ /.yardoc
4
+ /Gemfile.lock
5
+ /_yardoc/
6
+ /coverage/
7
+ /doc/
8
+ /pkg/
9
+ /spec/reports/
10
+ /tmp/
11
+ /lib/bookshark/storage/html_publisher_pages/
12
+ /lib/bookshark/logs/*.log
13
+ *.bundle
14
+ *.so
15
+ *.o
16
+ *.a
17
+ mkmf.log
18
+ *.log
19
+ *.sql
20
+ *.sqlite
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --color
2
+ --require spec_helper
3
+ --format documentation
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in bookshark.gemspec
4
+ gemspec
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2014 Dimitris Klisiaris
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,453 @@
1
+ # Bookshark
2
+ ![Bookshark Logo](https://dl.dropboxusercontent.com/u/4888041/bookshark/logo.png)
3
+
4
+ A ruby library for book metadata extraction from biblionet.gr which
5
+ extracts books, authors, publishers and ddc metatdata.
6
+ The representation of bibliographic metadata in JSON is inspired by [BibJSON](http://okfnlabs.org/bibjson/) but some tags may be different.
7
+
8
+ ## Installation
9
+
10
+ Add this line to your application's Gemfile:
11
+
12
+ ```ruby
13
+ gem 'bookshark', "~> 1.0.0.alpha"
14
+ ```
15
+
16
+ And then execute:
17
+
18
+ $ bundle install
19
+
20
+ Or install it yourself as:
21
+
22
+ $ gem install bookshark --pre
23
+
24
+ ## Usage
25
+ Include bookshark in your class/module.
26
+ ```ruby
27
+ include Bookshark
28
+ ```
29
+ Alternatively you can use this syntax
30
+ ```ruby
31
+ Bookshark::Extractor.new
32
+
33
+ # Instead of this
34
+ include Bookshark
35
+ Extractor.new
36
+ ```
37
+
38
+ ### Extractor
39
+
40
+ Create an extractor object
41
+ ```ruby
42
+ Extractor.new
43
+ Extractor.new(format: 'json')
44
+ Extractor.new(format: 'hash', site: 'biblionet')
45
+ ```
46
+ **Extractor Options**:
47
+
48
+ * format : The format in which the extracted data are returned
49
+ * hash (default)
50
+ * json
51
+ * pretty_json
52
+ * site : The site from where the metadata will be extracted
53
+ * biblionet (default and currently the only available, so it can be skipped)
54
+
55
+ #### Extract Book Data
56
+
57
+ You need book's id on biblionet website or its uri.
58
+ Currently more advanced search functions based on title and author are not available, but they will be until the stable version 1.0.0 release.
59
+
60
+ First create an extractor object:
61
+ ```ruby
62
+ # Create a new extractor object with pretty json format.
63
+ extractor = Extractor.new(format: 'pretty_json')
64
+ ```
65
+ Then you can extract books
66
+ ```ruby
67
+ # Extract book with id 103788 from website
68
+ extractor.book(id: 103788)
69
+
70
+ # Extract book from the provided webpage
71
+ extractor.book(uri: 'http://biblionet.gr/book/103788/')
72
+
73
+ # Extract book with id 103788 from local storage
74
+ extractor.book(id: 103788, local: true)
75
+ ```
76
+
77
+ **Book Options**
78
+ (Recommended option is to use just the id and let bookshark to generate uri):
79
+
80
+ * id : The id of book on the corresponding site (Integer)
81
+ * uri : The url of book web page or the path to local file.
82
+ * local : Boolean value. Has page been saved locally? (default is false)
83
+ * format : The format in which the extracted data are returned
84
+ * hash (default)
85
+ * json
86
+ * pretty_json
87
+ * eager : Perform eager extraction? (Boolean - default is false)
88
+
89
+ **Eager Extraction:**
90
+
91
+ Each book has some attributes such as authors, contributors, categories etc which are actually references to other objects.
92
+ By default when extracting a book, you get only names of these objects and references to their pages.
93
+ With eager option set to true, each of these objects' data is extracted and the produced output contains complete information about every object.
94
+ Eager extraction doesn't work with local option enabled.
95
+
96
+ ```ruby
97
+ # Extract book with id 103788 with eager extraction option enabled
98
+ extractor.book(id: 103788, eager: true)
99
+ ```
100
+
101
+ The expected result of a book extraction is something like this:
102
+ ```json
103
+ {
104
+ "book": [
105
+ {
106
+ "title": "Σημεία και τέρατα της οικονομίας",
107
+ "subtitle": "Η κρυφή πλευρά των πάντων",
108
+ "image": "http://www.biblionet.gr/images/covers/b103788.jpg",
109
+ "author": [
110
+ {
111
+ "name": "Steven D. Levitt",
112
+ "b_id": "59782"
113
+ },
114
+ {
115
+ "name": "Stephen J. Dubner",
116
+ "b_id": "59783"
117
+ }
118
+ ],
119
+ "contributors": {
120
+ "μετάφραση": [
121
+ {
122
+ "name": "Άγγελος Φιλιππάτος",
123
+ "b_id": "851"
124
+ }
125
+ ]
126
+ },
127
+ "publisher": {
128
+ "name": "Εκδοτικός Οίκος Α. Α. Λιβάνη",
129
+ "b_id": "271"
130
+ },
131
+
132
+ "publication_year": "2006",
133
+ "pages": "326",
134
+ "isbn": "960-14-1157-7",
135
+ "isbn_13": "978-960-14-1157-6",
136
+ "status": "Κυκλοφορεί",
137
+ "price": "16,31",
138
+ "award": [
139
+
140
+ ],
141
+ "description": "Τι είναι πιο επικίνδυνο, ένα όπλο ή μια πισίνα; Τι κοινό έχουν οι δάσκαλοι με τους παλαιστές του σούμο;...",
142
+ "category": [
143
+ {
144
+ "ddc": "330",
145
+ "text": "Οικονομία",
146
+ "b_id": "142"
147
+ }
148
+ ],
149
+ "b_id": "103788"
150
+ }
151
+ ]
152
+ }
153
+ ```
154
+ Here is a [Book Sample](https://gist.github.com/dklisiaris/a6f3d6f37806186f3c79) extracted with eager option enabled.
155
+
156
+ #### Extract Author Data
157
+
158
+ You need author's id on biblionet website or his uri
159
+ ```ruby
160
+ Extractor.new.author(id: 10207)
161
+ Extractor.new(format: 'json').author(uri: 'http://www.biblionet.gr/author/10207/')
162
+ ```
163
+ Extraction from local saved html pages is also possible, but not recommended
164
+ ```ruby
165
+ extractor = Extractor.new(format: 'json')
166
+ extractor.author(uri: 'storage/html_author_pages/2/author_2423.html', local: true)
167
+ ```
168
+ **Author Options**: (Recommended option is to use just the id and let bookshark to generate uri):
169
+ * id : The id of author on the corresponding site (Integer)
170
+ * uri : The url of author web page or the path to local file.
171
+ * local : Boolean value. Has page been saved locally? (default is false)
172
+
173
+ The expected result of an author extraction is something like this:
174
+ ```json
175
+ {
176
+ "author": [
177
+ {
178
+ "name": "Tolkien, John Ronald Reuel",
179
+ "firstname": "John Ronald Reuel",
180
+ "lastname": "Tolkien",
181
+ "lifetime": "1892-1973",
182
+ "image": "http://www.biblionet.gr/images/persons/10207.jpg",
183
+ "bio": "Ο John Ronald Reuel Tolkien, άγγλος φιλόλογος και συγγραφέας, γεννήθηκε το 1892 στην πόλη Μπλουμφοντέιν...",
184
+ "award": [
185
+ {
186
+ "name": "The Benson Medal [The Royal Society of Literature]",
187
+ "year": "1966"
188
+ }
189
+ ],
190
+ "b_id": "10207"
191
+ }
192
+ ]
193
+ }
194
+ ```
195
+ The convention here is that there is never just a single author, but instead the author hash is stored inside an array.
196
+ So, it is easy to include metadata for multiple authors or even for multiple types of entities such as publishers or books on the same json file.
197
+
198
+ #### Extract Publisher Data
199
+ Methods are pretty same as author:
200
+ ```ruby
201
+ # Create a new extractor object with pretty json format.
202
+ extractor = Extractor.new(format: 'pretty_json')
203
+
204
+ # Extract publisher with id 20 from website
205
+ extractor.publisher(id: 20)
206
+
207
+ # Extract publisher from the provided webpage
208
+ extractor.publisher(uri: 'http://biblionet.gr/com/20/')
209
+
210
+ # Extract publisher with id 20 from local storage
211
+ extractor.publisher(id: 20, local: true)
212
+ ```
213
+ **Publisher Options**: (Recommended option is to use just the id and let bookshark to generate uri):
214
+
215
+ * id : The id of publisher on the corresponding site (Integer)
216
+ * uri : The url of publisher web page or the path to local file.
217
+ * local : Boolean value. Has page been saved locally? (default is false)
218
+ * format : The format in which the extracted data are returned
219
+ * hash (default)
220
+ * json
221
+ * pretty_json
222
+
223
+ The expected result of an author extraction is something like this:
224
+ ```json
225
+ {
226
+ "publisher": [
227
+ {
228
+ "name": "Εκδόσεις Πατάκη",
229
+ "owner": "Στέφανος Πατάκης",
230
+ "bookstores": {
231
+ "Κεντρική διάθεση": {
232
+ "address": [
233
+ "Εμμ. Μπενάκη 16",
234
+ "106 78 Αθήνα"
235
+ ],
236
+ "telephone": [
237
+ "210 3831078"
238
+ ]
239
+ },
240
+ "Γενικό βιβλιοπωλείο Πατάκη": {
241
+ "address": [
242
+ "Ακαδημίας 65",
243
+ "106 78 Αθήνα"
244
+ ],
245
+ "telephone": [
246
+ "210 3811850",
247
+ "210 3811740"
248
+ ]
249
+ },
250
+ "Έδρα": {
251
+ "address": [
252
+ "Παναγή Τσαλδάρη 38 (πρ. Πειραιώς)",
253
+ "104 37 Αθήνα"
254
+ ],
255
+ "telephone": [
256
+ "210 3650000",
257
+ "210 5205600"
258
+ ],
259
+ "fax": "210 3650069",
260
+ "email": "info@patakis.gr",
261
+ "website": "www.patakis.gr"
262
+ }
263
+ },
264
+ "b_id": "20"
265
+ }
266
+ ]
267
+ }
268
+ ```
269
+ #### Extract Categories
270
+ Biblionet's categories are based on [Dewey Decimal Classification](http://en.wikipedia.org/wiki/Dewey_Decimal_Classification). It is possible to extract these categories also as seen below.
271
+ ```ruby
272
+ # Create a new extractor object with pretty json format.
273
+ extractor = Extractor.new(format: 'pretty_json')
274
+
275
+ # Extract category with id 1041 from website
276
+ extractor.category(id: 1041)
277
+
278
+ # Extract category from the provided webpage
279
+ extractor.category(uri: 'http://biblionet.gr/index/1041/')
280
+
281
+ # Extract category with id 1041 from local storage
282
+ extractor.category(id: 1041, local: true)
283
+ ```
284
+ **Categories Options**: (Pretty much the same as previous cases)
285
+
286
+ * id : The id of category on the corresponding site (Integer)
287
+ * uri : The url of category web page or the path to local file.
288
+ * local : Boolean value. Has page been saved locally? (default is false)
289
+ * format : The format in which the extracted data are returned
290
+ * hash (default)
291
+ * json
292
+ * pretty_json
293
+
294
+ Notice that when you are extracting a category you also extract parent categories and subcategories, thus you never extract just one category.
295
+
296
+ The expected result of a category extraction is something like this:
297
+ (Here the extracted category is the 1041, but parent and sub categories were also extracted.
298
+ ```json
299
+ {
300
+ "category": [
301
+ {
302
+ "192": {
303
+ "ddc": "500",
304
+ "name": "Φυσικές και θετικές επιστήμες",
305
+ "parent": null
306
+ },
307
+ "1040": {
308
+ "ddc": "520",
309
+ "name": "Αστρονομία",
310
+ "parent": "192"
311
+ },
312
+ "1041": {
313
+ "ddc": "523",
314
+ "name": "Πλανήτες",
315
+ "parent": "1040"
316
+ },
317
+ "780": {
318
+ "ddc": "523.01",
319
+ "name": "Αστροφυσική",
320
+ "parent": "1041"
321
+ },
322
+ "2105": {
323
+ "ddc": "523.083",
324
+ "name": "Πλανήτες - Βιβλία για παιδιά",
325
+ "parent": "1041"
326
+ },
327
+ "576": {
328
+ "ddc": "523.1",
329
+ "name": "Κοσμολογία",
330
+ "parent": "1041"
331
+ },
332
+ "current": {
333
+ "ddc": "523",
334
+ "name": "Πλανήτες",
335
+ "parent": "1040",
336
+ "b_id": "1041"
337
+ }
338
+ }
339
+ ]
340
+ }
341
+ Notice that the last item is the current category. The rest is the category tree.
342
+
343
+ ```
344
+ ### Book Search
345
+ Instead of providing the exact book id and extract that book directly, a search function can be used to get one or more books based on some parameters.
346
+ ```ruby
347
+ # Create a new extractor object with pretty json format.
348
+ extractor = Extractor.new(format: 'pretty_json')
349
+
350
+ # Extract books with these words in title
351
+ extractor.search(title: 'σημεια και τερατα')
352
+
353
+ # Extract books with these words in title and this name in author
354
+ extractor.search(title: 'χομπιτ', author: 'τολκιν', results_type: 'metadata')
355
+
356
+ # Extract books from specific author, published after 1984
357
+ extractor.search(author: 'arthur doyle', after_year: '2010')
358
+
359
+ # Extract ids of books books with these words in title and this name in author
360
+ extractor.search(title: 'αρχοντας', author: 'τολκιν', results_type: 'ids')
361
+ ```
362
+ Searching and extracting several books can be very slow at times, so instead of extracting every single book you may prefer only the ids of found books. In that case pass the option `results_type: 'ids'`.
363
+
364
+ **Search Options**:
365
+ With enought options you can customize your query to your needs. It is recommended to use at least two of the search options.
366
+
367
+ * title (The title of book to search)
368
+ * author (The author's last name is enough for filter the search)
369
+ * publisher
370
+ * category
371
+ * title_split
372
+ * 0 (The exact title phrase must by matched)
373
+ * 1 (Default - All the words in title must be matched in whatever order)
374
+ * 2 (At least one word should match)
375
+ * book_id (Providing id means only one book should returned)
376
+ * isbn
377
+ * author_id (ID of the selected author)
378
+ * publisher_id
379
+ * category_id
380
+ * after_year (Published this year or later)
381
+ * before_year (Published this year or before)
382
+ * results_type
383
+ * metadata (Default - Every book is extracted and an array of metadata is returned)
384
+ * ids (Only ids are returned)
385
+ * format : The format in which the extracted data are returned
386
+ * hash (default)
387
+ * json
388
+ * pretty_json
389
+
390
+ Results with ids option look like that:
391
+ ```json
392
+ {
393
+ "book": [
394
+ "119000",
395
+ "103788",
396
+ "87815",
397
+ "87812",
398
+ "15839",
399
+ "77381",
400
+ "46856",
401
+ "46763",
402
+ "33301"
403
+ ]
404
+ }
405
+ ```
406
+ Normally results are multiple books like the ones in book extractors:
407
+ ```json
408
+ {
409
+ "book": [
410
+ {
411
+ "title": "Στης Χλόης τα απόκρυφα",
412
+ "subtitle": "…και άλλα σημεία και τέρατα",
413
+ "... Rest of Metadata ...": "... condensed ..."
414
+ },
415
+ {
416
+ "title": "Σημεία και τέρατα της οικονομίας",
417
+ "subtitle": "Η κρυφή πλευρά των πάντων",
418
+ "... Rest of Metadata ...": "... condensed ..."
419
+ },
420
+ {
421
+ "title": "Και άλλα σημεία και τέρατα από την ιστορία",
422
+ "subtitle": null,
423
+ "... Rest of Metadata ...": "... condensed ..."
424
+ },
425
+ {
426
+ "title": "Σημεία και τέρατα από την ιστορία",
427
+ "subtitle": null,
428
+ "... Rest of Metadata ...": "... condensed ..."
429
+ }
430
+ ]
431
+ }
432
+ ```
433
+
434
+ ### Where do IDs point?
435
+ The id of each data type points to the corresponding type webpage.
436
+ Take a look at this table:
437
+
438
+ | ID | Data Type | Target Webpage |
439
+ |---------|:-----------:|----------------------------------|
440
+ | 103788 | book | http://biblionet.gr/book/103788 |
441
+ | 10207 | author | http://biblionet.gr/author/10207 |
442
+ | 20 | publisher | http://biblionet.gr/com/20 |
443
+ | 1041 | category | http://biblionet.gr/index/1041 |
444
+
445
+ So if you want to use the uri option provide the target webpage's url as seen above without any slugs after th id.
446
+
447
+ ## Contributing
448
+
449
+ 1. Fork it ( https://github.com/[my-github-username]/bookshark/fork )
450
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
451
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
452
+ 4. Push to the branch (`git push origin my-new-feature`)
453
+ 5. Create a new Pull Request