sinew 2.0.1 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fde4bbaa95fce45f3a7ae7aeacab1672615ea1ace852845b0395ce9cce32f861
4
- data.tar.gz: 5743800570722443f704c5fc7bc421346cc4f2fb116b8fe9f615bf84fb95f826
3
+ metadata.gz: 33506a03f47a88cae5bf7e0f4675d7cf83d86ba3c96f0880f5c473a7b23b167b
4
+ data.tar.gz: 990bd4690f9fe799774c349314a32ab2c08979d555f03891316c5e0be8a4ad3d
5
5
  SHA512:
6
- metadata.gz: 94009061e7f4e36cc23528be3866c6a372df51a83e096144cafbd923259439e6d44a7d656fbdcfe09c2e059b48deb553caca3ec5d332b33845afd1e91550371a
7
- data.tar.gz: 5a8baf7fbdba371065c796c9fdce4312039558b27b4a16676b3df16d5138916ce84db0677dce6ede1831be8040df9112a0491e421813af5e5fd0b0b747d49239
6
+ metadata.gz: 9644097a2e11d8cba59a7985dfe770f27b00d5d18b676d0cacdee3e73a21f1b6c237b3bb58d68489d2a67fc981f7a7f8bb27a6e6fb23781f318cde78b392d7cd
7
+ data.tar.gz: 667c301e7896b27162a77cff5165f264a0c2b73afbe5c35f541181709118185a13241d187ff8f8d5964e302537064ff927d9fa64ece2cb10ca65ba7dd89ce807
data/LICENSE CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2012 Adam Doppelt
1
+ Copyright (c) 2012-2018 Adam Doppelt
2
2
 
3
3
  Permission is hereby granted, free of charge, to any person obtaining
4
4
  a copy of this software and associated documentation files (the
data/README.md CHANGED
@@ -1,11 +1,13 @@
1
+ ![Travis](https://travis-ci.org/gurgeous/sinew.svg?branch=master)
2
+
1
3
  ## Welcome to Sinew
2
4
 
3
5
  Sinew collects structured data from web sites (screen scraping). It provides a Ruby DSL built for crawling, a robust caching system, and integration with [Nokogiri](http://nokogiri.org). Though small, this project is the culmination of years of effort based on crawling systems built at several different companies.
4
6
 
5
7
  Sinew is distributed as a ruby gem:
6
8
 
7
- ```ruby
8
- gem install sinew
9
+ ```sh
10
+ $ gem install sinew
9
11
  ```
10
12
 
11
13
  or in your Gemfile:
@@ -16,17 +18,16 @@ gem 'sinew'
16
18
 
17
19
  ## Table of Contents
18
20
 
19
- <!---
20
- markdown-toc --no-firsth1 --maxdepth 1 readme.md
21
- -->
21
+ <!--- markdown-toc --no-firsth1 --maxdepth 1 readme.md -->
22
22
 
23
- * [Sinew 2 (May 2018)](#sinew-2-may-2018)
23
+ * [Sinew 2](#sinew-2-may-2018)
24
24
  * [Quick Example](#quick-example)
25
25
  * [How it Works](#how-it-works)
26
26
  * [DSL Reference](#dsl-reference)
27
27
  * [Hints](#hints)
28
28
  * [Limitations](#limitations)
29
29
  * [Changelog](#changelog)
30
+ * [License](#license)
30
31
 
31
32
  ## Sinew 2 (May 2018)
32
33
 
@@ -34,7 +35,7 @@ I am pleased to announce the release of Sinew 2.0, a complete rewrite of Sinew f
34
35
 
35
36
  * Remove dependencies on active_support, curl and tidy. We use HTTParty now.
36
37
  * Much easier to customize requests in `.sinew` files. For example, setting User-Agent or Bearer tokens.
37
- * More operations like `post_json` or the generic `http`. These methods are thing wrappers around HTTParty.
38
+ * More operations like `post_json` or the generic `http`. These methods are thin wrappers around HTTParty.
38
39
  * New end-of-run report.
39
40
  * Tests, rubocop, vscode settings, travis, etc.
40
41
 
@@ -124,15 +125,18 @@ Because all requests are cached, you can run Sinew repeatedly with confidence. R
124
125
  #### Making requests
125
126
 
126
127
  * `get(url, query = {})` - fetch a url with HTTP GET. URL parameters can be added using `query.
127
- * `post(url, form = {})` - fetch a url with HTTP POST, using `form` as the POST body.
128
+ * `post(url, form = {})` - fetch a url with HTTP POST, using `form` as the URL encoded POST body.
128
129
  * `post_json(url, json = {})` - fetch a url with HTTP POST, using `json` as the POST body.
129
130
  * `http(method, url, options = {})` - use this for more complex requests
130
131
 
131
132
  #### Parsing the response
132
133
 
134
+ These variables are set after each HTTP request.
135
+
133
136
  * `raw` - the raw response from the last request
134
137
  * `html` - like `raw`, but with a handful of HTML-specific whitespace cleanups
135
- * `noko` - a [Nokogiri](http://nokogiri.org) document built from the tidied HTML
138
+ * `noko` - parse the response as HTML and return a [Nokogiri](http://nokogiri.org) document
139
+ * `xml` - parse the response as XML and return a [Nokogiri](http://nokogiri.org) document
136
140
  * `json` - parse the response as JSON, with symbolized keys
137
141
  * `url` - the url of the last request. If the request goes through a redirect, `url` will reflect the final url.
138
142
  * `uri` - the URI of the last request. This is useful for resolving relative URLs.
@@ -169,19 +173,24 @@ noko.css("table")[4].css("td").select { |i| i[:width].to_i > 80 }.map(&:text)
169
173
 
170
174
  ## Changelog
171
175
 
172
- #### 2.0.0 (May 2018)
176
+ #### 2.0.2 (May 2018)
173
177
 
174
- * Complete rewrite. See above.
178
+ * Support for `--limit`, `--proxy` and the `xml` variable
179
+ * Dedup - warn and ignore if row[:url] has already been emitted
180
+ * Auto gunzip if contents are compressed
175
181
 
176
- #### 1.0.3
182
+ #### 2.0.1 (May 2018)
177
183
 
178
- * Friendlier message if curl or tidy are missing.
184
+ * Support for legacy cached `head` files from Sinew 1
185
+
186
+ #### 2.0.0 (May 2018)
187
+
188
+ * Complete rewrite. See above.
179
189
 
180
- #### 1.0.2
190
+ #### 1.0.3 (June 2012)
181
191
 
182
- * Remove entity options from tidy, which didn't work on MacOS (thanks Rex!)
192
+ ...
183
193
 
184
- #### 1.0.1
194
+ ## License
185
195
 
186
- * Trying to run on 1.8 produces a fatal error. Onward!
187
- * Added first batch of unit tests
196
+ This extension is [licensed under the MIT License](LICENSE).
data/bin/sinew CHANGED
@@ -11,11 +11,13 @@ require 'slop'
11
11
 
12
12
  options = Slop.parse do |o|
13
13
  o.banner = 'Usage: sinew [options] <gub.sinew>'
14
- o.bool '-v', '--verbose', 'dump every row'
15
- o.bool '--version', 'show version'
14
+ o.bool '-v', '--verbose', 'dump emitted rows while running'
16
15
  o.bool '-q', '--quiet', 'suppress some output'
17
- o.string '--cache', 'Set the cache directory (defaults to ~/.sinew)', default: "#{ENV['HOME']}/.sinew"
18
- o.on '--help' do
16
+ o.integer '-l', '--limit', 'quit after emitting this many rows'
17
+ o.string '-c', '--cache', 'set custom cache directory', default: "#{ENV['HOME']}/.sinew"
18
+ o.string '--proxy', 'use host[:port] as HTTP proxy'
19
+ o.bool '--version', 'show version and exit'
20
+ o.on('--help', 'show this help') do
19
21
  puts o
20
22
  exit
21
23
  end
@@ -7,6 +7,9 @@ require 'cgi'
7
7
 
8
8
  module Sinew
9
9
  class DSL
10
+ # this is used to break out of --limit
11
+ class LimitError < StandardError; end
12
+
10
13
  attr_reader :sinew, :raw, :uri, :elapsed
11
14
 
12
15
  def initialize(sinew)
@@ -15,8 +18,12 @@ module Sinew
15
18
 
16
19
  def run
17
20
  tm = Time.now
18
- recipe = sinew.options[:recipe]
19
- instance_eval(File.read(recipe, mode: 'rb'), recipe)
21
+ begin
22
+ recipe = sinew.options[:recipe]
23
+ instance_eval(File.read(recipe, mode: 'rb'), recipe)
24
+ rescue LimitError
25
+ # ignore - this is flow control for --limit
26
+ end
20
27
  @elapsed = Time.now - tm
21
28
  end
22
29
 
@@ -46,14 +53,13 @@ module Sinew
46
53
 
47
54
  def http(method, url, options = {})
48
55
  # reset
49
- @html = @noko = @json = @url = nil
56
+ instance_variables.each do |i|
57
+ instance_variable_set(i, nil) if i != :@sinew
58
+ end
50
59
 
51
- # fetch
60
+ # fetch and make response available to callers
52
61
  response = sinew.http(method, url, options)
53
-
54
- # respond
55
- @uri = response.uri
56
- @raw = response.body
62
+ @uri, @raw = response.uri, response.body
57
63
  end
58
64
 
59
65
  #
@@ -75,6 +81,10 @@ module Sinew
75
81
  @noko ||= Nokogiri::HTML(html)
76
82
  end
77
83
 
84
+ def xml
85
+ @xml ||= Nokogiri::XML(html)
86
+ end
87
+
78
88
  def json
79
89
  @json ||= JSON.parse(raw, symbolize_names: true)
80
90
  end
@@ -93,6 +103,9 @@ module Sinew
93
103
 
94
104
  def csv_emit(row)
95
105
  sinew.output.emit(row)
106
+ if sinew.output.count == sinew.options[:limit]
107
+ raise LimitError.new
108
+ end
96
109
  end
97
110
  end
98
111
  end
@@ -15,6 +15,12 @@ module Sinew
15
15
  @runtime_options = RuntimeOptions.new
16
16
  @request_tm = Time.at(0)
17
17
  @request_count = 0
18
+
19
+ if options[:proxy]
20
+ addr, port = options[:proxy].split(':')
21
+ runtime_options.httparty_options[:http_proxyaddr] = addr
22
+ runtime_options.httparty_options[:http_proxyport] = port || 80
23
+ end
18
24
  end
19
25
 
20
26
  def run
@@ -1,4 +1,5 @@
1
1
  require 'csv'
2
+ require 'set'
2
3
  require 'stringex'
3
4
 
4
5
  #
@@ -7,11 +8,12 @@ require 'stringex'
7
8
 
8
9
  module Sinew
9
10
  class Output
10
- attr_reader :sinew, :columns, :rows, :csv
11
+ attr_reader :sinew, :columns, :rows, :urls, :csv
11
12
 
12
13
  def initialize(sinew)
13
14
  @sinew = sinew
14
15
  @rows = []
16
+ @urls = Set.new
15
17
  end
16
18
 
17
19
  def filename
@@ -41,6 +43,8 @@ module Sinew
41
43
  # implicit header if necessary
42
44
  header(row.keys) if !csv
43
45
 
46
+ # don't allow duplicate urls
47
+ return if dup_url?(row)
44
48
  rows << row.dup
45
49
 
46
50
  # map columns to row, and normalize along the way
@@ -94,6 +98,9 @@ module Sinew
94
98
  s.to_s
95
99
  end
96
100
 
101
+ # strip html tags. Note that we replace tags with spaces
102
+ s = s.gsub(/<[^>]+>/, ' ')
103
+
97
104
  #
98
105
  # Below uses stringex
99
106
  #
@@ -101,9 +108,6 @@ module Sinew
101
108
  # github.com/rsl/stringex/blob/master/lib/stringex/localization/conversion_expressions.rb
102
109
  #
103
110
 
104
- # <a>b</a> => b
105
- s = s.strip_html_tags
106
-
107
111
  # Converts MS Word 'smart punctuation' to ASCII
108
112
  s = s.convert_smart_punctuation
109
113
 
@@ -122,5 +126,17 @@ module Sinew
122
126
  s
123
127
  end
124
128
  protected :normalize
129
+
130
+ def dup_url?(row)
131
+ if url = row[:url]
132
+ if urls.include?(url)
133
+ sinew.warning("duplicate url: #{url}") if !sinew.quiet?
134
+ return true
135
+ end
136
+ urls << url
137
+ end
138
+ false
139
+ end
140
+ protected :dup_url?
125
141
  end
126
142
  end
@@ -28,7 +28,10 @@ module Sinew
28
28
  def perform
29
29
  validate!
30
30
 
31
- # merge global/options headers
31
+ # merge optons
32
+ options = self.options.merge(sinew.runtime_options.httparty_options)
33
+
34
+ # merge headers
32
35
  headers = sinew.runtime_options.headers
33
36
  headers = headers.merge(options[:headers]) if options[:headers]
34
37
  options[:headers] = headers
@@ -1,3 +1,6 @@
1
+ require 'stringio'
2
+ require 'zlib'
3
+
1
4
  #
2
5
  # An HTTP response. Mostly a wrapper around HTTParty.
3
6
  #
@@ -16,13 +19,7 @@ module Sinew
16
19
  response.uri = party_response.request.last_uri
17
20
  response.code = party_response.code
18
21
  response.headers = party_response.headers.to_h
19
-
20
- # force to utf-8 as best we can
21
- body = party_response.body
22
- if body.encoding != Encoding::UTF_8
23
- body = body.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
24
- end
25
- response.body = body
22
+ response.body = process_body(party_response)
26
23
  end
27
24
  end
28
25
 
@@ -60,21 +57,43 @@ module Sinew
60
57
  end
61
58
 
62
59
  def self.from_legacy_head(response, head)
63
- response.tap do |response|
60
+ response.tap do |r|
64
61
  case head
65
62
  when /\ACURLER_ERROR/
66
63
  # error
67
- response.code = 999
64
+ r.code = 999
68
65
  when /\AHTTP/
69
66
  # redirect
70
67
  location = head.scan(/Location: ([^\r\n]+)/).flatten.last
71
- response.uri += location
68
+ r.uri += location
72
69
  else
73
- $stderr.puts "unknown cached /head for #{response.uri}"
70
+ $stderr.puts "unknown cached /head for #{r.uri}"
74
71
  end
75
72
  end
76
73
  end
77
74
 
75
+ # helper for decoding bodies before parsing
76
+ def self.process_body(response)
77
+ body = response.body
78
+
79
+ # inflate if necessary
80
+ bits = body[0, 10].force_encoding('BINARY')
81
+ if bits =~ /\A\x1f\x8b/n
82
+ body = Zlib::GzipReader.new(StringIO.new(body)).read
83
+ end
84
+
85
+ # force to utf-8 if we think this could be text
86
+ if body.encoding != Encoding::UTF_8
87
+ if content_type = response.headers['content-type']
88
+ if content_type =~ /\b(html|javascript|json|text|xml)\b/
89
+ body = body.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
90
+ end
91
+ end
92
+ end
93
+
94
+ body
95
+ end
96
+
78
97
  #
79
98
  # accessors
80
99
  #
@@ -7,6 +7,7 @@ module Sinew
7
7
  attr_accessor :retries
8
8
  attr_accessor :rate_limit
9
9
  attr_accessor :headers
10
+ attr_accessor :httparty_options
10
11
  attr_accessor :before_generate_cache_key
11
12
 
12
13
  def initialize
@@ -15,6 +16,7 @@ module Sinew
15
16
  self.headers = {
16
17
  'User-Agent' => "sinew/#{VERSION}",
17
18
  }
19
+ self.httparty_options = {}
18
20
  self.before_generate_cache_key = ->(i) { i }
19
21
 
20
22
  # for testing
@@ -1,4 +1,4 @@
1
1
  module Sinew
2
2
  # Gem version
3
- VERSION = '2.0.1'.freeze
3
+ VERSION = '2.0.2'.freeze
4
4
  end
@@ -0,0 +1,6 @@
1
+ csv_header(%i[n a p])
2
+ csv_emit(n: 'n1', a: 'a1')
3
+
4
+ # OUTPUT
5
+ # n,a,p
6
+ # n1,a1,""
@@ -0,0 +1,8 @@
1
+ get 'http://httpbin.org/html'
2
+ raw.scan(/<h1>([^<]+)/) do
3
+ csv_emit(h1: $1)
4
+ end
5
+
6
+ # OUTPUT
7
+ # h1
8
+ # Herman Melville - Moby-Dick
@@ -0,0 +1,7 @@
1
+ 5.times do
2
+ csv_emit(url: 'https://gub')
3
+ end
4
+
5
+ # OUTPUT
6
+ # url
7
+ # https://gub
@@ -0,0 +1,5 @@
1
+ csv_emit(name: 'bob', address: 'main')
2
+
3
+ # OUTPUT
4
+ # name,address
5
+ # bob,main
@@ -0,0 +1,11 @@
1
+ # OPTIONS { limit: 3 }
2
+
3
+ (1..5).each do |i|
4
+ csv_emit(i: i)
5
+ end
6
+
7
+ # OUTPUT
8
+ # i
9
+ # 1
10
+ # 2
11
+ # 3
@@ -0,0 +1,9 @@
1
+ get 'http://httpbin.org/xml'
2
+ noko.css('slide title').each do |title|
3
+ csv_emit(title: title.text)
4
+ end
5
+
6
+ # OUTPUT
7
+ # title
8
+ # Wake up to WonderWidgets!
9
+ # Overview
@@ -0,0 +1,11 @@
1
+ # This tests get by URI, URI math, and csv_emit with uri
2
+ get(URI.parse('http://httpbin.org/html'))
3
+ csv_emit(url: uri)
4
+
5
+ get(uri + '../get')
6
+ csv_emit(url: uri)
7
+
8
+ # OUTPUT
9
+ # url
10
+ # http://httpbin.org/html
11
+ # http://httpbin.org/get
@@ -0,0 +1,8 @@
1
+ get 'http://httpbin.org/html'
2
+ noko.css('h1').each do |h1|
3
+ csv_emit(h1: h1.text)
4
+ end
5
+
6
+ # OUTPUT
7
+ # h1
8
+ # Herman Melville - Moby-Dick
@@ -12,8 +12,6 @@ require 'sinew'
12
12
 
13
13
  class MiniTest::Test
14
14
  TMP = '/tmp/_test_sinew'.freeze
15
- RECIPE = "#{TMP}/test.sinew".freeze
16
- CSV = "#{TMP}/test.csv".freeze
17
15
  HTML = File.read("#{__dir__}/test.html")
18
16
 
19
17
  def setup
@@ -27,16 +25,10 @@ class MiniTest::Test
27
25
  end
28
26
 
29
27
  def sinew
30
- @sinew ||= Sinew::Main.new(cache: TMP, quiet: true, recipe: RECIPE)
28
+ @sinew ||= Sinew::Main.new(cache: TMP, quiet: true, recipe: "#{TMP}/ignore.sinew")
31
29
  end
32
30
  protected :sinew
33
31
 
34
- def run_recipe(recipe)
35
- File.write(RECIPE, recipe)
36
- sinew.run
37
- end
38
- protected :run_recipe
39
-
40
32
  def test_network?
41
33
  !!ENV['SINEW_TEST_NETWORK']
42
34
  end
@@ -50,6 +42,7 @@ class MiniTest::Test
50
42
  stub_request(:get, %r{http://[^/]+/status/\d+}).to_return(method(:respond_status))
51
43
  stub_request(:get, %r{http://[^/]+/(relative-)?redirect/\d+}).to_return(method(:respond_redirect))
52
44
  stub_request(:get, %r{http://[^/]+/delay/\d+}).to_timeout
45
+ stub_request(:get, %r{http://[^/]+/xml}).to_return(method(:respond_xml))
53
46
  end
54
47
  protected :stub_network
55
48
 
@@ -58,7 +51,7 @@ class MiniTest::Test
58
51
  #
59
52
 
60
53
  def respond_html(_request)
61
- # this html was carefully chosen to match httpbin.org/html
54
+ # this html was carefully chosen to somewhat match httpbin.org/html
62
55
  html = <<~EOF
63
56
  <body>
64
57
  <h1>Herman Melville - Moby-Dick</h1>
@@ -68,20 +61,37 @@ class MiniTest::Test
68
61
  end
69
62
  protected :respond_html
70
63
 
64
+ def respond_xml(_request)
65
+ # this xml was carefully chosen to somewhat match httpbin.org/xml
66
+ xml = <<~EOF
67
+ <!-- A SAMPLE set of slides -->
68
+ <slideshow>
69
+ <slide type="all">
70
+ <title>Wake up to WonderWidgets!</title>
71
+ </slide>
72
+ <slide type="all">
73
+ <title>Overview</title>
74
+ </slide>
75
+ </slideshow>
76
+ EOF
77
+ { body: xml }
78
+ end
79
+ protected :respond_xml
80
+
71
81
  def respond_echo(request)
72
82
  response = {}
73
83
  response[:headers] = request.headers
74
84
 
75
85
  # args
76
86
  response[:args] = if request.uri.query
77
- CGI.parse(request.uri.query).map { |k, v| [k, v.first] }.to_h
87
+ CGI.parse(request.uri.query).map { |k, v| [ k, v.first ] }.to_h
78
88
  else
79
89
  {}
80
90
  end
81
91
 
82
92
  # form
83
93
  if request.headers['Content-Type'] == 'application/x-www-form-urlencoded'
84
- response[:form] = CGI.parse(request.body).map { |k, v| [k, v.first] }.to_h
94
+ response[:form] = CGI.parse(request.body).map { |k, v| [ k, v.first ] }.to_h
85
95
  end
86
96
 
87
97
  # json
@@ -12,8 +12,10 @@ class TestLegacy < MiniTest::Test
12
12
  end
13
13
 
14
14
  def test_legacy
15
- sinew.dsl.get('http://eu.httpbin.org/status/500')
16
- assert_equal "\n", sinew.dsl.raw
15
+ assert_output(/failed with 999/) do
16
+ sinew.dsl.get('http://eu.httpbin.org/status/500')
17
+ assert_equal "\n", sinew.dsl.raw
18
+ end
17
19
 
18
20
  sinew.dsl.get('http://eu.httpbin.org/redirect/3')
19
21
  assert_equal 'http://eu.httpbin.org/get', sinew.dsl.url
@@ -1,26 +1,8 @@
1
1
  require_relative 'test_helper'
2
2
 
3
- class TestMain < MiniTest::Test
4
- def test_noko
5
- run_recipe <<~'EOF'
6
- get 'http://httpbin.org/html'
7
- noko.css("h1").each do |h1|
8
- csv_emit(h1: h1.text)
9
- end
10
- EOF
11
- assert_equal("h1\nHerman Melville - Moby-Dick\n", File.read(CSV))
12
- end
13
-
14
- def test_raw
15
- run_recipe <<~'EOF'
16
- get "http://httpbin.org/html"
17
- raw.scan(/<h1>([^<]+)/) do
18
- csv_emit(h1: $1)
19
- end
20
- EOF
21
- assert_equal("h1\nHerman Melville - Moby-Dick\n", File.read(CSV))
22
- end
3
+ require 'base64'
23
4
 
5
+ class TestMain < MiniTest::Test
24
6
  def test_rate_limit
25
7
  # true network requests call sleep for timeouts, which interferes with our
26
8
  # instrumentation of Kernel#sleep
@@ -43,4 +25,10 @@ class TestMain < MiniTest::Test
43
25
  Kernel.send(:alias_method, :sleep, :old_sleep)
44
26
  Kernel.send(:undef_method, :old_sleep)
45
27
  end
28
+
29
+ def test_gunzip
30
+ body = Base64.decode64('H4sICBRI61oAA2d1Yi50eHQASy9N4gIAJlqRYgQAAAA=')
31
+ body = Sinew::Response.process_body(OpenStruct.new(body: body))
32
+ assert_equal 'gub', body.strip
33
+ end
46
34
  end
@@ -1,25 +1,6 @@
1
1
  require_relative 'test_helper'
2
2
 
3
3
  class TestOutput < MiniTest::Test
4
- def test_output
5
- sinew.dsl.csv_header(:n, :a, :p)
6
- sinew.dsl.csv_emit(n: 'n1', a: 'a1')
7
- sinew.dsl.csv_emit(n: 'n2', a: 'a2')
8
- assert_equal 2, sinew.output.count
9
- assert_equal "n,a,p\nn1,a1,\"\"\nn2,a2,\"\"\n", File.read(CSV)
10
- end
11
-
12
- def test_implicit_header
13
- sinew.dsl.csv_emit(name: 'bob', address: 'main')
14
- assert_equal "name,address\nbob,main\n", File.read(CSV)
15
- end
16
-
17
- def test_array_header
18
- sinew.dsl.csv_header(%i[n a p])
19
- sinew.dsl.csv_emit(n: 'n1', a: 'a1')
20
- assert_equal "n,a,p\nn1,a1,\"\"\n", File.read(CSV)
21
- end
22
-
23
4
  def test_filenames
24
5
  sinew = Sinew::Main.new(recipe: 'gub.sinew')
25
6
  assert_equal 'gub.csv', sinew.output.filename
@@ -59,6 +40,8 @@ class TestOutput < MiniTest::Test
59
40
 
60
41
  # strip_html_tags
61
42
  assert_equal('gub', output.send(:normalize, '<tag>gub</tag>'))
43
+ # strip_html_tags and replace with spaces
44
+ assert_equal('hello world', output.send(:normalize, '<tag>hello<br>world</tag>'))
62
45
  # convert_smart_punctuation
63
46
  assert_equal('"gub"', output.send(:normalize, "\302\223gub\302\224"))
64
47
  # convert_accented_html_entities
@@ -0,0 +1,60 @@
1
+ require_relative 'test_helper'
2
+
3
+ class TestRecipe < MiniTest::Test
4
+ DIR = File.expand_path('recipes', __dir__)
5
+ TEST_SINEW = "#{TMP}/test.sinew".freeze
6
+ TEST_CSV = "#{TMP}/test.csv".freeze
7
+
8
+ def test_recipes
9
+ Dir.chdir(DIR) do
10
+ Dir['*.sinew'].sort.each do |filename|
11
+ recipe = IO.read(filename)
12
+
13
+ # get ready
14
+ IO.write(TEST_SINEW, recipe)
15
+ sinew = Sinew::Main.new(cache: TMP, quiet: true, recipe: TEST_SINEW)
16
+
17
+ # read OPTIONS
18
+ if options = options_from(recipe)
19
+ options.each do |key, value|
20
+ sinew.options[key] = value
21
+ end
22
+ end
23
+
24
+ # read OUTPUT
25
+ output = output_from(recipe, filename)
26
+
27
+ # run
28
+ sinew.run
29
+
30
+ # assert
31
+ csv = IO.read(TEST_CSV)
32
+ assert_equal(output, csv, "Output didn't match for recipes/#{filename}")
33
+ end
34
+ end
35
+ end
36
+
37
+ def options_from(recipe)
38
+ if options = recipe[/^#\s*OPTIONS\s*(\{.*\})/, 1]
39
+ # rubocop:disable Security/Eval
40
+ eval(options)
41
+ # rubocop:enable Security/Eval
42
+ end
43
+ end
44
+ protected :options_from
45
+
46
+ def output_from(recipe, filename)
47
+ lines = recipe.split("\n")
48
+ first_line = lines.index { |i| i =~ /^# OUTPUT/ }
49
+ if !first_line
50
+ raise "# OUTPUT not found in recipes/#{filename}"
51
+ end
52
+
53
+ output = lines[first_line + 1..-1]
54
+ output = output.map { |i| i.gsub(/^# /, '') }
55
+ output = output.join("\n")
56
+ output += "\n"
57
+ output
58
+ end
59
+ protected :output_from
60
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: sinew
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.1
4
+ version: 2.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adam Doppelt
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-05-02 00:00:00.000000000 Z
11
+ date: 2018-05-03 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print
@@ -186,6 +186,14 @@ files:
186
186
  - test/legacy/eu.httpbin.org/redirect,3
187
187
  - test/legacy/eu.httpbin.org/status,500
188
188
  - test/legacy/legacy.sinew
189
+ - test/recipes/array_header.sinew
190
+ - test/recipes/basic.sinew
191
+ - test/recipes/dups.sinew
192
+ - test/recipes/implicit_header.sinew
193
+ - test/recipes/limit.sinew
194
+ - test/recipes/noko.sinew
195
+ - test/recipes/uri.sinew
196
+ - test/recipes/xml.sinew
189
197
  - test/test.html
190
198
  - test/test_cache.rb
191
199
  - test/test_helper.rb
@@ -193,6 +201,7 @@ files:
193
201
  - test/test_main.rb
194
202
  - test/test_nokogiri_ext.rb
195
203
  - test/test_output.rb
204
+ - test/test_recipes.rb
196
205
  - test/test_requests.rb
197
206
  - test/test_utf8.rb
198
207
  homepage: http://github.com/gurgeous/sinew
@@ -225,6 +234,14 @@ test_files:
225
234
  - test/legacy/eu.httpbin.org/redirect,3
226
235
  - test/legacy/eu.httpbin.org/status,500
227
236
  - test/legacy/legacy.sinew
237
+ - test/recipes/array_header.sinew
238
+ - test/recipes/basic.sinew
239
+ - test/recipes/dups.sinew
240
+ - test/recipes/implicit_header.sinew
241
+ - test/recipes/limit.sinew
242
+ - test/recipes/noko.sinew
243
+ - test/recipes/uri.sinew
244
+ - test/recipes/xml.sinew
228
245
  - test/test.html
229
246
  - test/test_cache.rb
230
247
  - test/test_helper.rb
@@ -232,5 +249,6 @@ test_files:
232
249
  - test/test_main.rb
233
250
  - test/test_nokogiri_ext.rb
234
251
  - test/test_output.rb
252
+ - test/test_recipes.rb
235
253
  - test/test_requests.rb
236
254
  - test/test_utf8.rb