jakal 0.1.1 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/{README.rdoc → README.md} +35 -4
- data/features/calais.feature +3 -36
- data/features/http.feature +4 -10
- data/features/sanitize-text.feature +63 -45
- data/features/step_definitions/calais_steps.rb +1 -41
- data/features/step_definitions/http_steps.rb +1 -18
- data/features/step_definitions/sanitize-text_steps.rb +62 -15
- data/features/support/env.rb +9 -0
- data/lib/jkl.rb +18 -4
- data/lib/jkl/calais_client.rb +5 -57
- data/lib/jkl/rest_client.rb +6 -2
- data/lib/jkl/rss_client.rb +1 -1
- data/lib/jkl/text_client.rb +38 -0
- metadata +6 -7
- data/features/step_definitions/require_steps.rb +0 -12
- data/lib/jkl/url_doc_handler.rb +0 -35
data/{README.rdoc → README.md}
RENAMED
@@ -1,12 +1,43 @@
|
|
1
|
-
|
1
|
+
# jkl
|
2
2
|
|
3
|
-
|
3
|
+
jkl (Jakal) does these things:
|
4
4
|
|
5
|
-
|
5
|
+
* Connects to URLs.
|
6
|
+
* Gets stuff out of RSS feeds.
|
7
|
+
* Gets the main content from web pages
|
8
|
+
* Gets a set of metadata from a web page (using the calais gem)
|
9
|
+
|
10
|
+
# Sample usage
|
11
|
+
|
12
|
+
For example - if you had a RSS feed:
|
13
|
+
|
14
|
+
require "jkl"
|
15
|
+
|
16
|
+
feed = "http://www.topix.net/rss/search/article?x=0&y=0&q=London"
|
17
|
+
|
18
|
+
You could collect some metadata from the links in that feed, thus:
|
19
|
+
|
20
|
+
tags = []
|
21
|
+
Jkl::links(feed).each do |link|
|
22
|
+
tags << Jkl::tags("my_calais_key",link)
|
23
|
+
end
|
24
|
+
|
25
|
+
A metadata sample might look something like this:
|
26
|
+
|
27
|
+
{
|
28
|
+
"Person"=>["Barack Obama", "Hillary Clinton"],
|
29
|
+
"Position"=>["Secretary of State"]
|
30
|
+
}
|
31
|
+
|
32
|
+
It is hosted at [gemcutter](http://gemcutter.org/gems/jakal)
|
33
|
+
|
34
|
+
gem install jakal
|
35
|
+
|
36
|
+
# LICENSE:
|
6
37
|
|
7
38
|
(The MIT License)
|
8
39
|
|
9
|
-
Copyright (c) 2009
|
40
|
+
Copyright (c) 2009 sshingler
|
10
41
|
|
11
42
|
Permission is hereby granted, free of charge, to any person obtaining
|
12
43
|
a copy of this software and associated documentation files (the
|
data/features/calais.feature
CHANGED
@@ -3,41 +3,8 @@ Feature: Calais-Specific features
|
|
3
3
|
As a developer
|
4
4
|
I want to make some requests and inspect some responses
|
5
5
|
|
6
|
-
@
|
7
|
-
Scenario:
|
8
|
-
|
9
|
-
When I post to calais
|
10
|
-
Then I should get a response
|
11
|
-
And I should receive some tags
|
12
|
-
|
13
|
-
@connection_needed
|
14
|
-
Scenario: Post a mock story to calais, inspect the response
|
15
|
-
Given I have a sanitized sample BBC story
|
16
|
-
When I post to calais
|
17
|
-
Then I should get a response
|
18
|
-
And I should receive some tags
|
19
|
-
|
20
|
-
@connection_needed
|
21
|
-
Scenario: Get nested tags from calais
|
22
|
-
Given I have some simple text
|
6
|
+
@live
|
7
|
+
Scenario: Get nested tags from calais
|
8
|
+
Given I have some text
|
23
9
|
When I request the nested entities from calais
|
24
10
|
Then I should receive the entities grouped into categories
|
25
|
-
|
26
|
-
Scenario: Clean up blank items from a calais response
|
27
|
-
Given I have a mock calais response
|
28
|
-
When I remove the unwanted items
|
29
|
-
Then I should receive some tags
|
30
|
-
And there should no longer be any "instances"
|
31
|
-
And there should no longer be any "relevance"
|
32
|
-
And there should no longer be any "blank"
|
33
|
-
And there should no longer be any "not_available"
|
34
|
-
|
35
|
-
Scenario: Go through the calais response tags in a bit more detail
|
36
|
-
Given I have a mock calais response
|
37
|
-
When I remove the unwanted items
|
38
|
-
Then I should receive some tags
|
39
|
-
And there should be some "Organization" tags
|
40
|
-
|
41
|
-
Scenario: Go through the calais response tags as a single array
|
42
|
-
Given I have a mock calais response
|
43
|
-
Then I should be able to see the whole lot of tags as one block
|
data/features/http.feature
CHANGED
@@ -3,24 +3,18 @@ Feature: http features
|
|
3
3
|
As a developer
|
4
4
|
I want to make some requests and inspect some responses
|
5
5
|
|
6
|
-
@
|
6
|
+
@live
|
7
7
|
Scenario: Make a restful post to yahoo
|
8
8
|
When I post some data to yahoo
|
9
9
|
Then I should get a response
|
10
10
|
|
11
|
-
@
|
11
|
+
@live
|
12
12
|
Scenario: Make a restful get
|
13
13
|
When I make a restful get request
|
14
14
|
Then I should get a response
|
15
15
|
And I should see some text
|
16
16
|
|
17
|
-
@
|
17
|
+
@live
|
18
18
|
Scenario: Get some trends
|
19
|
-
When I request some trends
|
19
|
+
When I request some twitter trends
|
20
20
|
Then I should get a response
|
21
|
-
|
22
|
-
@connection_needed
|
23
|
-
Scenario: Get some RSS
|
24
|
-
When I request some RSS
|
25
|
-
Then I should get a response
|
26
|
-
And I should receive some headlines
|
@@ -3,51 +3,69 @@ Feature: Processing features
|
|
3
3
|
As a developer
|
4
4
|
I want to make some requests and inspect some responses
|
5
5
|
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
Scenario: Sanitize some short text
|
15
|
-
Given I have a keyphrase 'the cat sat'
|
16
|
-
When I sanitize this text
|
17
|
-
Then it should say ''
|
18
|
-
|
19
|
-
@unit @text @wip
|
20
|
-
Scenario: Sanitize some text with tabs and spaces
|
21
|
-
Given I have a keyphrase 'the cat sat on the mat '
|
22
|
-
When I sanitize this text
|
23
|
-
Then it should say 'the cat sat on the mat'
|
24
|
-
|
25
|
-
@unit @text @wip
|
26
|
-
Scenario: Sanitize some short text with tabs and spaces
|
27
|
-
Given I have a keyphrase 'the cat sat on '
|
28
|
-
When I sanitize this text
|
29
|
-
Then it should say ''
|
6
|
+
@unit @text
|
7
|
+
Scenario: No changes needed
|
8
|
+
Given I have the text "the cat sat on the mat"
|
9
|
+
When I sanitize this text
|
10
|
+
Then there should be no script tags
|
11
|
+
And there should be no tags
|
12
|
+
And there should be no blank lines
|
13
|
+
And it should say "the cat sat on the mat"
|
30
14
|
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
15
|
+
@unit @text
|
16
|
+
Scenario: Remove simple tags
|
17
|
+
Given I have the text "<a href=\"a-link.html\">the cat sat on the mat</a>"
|
18
|
+
When I sanitize this text
|
19
|
+
Then there should be no script tags
|
20
|
+
And there should be no tags
|
21
|
+
And there should be no blank lines
|
22
|
+
Then it should say "the cat sat on the mat"
|
36
23
|
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
Scenario: Remove script tags
|
46
|
-
Given I have some script tag data
|
47
|
-
When I sanitize this text
|
48
|
-
Then it should say ' some para stuff here '
|
24
|
+
@unit @text @wip
|
25
|
+
Scenario: Remove script tags
|
26
|
+
Given I have some script tag data
|
27
|
+
When I sanitize this text
|
28
|
+
Then there should be no script tags
|
29
|
+
And there should be no tags
|
30
|
+
And there should be no blank lines
|
31
|
+
Then it should say "the cat sat on the mat"
|
49
32
|
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
33
|
+
@mock
|
34
|
+
Scenario: Remove script tags
|
35
|
+
Given I have a sample web page
|
36
|
+
When I remove the script tags
|
37
|
+
Then there should be no script tags
|
38
|
+
|
39
|
+
@mock
|
40
|
+
Scenario: Remove all tags
|
41
|
+
Given I have a sample web page
|
42
|
+
When I remove the script tags
|
43
|
+
And I strip all the tags
|
44
|
+
Then there should be no script tags
|
45
|
+
And there should be no tags
|
46
|
+
|
47
|
+
@mock
|
48
|
+
Scenario: Remove empty lines
|
49
|
+
Given a stripped web page
|
50
|
+
When I remove the blank lines
|
51
|
+
Then there should be no blank lines
|
52
|
+
|
53
|
+
@mock
|
54
|
+
Scenario: Remove a short line
|
55
|
+
Given I have the text "the cat sat on the"
|
56
|
+
When I remove the short lines
|
57
|
+
Then it should say ""
|
58
|
+
|
59
|
+
@mock
|
60
|
+
Scenario: Don't remove a long line
|
61
|
+
Given I have the text "the cat sat on the mat"
|
62
|
+
When I remove the short lines
|
63
|
+
Then it should say "the cat sat on the mat"
|
64
|
+
|
65
|
+
@mock
|
66
|
+
Scenario: Santize a sample BBC page
|
67
|
+
Given I have a sample BBC story
|
68
|
+
When I sanitize this text
|
69
|
+
Then there should be no script tags
|
70
|
+
And there should be no tags
|
71
|
+
And there should be no blank lines
|
@@ -1,48 +1,8 @@
|
|
1
1
|
|
2
|
-
Given /^I have some
|
2
|
+
Given /^I have some text$/ do
|
3
3
|
@text = "Barack Obama said today that he expects there to be conflict within his new security team after confirming Hillary Clinton as his choice for US Secretary of State."
|
4
4
|
end
|
5
5
|
|
6
|
-
Given /^I have a sanitized sample BBC story$/ do
|
7
|
-
Given "I have a sample BBC story"
|
8
|
-
When "I sanitize this text"
|
9
|
-
end
|
10
|
-
|
11
|
-
Given /^I have a mock calais response$/ do
|
12
|
-
@response = File.open('features/mocks/calais.json','r') {|f| f.readlines.to_s}
|
13
|
-
end
|
14
|
-
|
15
|
-
When /^I post to calais$/ do
|
16
|
-
key = YAML::load_file('config/keys.yml')['calais']
|
17
|
-
@response = Jkl::Extraction::get_from_calais(key, @text)
|
18
|
-
end
|
19
|
-
|
20
|
-
When /^I remove the unwanted items$/ do
|
21
|
-
@processed_json = Jkl::clean_unwanted_items_from_hash(JSON.parse(@response))
|
22
|
-
end
|
23
|
-
|
24
|
-
Then /^there should no longer be any "([^\"]*)"$/ do |arg1|
|
25
|
-
@processed_json[arg1].should be_nil
|
26
|
-
end
|
27
|
-
|
28
|
-
Then /^I should receive some tags$/ do
|
29
|
-
Jkl::get_tag_from_json(@response) do |tag|
|
30
|
-
tag.should_not be_nil
|
31
|
-
end
|
32
|
-
end
|
33
|
-
|
34
|
-
Then /^there should be some "([^\"]*)" tags$/ do |arg1|
|
35
|
-
Jkl::get_tag_from_json(@response) {|tag|
|
36
|
-
#puts tag.inspect
|
37
|
-
tag.each{|k,v| puts "#{k} : #{v}" if k=='_type'}
|
38
|
-
}
|
39
|
-
end
|
40
|
-
|
41
|
-
Then /^I should be able to see the whole lot of tags as one block$/ do
|
42
|
-
tags = Jkl::get_tag_from_json(@response)
|
43
|
-
tags.length.should > 0
|
44
|
-
end
|
45
|
-
|
46
6
|
When /^I request the nested entities from calais$/ do
|
47
7
|
key = YAML::load_file('config/keys.yml')['calais']
|
48
8
|
@response = Jkl::Extraction::tags key, @text
|
@@ -6,12 +6,6 @@ When /^I post some data to yahoo$/ do
|
|
6
6
|
@response = Jkl::post_to @url, post_args
|
7
7
|
end
|
8
8
|
|
9
|
-
When /^I request some RSS$/ do
|
10
|
-
keyphrase = @keyphrase || "iraq"
|
11
|
-
url = "#{YAML::load_file('config/config.yml')['topix']}#{CGI::escape(keyphrase)}"
|
12
|
-
@response = Jkl::get_xml_from url
|
13
|
-
end
|
14
|
-
|
15
9
|
Given /^I have some RSS$/ do
|
16
10
|
raw = File.open('features/mocks/topix_rss.xml','r') {|f| f.readlines.to_s}
|
17
11
|
@response = Hpricot.XML raw
|
@@ -22,7 +16,7 @@ When /^I make a restful get request$/ do
|
|
22
16
|
@response = Jkl::get_from url
|
23
17
|
end
|
24
18
|
|
25
|
-
When /^I request some trends$/ do
|
19
|
+
When /^I request some twitter trends$/ do
|
26
20
|
twitter_json_url = YAML::load_file('config/config.yml')['twitter']
|
27
21
|
output = JSON.parse Jkl::get_from twitter_json_url
|
28
22
|
@response = output['trends']
|
@@ -30,17 +24,6 @@ end
|
|
30
24
|
|
31
25
|
Then /^I should get a response$/ do
|
32
26
|
@response.should_not == nil
|
33
|
-
#puts @response.inspect
|
34
|
-
end
|
35
|
-
|
36
|
-
Then /^I should receive some headlines$/ do
|
37
|
-
@items = Jkl::Rss::items @response
|
38
|
-
@links = []
|
39
|
-
@items.each do |item|
|
40
|
-
@links << Jkl::Rss::attribute_from(item, :link)
|
41
|
-
end
|
42
|
-
@links.should_not == nil
|
43
|
-
@links.length.should > 0
|
44
27
|
end
|
45
28
|
|
46
29
|
Then /^I should be able to get the copy from the first headline$/ do
|
@@ -1,4 +1,4 @@
|
|
1
|
-
Given "I have
|
1
|
+
Given "I have the text \"$text\"" do |text|
|
2
2
|
@text = text
|
3
3
|
end
|
4
4
|
|
@@ -6,22 +6,9 @@ Given /^I have a sample BBC story$/ do
|
|
6
6
|
@text = File.open('features/mocks/bbc_story.html','r') {|f| f.readlines.to_s}
|
7
7
|
end
|
8
8
|
|
9
|
-
When /^I sanitize this text$/ do
|
10
|
-
@text = Jkl::sanitize @text
|
11
|
-
end
|
12
|
-
|
13
|
-
Then /^it should be ok$/ do
|
14
|
-
@text.should_not be_nil
|
15
|
-
@text.should_not == ""
|
16
|
-
end
|
17
|
-
|
18
|
-
Then "it should say '$text'" do |text|
|
19
|
-
@text.should == text
|
20
|
-
end
|
21
|
-
|
22
9
|
Given /^I have some script tag data$/ do
|
23
10
|
@text = <<-EOF;
|
24
|
-
|
11
|
+
the cat sat on the mat
|
25
12
|
<script type="text/javascript" charset="utf-8">
|
26
13
|
function nofunction(){var bob;}
|
27
14
|
</script>
|
@@ -30,3 +17,63 @@ Given /^I have some script tag data$/ do
|
|
30
17
|
EOF
|
31
18
|
end
|
32
19
|
|
20
|
+
Given /^I have a sample web page$/ do
|
21
|
+
@text = File.open('features/mocks/sample-web-page.html','r') {|f| f.readlines.to_s}
|
22
|
+
end
|
23
|
+
|
24
|
+
Given /^a stripped web page$/ do
|
25
|
+
Given "I have a sample web page"
|
26
|
+
When "I remove the script tags"
|
27
|
+
And "I strip all the tags"
|
28
|
+
Then "there should be no script tags"
|
29
|
+
And "there should be no tags"
|
30
|
+
end
|
31
|
+
|
32
|
+
When /^I sanitize this text$/ do
|
33
|
+
@text = Jkl::Text::sanitize @text
|
34
|
+
end
|
35
|
+
|
36
|
+
When /^I examine the text$/ do
|
37
|
+
text = Jkl::Text::remove_tabs @text
|
38
|
+
end
|
39
|
+
|
40
|
+
Then "it should say \"$text\"" do |text|
|
41
|
+
@text.to_s.should == text
|
42
|
+
end
|
43
|
+
|
44
|
+
Then /^I can read it$/ do
|
45
|
+
Jkl::Text::document_from(@response).should_not be_nil
|
46
|
+
end
|
47
|
+
|
48
|
+
When /^I remove the script tags$/ do
|
49
|
+
@text = Jkl::Text::remove_script_tags @text
|
50
|
+
end
|
51
|
+
|
52
|
+
When /^I remove the blank lines$/ do
|
53
|
+
@text = Jkl::Text::remove_blank_lines @text
|
54
|
+
end
|
55
|
+
|
56
|
+
When /^I remove the short lines$/ do
|
57
|
+
@text = Jkl::Text::remove_short_lines @text
|
58
|
+
end
|
59
|
+
|
60
|
+
When /^I clean it up$/ do
|
61
|
+
@text = Jkl::Text::remove_short_lines Jkl::Text:: strip_all_tags Jkl::Text::remove_script_tags @text
|
62
|
+
end
|
63
|
+
|
64
|
+
When /^I strip all the tags$/ do
|
65
|
+
@text = Jkl::Text::strip_all_tags @text
|
66
|
+
end
|
67
|
+
|
68
|
+
Then /^there should be no tags$/ do
|
69
|
+
@text.match(/</).should be_nil
|
70
|
+
end
|
71
|
+
|
72
|
+
Then /^there should be no script tags$/ do
|
73
|
+
@text.match(/<script/).should be_nil
|
74
|
+
end
|
75
|
+
|
76
|
+
Then /^there should be no blank lines$/ do
|
77
|
+
@text.match(/\r/).should be_nil
|
78
|
+
@text.match(/\n/).should be_nil
|
79
|
+
end
|
data/features/support/env.rb
CHANGED
@@ -2,6 +2,15 @@ gem 'rack-test'
|
|
2
2
|
|
3
3
|
require 'spec/expectations'
|
4
4
|
require 'rack/test'
|
5
|
+
require 'hpricot'
|
6
|
+
require 'json'
|
7
|
+
require 'restclient'
|
8
|
+
require 'haml'
|
9
|
+
require 'cgi'
|
10
|
+
|
11
|
+
require 'lib/jkl.rb'
|
12
|
+
|
13
|
+
include Jkl
|
5
14
|
|
6
15
|
class MyWorld
|
7
16
|
include Rack::Test::Methods
|
data/lib/jkl.rb
CHANGED
@@ -1,8 +1,22 @@
|
|
1
|
-
require "jkl/rest_client
|
2
|
-
require "jkl/rss_client
|
3
|
-
require "jkl/calais_client
|
4
|
-
require "jkl/
|
1
|
+
require "jkl/rest_client"
|
2
|
+
require "jkl/rss_client"
|
3
|
+
require "jkl/calais_client"
|
4
|
+
require "jkl/text_client"
|
5
5
|
|
6
6
|
module Jkl
|
7
|
+
class << self
|
8
|
+
|
9
|
+
def links(feed)
|
10
|
+
links = Jkl::Rss::links(Jkl::Rss::items(Jkl::get_xml_from(feed)))
|
11
|
+
links.each do |link|
|
12
|
+
yield link if block_given?
|
13
|
+
end
|
14
|
+
end
|
15
|
+
|
16
|
+
def tags(key, link)
|
17
|
+
text = Jkl::Text::sanitize(Jkl::get_from(link))
|
18
|
+
Jkl::Extraction::tags(key, text)
|
19
|
+
end
|
7
20
|
|
21
|
+
end
|
8
22
|
end
|
data/lib/jkl/calais_client.rb
CHANGED
@@ -1,21 +1,17 @@
|
|
1
|
-
require "json"
|
2
1
|
require "calais"
|
3
2
|
|
4
|
-
require "rest_client"
|
5
|
-
|
6
3
|
module Jkl
|
7
4
|
module Extraction
|
8
5
|
class << self
|
9
|
-
|
10
|
-
|
11
|
-
def calais_response(key, pages)
|
6
|
+
|
7
|
+
def calais_response(key, text)
|
12
8
|
Calais.process_document(
|
13
|
-
:content =>
|
9
|
+
:content => text,
|
14
10
|
:content_type => :text,
|
15
11
|
:license_id => key
|
16
12
|
)
|
17
13
|
end
|
18
|
-
|
14
|
+
|
19
15
|
def tags(key, text)
|
20
16
|
nested_list = {}
|
21
17
|
entities(key,text).each do |a|
|
@@ -23,58 +19,10 @@ module Jkl
|
|
23
19
|
end
|
24
20
|
nested_list
|
25
21
|
end
|
26
|
-
|
22
|
+
|
27
23
|
def entities(key,text)
|
28
24
|
calais_response(key, text).entities.map{|e| {e.type => [e.attributes["name"]]}}
|
29
25
|
end
|
30
|
-
|
31
|
-
#not using calais gem, experimenting with json response
|
32
|
-
def get_from_calais(key, content)
|
33
|
-
post_args = {
|
34
|
-
"licenseID" => key,
|
35
|
-
"content" => content,
|
36
|
-
"paramsXML" => paramsXML("application/json")
|
37
|
-
}
|
38
|
-
Jkl::post_to(URI.parse("http://api.opencalais.com/enlighten/rest/"), post_args)
|
39
|
-
end
|
40
|
-
|
41
|
-
def get_tag_from_json(response)
|
42
|
-
result = JSON.parse response
|
43
|
-
result.delete_if {|key, value| key == "doc" } # ditching the doc
|
44
|
-
cleaned_result = []
|
45
|
-
result.each do |key,tag|
|
46
|
-
tag = Jkl::clean_unwanted_items_from_hash tag
|
47
|
-
cleaned_result << tag
|
48
|
-
yield tag if block_given?
|
49
|
-
end
|
50
|
-
cleaned_result
|
51
|
-
end
|
52
|
-
|
53
|
-
def clean_unwanted_items_from_hash h
|
54
|
-
h.delete_if {|k, v| k == "relevance" }
|
55
|
-
h.delete_if {|k, v| k == "instances" }
|
56
|
-
h.delete_if {|k, v| v == "N/A"}
|
57
|
-
h.delete_if {|k, v| v == []}
|
58
|
-
h.delete_if {|k, v| v == ""}
|
59
|
-
h.delete_if {|k, v| k == "_typeGroup"}
|
60
|
-
h
|
61
|
-
end
|
62
|
-
|
63
|
-
private
|
64
|
-
|
65
|
-
def paramsXML(format)
|
66
|
-
<<-paramsXML;
|
67
|
-
<c:params xmlns:c="http://s.opencalais.com/1/pred/"
|
68
|
-
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
|
69
|
-
<c:processingDirectives
|
70
|
-
c:contentType="text/txt"
|
71
|
-
c:outputFormat="#{format}">
|
72
|
-
</c:processingDirectives>
|
73
|
-
<c:userDirectives />
|
74
|
-
<c:externalMetadata />
|
75
|
-
</c:params>
|
76
|
-
paramsXML
|
77
|
-
end
|
78
26
|
|
79
27
|
end
|
80
28
|
end
|
data/lib/jkl/rest_client.rb
CHANGED
@@ -19,8 +19,8 @@ module Jkl
|
|
19
19
|
|
20
20
|
def get_from(uri)
|
21
21
|
begin
|
22
|
-
|
23
|
-
|
22
|
+
response = Net::HTTP.get_response(URI.parse(uri))
|
23
|
+
response.body
|
24
24
|
rescue URI::InvalidURIError => e
|
25
25
|
puts("WARN: Invalid URI: #{e}")
|
26
26
|
rescue SocketError => e
|
@@ -33,6 +33,10 @@ module Jkl
|
|
33
33
|
def get_xml_from(uri)
|
34
34
|
Hpricot.XML get_from uri
|
35
35
|
end
|
36
|
+
|
37
|
+
def document_from(text)
|
38
|
+
Hpricot(text)
|
39
|
+
end
|
36
40
|
|
37
41
|
end
|
38
42
|
end
|
data/lib/jkl/rss_client.rb
CHANGED
@@ -0,0 +1,38 @@
|
|
1
|
+
module Jkl
|
2
|
+
module Text
|
3
|
+
class << self
|
4
|
+
|
5
|
+
def sanitize(text)
|
6
|
+
remove_short_lines strip_all_tags remove_script_tags text
|
7
|
+
end
|
8
|
+
|
9
|
+
def strip_all_tags(text)
|
10
|
+
text.gsub(/<\/?[^>]*>/, "")
|
11
|
+
end
|
12
|
+
|
13
|
+
def remove_blank_lines(text)
|
14
|
+
text.gsub(/\n\r|\r\n|\n|\r/, "")
|
15
|
+
end
|
16
|
+
|
17
|
+
def remove_html_comments(text)
|
18
|
+
text.gsub(/<!--(.|\s)*?-->/, "")
|
19
|
+
end
|
20
|
+
|
21
|
+
def remove_script_tags(text)
|
22
|
+
text = remove_html_comments(text)
|
23
|
+
text.gsub(/((<[\s\/]*script\b[^>]*>)([^>]*)(<\/script>))/i, "")
|
24
|
+
end
|
25
|
+
|
26
|
+
def remove_short_lines(text)
|
27
|
+
text = text.gsub(/\s\s/, "\n")
|
28
|
+
str = ""
|
29
|
+
# remove short lines - ususally just navigation
|
30
|
+
text.split("\n").each do |l|
|
31
|
+
str << l unless l.count(" ") < 5
|
32
|
+
end
|
33
|
+
str
|
34
|
+
end
|
35
|
+
|
36
|
+
end
|
37
|
+
end
|
38
|
+
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: jakal
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- sshingler
|
@@ -13,21 +13,21 @@ date: 2009-08-27 00:00:00 +01:00
|
|
13
13
|
default_executable:
|
14
14
|
dependencies: []
|
15
15
|
|
16
|
-
description: Jakal is a Ruby library which contains some
|
16
|
+
description: Jakal is a Ruby library which contains some utilities for connecting to internet based APIs.
|
17
17
|
email: "'shingler@gmail.com'"
|
18
18
|
executables: []
|
19
19
|
|
20
20
|
extensions: []
|
21
21
|
|
22
22
|
extra_rdoc_files:
|
23
|
-
- README.
|
23
|
+
- README.md
|
24
24
|
- License.txt
|
25
25
|
files:
|
26
26
|
- lib/jkl.rb
|
27
27
|
- lib/jkl/calais_client.rb
|
28
28
|
- lib/jkl/rest_client.rb
|
29
29
|
- lib/jkl/rss_client.rb
|
30
|
-
- lib/jkl/
|
30
|
+
- lib/jkl/text_client.rb
|
31
31
|
- features/calais.feature
|
32
32
|
- features/http.feature
|
33
33
|
- features/sanitize-text.feature
|
@@ -37,11 +37,10 @@ files:
|
|
37
37
|
- features/mocks/twitter.json
|
38
38
|
- features/step_definitions/calais_steps.rb
|
39
39
|
- features/step_definitions/http_steps.rb
|
40
|
-
- features/step_definitions/require_steps.rb
|
41
40
|
- features/step_definitions/sanitize-text_steps.rb
|
42
41
|
- features/step_definitions/twitter_steps.rb
|
43
42
|
- features/support/env.rb
|
44
|
-
- README.
|
43
|
+
- README.md
|
45
44
|
- License.txt
|
46
45
|
has_rdoc: true
|
47
46
|
homepage: http://github.com/sshingler/jkl
|
@@ -71,6 +70,6 @@ rubyforge_project:
|
|
71
70
|
rubygems_version: 1.3.5
|
72
71
|
signing_key:
|
73
72
|
specification_version: 3
|
74
|
-
summary: Jakal is a Ruby library which contains some
|
73
|
+
summary: Jakal is a Ruby library which contains some utilities for connecting to internet based APIs.
|
75
74
|
test_files: []
|
76
75
|
|
@@ -1,12 +0,0 @@
|
|
1
|
-
require 'hpricot'
|
2
|
-
require 'json'
|
3
|
-
require 'restclient'
|
4
|
-
require 'haml'
|
5
|
-
require 'cgi'
|
6
|
-
require 'lib/jkl.rb'
|
7
|
-
require 'lib/jkl/calais_client.rb'
|
8
|
-
require 'lib/jkl/rest_client.rb'
|
9
|
-
require 'lib/jkl/rss_client.rb'
|
10
|
-
require 'lib/jkl/url_doc_handler.rb'
|
11
|
-
|
12
|
-
include Jkl
|
data/lib/jkl/url_doc_handler.rb
DELETED
@@ -1,35 +0,0 @@
|
|
1
|
-
require 'hpricot'
|
2
|
-
require 'rest_client'
|
3
|
-
|
4
|
-
module Jkl
|
5
|
-
|
6
|
-
class << self
|
7
|
-
|
8
|
-
def sanitize(text)
|
9
|
-
str = ""
|
10
|
-
text = text.to_s.gsub(/((<[\s\/]*script\b[^>]*>)([^>]*)(<\/script>))/i,"") #remove script tags - with contents
|
11
|
-
text.to_s.gsub(/<\/?[^>]*>/, "").split("\r").each do |l| # remove all tags
|
12
|
-
l = l.gsub(/^[ \t]/,"") #remove tabs
|
13
|
-
l = l.gsub(/^[ \s]/,"")
|
14
|
-
l.split("\n").each do |l|
|
15
|
-
str << l unless l.count(" ") < 5 # remove short lines - ususally just navigation
|
16
|
-
end
|
17
|
-
end
|
18
|
-
str
|
19
|
-
end
|
20
|
-
|
21
|
-
def from_doc(response)
|
22
|
-
begin
|
23
|
-
Hpricot(response)
|
24
|
-
rescue URI::InvalidURIError => e
|
25
|
-
puts("WARN: Problem with getting a connection: #{e}")
|
26
|
-
rescue SocketError => e
|
27
|
-
puts("WARN: Could not connect to feed: #{e}")
|
28
|
-
rescue Errno::ECONNREFUSED => e
|
29
|
-
puts("WARN: Connection refused: #{e}")
|
30
|
-
end
|
31
|
-
end
|
32
|
-
|
33
|
-
end
|
34
|
-
|
35
|
-
end
|