jakal 0.1.1 → 0.1.2
Sign up to get free protection for your applications and to get access to all the features.
- data/{README.rdoc → README.md} +35 -4
- data/features/calais.feature +3 -36
- data/features/http.feature +4 -10
- data/features/sanitize-text.feature +63 -45
- data/features/step_definitions/calais_steps.rb +1 -41
- data/features/step_definitions/http_steps.rb +1 -18
- data/features/step_definitions/sanitize-text_steps.rb +62 -15
- data/features/support/env.rb +9 -0
- data/lib/jkl.rb +18 -4
- data/lib/jkl/calais_client.rb +5 -57
- data/lib/jkl/rest_client.rb +6 -2
- data/lib/jkl/rss_client.rb +1 -1
- data/lib/jkl/text_client.rb +38 -0
- metadata +6 -7
- data/features/step_definitions/require_steps.rb +0 -12
- data/lib/jkl/url_doc_handler.rb +0 -35
data/{README.rdoc → README.md}
RENAMED
@@ -1,12 +1,43 @@
|
|
1
|
-
|
1
|
+
# jkl
|
2
2
|
|
3
|
-
|
3
|
+
jkl (Jakal) does these things:
|
4
4
|
|
5
|
-
|
5
|
+
* Connects to URLs.
|
6
|
+
* Gets stuff out of RSS feeds.
|
7
|
+
* Gets the main content from web pages
|
8
|
+
* Gets a set of metadata from a web page (using the calais gem)
|
9
|
+
|
10
|
+
# Sample usage
|
11
|
+
|
12
|
+
For example - if you had a RSS feed:
|
13
|
+
|
14
|
+
require "jkl"
|
15
|
+
|
16
|
+
feed = "http://www.topix.net/rss/search/article?x=0&y=0&q=London"
|
17
|
+
|
18
|
+
You could collect some metadata from the links in that feed, thus:
|
19
|
+
|
20
|
+
tags = []
|
21
|
+
Jkl::links(feed).each do |link|
|
22
|
+
tags << Jkl::tags("my_calais_key",link)
|
23
|
+
end
|
24
|
+
|
25
|
+
A metadata sample might look something like this:
|
26
|
+
|
27
|
+
{
|
28
|
+
"Person"=>["Barack Obama", "Hillary Clinton"],
|
29
|
+
"Position"=>["Secretary of State"]
|
30
|
+
}
|
31
|
+
|
32
|
+
It is hosted at [gemcutter](http://gemcutter.org/gems/jakal)
|
33
|
+
|
34
|
+
gem install jakal
|
35
|
+
|
36
|
+
# LICENSE:
|
6
37
|
|
7
38
|
(The MIT License)
|
8
39
|
|
9
|
-
Copyright (c) 2009
|
40
|
+
Copyright (c) 2009 sshingler
|
10
41
|
|
11
42
|
Permission is hereby granted, free of charge, to any person obtaining
|
12
43
|
a copy of this software and associated documentation files (the
|
data/features/calais.feature
CHANGED
@@ -3,41 +3,8 @@ Feature: Calais-Specific features
|
|
3
3
|
As a developer
|
4
4
|
I want to make some requests and inspect some responses
|
5
5
|
|
6
|
-
@
|
7
|
-
Scenario:
|
8
|
-
|
9
|
-
When I post to calais
|
10
|
-
Then I should get a response
|
11
|
-
And I should receive some tags
|
12
|
-
|
13
|
-
@connection_needed
|
14
|
-
Scenario: Post a mock story to calais, inspect the response
|
15
|
-
Given I have a sanitized sample BBC story
|
16
|
-
When I post to calais
|
17
|
-
Then I should get a response
|
18
|
-
And I should receive some tags
|
19
|
-
|
20
|
-
@connection_needed
|
21
|
-
Scenario: Get nested tags from calais
|
22
|
-
Given I have some simple text
|
6
|
+
@live
|
7
|
+
Scenario: Get nested tags from calais
|
8
|
+
Given I have some text
|
23
9
|
When I request the nested entities from calais
|
24
10
|
Then I should receive the entities grouped into categories
|
25
|
-
|
26
|
-
Scenario: Clean up blank items from a calais response
|
27
|
-
Given I have a mock calais response
|
28
|
-
When I remove the unwanted items
|
29
|
-
Then I should receive some tags
|
30
|
-
And there should no longer be any "instances"
|
31
|
-
And there should no longer be any "relevance"
|
32
|
-
And there should no longer be any "blank"
|
33
|
-
And there should no longer be any "not_available"
|
34
|
-
|
35
|
-
Scenario: Go through the calais response tags in a bit more detail
|
36
|
-
Given I have a mock calais response
|
37
|
-
When I remove the unwanted items
|
38
|
-
Then I should receive some tags
|
39
|
-
And there should be some "Organization" tags
|
40
|
-
|
41
|
-
Scenario: Go through the calais response tags as a single array
|
42
|
-
Given I have a mock calais response
|
43
|
-
Then I should be able to see the whole lot of tags as one block
|
data/features/http.feature
CHANGED
@@ -3,24 +3,18 @@ Feature: http features
|
|
3
3
|
As a developer
|
4
4
|
I want to make some requests and inspect some responses
|
5
5
|
|
6
|
-
@
|
6
|
+
@live
|
7
7
|
Scenario: Make a restful post to yahoo
|
8
8
|
When I post some data to yahoo
|
9
9
|
Then I should get a response
|
10
10
|
|
11
|
-
@
|
11
|
+
@live
|
12
12
|
Scenario: Make a restful get
|
13
13
|
When I make a restful get request
|
14
14
|
Then I should get a response
|
15
15
|
And I should see some text
|
16
16
|
|
17
|
-
@
|
17
|
+
@live
|
18
18
|
Scenario: Get some trends
|
19
|
-
When I request some trends
|
19
|
+
When I request some twitter trends
|
20
20
|
Then I should get a response
|
21
|
-
|
22
|
-
@connection_needed
|
23
|
-
Scenario: Get some RSS
|
24
|
-
When I request some RSS
|
25
|
-
Then I should get a response
|
26
|
-
And I should receive some headlines
|
@@ -3,51 +3,69 @@ Feature: Processing features
|
|
3
3
|
As a developer
|
4
4
|
I want to make some requests and inspect some responses
|
5
5
|
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
Scenario: Sanitize some short text
|
15
|
-
Given I have a keyphrase 'the cat sat'
|
16
|
-
When I sanitize this text
|
17
|
-
Then it should say ''
|
18
|
-
|
19
|
-
@unit @text @wip
|
20
|
-
Scenario: Sanitize some text with tabs and spaces
|
21
|
-
Given I have a keyphrase 'the cat sat on the mat '
|
22
|
-
When I sanitize this text
|
23
|
-
Then it should say 'the cat sat on the mat'
|
24
|
-
|
25
|
-
@unit @text @wip
|
26
|
-
Scenario: Sanitize some short text with tabs and spaces
|
27
|
-
Given I have a keyphrase 'the cat sat on '
|
28
|
-
When I sanitize this text
|
29
|
-
Then it should say ''
|
6
|
+
@unit @text
|
7
|
+
Scenario: No changes needed
|
8
|
+
Given I have the text "the cat sat on the mat"
|
9
|
+
When I sanitize this text
|
10
|
+
Then there should be no script tags
|
11
|
+
And there should be no tags
|
12
|
+
And there should be no blank lines
|
13
|
+
And it should say "the cat sat on the mat"
|
30
14
|
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
15
|
+
@unit @text
|
16
|
+
Scenario: Remove simple tags
|
17
|
+
Given I have the text "<a href=\"a-link.html\">the cat sat on the mat</a>"
|
18
|
+
When I sanitize this text
|
19
|
+
Then there should be no script tags
|
20
|
+
And there should be no tags
|
21
|
+
And there should be no blank lines
|
22
|
+
Then it should say "the cat sat on the mat"
|
36
23
|
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
Scenario: Remove script tags
|
46
|
-
Given I have some script tag data
|
47
|
-
When I sanitize this text
|
48
|
-
Then it should say ' some para stuff here '
|
24
|
+
@unit @text @wip
|
25
|
+
Scenario: Remove script tags
|
26
|
+
Given I have some script tag data
|
27
|
+
When I sanitize this text
|
28
|
+
Then there should be no script tags
|
29
|
+
And there should be no tags
|
30
|
+
And there should be no blank lines
|
31
|
+
Then it should say "the cat sat on the mat"
|
49
32
|
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
33
|
+
@mock
|
34
|
+
Scenario: Remove script tags
|
35
|
+
Given I have a sample web page
|
36
|
+
When I remove the script tags
|
37
|
+
Then there should be no script tags
|
38
|
+
|
39
|
+
@mock
|
40
|
+
Scenario: Remove all tags
|
41
|
+
Given I have a sample web page
|
42
|
+
When I remove the script tags
|
43
|
+
And I strip all the tags
|
44
|
+
Then there should be no script tags
|
45
|
+
And there should be no tags
|
46
|
+
|
47
|
+
@mock
|
48
|
+
Scenario: Remove empty lines
|
49
|
+
Given a stripped web page
|
50
|
+
When I remove the blank lines
|
51
|
+
Then there should be no blank lines
|
52
|
+
|
53
|
+
@mock
|
54
|
+
Scenario: Remove a short line
|
55
|
+
Given I have the text "the cat sat on the"
|
56
|
+
When I remove the short lines
|
57
|
+
Then it should say ""
|
58
|
+
|
59
|
+
@mock
|
60
|
+
Scenario: Don't remove a long line
|
61
|
+
Given I have the text "the cat sat on the mat"
|
62
|
+
When I remove the short lines
|
63
|
+
Then it should say "the cat sat on the mat"
|
64
|
+
|
65
|
+
@mock
|
66
|
+
Scenario: Santize a sample BBC page
|
67
|
+
Given I have a sample BBC story
|
68
|
+
When I sanitize this text
|
69
|
+
Then there should be no script tags
|
70
|
+
And there should be no tags
|
71
|
+
And there should be no blank lines
|
@@ -1,48 +1,8 @@
|
|
1
1
|
|
2
|
-
Given /^I have some
|
2
|
+
Given /^I have some text$/ do
|
3
3
|
@text = "Barack Obama said today that he expects there to be conflict within his new security team after confirming Hillary Clinton as his choice for US Secretary of State."
|
4
4
|
end
|
5
5
|
|
6
|
-
Given /^I have a sanitized sample BBC story$/ do
|
7
|
-
Given "I have a sample BBC story"
|
8
|
-
When "I sanitize this text"
|
9
|
-
end
|
10
|
-
|
11
|
-
Given /^I have a mock calais response$/ do
|
12
|
-
@response = File.open('features/mocks/calais.json','r') {|f| f.readlines.to_s}
|
13
|
-
end
|
14
|
-
|
15
|
-
When /^I post to calais$/ do
|
16
|
-
key = YAML::load_file('config/keys.yml')['calais']
|
17
|
-
@response = Jkl::Extraction::get_from_calais(key, @text)
|
18
|
-
end
|
19
|
-
|
20
|
-
When /^I remove the unwanted items$/ do
|
21
|
-
@processed_json = Jkl::clean_unwanted_items_from_hash(JSON.parse(@response))
|
22
|
-
end
|
23
|
-
|
24
|
-
Then /^there should no longer be any "([^\"]*)"$/ do |arg1|
|
25
|
-
@processed_json[arg1].should be_nil
|
26
|
-
end
|
27
|
-
|
28
|
-
Then /^I should receive some tags$/ do
|
29
|
-
Jkl::get_tag_from_json(@response) do |tag|
|
30
|
-
tag.should_not be_nil
|
31
|
-
end
|
32
|
-
end
|
33
|
-
|
34
|
-
Then /^there should be some "([^\"]*)" tags$/ do |arg1|
|
35
|
-
Jkl::get_tag_from_json(@response) {|tag|
|
36
|
-
#puts tag.inspect
|
37
|
-
tag.each{|k,v| puts "#{k} : #{v}" if k=='_type'}
|
38
|
-
}
|
39
|
-
end
|
40
|
-
|
41
|
-
Then /^I should be able to see the whole lot of tags as one block$/ do
|
42
|
-
tags = Jkl::get_tag_from_json(@response)
|
43
|
-
tags.length.should > 0
|
44
|
-
end
|
45
|
-
|
46
6
|
When /^I request the nested entities from calais$/ do
|
47
7
|
key = YAML::load_file('config/keys.yml')['calais']
|
48
8
|
@response = Jkl::Extraction::tags key, @text
|
@@ -6,12 +6,6 @@ When /^I post some data to yahoo$/ do
|
|
6
6
|
@response = Jkl::post_to @url, post_args
|
7
7
|
end
|
8
8
|
|
9
|
-
When /^I request some RSS$/ do
|
10
|
-
keyphrase = @keyphrase || "iraq"
|
11
|
-
url = "#{YAML::load_file('config/config.yml')['topix']}#{CGI::escape(keyphrase)}"
|
12
|
-
@response = Jkl::get_xml_from url
|
13
|
-
end
|
14
|
-
|
15
9
|
Given /^I have some RSS$/ do
|
16
10
|
raw = File.open('features/mocks/topix_rss.xml','r') {|f| f.readlines.to_s}
|
17
11
|
@response = Hpricot.XML raw
|
@@ -22,7 +16,7 @@ When /^I make a restful get request$/ do
|
|
22
16
|
@response = Jkl::get_from url
|
23
17
|
end
|
24
18
|
|
25
|
-
When /^I request some trends$/ do
|
19
|
+
When /^I request some twitter trends$/ do
|
26
20
|
twitter_json_url = YAML::load_file('config/config.yml')['twitter']
|
27
21
|
output = JSON.parse Jkl::get_from twitter_json_url
|
28
22
|
@response = output['trends']
|
@@ -30,17 +24,6 @@ end
|
|
30
24
|
|
31
25
|
Then /^I should get a response$/ do
|
32
26
|
@response.should_not == nil
|
33
|
-
#puts @response.inspect
|
34
|
-
end
|
35
|
-
|
36
|
-
Then /^I should receive some headlines$/ do
|
37
|
-
@items = Jkl::Rss::items @response
|
38
|
-
@links = []
|
39
|
-
@items.each do |item|
|
40
|
-
@links << Jkl::Rss::attribute_from(item, :link)
|
41
|
-
end
|
42
|
-
@links.should_not == nil
|
43
|
-
@links.length.should > 0
|
44
27
|
end
|
45
28
|
|
46
29
|
Then /^I should be able to get the copy from the first headline$/ do
|
@@ -1,4 +1,4 @@
|
|
1
|
-
Given "I have
|
1
|
+
Given "I have the text \"$text\"" do |text|
|
2
2
|
@text = text
|
3
3
|
end
|
4
4
|
|
@@ -6,22 +6,9 @@ Given /^I have a sample BBC story$/ do
|
|
6
6
|
@text = File.open('features/mocks/bbc_story.html','r') {|f| f.readlines.to_s}
|
7
7
|
end
|
8
8
|
|
9
|
-
When /^I sanitize this text$/ do
|
10
|
-
@text = Jkl::sanitize @text
|
11
|
-
end
|
12
|
-
|
13
|
-
Then /^it should be ok$/ do
|
14
|
-
@text.should_not be_nil
|
15
|
-
@text.should_not == ""
|
16
|
-
end
|
17
|
-
|
18
|
-
Then "it should say '$text'" do |text|
|
19
|
-
@text.should == text
|
20
|
-
end
|
21
|
-
|
22
9
|
Given /^I have some script tag data$/ do
|
23
10
|
@text = <<-EOF;
|
24
|
-
|
11
|
+
the cat sat on the mat
|
25
12
|
<script type="text/javascript" charset="utf-8">
|
26
13
|
function nofunction(){var bob;}
|
27
14
|
</script>
|
@@ -30,3 +17,63 @@ Given /^I have some script tag data$/ do
|
|
30
17
|
EOF
|
31
18
|
end
|
32
19
|
|
20
|
+
Given /^I have a sample web page$/ do
|
21
|
+
@text = File.open('features/mocks/sample-web-page.html','r') {|f| f.readlines.to_s}
|
22
|
+
end
|
23
|
+
|
24
|
+
Given /^a stripped web page$/ do
|
25
|
+
Given "I have a sample web page"
|
26
|
+
When "I remove the script tags"
|
27
|
+
And "I strip all the tags"
|
28
|
+
Then "there should be no script tags"
|
29
|
+
And "there should be no tags"
|
30
|
+
end
|
31
|
+
|
32
|
+
When /^I sanitize this text$/ do
|
33
|
+
@text = Jkl::Text::sanitize @text
|
34
|
+
end
|
35
|
+
|
36
|
+
When /^I examine the text$/ do
|
37
|
+
text = Jkl::Text::remove_tabs @text
|
38
|
+
end
|
39
|
+
|
40
|
+
Then "it should say \"$text\"" do |text|
|
41
|
+
@text.to_s.should == text
|
42
|
+
end
|
43
|
+
|
44
|
+
Then /^I can read it$/ do
|
45
|
+
Jkl::Text::document_from(@response).should_not be_nil
|
46
|
+
end
|
47
|
+
|
48
|
+
When /^I remove the script tags$/ do
|
49
|
+
@text = Jkl::Text::remove_script_tags @text
|
50
|
+
end
|
51
|
+
|
52
|
+
When /^I remove the blank lines$/ do
|
53
|
+
@text = Jkl::Text::remove_blank_lines @text
|
54
|
+
end
|
55
|
+
|
56
|
+
When /^I remove the short lines$/ do
|
57
|
+
@text = Jkl::Text::remove_short_lines @text
|
58
|
+
end
|
59
|
+
|
60
|
+
When /^I clean it up$/ do
|
61
|
+
@text = Jkl::Text::remove_short_lines Jkl::Text:: strip_all_tags Jkl::Text::remove_script_tags @text
|
62
|
+
end
|
63
|
+
|
64
|
+
When /^I strip all the tags$/ do
|
65
|
+
@text = Jkl::Text::strip_all_tags @text
|
66
|
+
end
|
67
|
+
|
68
|
+
Then /^there should be no tags$/ do
|
69
|
+
@text.match(/</).should be_nil
|
70
|
+
end
|
71
|
+
|
72
|
+
Then /^there should be no script tags$/ do
|
73
|
+
@text.match(/<script/).should be_nil
|
74
|
+
end
|
75
|
+
|
76
|
+
Then /^there should be no blank lines$/ do
|
77
|
+
@text.match(/\r/).should be_nil
|
78
|
+
@text.match(/\n/).should be_nil
|
79
|
+
end
|
data/features/support/env.rb
CHANGED
@@ -2,6 +2,15 @@ gem 'rack-test'
|
|
2
2
|
|
3
3
|
require 'spec/expectations'
|
4
4
|
require 'rack/test'
|
5
|
+
require 'hpricot'
|
6
|
+
require 'json'
|
7
|
+
require 'restclient'
|
8
|
+
require 'haml'
|
9
|
+
require 'cgi'
|
10
|
+
|
11
|
+
require 'lib/jkl.rb'
|
12
|
+
|
13
|
+
include Jkl
|
5
14
|
|
6
15
|
class MyWorld
|
7
16
|
include Rack::Test::Methods
|
data/lib/jkl.rb
CHANGED
@@ -1,8 +1,22 @@
|
|
1
|
-
require "jkl/rest_client
|
2
|
-
require "jkl/rss_client
|
3
|
-
require "jkl/calais_client
|
4
|
-
require "jkl/
|
1
|
+
require "jkl/rest_client"
|
2
|
+
require "jkl/rss_client"
|
3
|
+
require "jkl/calais_client"
|
4
|
+
require "jkl/text_client"
|
5
5
|
|
6
6
|
module Jkl
|
7
|
+
class << self
|
8
|
+
|
9
|
+
def links(feed)
|
10
|
+
links = Jkl::Rss::links(Jkl::Rss::items(Jkl::get_xml_from(feed)))
|
11
|
+
links.each do |link|
|
12
|
+
yield link if block_given?
|
13
|
+
end
|
14
|
+
end
|
15
|
+
|
16
|
+
def tags(key, link)
|
17
|
+
text = Jkl::Text::sanitize(Jkl::get_from(link))
|
18
|
+
Jkl::Extraction::tags(key, text)
|
19
|
+
end
|
7
20
|
|
21
|
+
end
|
8
22
|
end
|
data/lib/jkl/calais_client.rb
CHANGED
@@ -1,21 +1,17 @@
|
|
1
|
-
require "json"
|
2
1
|
require "calais"
|
3
2
|
|
4
|
-
require "rest_client"
|
5
|
-
|
6
3
|
module Jkl
|
7
4
|
module Extraction
|
8
5
|
class << self
|
9
|
-
|
10
|
-
|
11
|
-
def calais_response(key, pages)
|
6
|
+
|
7
|
+
def calais_response(key, text)
|
12
8
|
Calais.process_document(
|
13
|
-
:content =>
|
9
|
+
:content => text,
|
14
10
|
:content_type => :text,
|
15
11
|
:license_id => key
|
16
12
|
)
|
17
13
|
end
|
18
|
-
|
14
|
+
|
19
15
|
def tags(key, text)
|
20
16
|
nested_list = {}
|
21
17
|
entities(key,text).each do |a|
|
@@ -23,58 +19,10 @@ module Jkl
|
|
23
19
|
end
|
24
20
|
nested_list
|
25
21
|
end
|
26
|
-
|
22
|
+
|
27
23
|
def entities(key,text)
|
28
24
|
calais_response(key, text).entities.map{|e| {e.type => [e.attributes["name"]]}}
|
29
25
|
end
|
30
|
-
|
31
|
-
#not using calais gem, experimenting with json response
|
32
|
-
def get_from_calais(key, content)
|
33
|
-
post_args = {
|
34
|
-
"licenseID" => key,
|
35
|
-
"content" => content,
|
36
|
-
"paramsXML" => paramsXML("application/json")
|
37
|
-
}
|
38
|
-
Jkl::post_to(URI.parse("http://api.opencalais.com/enlighten/rest/"), post_args)
|
39
|
-
end
|
40
|
-
|
41
|
-
def get_tag_from_json(response)
|
42
|
-
result = JSON.parse response
|
43
|
-
result.delete_if {|key, value| key == "doc" } # ditching the doc
|
44
|
-
cleaned_result = []
|
45
|
-
result.each do |key,tag|
|
46
|
-
tag = Jkl::clean_unwanted_items_from_hash tag
|
47
|
-
cleaned_result << tag
|
48
|
-
yield tag if block_given?
|
49
|
-
end
|
50
|
-
cleaned_result
|
51
|
-
end
|
52
|
-
|
53
|
-
def clean_unwanted_items_from_hash h
|
54
|
-
h.delete_if {|k, v| k == "relevance" }
|
55
|
-
h.delete_if {|k, v| k == "instances" }
|
56
|
-
h.delete_if {|k, v| v == "N/A"}
|
57
|
-
h.delete_if {|k, v| v == []}
|
58
|
-
h.delete_if {|k, v| v == ""}
|
59
|
-
h.delete_if {|k, v| k == "_typeGroup"}
|
60
|
-
h
|
61
|
-
end
|
62
|
-
|
63
|
-
private
|
64
|
-
|
65
|
-
def paramsXML(format)
|
66
|
-
<<-paramsXML;
|
67
|
-
<c:params xmlns:c="http://s.opencalais.com/1/pred/"
|
68
|
-
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
|
69
|
-
<c:processingDirectives
|
70
|
-
c:contentType="text/txt"
|
71
|
-
c:outputFormat="#{format}">
|
72
|
-
</c:processingDirectives>
|
73
|
-
<c:userDirectives />
|
74
|
-
<c:externalMetadata />
|
75
|
-
</c:params>
|
76
|
-
paramsXML
|
77
|
-
end
|
78
26
|
|
79
27
|
end
|
80
28
|
end
|
data/lib/jkl/rest_client.rb
CHANGED
@@ -19,8 +19,8 @@ module Jkl
|
|
19
19
|
|
20
20
|
def get_from(uri)
|
21
21
|
begin
|
22
|
-
|
23
|
-
|
22
|
+
response = Net::HTTP.get_response(URI.parse(uri))
|
23
|
+
response.body
|
24
24
|
rescue URI::InvalidURIError => e
|
25
25
|
puts("WARN: Invalid URI: #{e}")
|
26
26
|
rescue SocketError => e
|
@@ -33,6 +33,10 @@ module Jkl
|
|
33
33
|
def get_xml_from(uri)
|
34
34
|
Hpricot.XML get_from uri
|
35
35
|
end
|
36
|
+
|
37
|
+
def document_from(text)
|
38
|
+
Hpricot(text)
|
39
|
+
end
|
36
40
|
|
37
41
|
end
|
38
42
|
end
|
data/lib/jkl/rss_client.rb
CHANGED
@@ -0,0 +1,38 @@
|
|
1
|
+
module Jkl
|
2
|
+
module Text
|
3
|
+
class << self
|
4
|
+
|
5
|
+
def sanitize(text)
|
6
|
+
remove_short_lines strip_all_tags remove_script_tags text
|
7
|
+
end
|
8
|
+
|
9
|
+
def strip_all_tags(text)
|
10
|
+
text.gsub(/<\/?[^>]*>/, "")
|
11
|
+
end
|
12
|
+
|
13
|
+
def remove_blank_lines(text)
|
14
|
+
text.gsub(/\n\r|\r\n|\n|\r/, "")
|
15
|
+
end
|
16
|
+
|
17
|
+
def remove_html_comments(text)
|
18
|
+
text.gsub(/<!--(.|\s)*?-->/, "")
|
19
|
+
end
|
20
|
+
|
21
|
+
def remove_script_tags(text)
|
22
|
+
text = remove_html_comments(text)
|
23
|
+
text.gsub(/((<[\s\/]*script\b[^>]*>)([^>]*)(<\/script>))/i, "")
|
24
|
+
end
|
25
|
+
|
26
|
+
def remove_short_lines(text)
|
27
|
+
text = text.gsub(/\s\s/, "\n")
|
28
|
+
str = ""
|
29
|
+
# remove short lines - ususally just navigation
|
30
|
+
text.split("\n").each do |l|
|
31
|
+
str << l unless l.count(" ") < 5
|
32
|
+
end
|
33
|
+
str
|
34
|
+
end
|
35
|
+
|
36
|
+
end
|
37
|
+
end
|
38
|
+
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: jakal
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- sshingler
|
@@ -13,21 +13,21 @@ date: 2009-08-27 00:00:00 +01:00
|
|
13
13
|
default_executable:
|
14
14
|
dependencies: []
|
15
15
|
|
16
|
-
description: Jakal is a Ruby library which contains some
|
16
|
+
description: Jakal is a Ruby library which contains some utilities for connecting to internet based APIs.
|
17
17
|
email: "'shingler@gmail.com'"
|
18
18
|
executables: []
|
19
19
|
|
20
20
|
extensions: []
|
21
21
|
|
22
22
|
extra_rdoc_files:
|
23
|
-
- README.
|
23
|
+
- README.md
|
24
24
|
- License.txt
|
25
25
|
files:
|
26
26
|
- lib/jkl.rb
|
27
27
|
- lib/jkl/calais_client.rb
|
28
28
|
- lib/jkl/rest_client.rb
|
29
29
|
- lib/jkl/rss_client.rb
|
30
|
-
- lib/jkl/
|
30
|
+
- lib/jkl/text_client.rb
|
31
31
|
- features/calais.feature
|
32
32
|
- features/http.feature
|
33
33
|
- features/sanitize-text.feature
|
@@ -37,11 +37,10 @@ files:
|
|
37
37
|
- features/mocks/twitter.json
|
38
38
|
- features/step_definitions/calais_steps.rb
|
39
39
|
- features/step_definitions/http_steps.rb
|
40
|
-
- features/step_definitions/require_steps.rb
|
41
40
|
- features/step_definitions/sanitize-text_steps.rb
|
42
41
|
- features/step_definitions/twitter_steps.rb
|
43
42
|
- features/support/env.rb
|
44
|
-
- README.
|
43
|
+
- README.md
|
45
44
|
- License.txt
|
46
45
|
has_rdoc: true
|
47
46
|
homepage: http://github.com/sshingler/jkl
|
@@ -71,6 +70,6 @@ rubyforge_project:
|
|
71
70
|
rubygems_version: 1.3.5
|
72
71
|
signing_key:
|
73
72
|
specification_version: 3
|
74
|
-
summary: Jakal is a Ruby library which contains some
|
73
|
+
summary: Jakal is a Ruby library which contains some utilities for connecting to internet based APIs.
|
75
74
|
test_files: []
|
76
75
|
|
@@ -1,12 +0,0 @@
|
|
1
|
-
require 'hpricot'
|
2
|
-
require 'json'
|
3
|
-
require 'restclient'
|
4
|
-
require 'haml'
|
5
|
-
require 'cgi'
|
6
|
-
require 'lib/jkl.rb'
|
7
|
-
require 'lib/jkl/calais_client.rb'
|
8
|
-
require 'lib/jkl/rest_client.rb'
|
9
|
-
require 'lib/jkl/rss_client.rb'
|
10
|
-
require 'lib/jkl/url_doc_handler.rb'
|
11
|
-
|
12
|
-
include Jkl
|
data/lib/jkl/url_doc_handler.rb
DELETED
@@ -1,35 +0,0 @@
|
|
1
|
-
require 'hpricot'
|
2
|
-
require 'rest_client'
|
3
|
-
|
4
|
-
module Jkl
|
5
|
-
|
6
|
-
class << self
|
7
|
-
|
8
|
-
def sanitize(text)
|
9
|
-
str = ""
|
10
|
-
text = text.to_s.gsub(/((<[\s\/]*script\b[^>]*>)([^>]*)(<\/script>))/i,"") #remove script tags - with contents
|
11
|
-
text.to_s.gsub(/<\/?[^>]*>/, "").split("\r").each do |l| # remove all tags
|
12
|
-
l = l.gsub(/^[ \t]/,"") #remove tabs
|
13
|
-
l = l.gsub(/^[ \s]/,"")
|
14
|
-
l.split("\n").each do |l|
|
15
|
-
str << l unless l.count(" ") < 5 # remove short lines - ususally just navigation
|
16
|
-
end
|
17
|
-
end
|
18
|
-
str
|
19
|
-
end
|
20
|
-
|
21
|
-
def from_doc(response)
|
22
|
-
begin
|
23
|
-
Hpricot(response)
|
24
|
-
rescue URI::InvalidURIError => e
|
25
|
-
puts("WARN: Problem with getting a connection: #{e}")
|
26
|
-
rescue SocketError => e
|
27
|
-
puts("WARN: Could not connect to feed: #{e}")
|
28
|
-
rescue Errno::ECONNREFUSED => e
|
29
|
-
puts("WARN: Connection refused: #{e}")
|
30
|
-
end
|
31
|
-
end
|
32
|
-
|
33
|
-
end
|
34
|
-
|
35
|
-
end
|