relaton-ecma 1.14.0 → 1.14.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile +6 -0
- data/README.adoc +59 -1
- data/grammars/basicdoc.rng +0 -1
- data/grammars/biblio.rng +12 -2
- data/lib/relaton_ecma/data_fetcher.rb +97 -0
- data/lib/relaton_ecma/data_parser.rb +215 -0
- data/lib/relaton_ecma/ecma_bibliography.rb +52 -6
- data/lib/relaton_ecma/processor.rb +13 -0
- data/lib/relaton_ecma/version.rb +1 -1
- data/lib/relaton_ecma.rb +3 -1
- data/relaton_ecma.gemspec +2 -7
- metadata +20 -61
- data/lib/relaton_ecma/scrapper.rb +0 -29
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5bc67bcba0ff063e94f85b12e208b74231c8ffd77736a047e910797eb1da211c
|
4
|
+
data.tar.gz: 34396c96fdf0d4b8d8d1f1cff04f72f135a103f3a830f6784521ff46f5e24367
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 83fde052717c206e86a00e893581538878f03ae25a06c24f9288d8adbc69c92ec4e3a1003b39ff7e8a834be4360e7a81eb802e42924c7d14c178a18c39f013c1
|
7
|
+
data.tar.gz: e4adc23d844076f2c555b677db11a2e799041ae5b9fc3ec43f7fac51fd583db964bef5cf874dc656fd5c5b3d23151bb60752f3f270f7d631f62ce3b05412cbdc
|
data/Gemfile
CHANGED
data/README.adoc
CHANGED
@@ -29,25 +29,64 @@ Or install it yourself as:
|
|
29
29
|
|
30
30
|
== Usage
|
31
31
|
|
32
|
-
===
|
32
|
+
=== Fetch documents
|
33
|
+
|
34
|
+
Documents can be fetched by reference. The structure of the reference depends on the type of the document. There are three types of documents:
|
35
|
+
- ECMA standards
|
36
|
+
- ECMA technical reports
|
37
|
+
- ECMA mementos
|
38
|
+
|
39
|
+
ECMA standards have the following reference structure: `ECMA-{NUMBER}[ ed{EDITION}][ vol{VOLUME}]`. Where: `NUMBER` is a number of the standard, `EDITION` is an edition of the standard, and `VOLUME` is a volume of the standard. The `EDITION` and `VOLUME` are optional. If `EDITION` is not specified, the latest edition of the standard will be fetched. If `VOLUME` is not specified, the first volume of the standard will be fetched. +
|
40
|
+
ECMA technical reports have the following reference structure: `ECMA TR/{NUMBER}[ ed{EDITION}]`. Where: `NUMBER` is a number of the technical report, and `EDITION` is an edition of the technical report. The `EDITION` is optional. If `EDITION` is not specified, the latest edition of the technical report will be fetched. +
|
41
|
+
ECMA mementos have the following reference structure: `ECMA MEM/{YEAR}`. Where: `YEAR` is an year of the memento.
|
33
42
|
|
34
43
|
[source,ruby]
|
35
44
|
----
|
36
45
|
require 'relaton_ecma'
|
37
46
|
=> true
|
38
47
|
|
48
|
+
# fetch ECMA standard
|
39
49
|
item = RelatonEcma::EcmaBibliography.get 'ECMA-6'
|
40
50
|
[relaton-ecma] ("ECMA-6") fetching...
|
41
51
|
[relaton-ecma] ("ECMA-6") found ECMA-6
|
42
52
|
#<RelatonEcma::BibliographicItem:0x00007fc645b11c10
|
43
53
|
...
|
44
54
|
|
55
|
+
# fetch ECMA standard with edition and volume
|
56
|
+
RelatonEcma::EcmaBibliography.get "ECMA-269 ed3 vol2"
|
57
|
+
[relaton-ecma] ("ECMA-269 ed3 vol2") fetching...
|
58
|
+
[relaton-ecma] ("ECMA-269 ed3 vol2") found ECMA-269
|
59
|
+
=> #<RelatonEcma::BibliographicItem:0x0000000106ac8210
|
60
|
+
...
|
61
|
+
|
62
|
+
# fetch the last edition of ECMA standard
|
63
|
+
bib = RelatonEcma::EcmaBibliography.get "ECMA-269"
|
64
|
+
[relaton-ecma] ("ECMA-269") fetching...
|
65
|
+
[relaton-ecma] ("ECMA-269") found ECMA-269
|
66
|
+
=> #<RelatonEcma::BibliographicItem:0x000000010a408480
|
67
|
+
...
|
68
|
+
|
69
|
+
bib.edition.content
|
70
|
+
=> "9"
|
71
|
+
|
72
|
+
# fetch the first volume of ECMA standard
|
73
|
+
bib = RelatonEcma::EcmaBibliography.get "ECMA-269 ed3"
|
74
|
+
[relaton-ecma] ("ECMA-269 ed3") fetching...
|
75
|
+
[relaton-ecma] ("ECMA-269 ed3") found ECMA-269
|
76
|
+
=> #<RelatonEcma::BibliographicItem:0x000000010a3ed0e0
|
77
|
+
...
|
78
|
+
|
79
|
+
bib.extent.first.reference_from
|
80
|
+
=> "1"
|
81
|
+
|
82
|
+
# fetch ECMA technical report
|
45
83
|
RelatonEcma::EcmaBibliography.get 'ECMA TR/18'
|
46
84
|
[relaton-ecma] ("ECMA TR/18") fetching...
|
47
85
|
[relaton-ecma] ("ECMA TR/18") found ECMA TR/18
|
48
86
|
=> #<RelatonEcma::BibliographicItem:0x00007fc645c00cc0
|
49
87
|
...
|
50
88
|
|
89
|
+
# fetch ECMA memento
|
51
90
|
RelatonEcma::EcmaBibliography.get "ECMA MEM/2021"
|
52
91
|
[relaton-ecma] ("ECMA MEM/2021") fetching...
|
53
92
|
[relaton-ecma] ("ECMA MEM/2021") found ECMA MEM/2021
|
@@ -113,6 +152,25 @@ item = RelatonEcma::XMLParser.from_xml File.read("spec/fixtures/bibdata.xml")
|
|
113
152
|
...
|
114
153
|
----
|
115
154
|
|
155
|
+
=== Fetch data
|
156
|
+
|
157
|
+
This gem uses a https://github.com/relaton/relaton-data-ecma[ecma-standards] prefetched dataset as a data source. The dataset contains documents from ECMA https://www.ecma-international.org/publications-and-standards/standards/[Standards], https://www.ecma-international.org/publications-and-standards/technical-reports/[Technical Reports], and https://www.ecma-international.org/publications-and-standards/mementos/[Mementos] pages.
|
158
|
+
|
159
|
+
The method `RelatonEcma::DataFetcher.new(output: "data", format: "yaml").fetch` fetches all the documents from the pages and saves them to the `./data` folder in YAML format.
|
160
|
+
Arguments:
|
161
|
+
|
162
|
+
- `output` - folder to save documents (default './data').
|
163
|
+
- `format` - the format in which the documents are saved. Possible formats are: `yaml`, `xml`, `bibxxml` (default `yaml`).
|
164
|
+
|
165
|
+
[source,ruby]
|
166
|
+
----
|
167
|
+
RelatonEcma::DataFetcher.new.fetch
|
168
|
+
Started at: 2022-06-23 09:36:55 +0200
|
169
|
+
Stopped at: 2022-06-23 09:36:58 +0200
|
170
|
+
Done in: 752 sec.
|
171
|
+
=> nil
|
172
|
+
----
|
173
|
+
|
116
174
|
== Development
|
117
175
|
|
118
176
|
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
data/grammars/basicdoc.rng
CHANGED
data/grammars/biblio.rng
CHANGED
@@ -216,6 +216,9 @@
|
|
216
216
|
<optional>
|
217
217
|
<ref name="fullname"/>
|
218
218
|
</optional>
|
219
|
+
<zeroOrMore>
|
220
|
+
<ref name="credential"/>
|
221
|
+
</zeroOrMore>
|
219
222
|
<zeroOrMore>
|
220
223
|
<ref name="affiliation"/>
|
221
224
|
</zeroOrMore>
|
@@ -232,6 +235,11 @@
|
|
232
235
|
<ref name="FullNameType"/>
|
233
236
|
</element>
|
234
237
|
</define>
|
238
|
+
<define name="credential">
|
239
|
+
<element name="credential">
|
240
|
+
<text/>
|
241
|
+
</element>
|
242
|
+
</define>
|
235
243
|
<define name="FullNameType">
|
236
244
|
<choice>
|
237
245
|
<group>
|
@@ -305,7 +313,9 @@
|
|
305
313
|
<zeroOrMore>
|
306
314
|
<ref name="affiliationdescription"/>
|
307
315
|
</zeroOrMore>
|
308
|
-
<
|
316
|
+
<optional>
|
317
|
+
<ref name="organization"/>
|
318
|
+
</optional>
|
309
319
|
</element>
|
310
320
|
</define>
|
311
321
|
<define name="affiliationname">
|
@@ -1316,7 +1326,7 @@
|
|
1316
1326
|
<value>commentaryOf</value>
|
1317
1327
|
<value>hasCommentary</value>
|
1318
1328
|
<value>related</value>
|
1319
|
-
<value>
|
1329
|
+
<value>hasComplement</value>
|
1320
1330
|
<value>complementOf</value>
|
1321
1331
|
<value>obsoletes</value>
|
1322
1332
|
<value>obsoletedBy</value>
|
@@ -0,0 +1,97 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "English"
|
4
|
+
require "mechanize"
|
5
|
+
require "relaton_ecma"
|
6
|
+
|
7
|
+
module RelatonEcma
|
8
|
+
class DataFetcher
|
9
|
+
URL = "https://www.ecma-international.org/publications-and-standards/"
|
10
|
+
|
11
|
+
# @param [String] :output directory to output documents
|
12
|
+
# @param [String] :format output format (xml, yaml, bibxml)
|
13
|
+
def initialize(output: "data", format: "yaml")
|
14
|
+
@output = output
|
15
|
+
@format = format
|
16
|
+
@ext = format.sub(/^bib/, "")
|
17
|
+
@files = []
|
18
|
+
@index = Relaton::Index.find_or_create :ECMA
|
19
|
+
@agent = Mechanize.new
|
20
|
+
@agent.user_agent_alias = Mechanize::AGENT_ALIASES.keys[rand(21)]
|
21
|
+
end
|
22
|
+
|
23
|
+
# @param bib [RelatonItu::ItuBibliographicItem]
|
24
|
+
def write_file(bib) # rubocop:disable Metrics/AbcSize, Metrics/MethodLength
|
25
|
+
id = bib.docidentifier[0].id.gsub(%r{[/\s]}, "_")
|
26
|
+
id += "-#{bib.edition.content.gsub('.', '_')}" if bib.edition
|
27
|
+
extent = bib.extent.detect { |e| e.type == "volume" }
|
28
|
+
id += "-#{extent.reference_from}" if extent
|
29
|
+
file = "#{@output}/#{id}.#{@ext}"
|
30
|
+
if @files.include? file
|
31
|
+
warn "Duplicate file #{file}"
|
32
|
+
else
|
33
|
+
@files << file
|
34
|
+
File.write file, render_doc(bib), encoding: "UTF-8"
|
35
|
+
@index.add_or_update index_id(bib), file
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
def index_id(bib)
|
40
|
+
{ id: bib.docidentifier[0].id }.tap do |i|
|
41
|
+
i[:ed] = bib.edition.content if bib.edition
|
42
|
+
extent = bib.extent.detect { |e| e.type == "volume" }
|
43
|
+
i[:vol] = extent.reference_from if extent
|
44
|
+
end
|
45
|
+
end
|
46
|
+
|
47
|
+
def render_doc(bib)
|
48
|
+
case @format
|
49
|
+
when "yaml" then bib.to_hash.to_yaml
|
50
|
+
when "xml" then bib.to_xml bibdata: true
|
51
|
+
when "bibxml" then bib.to_bibxml
|
52
|
+
end
|
53
|
+
end
|
54
|
+
|
55
|
+
# @param hit [Nokogiri::XML::Element]
|
56
|
+
def parse_page(hit) # rubocop:disable Metrics/AbcSize, Metrics/MethodLength
|
57
|
+
DataParser.new(hit).parse.each { |item| write_file item }
|
58
|
+
end
|
59
|
+
|
60
|
+
# @param type [String]
|
61
|
+
def html_index(type) # rubocop:disable Metrics/MethodLength
|
62
|
+
result = @agent.get "#{URL}#{type}/"
|
63
|
+
# @last_call_time = Time.now
|
64
|
+
result.xpath(
|
65
|
+
"//li/span[1]/a",
|
66
|
+
"//div[contains(@class, 'entry-content-wrapper')][.//a[.='Download']]",
|
67
|
+
).each do |hit|
|
68
|
+
# workers << hit
|
69
|
+
parse_page(hit)
|
70
|
+
rescue StandardError => e
|
71
|
+
warn e.message
|
72
|
+
warn e.backtrace
|
73
|
+
end
|
74
|
+
end
|
75
|
+
|
76
|
+
#
|
77
|
+
# Fetch data from Ecma website.
|
78
|
+
#
|
79
|
+
# @return [void]
|
80
|
+
#
|
81
|
+
def fetch
|
82
|
+
t1 = Time.now
|
83
|
+
puts "Started at: #{t1}"
|
84
|
+
|
85
|
+
FileUtils.mkdir_p @output
|
86
|
+
|
87
|
+
html_index "standards"
|
88
|
+
html_index "technical-reports"
|
89
|
+
html_index "mementos"
|
90
|
+
@index.save
|
91
|
+
|
92
|
+
t2 = Time.now
|
93
|
+
puts "Stopped at: #{t2}"
|
94
|
+
puts "Done in: #{(t2 - t1).round} sec."
|
95
|
+
end
|
96
|
+
end
|
97
|
+
end
|
@@ -0,0 +1,215 @@
|
|
1
|
+
module RelatonEcma
|
2
|
+
class DataParser
|
3
|
+
MATTRS = %i[docid title date link].freeze
|
4
|
+
ATTRS = MATTRS + %i[abstract relation edition].freeze
|
5
|
+
|
6
|
+
#
|
7
|
+
# Initialize parser
|
8
|
+
#
|
9
|
+
# @param [Nokogiri::XML::Element] hit document hit
|
10
|
+
#
|
11
|
+
def initialize(hit)
|
12
|
+
@hit = hit
|
13
|
+
@bib = {
|
14
|
+
type: "standard", language: ["en"], script: ["Latn"], place: ["Geneva"], doctype: "document"
|
15
|
+
}
|
16
|
+
@agent = Mechanize.new
|
17
|
+
end
|
18
|
+
|
19
|
+
def parse # rubocop:disable Metrics/AbcSize,Metrics/MethodLength
|
20
|
+
if @hit[:href]
|
21
|
+
@agent.user_agent_alias = Mechanize::AGENT_ALIASES.keys[rand(21)]
|
22
|
+
@doc = get_page @hit[:href]
|
23
|
+
ATTRS.each { |a| @bib[a] = send "fetch_#{a}" }
|
24
|
+
else
|
25
|
+
MATTRS.each { |a| @bib[a] = send "fetch_mem_#{a}" }
|
26
|
+
end
|
27
|
+
@bib[:contributor] = contributor
|
28
|
+
items = [BibliographicItem.new(**@bib)]
|
29
|
+
items + parse_editions
|
30
|
+
end
|
31
|
+
|
32
|
+
#
|
33
|
+
# Get page with retries
|
34
|
+
#
|
35
|
+
# @param [String] url url to fetch
|
36
|
+
#
|
37
|
+
# @return [Mechanize::Page] document
|
38
|
+
#
|
39
|
+
def get_page(url)
|
40
|
+
3.times do |n|
|
41
|
+
sleep n
|
42
|
+
doc = @agent.get url
|
43
|
+
return doc
|
44
|
+
rescue StandardError => e
|
45
|
+
warn e.message
|
46
|
+
end
|
47
|
+
end
|
48
|
+
|
49
|
+
#
|
50
|
+
# Parse editions
|
51
|
+
#
|
52
|
+
# @param [Mechanize::Page] doc document
|
53
|
+
# @param [Hash] bib bibliographic item the last edition
|
54
|
+
#
|
55
|
+
# @return [void]
|
56
|
+
#
|
57
|
+
def parse_editions # rubocop:disable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/MethodLength, Metrics/PerceivedComplexity
|
58
|
+
return [] unless @doc
|
59
|
+
|
60
|
+
docid = @bib[:docid]
|
61
|
+
@doc.xpath('//div[@id="main"]/div[1]/div/main/article/div/div/standard/div/ul/li').map do |hit|
|
62
|
+
id, ed, @bib[:date], vol = edition_id_parts hit.at("./span", "./a").text
|
63
|
+
@bib[:link] = edition_link(hit) + edition_translation_link(ed)
|
64
|
+
next if ed.nil? || ed.empty?
|
65
|
+
|
66
|
+
@bib[:docid] = id.nil? || id.empty? ? docid : fetch_docid(id)
|
67
|
+
@bib[:edition] = RelatonBib::Edition.new(content: ed)
|
68
|
+
@bib[:extent] = vol && [RelatonBib::Locality.new("volume", vol)]
|
69
|
+
BibliographicItem.new(**@bib)
|
70
|
+
end.compact
|
71
|
+
end
|
72
|
+
|
73
|
+
def edition_link(hit)
|
74
|
+
{ "src" => hit.at("./a"), "pdf" => hit.at("./span/a") }.map do |type, a|
|
75
|
+
RelatonBib::TypedUri.new(type: type, content: a[:href]) if a
|
76
|
+
end.compact
|
77
|
+
end
|
78
|
+
|
79
|
+
#
|
80
|
+
# Parse edition and date
|
81
|
+
#
|
82
|
+
# @param [String] text identifier text
|
83
|
+
#
|
84
|
+
# @return [Array<String,nil,Array<RelatonBib::BibliographicDate>>] edition and date
|
85
|
+
#
|
86
|
+
def edition_id_parts(text) # rubocop:disable Metrics/MethodLength
|
87
|
+
%r{^
|
88
|
+
(?<id>\w+(?:[\d-]+|\sTR/\d+)),?\s
|
89
|
+
(?:Volume\s(?<vol>[\d.]+),?\s)?
|
90
|
+
(?<ed>[\d.]+)(?:st|nd|rd|th)?\sedition
|
91
|
+
(?:[,.]\s(?<dt>\w+\s\d+))?
|
92
|
+
}x =~ text
|
93
|
+
date = [dt].compact.map do |d|
|
94
|
+
on = Date.strptime(d, "%B %Y").strftime("%Y-%m")
|
95
|
+
RelatonBib::BibliographicDate.new(type: "published", on: on)
|
96
|
+
end
|
97
|
+
[id, ed, date, vol]
|
98
|
+
end
|
99
|
+
|
100
|
+
# @return [Array<RelatonBib::DocumentIdentifier>]
|
101
|
+
def fetch_docid(id = nil)
|
102
|
+
id ||= @hit.text
|
103
|
+
[RelatonBib::DocumentIdentifier.new(type: "ECMA", id: id, primary: true)]
|
104
|
+
end
|
105
|
+
|
106
|
+
# @return [Array<RelatonBib::TypedUri>]
|
107
|
+
def fetch_link # rubocop:disable Metrics/AbcSize
|
108
|
+
link = []
|
109
|
+
link << RelatonBib::TypedUri.new(type: "src", content: @hit[:href]) if @hit[:href]
|
110
|
+
ref = @doc.at('//div[@class="ecma-item-content-wrapper"]/span/a',
|
111
|
+
'//div[@class="ecma-item-content-wrapper"]/a')
|
112
|
+
link << RelatonBib::TypedUri.new(type: "pdf", content: ref[:href]) if ref
|
113
|
+
link + edition_translation_link(@bib[:edition]&.content)
|
114
|
+
end
|
115
|
+
|
116
|
+
def fetch_mem_link
|
117
|
+
@hit.xpath("./div/section/div/p/a").map do |a|
|
118
|
+
RelatonBib::TypedUri.new(type: "pdf", content: a[:href])
|
119
|
+
end
|
120
|
+
end
|
121
|
+
|
122
|
+
def edition_translation_link(edition)
|
123
|
+
translation_link.select { |l| l[:ed] == edition }.map { |l| l[:link] }
|
124
|
+
end
|
125
|
+
|
126
|
+
def translation_link
|
127
|
+
return [] unless @doc
|
128
|
+
|
129
|
+
@translation_link ||= @doc.xpath("//main/article/div/div/standard/div[2]/ul/li").map do |l|
|
130
|
+
a = l.at("span/a")
|
131
|
+
id = l.at("span").text
|
132
|
+
%r{\w+[\d-]+,\s(?<lang>\w+)\sversion,\s(?<ed>[\d.]+)(?:st|nd|rd|th)\sedition} =~ id
|
133
|
+
case lang
|
134
|
+
when "Japanese"
|
135
|
+
{ ed: ed, link: RelatonBib::TypedUri.new(type: "pdf", language: "ja", script: "Jpan", content: a[:href]) }
|
136
|
+
end
|
137
|
+
end.compact
|
138
|
+
end
|
139
|
+
|
140
|
+
# @return [Array<Hash>]
|
141
|
+
def fetch_title
|
142
|
+
@doc.xpath('//p[@class="ecma-item-short-description"]').map do |t|
|
143
|
+
{ content: t.text.strip, language: "en", script: "Latn" }
|
144
|
+
end
|
145
|
+
end
|
146
|
+
|
147
|
+
# @return [Array<RelatonBib::FormattedString>]
|
148
|
+
def fetch_abstract
|
149
|
+
content = @doc.xpath('//div[@class="ecma-item-content"]/p').map do |a|
|
150
|
+
a.text.strip.squeeze(" ").gsub(/\r\n/, "")
|
151
|
+
end.join "\n"
|
152
|
+
return [] if content.empty?
|
153
|
+
|
154
|
+
[RelatonBib::FormattedString.new(content: content, language: "en", script: "Latn")]
|
155
|
+
end
|
156
|
+
|
157
|
+
# @return [Array<RelatonBib::BibliographicDate>]
|
158
|
+
def fetch_date
|
159
|
+
@doc.xpath('//p[@class="ecma-item-edition"]').map do |d|
|
160
|
+
date = d.text.split(", ").last
|
161
|
+
RelatonBib::BibliographicDate.new type: "published", on: date
|
162
|
+
end
|
163
|
+
end
|
164
|
+
|
165
|
+
# @return [Array<Hash>]
|
166
|
+
def fetch_relation # rubocop:disable Metrics/AbcSize, Metrics/MethodLength, Metrics/CyclomaticComplexity
|
167
|
+
@doc.xpath("//ul[@class='ecma-item-archives']/li").map do |rel|
|
168
|
+
ref, ed, date, vol = edition_id_parts rel.at("span").text
|
169
|
+
next if ed.nil? || ed.empty?
|
170
|
+
|
171
|
+
fref = RelatonBib::FormattedRef.new content: ref, language: "en", script: "Latn"
|
172
|
+
docid = RelatonBib::DocumentIdentifier.new(type: "ECMA", id: ref, primary: true)
|
173
|
+
link = rel.xpath("span/a").map { |l| RelatonBib::TypedUri.new type: "pdf", content: l[:href] }
|
174
|
+
edition = RelatonBib::Edition.new content: ed
|
175
|
+
extent = vol && [RelatonBib::Locality.new("volume", vol)]
|
176
|
+
bibitem = BibliographicItem.new(
|
177
|
+
docid: [docid], formattedref: fref, date: date, edition: edition,
|
178
|
+
link: link, extent: extent
|
179
|
+
)
|
180
|
+
{ type: "updates", bibitem: bibitem }
|
181
|
+
end.compact
|
182
|
+
end
|
183
|
+
|
184
|
+
#
|
185
|
+
# @return [RelatonBib::Edition, nil]
|
186
|
+
#
|
187
|
+
def fetch_edition
|
188
|
+
cnt = @doc.at('//p[@class="ecma-item-edition"]')&.text&.match(/^\d+(?=(?:st|nd|th|rd))/)&.to_s
|
189
|
+
RelatonBib::Edition.new(content: cnt) if cnt && !cnt.empty?
|
190
|
+
end
|
191
|
+
|
192
|
+
def contributor
|
193
|
+
org = RelatonBib::Organization.new name: "Ecma International"
|
194
|
+
[{ entity: org, role: [{ type: "publisher" }] }]
|
195
|
+
end
|
196
|
+
|
197
|
+
# @return [Array<RelatonBib::DocumentIdentifier>]
|
198
|
+
def fetch_mem_docid
|
199
|
+
code = "ECMA MEM/#{@hit.at('div[1]//p').text}"
|
200
|
+
fetch_docid code
|
201
|
+
end
|
202
|
+
|
203
|
+
def fetch_mem_date
|
204
|
+
date = @hit.at("div[2]//p").text
|
205
|
+
on = Date.strptime(date, "%B %Y").strftime "%Y-%m"
|
206
|
+
[RelatonBib::BibliographicDate.new(type: "published", on: on)]
|
207
|
+
end
|
208
|
+
|
209
|
+
def fetch_mem_title
|
210
|
+
year = @hit.at("div[1]//p").text
|
211
|
+
content = "\"Memento #{year}\" for year #{year}"
|
212
|
+
[{ content: content, language: "en", script: "Latn" }]
|
213
|
+
end
|
214
|
+
end
|
215
|
+
end
|
@@ -3,11 +3,36 @@
|
|
3
3
|
module RelatonEcma
|
4
4
|
# IETF bibliography module
|
5
5
|
module EcmaBibliography
|
6
|
+
ENDPOINT = "https://raw.githubusercontent.com/relaton/relaton-data-ecma/master/"
|
7
|
+
|
6
8
|
class << self
|
7
|
-
#
|
8
|
-
#
|
9
|
-
|
10
|
-
|
9
|
+
#
|
10
|
+
# Search for a reference on the IETF website.
|
11
|
+
#
|
12
|
+
# @param ref [String] the ECMA standard reference to look up (e..g "ECMA-6")
|
13
|
+
#
|
14
|
+
# @return [Array<Hash>]
|
15
|
+
#
|
16
|
+
def search(ref)
|
17
|
+
refparts = parse_ref ref
|
18
|
+
return false unless refparts
|
19
|
+
|
20
|
+
index = Relaton::Index.find_or_create :ECMA, url: "#{ENDPOINT}index.zip"
|
21
|
+
index.search { |row| match_ref refparts, row }
|
22
|
+
end
|
23
|
+
|
24
|
+
def parse_ref(ref)
|
25
|
+
%r{^
|
26
|
+
(?<id>ECMA(?:[\d-]+|\s\w+/\d+))
|
27
|
+
(?:\sed(?<ed>[\d.]+))?
|
28
|
+
(?:\svol(?<vol>\d+))?
|
29
|
+
}x.match ref
|
30
|
+
end
|
31
|
+
|
32
|
+
def match_ref(refparts, row)
|
33
|
+
row[:id][:id] == refparts[:id] &&
|
34
|
+
(refparts[:ed].nil? || row[:id][:ed] == refparts[:ed]) &&
|
35
|
+
(refparts[:vol].nil? || row[:id][:vol] == refparts[:vol])
|
11
36
|
end
|
12
37
|
|
13
38
|
# @param code [String] the ECMA standard Code to look up (e..g "ECMA-6")
|
@@ -16,15 +41,36 @@ module RelatonEcma
|
|
16
41
|
# @return [RelatonEcma::BibliographicItem] Relaton of reference
|
17
42
|
def get(code, _year = nil, _opts = {})
|
18
43
|
warn "[relaton-ecma] (\"#{code}\") fetching..."
|
19
|
-
result =
|
44
|
+
result = fetch_doc(code)
|
20
45
|
if result
|
21
46
|
warn "[relaton-ecma] (\"#{code}\") found #{result.docidentifier.first.id}"
|
47
|
+
# item
|
22
48
|
else
|
23
|
-
warn "[relaton-ecma] WARNING no match found online for #{code}. "\
|
49
|
+
warn "[relaton-ecma] WARNING no match found online for #{code}. " \
|
24
50
|
"The code must be exactly like it is on the standards website."
|
25
51
|
end
|
26
52
|
result
|
27
53
|
end
|
54
|
+
|
55
|
+
def compare_edition_volume(aaa, bbb)
|
56
|
+
comp = bbb[:id][:ed] <=> aaa[:id][:ed]
|
57
|
+
comp.zero? ? aaa[:id][:vol] <=> bbb[:id][:vol] : comp
|
58
|
+
end
|
59
|
+
|
60
|
+
def fetch_doc(code) # rubocop:disable Metrics/AbcSize
|
61
|
+
row = search(code).min { |a, b| compare_edition_volume a, b }
|
62
|
+
return unless row
|
63
|
+
|
64
|
+
url = "#{ENDPOINT}#{row[:file]}"
|
65
|
+
doc = OpenURI.open_uri url
|
66
|
+
hash = YAML.safe_load doc
|
67
|
+
hash["fetched"] = Date.today.to_s
|
68
|
+
BibliographicItem.from_hash hash
|
69
|
+
rescue OpenURI::HTTPError => e
|
70
|
+
return if e.io.status.first == "404"
|
71
|
+
|
72
|
+
raise RelatonBib::RequestError, "No document found for #{code} reference. #{e.message}"
|
73
|
+
end
|
28
74
|
end
|
29
75
|
end
|
30
76
|
end
|
@@ -7,6 +7,7 @@ module RelatonEcma
|
|
7
7
|
@prefix = "ECMA"
|
8
8
|
@defaultprefix = /^ECMA(-|\s)/
|
9
9
|
@idtype = "ECMA"
|
10
|
+
@datasets = %w[ecma-standards]
|
10
11
|
end
|
11
12
|
|
12
13
|
# @param code [String]
|
@@ -17,6 +18,18 @@ module RelatonEcma
|
|
17
18
|
::RelatonEcma::EcmaBibliography.get(code, date, opts)
|
18
19
|
end
|
19
20
|
|
21
|
+
#
|
22
|
+
# Fetch all the documents from a source
|
23
|
+
#
|
24
|
+
# @param [String] source source name (iec-harmonized-all, iec-harmonized-latest)
|
25
|
+
# @param [Hash] opts
|
26
|
+
# @option opts [String] :output directory to output documents
|
27
|
+
# @option opts [String] :format output format (xml, yaml, bibxml)
|
28
|
+
#
|
29
|
+
def fetch_data(_source, opts)
|
30
|
+
DataFetcher.new(**opts).fetch
|
31
|
+
end
|
32
|
+
|
20
33
|
# @param xml [String]
|
21
34
|
# @return [RelatonEcma::BibliographicItem]
|
22
35
|
def from_xml(xml)
|
data/lib/relaton_ecma/version.rb
CHANGED
data/lib/relaton_ecma.rb
CHANGED
@@ -1,13 +1,15 @@
|
|
1
1
|
require "nokogiri"
|
2
2
|
require "open-uri"
|
3
3
|
require "yaml"
|
4
|
+
require "relaton/index"
|
4
5
|
require "relaton_bib"
|
5
6
|
require "relaton_ecma/version"
|
6
7
|
require "relaton_ecma/bibliographic_item"
|
7
8
|
require "relaton_ecma/xml_parser"
|
8
9
|
require "relaton_ecma/hash_converter"
|
9
|
-
require "relaton_ecma/scrapper"
|
10
10
|
require "relaton_ecma/ecma_bibliography"
|
11
|
+
require "relaton_ecma/data_fetcher"
|
12
|
+
require "relaton_ecma/data_parser"
|
11
13
|
|
12
14
|
module RelatonEcma
|
13
15
|
# Returns hash of XML reammar
|
data/relaton_ecma.gemspec
CHANGED
@@ -27,15 +27,10 @@ Gem::Specification.new do |spec| # rubocop:disable Metrics/BlockLength
|
|
27
27
|
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
|
28
28
|
spec.require_paths = ["lib"]
|
29
29
|
|
30
|
-
# spec.add_development_dependency "debase"
|
31
30
|
spec.add_development_dependency "equivalent-xml", "~> 0.6"
|
32
|
-
spec.add_development_dependency "pry-byebug"
|
33
31
|
spec.add_development_dependency "rake", "~> 10.0"
|
34
|
-
# spec.add_development_dependency "ruby-debug-ide"
|
35
|
-
spec.add_development_dependency "ruby-jing"
|
36
|
-
spec.add_development_dependency "simplecov"
|
37
|
-
spec.add_development_dependency "vcr"
|
38
|
-
spec.add_development_dependency "webmock"
|
39
32
|
|
33
|
+
spec.add_dependency "mechanize", "~> 2.7"
|
40
34
|
spec.add_dependency "relaton-bib", "~> 1.14.0"
|
35
|
+
spec.add_dependency "relaton-index", "~> 0.1.6"
|
41
36
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: relaton-ecma
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.14.
|
4
|
+
version: 1.14.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Ribose Inc.
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2023-04-27 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: equivalent-xml
|
@@ -24,20 +24,6 @@ dependencies:
|
|
24
24
|
- - "~>"
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '0.6'
|
27
|
-
- !ruby/object:Gem::Dependency
|
28
|
-
name: pry-byebug
|
29
|
-
requirement: !ruby/object:Gem::Requirement
|
30
|
-
requirements:
|
31
|
-
- - ">="
|
32
|
-
- !ruby/object:Gem::Version
|
33
|
-
version: '0'
|
34
|
-
type: :development
|
35
|
-
prerelease: false
|
36
|
-
version_requirements: !ruby/object:Gem::Requirement
|
37
|
-
requirements:
|
38
|
-
- - ">="
|
39
|
-
- !ruby/object:Gem::Version
|
40
|
-
version: '0'
|
41
27
|
- !ruby/object:Gem::Dependency
|
42
28
|
name: rake
|
43
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -53,75 +39,47 @@ dependencies:
|
|
53
39
|
- !ruby/object:Gem::Version
|
54
40
|
version: '10.0'
|
55
41
|
- !ruby/object:Gem::Dependency
|
56
|
-
name:
|
42
|
+
name: mechanize
|
57
43
|
requirement: !ruby/object:Gem::Requirement
|
58
44
|
requirements:
|
59
|
-
- - "
|
60
|
-
- !ruby/object:Gem::Version
|
61
|
-
version: '0'
|
62
|
-
type: :development
|
63
|
-
prerelease: false
|
64
|
-
version_requirements: !ruby/object:Gem::Requirement
|
65
|
-
requirements:
|
66
|
-
- - ">="
|
67
|
-
- !ruby/object:Gem::Version
|
68
|
-
version: '0'
|
69
|
-
- !ruby/object:Gem::Dependency
|
70
|
-
name: simplecov
|
71
|
-
requirement: !ruby/object:Gem::Requirement
|
72
|
-
requirements:
|
73
|
-
- - ">="
|
74
|
-
- !ruby/object:Gem::Version
|
75
|
-
version: '0'
|
76
|
-
type: :development
|
77
|
-
prerelease: false
|
78
|
-
version_requirements: !ruby/object:Gem::Requirement
|
79
|
-
requirements:
|
80
|
-
- - ">="
|
81
|
-
- !ruby/object:Gem::Version
|
82
|
-
version: '0'
|
83
|
-
- !ruby/object:Gem::Dependency
|
84
|
-
name: vcr
|
85
|
-
requirement: !ruby/object:Gem::Requirement
|
86
|
-
requirements:
|
87
|
-
- - ">="
|
45
|
+
- - "~>"
|
88
46
|
- !ruby/object:Gem::Version
|
89
|
-
version: '
|
90
|
-
type: :
|
47
|
+
version: '2.7'
|
48
|
+
type: :runtime
|
91
49
|
prerelease: false
|
92
50
|
version_requirements: !ruby/object:Gem::Requirement
|
93
51
|
requirements:
|
94
|
-
- - "
|
52
|
+
- - "~>"
|
95
53
|
- !ruby/object:Gem::Version
|
96
|
-
version: '
|
54
|
+
version: '2.7'
|
97
55
|
- !ruby/object:Gem::Dependency
|
98
|
-
name:
|
56
|
+
name: relaton-bib
|
99
57
|
requirement: !ruby/object:Gem::Requirement
|
100
58
|
requirements:
|
101
|
-
- - "
|
59
|
+
- - "~>"
|
102
60
|
- !ruby/object:Gem::Version
|
103
|
-
version:
|
104
|
-
type: :
|
61
|
+
version: 1.14.0
|
62
|
+
type: :runtime
|
105
63
|
prerelease: false
|
106
64
|
version_requirements: !ruby/object:Gem::Requirement
|
107
65
|
requirements:
|
108
|
-
- - "
|
66
|
+
- - "~>"
|
109
67
|
- !ruby/object:Gem::Version
|
110
|
-
version:
|
68
|
+
version: 1.14.0
|
111
69
|
- !ruby/object:Gem::Dependency
|
112
|
-
name: relaton-
|
70
|
+
name: relaton-index
|
113
71
|
requirement: !ruby/object:Gem::Requirement
|
114
72
|
requirements:
|
115
73
|
- - "~>"
|
116
74
|
- !ruby/object:Gem::Version
|
117
|
-
version: 1.
|
75
|
+
version: 0.1.6
|
118
76
|
type: :runtime
|
119
77
|
prerelease: false
|
120
78
|
version_requirements: !ruby/object:Gem::Requirement
|
121
79
|
requirements:
|
122
80
|
- - "~>"
|
123
81
|
- !ruby/object:Gem::Version
|
124
|
-
version: 1.
|
82
|
+
version: 0.1.6
|
125
83
|
description: "RelatonEcma: retrieve ECMA Standards for bibliographic use \nusing the
|
126
84
|
BibliographicItem model.\n"
|
127
85
|
email:
|
@@ -148,10 +106,11 @@ files:
|
|
148
106
|
- grammars/relaton-ecma.rng
|
149
107
|
- lib/relaton_ecma.rb
|
150
108
|
- lib/relaton_ecma/bibliographic_item.rb
|
109
|
+
- lib/relaton_ecma/data_fetcher.rb
|
110
|
+
- lib/relaton_ecma/data_parser.rb
|
151
111
|
- lib/relaton_ecma/ecma_bibliography.rb
|
152
112
|
- lib/relaton_ecma/hash_converter.rb
|
153
113
|
- lib/relaton_ecma/processor.rb
|
154
|
-
- lib/relaton_ecma/scrapper.rb
|
155
114
|
- lib/relaton_ecma/version.rb
|
156
115
|
- lib/relaton_ecma/xml_parser.rb
|
157
116
|
- relaton_ecma.gemspec
|
@@ -175,7 +134,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
175
134
|
- !ruby/object:Gem::Version
|
176
135
|
version: '0'
|
177
136
|
requirements: []
|
178
|
-
rubygems_version: 3.
|
137
|
+
rubygems_version: 3.4.9
|
179
138
|
signing_key:
|
180
139
|
specification_version: 4
|
181
140
|
summary: 'RelatonIetf: retrieve ECMA Standards for bibliographic use using the BibliographicItem
|
@@ -1,29 +0,0 @@
|
|
1
|
-
module RelatonEcma
|
2
|
-
module Scrapper
|
3
|
-
ENDPOINT = "https://raw.githubusercontent.com/relaton/relaton-data-ecma/master/data/".freeze
|
4
|
-
|
5
|
-
class << self
|
6
|
-
# @param code [String]
|
7
|
-
# @return [RelatonBib::BibliographicItem]
|
8
|
-
def scrape_page(code)
|
9
|
-
url = "#{ENDPOINT}#{code.gsub(/[\/\s]/, '_').upcase}.yaml"
|
10
|
-
parse_page url
|
11
|
-
rescue OpenURI::HTTPError => e
|
12
|
-
return if e.io.status.first == "404"
|
13
|
-
|
14
|
-
raise RelatonBib::RequestError, "No document found for #{code} reference. #{e.message}"
|
15
|
-
end
|
16
|
-
|
17
|
-
private
|
18
|
-
|
19
|
-
# @param url [String]
|
20
|
-
# @retrurn [RelatonEcma::BibliographicItem]
|
21
|
-
def parse_page(url)
|
22
|
-
doc = OpenURI.open_uri url
|
23
|
-
hash = YAML.safe_load(doc)
|
24
|
-
hash["fetched"] = Date.today.to_s
|
25
|
-
BibliographicItem.from_hash hash
|
26
|
-
end
|
27
|
-
end
|
28
|
-
end
|
29
|
-
end
|