relaton-gb 1.20.1 → 1.20.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CLAUDE.md +74 -0
- data/lib/relaton_gb/t_scrapper.rb +17 -13
- data/lib/relaton_gb/version.rb +1 -1
- metadata +3 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 545e535b0db62ecab737c3067cd5b3f8f2104560e104b1dc2cbc7c920ab7238b
|
|
4
|
+
data.tar.gz: 3672fda5a7bb420fb5cc1086b60d46ce3b94f0ae4d9cf7759101f9b67cce2aac
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 74581dbf698e664f97cdfa66c1d4f948f2a14713eeb772ce20838a03f01fba1eb864afe5b6a4153f8fc80a32b30d3893bdbfa713667edd83c193f6bada37e533
|
|
7
|
+
data.tar.gz: 3141ac12e00f11a00a44d4563a48e7aa16b7b762514dc63ca828f104191614c55ec75ab6cf2df7ccc0e860ab50ccc4d42665ce61e98fd49ae51e81d2fcee55e2
|
data/CLAUDE.md
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
# CLAUDE.md
|
|
2
|
+
|
|
3
|
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
4
|
+
|
|
5
|
+
## Project Overview
|
|
6
|
+
|
|
7
|
+
relaton-gb is a Ruby gem for searching and fetching Chinese GB (Guobiao) standards bibliographic data. It's part of the Relaton family of gems and scrapes standards from Chinese government websites.
|
|
8
|
+
|
|
9
|
+
## Common Commands
|
|
10
|
+
|
|
11
|
+
```bash
|
|
12
|
+
# Install dependencies
|
|
13
|
+
bin/setup
|
|
14
|
+
|
|
15
|
+
# Run all tests
|
|
16
|
+
bundle exec rake spec
|
|
17
|
+
|
|
18
|
+
# Run a single test file
|
|
19
|
+
bundle exec rspec spec/relaton_gb_spec.rb
|
|
20
|
+
|
|
21
|
+
# Run a specific test by line number
|
|
22
|
+
bundle exec rspec spec/relaton_gb_spec.rb:31
|
|
23
|
+
|
|
24
|
+
# Interactive console for experimenting
|
|
25
|
+
bin/console
|
|
26
|
+
|
|
27
|
+
# Lint with RuboCop (uses Ribose OSS style guide)
|
|
28
|
+
bundle exec rubocop
|
|
29
|
+
|
|
30
|
+
# Install gem locally
|
|
31
|
+
bundle exec rake install
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
## Architecture
|
|
35
|
+
|
|
36
|
+
### Entry Point
|
|
37
|
+
`RelatonGb::GbBibliography` is the main API class:
|
|
38
|
+
- `search(text)` - Returns `HitCollection` of search results
|
|
39
|
+
- `get(code, year, opts)` - Fetches a specific standard by identifier
|
|
40
|
+
|
|
41
|
+
### Scrapers (lib/relaton_gb/)
|
|
42
|
+
Each scraper handles a different standard source:
|
|
43
|
+
- `GbScrapper` - National standards (GB/GJ/GS prefix) from openstd.samr.gov.cn
|
|
44
|
+
- `TScrapper` - Social organization standards (T/XX prefix) from www.ttbz.org.cn
|
|
45
|
+
- `Scrapper` - Common scraping methods shared via `extend`
|
|
46
|
+
|
|
47
|
+
The scrapers use Mechanize for HTTP requests and Nokogiri for HTML parsing.
|
|
48
|
+
|
|
49
|
+
### Domain Models
|
|
50
|
+
- `GbBibliographicItem` - Main bibliographic item class, extends `RelatonIsoBib::IsoBibliographicItem`
|
|
51
|
+
- `Hit` / `HitCollection` - Search result wrappers with lazy fetching via `hit.fetch`
|
|
52
|
+
- `GbStandardType` - Standard classification (scope, mandate, prefix)
|
|
53
|
+
- `GbTechnicalCommittee` - Technical committee information
|
|
54
|
+
|
|
55
|
+
### Data Flow
|
|
56
|
+
1. `GbBibliography.search` routes to appropriate scraper based on standard prefix
|
|
57
|
+
2. Scraper returns `HitCollection` with basic metadata
|
|
58
|
+
3. Calling `hit.fetch` scrapes the full document page and returns `GbBibliographicItem`
|
|
59
|
+
4. `GbBibliographicItem` can serialize to XML, hash, or AsciiBib
|
|
60
|
+
|
|
61
|
+
## Testing
|
|
62
|
+
|
|
63
|
+
Tests use RSpec with VCR to record/replay HTTP interactions:
|
|
64
|
+
- VCR cassettes stored in `spec/vcr_cassettes/`
|
|
65
|
+
- Cassettes auto-expire after 7 days (`re_record_interval`)
|
|
66
|
+
- XML output validated against RelaxNG schemas in `grammars/`
|
|
67
|
+
|
|
68
|
+
To re-record a VCR cassette, delete the corresponding YAML file and run the test.
|
|
69
|
+
|
|
70
|
+
## Important Notes
|
|
71
|
+
|
|
72
|
+
- GB standard searches **require the year** in the identifier (e.g., `GB/T 20223-2006`, not `GB/T 20223`)
|
|
73
|
+
- Standard prefixes define the type: GB/GJ/GS = national, T/XX = social organization
|
|
74
|
+
- The `/T` suffix in prefix indicates "recommended" (推荐), `/Z` indicates "guidelines"
|
|
@@ -1,8 +1,6 @@
|
|
|
1
1
|
# encoding: UTF-8
|
|
2
2
|
# frozen_string_literal: true
|
|
3
3
|
|
|
4
|
-
require "open-uri"
|
|
5
|
-
require "net/http"
|
|
6
4
|
require "nokogiri"
|
|
7
5
|
require "relaton_gb/scrapper"
|
|
8
6
|
require "relaton_gb/gb_bibliographic_item"
|
|
@@ -19,22 +17,24 @@ module RelatonGb
|
|
|
19
17
|
# @param text [String]
|
|
20
18
|
# @return [RelatonGb::HitCollection]
|
|
21
19
|
def scrape_page(text)
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
).read
|
|
26
|
-
header = Nokogiri::HTML search_html
|
|
20
|
+
url = "http://www.ttbz.org.cn/Home/Standard?searchType=2&key=" \
|
|
21
|
+
"#{CGI.escape(text.tr('-', [8212].pack('U')))}"
|
|
22
|
+
doc = agent.get(url)
|
|
27
23
|
xpath = '//table[contains(@class, "standard_list_table")]/tr/td/a'
|
|
28
24
|
t_xpath = "../preceding-sibling::td[4]"
|
|
29
|
-
hits =
|
|
25
|
+
hits = doc.xpath(xpath).map do |h|
|
|
30
26
|
docref = h.at(t_xpath).text.gsub(/â\u0080\u0094/, "-")
|
|
31
27
|
status = h.at("../preceding-sibling::td[1]").text.delete "\r\n"
|
|
32
28
|
pid = h[:href].sub(%r{/$}, "")
|
|
33
29
|
Hit.new pid: pid, docref: docref, status: status, scrapper: self
|
|
34
30
|
end
|
|
35
31
|
HitCollection.new hits
|
|
36
|
-
rescue
|
|
37
|
-
|
|
32
|
+
rescue Mechanize::ResponseCodeError => e
|
|
33
|
+
return nil if e.response_code == "404"
|
|
34
|
+
|
|
35
|
+
raise RelatonBib::RequestError, "Cannot access #{url}: #{e.message}"
|
|
36
|
+
rescue Mechanize::Error => e
|
|
37
|
+
raise RelatonBib::RequestError, "Cannot access #{url}: #{e.message}"
|
|
38
38
|
end
|
|
39
39
|
# rubocop:enable Metrics/MethodLength, Metrics/AbcSize
|
|
40
40
|
|
|
@@ -42,10 +42,14 @@ module RelatonGb
|
|
|
42
42
|
# @return [RelatonGb::GbBibliographicItem]
|
|
43
43
|
def scrape_doc(hit)
|
|
44
44
|
src = "http://www.ttbz.org.cn#{hit.pid}"
|
|
45
|
-
doc =
|
|
45
|
+
doc = agent.get(src)
|
|
46
46
|
GbBibliographicItem.new(**scrapped_data(doc, src, hit))
|
|
47
|
-
rescue
|
|
48
|
-
raise RelatonBib::RequestError, "Cannot access #{src}"
|
|
47
|
+
rescue Mechanize::Error => e
|
|
48
|
+
raise RelatonBib::RequestError, "Cannot access #{src}: #{e.message}"
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
def agent
|
|
52
|
+
@agent ||= Mechanize.new
|
|
49
53
|
end
|
|
50
54
|
|
|
51
55
|
private
|
data/lib/relaton_gb/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: relaton-gb
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.20.
|
|
4
|
+
version: 1.20.2
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Ribose Inc.
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date:
|
|
11
|
+
date: 2026-01-22 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: cnccs
|
|
@@ -108,6 +108,7 @@ files:
|
|
|
108
108
|
- ".hound.yml"
|
|
109
109
|
- ".rspec"
|
|
110
110
|
- ".rubocop.yml"
|
|
111
|
+
- CLAUDE.md
|
|
111
112
|
- Gemfile
|
|
112
113
|
- LICENSE.txt
|
|
113
114
|
- README.adoc
|