@yottagraph-app/data-model-skill 0.0.20 → 0.0.22
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json
CHANGED
|
@@ -0,0 +1,140 @@
|
|
|
1
|
+
# Data Dictionary: DOT Authority
|
|
2
|
+
|
|
3
|
+
## Source Overview
|
|
4
|
+
|
|
5
|
+
FMCSA Carrier Authority data from the "Carrier All With History" Socrata dataset (`6eyk-hxee`), published by the Federal Motor Carrier Safety Administration. Contains operating authority records for motor carriers, brokers, and freight forwarders, including docket numbers, authority statuses (common, contract, broker), insurance requirements, and business contact information.
|
|
6
|
+
|
|
7
|
+
Bulk-refreshed periodically by FMCSA; no per-row timestamps. The streamer performs a single full download per run.
|
|
8
|
+
|
|
9
|
+
| Pipeline | `Record.Source` |
|
|
10
|
+
|----------|----------------|
|
|
11
|
+
| Carrier Authority | `dotauthority` |
|
|
12
|
+
|
|
13
|
+
This is a companion dataset to DOT Census (`dotcensus`), which covers company registration, fleet size, and safety ratings. Authority covers operating authority statuses and insurance/bond requirements. Both sources share the `usdot_number` strong ID, enabling entity merging across datasets.
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Entity Types
|
|
18
|
+
|
|
19
|
+
### `organization`
|
|
20
|
+
|
|
21
|
+
A motor carrier, broker, or freight forwarder registered with FMCSA.
|
|
22
|
+
|
|
23
|
+
- Primary key: `usdot_number` (USDOT number assigned by FMCSA)
|
|
24
|
+
- Entity resolver: named entity. Strong ID = `usdot_number`. Disambiguation via legal name and business address.
|
|
25
|
+
- Entity name: DBA name if present (with legal name as the owner entity); otherwise the legal name.
|
|
26
|
+
|
|
27
|
+
### `person`
|
|
28
|
+
|
|
29
|
+
A named individual who is the legal owner of a carrier operating under a DBA name.
|
|
30
|
+
|
|
31
|
+
- Only created when the carrier has a DBA name different from the legal name, and the legal name appears to be a person (not an organization).
|
|
32
|
+
- Entity resolver: named entity. No strong ID.
|
|
33
|
+
|
|
34
|
+
### `location`
|
|
35
|
+
|
|
36
|
+
The business location of a carrier, derived from city, state, and country fields.
|
|
37
|
+
|
|
38
|
+
- Entity resolver: named entity. No strong ID.
|
|
39
|
+
- Entity name: formatted as "City, State" or "City, State, Country".
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## Properties
|
|
44
|
+
|
|
45
|
+
### Organization: Identity and Registration
|
|
46
|
+
|
|
47
|
+
* `usdot_number`
|
|
48
|
+
* Definition: USDOT number uniquely identifying the registered motor carrier, broker, or shipper.
|
|
49
|
+
* Examples: `1234567`, `3456789`
|
|
50
|
+
* Derivation: `dot_number` field from the Socrata API. Carriers with DOT number `00000000` or empty are treated as not having a DOT number.
|
|
51
|
+
|
|
52
|
+
* `dot_docket_number`
|
|
53
|
+
* Definition: FMCSA docket number (MC/FF/MX number) for the carrier's operating authority.
|
|
54
|
+
* Examples: `MC012892`, `MC599911`
|
|
55
|
+
* Derivation: `docket_number` field from the Socrata API.
|
|
56
|
+
|
|
57
|
+
### Organization: Business Contact
|
|
58
|
+
|
|
59
|
+
* `address`
|
|
60
|
+
* Definition: Formatted business street address of the carrier.
|
|
61
|
+
* Examples: `1200 SEABOARD DR, HIALEAH, FL 33010`
|
|
62
|
+
* Derivation: Composed from `bus_street_po`, `bus_city`, `bus_state_code`, `bus_zip_code`, and `bus_ctry_code` fields, formatted as "Street, City, State Zip".
|
|
63
|
+
|
|
64
|
+
* `dot_phone_number`
|
|
65
|
+
* Definition: Primary business phone number of the carrier.
|
|
66
|
+
* Examples: `5551234567`
|
|
67
|
+
* Derivation: `bus_telno` field from the Socrata API.
|
|
68
|
+
|
|
69
|
+
### Organization: Authority Status
|
|
70
|
+
|
|
71
|
+
* `dot_common_authority_status`
|
|
72
|
+
* Definition: Status of the carrier's common carrier authority.
|
|
73
|
+
* Examples: `A (Active)`, `I (Inactive)`, `N (None)`
|
|
74
|
+
* Derivation: `common_stat` field. Single-letter code expanded to include the human-readable label.
|
|
75
|
+
|
|
76
|
+
* `dot_contract_authority_status`
|
|
77
|
+
* Definition: Status of the carrier's contract carrier authority.
|
|
78
|
+
* Examples: `A (Active)`, `I (Inactive)`, `N (None)`
|
|
79
|
+
* Derivation: `contract_stat` field. Same code expansion as common authority.
|
|
80
|
+
|
|
81
|
+
* `dot_broker_authority_status`
|
|
82
|
+
* Definition: Status of the carrier's broker authority.
|
|
83
|
+
* Examples: `A (Active)`, `I (Inactive)`, `N (None)`
|
|
84
|
+
* Derivation: `broker_stat` field. Same code expansion as common authority.
|
|
85
|
+
|
|
86
|
+
* `dot_authority_type`
|
|
87
|
+
* Definition: Authority type flags indicating which categories the carrier is authorized for.
|
|
88
|
+
* Examples: `Property`, `Passenger`, `Household Goods`, `Property; Passenger`
|
|
89
|
+
* Derivation: Composed from five checkbox fields (`property_chk`, `passenger_chk`, `hhg_chk`, `private_auth_chk`, `enterprise_chk`). Values with `Y` are included, joined with `; `.
|
|
90
|
+
|
|
91
|
+
### Organization: Insurance and Bonding
|
|
92
|
+
|
|
93
|
+
* `dot_min_coverage_amount`
|
|
94
|
+
* Definition: Minimum insurance coverage amount required, in thousands of dollars.
|
|
95
|
+
* Examples: `00750`, `05000`
|
|
96
|
+
* Derivation: `min_cov_amount` field. Values of `00000` are suppressed.
|
|
97
|
+
|
|
98
|
+
* `dot_cargo_insurance_required`
|
|
99
|
+
* Definition: Whether cargo insurance is required for this carrier.
|
|
100
|
+
* Examples: `1.0` (yes), `0.0` (no)
|
|
101
|
+
* Derivation: `cargo_req` field. `Y` → 1.0, `N` → 0.0. Other values are not emitted.
|
|
102
|
+
* Note: Stored as float per KG boolean convention.
|
|
103
|
+
|
|
104
|
+
* `dot_bond_required`
|
|
105
|
+
* Definition: Whether a surety bond is required for this carrier.
|
|
106
|
+
* Examples: `1.0` (yes), `0.0` (no)
|
|
107
|
+
* Derivation: `bond_req` field. `Y` → 1.0, `N` → 0.0. Other values are not emitted.
|
|
108
|
+
* Note: Stored as float per KG boolean convention.
|
|
109
|
+
|
|
110
|
+
* `dot_bipd_insurance_on_file`
|
|
111
|
+
* Definition: Bodily injury / property damage insurance filing amount on file.
|
|
112
|
+
* Examples: `01000`, `05000`
|
|
113
|
+
* Derivation: `bipd_file` field. Values of `00000` are suppressed.
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
## Entity Relationships Summary
|
|
118
|
+
|
|
119
|
+
```
|
|
120
|
+
organization ──[is_located_at]──→ location
|
|
121
|
+
person ──[doing_business_as]──→ organization
|
|
122
|
+
organization ──[doing_business_as]──→ organization
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
The `doing_business_as` relationship is created when a carrier's legal name differs from its DBA name. The legal entity (person or organization, determined by name heuristics) is the subject; the DBA organization is the target. The DBA organization is also the primary record subject carrying all authority properties.
|
|
126
|
+
|
|
127
|
+
The `is_located_at` relationship is created when both city and state are present. The target is a location entity named "City, State" (or "City, State, Country" if the country code is present).
|
|
128
|
+
|
|
129
|
+
---
|
|
130
|
+
|
|
131
|
+
## Source Fields Not Mapped
|
|
132
|
+
|
|
133
|
+
The following Socrata fields are present in the API response but not currently mapped to KG properties:
|
|
134
|
+
|
|
135
|
+
- `common_app_pend`, `contract_app_pend`, `broker_app_pend` -- application pending flags
|
|
136
|
+
- `common_rev_pend`, `contract_rev_pend`, `broker_rev_pend` -- revocation pending flags
|
|
137
|
+
- `cargo_file` -- cargo insurance filing amount
|
|
138
|
+
- `bond_file` -- bond filing amount
|
|
139
|
+
|
|
140
|
+
These were omitted as lower-priority. They could be added as future enhancements if needed for analysis.
|
|
@@ -0,0 +1,161 @@
|
|
|
1
|
+
# Dataset schema for FMCSA Carrier Authority (All With History).
|
|
2
|
+
#
|
|
3
|
+
# Source: https://data.transportation.gov/Trucking-and-Motorcoaches/Carrier-All-With-History/6eyk-hxee
|
|
4
|
+
# Bulk-refreshed periodically by FMCSA; no per-row timestamps.
|
|
5
|
+
#
|
|
6
|
+
# This schema describes motor carrier operating authority records including
|
|
7
|
+
# docket information, authority statuses, insurance requirements, and
|
|
8
|
+
# business contact information.
|
|
9
|
+
name: "dotauthority"
|
|
10
|
+
description: "FMCSA carrier operating authority data including docket numbers, authority statuses, insurance requirements, and business contact information from the Carrier All With History file"
|
|
11
|
+
|
|
12
|
+
extraction:
|
|
13
|
+
flavors: closed
|
|
14
|
+
properties: closed
|
|
15
|
+
relationships: closed
|
|
16
|
+
attributes: closed
|
|
17
|
+
events: closed
|
|
18
|
+
|
|
19
|
+
flavors:
|
|
20
|
+
- name: "organization"
|
|
21
|
+
description: "A particular business, institution, or organization such as a corporation, university, government agency, or non-profit"
|
|
22
|
+
display_name: "Organization"
|
|
23
|
+
mergeability: not_mergeable
|
|
24
|
+
strong_id_properties: ["usdot_number"]
|
|
25
|
+
passive: true
|
|
26
|
+
|
|
27
|
+
- name: "person"
|
|
28
|
+
description: "A named individual such as a business owner, executive, or public figure"
|
|
29
|
+
display_name: "Person"
|
|
30
|
+
mergeability: not_mergeable
|
|
31
|
+
passive: true
|
|
32
|
+
|
|
33
|
+
- name: "location"
|
|
34
|
+
description: "A specific named geographic location such as a city, country, region, or landmark"
|
|
35
|
+
display_name: "Location"
|
|
36
|
+
mergeability: not_mergeable
|
|
37
|
+
examples: ["New York City", "San Francisco", "North America", "Bakery Square"]
|
|
38
|
+
passive: true
|
|
39
|
+
|
|
40
|
+
properties:
|
|
41
|
+
- name: "usdot_number"
|
|
42
|
+
type: string
|
|
43
|
+
description: "USDOT number assigned by FMCSA, uniquely identifying a registered motor carrier, broker, or shipper"
|
|
44
|
+
display_name: "USDOT Number"
|
|
45
|
+
mergeability: not_mergeable
|
|
46
|
+
domain_flavors: ["organization"]
|
|
47
|
+
passive: true
|
|
48
|
+
|
|
49
|
+
- name: "dot_docket_number"
|
|
50
|
+
type: string
|
|
51
|
+
description: "FMCSA docket number (MC number) for operating authority"
|
|
52
|
+
display_name: "Docket Number"
|
|
53
|
+
mergeability: not_mergeable
|
|
54
|
+
domain_flavors: ["organization"]
|
|
55
|
+
examples: ["MC012892", "MC599911"]
|
|
56
|
+
passive: true
|
|
57
|
+
|
|
58
|
+
- name: "address"
|
|
59
|
+
type: string
|
|
60
|
+
description: "Physical street address of the entity"
|
|
61
|
+
display_name: "Address"
|
|
62
|
+
mergeability: not_mergeable
|
|
63
|
+
domain_flavors: ["organization"]
|
|
64
|
+
examples: ["1200 SEABOARD DR, HIALEAH, FL 33010"]
|
|
65
|
+
passive: true
|
|
66
|
+
|
|
67
|
+
- name: "dot_phone_number"
|
|
68
|
+
type: string
|
|
69
|
+
description: "Primary phone number of the carrier"
|
|
70
|
+
display_name: "Phone Number"
|
|
71
|
+
mergeability: not_mergeable
|
|
72
|
+
domain_flavors: ["organization"]
|
|
73
|
+
examples: ["5551234567"]
|
|
74
|
+
passive: true
|
|
75
|
+
|
|
76
|
+
- name: "dot_common_authority_status"
|
|
77
|
+
type: string
|
|
78
|
+
description: "Status of common carrier authority (A=Active, I=Inactive, N=None)"
|
|
79
|
+
display_name: "Common Authority Status"
|
|
80
|
+
mergeability: not_mergeable
|
|
81
|
+
domain_flavors: ["organization"]
|
|
82
|
+
examples: ["A (Active)", "I (Inactive)", "N (None)"]
|
|
83
|
+
passive: true
|
|
84
|
+
|
|
85
|
+
- name: "dot_contract_authority_status"
|
|
86
|
+
type: string
|
|
87
|
+
description: "Status of contract carrier authority (A=Active, I=Inactive, N=None)"
|
|
88
|
+
display_name: "Contract Authority Status"
|
|
89
|
+
mergeability: not_mergeable
|
|
90
|
+
domain_flavors: ["organization"]
|
|
91
|
+
examples: ["A (Active)", "I (Inactive)", "N (None)"]
|
|
92
|
+
passive: true
|
|
93
|
+
|
|
94
|
+
- name: "dot_broker_authority_status"
|
|
95
|
+
type: string
|
|
96
|
+
description: "Status of broker authority (A=Active, I=Inactive, N=None)"
|
|
97
|
+
display_name: "Broker Authority Status"
|
|
98
|
+
mergeability: not_mergeable
|
|
99
|
+
domain_flavors: ["organization"]
|
|
100
|
+
examples: ["A (Active)", "I (Inactive)", "N (None)"]
|
|
101
|
+
passive: true
|
|
102
|
+
|
|
103
|
+
- name: "dot_authority_type"
|
|
104
|
+
type: string
|
|
105
|
+
description: "Carrier authority type flags indicating property, passenger, household goods, private, or enterprise authorization"
|
|
106
|
+
display_name: "Authority Type"
|
|
107
|
+
mergeability: not_mergeable
|
|
108
|
+
domain_flavors: ["organization"]
|
|
109
|
+
examples: ["Property", "Passenger", "Household Goods", "Property; Passenger"]
|
|
110
|
+
passive: true
|
|
111
|
+
|
|
112
|
+
- name: "dot_min_coverage_amount"
|
|
113
|
+
type: string
|
|
114
|
+
description: "Minimum insurance coverage amount required (in thousands of dollars)"
|
|
115
|
+
display_name: "Minimum Coverage Amount"
|
|
116
|
+
mergeability: not_mergeable
|
|
117
|
+
domain_flavors: ["organization"]
|
|
118
|
+
examples: ["00750", "05000"]
|
|
119
|
+
passive: true
|
|
120
|
+
|
|
121
|
+
- name: "dot_cargo_insurance_required"
|
|
122
|
+
type: float
|
|
123
|
+
description: "Whether cargo insurance is required (1.0 = yes, 0.0 = no)"
|
|
124
|
+
display_name: "Cargo Insurance Required"
|
|
125
|
+
mergeability: not_mergeable
|
|
126
|
+
domain_flavors: ["organization"]
|
|
127
|
+
passive: true
|
|
128
|
+
|
|
129
|
+
- name: "dot_bond_required"
|
|
130
|
+
type: float
|
|
131
|
+
description: "Whether a surety bond is required (1.0 = yes, 0.0 = no)"
|
|
132
|
+
display_name: "Bond Required"
|
|
133
|
+
mergeability: not_mergeable
|
|
134
|
+
domain_flavors: ["organization"]
|
|
135
|
+
passive: true
|
|
136
|
+
|
|
137
|
+
- name: "dot_bipd_insurance_on_file"
|
|
138
|
+
type: string
|
|
139
|
+
description: "Bodily injury / property damage insurance filing amount"
|
|
140
|
+
display_name: "BIPD Insurance On File"
|
|
141
|
+
mergeability: not_mergeable
|
|
142
|
+
domain_flavors: ["organization"]
|
|
143
|
+
examples: ["01000", "05000"]
|
|
144
|
+
passive: true
|
|
145
|
+
|
|
146
|
+
relationships:
|
|
147
|
+
- name: "doing_business_as"
|
|
148
|
+
description: "A legal entity is doing business as (DBA) an organization"
|
|
149
|
+
display_name: "Doing Business As"
|
|
150
|
+
mergeability: not_mergeable
|
|
151
|
+
domain_flavors: ["person", "organization"]
|
|
152
|
+
target_flavors: ["organization"]
|
|
153
|
+
passive: true
|
|
154
|
+
|
|
155
|
+
- name: "is_located_at"
|
|
156
|
+
description: "An entity is located at, operates in, resides in, is headquartered in, was born in, visits, or died in a location"
|
|
157
|
+
display_name: "Located At"
|
|
158
|
+
mergeability: not_mergeable
|
|
159
|
+
domain_flavors: ["organization"]
|
|
160
|
+
target_flavors: ["location"]
|
|
161
|
+
passive: true
|
|
@@ -0,0 +1,172 @@
|
|
|
1
|
+
# Data Dictionary: Patents (Google Patents Public Datasets / BigQuery)
|
|
2
|
+
|
|
3
|
+
*Last updated: 2026-04-29 — aligned with `patentsBQQueryBase` / streamer in `moongoose/fetch/patents_streamer.go` (US-only SQL, English title/abstract, no publication dedupe).*
|
|
4
|
+
|
|
5
|
+
## Purpose / Source Overview
|
|
6
|
+
|
|
7
|
+
This source ingests **granted US patent publications** from the Google-hosted BigQuery table `patents-public-data.patents.publications`. Each row is atomized into one **patent** record carrying its own metadata (publication number, grant date, title, abstract), CPC classifications (multi-valued: code + human-readable description), and **person** / **organization** edges for inventors and assignees. Data is structured from BigQuery only (no LLM extraction). The CPC code → description map is loaded at streamer startup from the public BigQuery table `patents-public-data.cpc.definition` (see "Reference Data" below).
|
|
8
|
+
|
|
9
|
+
The upstream query scans `publications` with **`WHERE p.country_code = 'US'`** (hardcoded in SQL, not a stream arg), **`grant_date`** in the current poll window, **`ORDER BY grant_date, publication_number`**, and an optional **`LIMIT`** when `maxPatents` is set. It selects **`publication_number`**, **`country_code`**, dates, **`kind_code`**, English-only **`title`** / **`title_language`** and **`abstract`** / **`abstract_language`** (subqueries over `title_localized` / `abstract_localized` with `WHERE language = 'en'`), aggregated CPC and inventor/assignee/citation lists. There is **no `QUALIFY`** deduplication in SQL and **no Go-side deduplication** of publication numbers: each BigQuery row returned for the window is processed as one publication. (In the public dataset, `publication_number` is unique per row at table scale; if upstream ever returned duplicates within a window, they would be emitted twice.)
|
|
10
|
+
|
|
11
|
+
| `Record.Source` value | Meaning |
|
|
12
|
+
|----------------------|---------|
|
|
13
|
+
| `patents` | Patent publication record and its atoms |
|
|
14
|
+
|
|
15
|
+
Poll cadence and grant-date windows are configured per stream (`pollTimeMin`, `windowDays`, required `initialGrantDateMin`). Optional: `maxPatents`, `batchSize`, `projectId`.
|
|
16
|
+
|
|
17
|
+
## Entity Types
|
|
18
|
+
|
|
19
|
+
### `patent` (patent publication)
|
|
20
|
+
|
|
21
|
+
Represents one patent publication identified by Google’s publication number (for example `US-12345678-B2`). The `patent` flavor is distinct from the generic `document` flavor used by other sources (news, EDGAR filings) so that queries can filter to patents directly.
|
|
22
|
+
|
|
23
|
+
- **Primary key:** `patent_publication_number` (strong ID on the subject entity)
|
|
24
|
+
|
|
25
|
+
### `person` (inventor)
|
|
26
|
+
|
|
27
|
+
A named inventor appearing on the publication’s harmonized inventor list.
|
|
28
|
+
|
|
29
|
+
- **Primary key:** none in source data; entity resolution uses mergeable name + disambiguation snippet (patent publication context).
|
|
30
|
+
|
|
31
|
+
### `organization` (assignee)
|
|
32
|
+
|
|
33
|
+
A named assignee from the harmonized assignee list (typically a company).
|
|
34
|
+
|
|
35
|
+
- **Primary key:** none in source data; mergeable name + snippet for resolution.
|
|
36
|
+
|
|
37
|
+
## Properties
|
|
38
|
+
|
|
39
|
+
### On `patent`
|
|
40
|
+
|
|
41
|
+
* `patent_publication_number`
|
|
42
|
+
* **Definition:** Canonical publication identifier from the `publication_number` field.
|
|
43
|
+
* **Examples:** `US-12345678-B2`
|
|
44
|
+
* **Derivation:** BigQuery `publication_number`, copied verbatim.
|
|
45
|
+
|
|
46
|
+
* `patent_grant_date`
|
|
47
|
+
* **Definition:** Grant date in `YYYY-MM-DD` (UTC calendar interpretation of the integer `grant_date`).
|
|
48
|
+
* **Examples:** `2025-10-15`
|
|
49
|
+
* **Derivation:** BigQuery `grant_date` (YYYYMMDD) reformatted.
|
|
50
|
+
|
|
51
|
+
* `patent_filing_date`
|
|
52
|
+
* **Definition:** Application filing date in `YYYY-MM-DD`.
|
|
53
|
+
* **Derivation:** BigQuery `filing_date` (YYYYMMDD) reformatted. Omitted when null.
|
|
54
|
+
|
|
55
|
+
* `patent_priority_date`
|
|
56
|
+
* **Definition:** Earliest priority date claimed by the patent in `YYYY-MM-DD`.
|
|
57
|
+
* **Derivation:** BigQuery `priority_date` (YYYYMMDD) reformatted. Omitted when null.
|
|
58
|
+
|
|
59
|
+
* `patent_kind_code`
|
|
60
|
+
* **Definition:** Document kind code (e.g. `A1`, `B1`, `B2`, `C1`, `S`, `P`).
|
|
61
|
+
* **Derivation:** BigQuery `kind_code`, copied verbatim.
|
|
62
|
+
|
|
63
|
+
* `patent_country`
|
|
64
|
+
* **Definition:** WIPO country/office code identifying the issuing patent office. Two-letter ISO 3166-1 alpha-2 for national offices (`US`, `JP`, `DE`, …) plus regional/international codes (`EP` for the European Patent Office, `WO` for WIPO/PCT).
|
|
65
|
+
* **Derivation:** BigQuery `country_code` on the selected row. The patents stream only ingests **`US`** (`WHERE` clause); the atom reflects the column when present, with a streamer fallback to `US` if empty.
|
|
66
|
+
|
|
67
|
+
* `title`
|
|
68
|
+
* **Definition:** Title of the patent publication **when an English row exists** in `title_localized`; otherwise empty (no fallback to other languages).
|
|
69
|
+
* **Examples:** “Example fusion reactor control”
|
|
70
|
+
* **Derivation:** `(SELECT t.text FROM UNNEST(p.title_localized) AS t WHERE t.language = 'en' LIMIT 1)`.
|
|
71
|
+
|
|
72
|
+
* `patent_title_language`
|
|
73
|
+
* **Definition:** ISO 639-1 lower-case code of the language used for the title atom (here **`en`** when English text exists).
|
|
74
|
+
* **Derivation:** `(SELECT t.language FROM UNNEST(p.title_localized) AS t WHERE t.language = 'en' LIMIT 1)`. Emitted only when a title is present.
|
|
75
|
+
|
|
76
|
+
* `patent_abstract`
|
|
77
|
+
* **Definition:** Abstract text **when an English row exists** in `abstract_localized`; otherwise empty. Not truncated in the current BigQuery SQL (full English text as stored).
|
|
78
|
+
* **Derivation:** `(SELECT a.text FROM UNNEST(p.abstract_localized) AS a WHERE a.language = 'en' LIMIT 1)`.
|
|
79
|
+
|
|
80
|
+
* `patent_abstract_language`
|
|
81
|
+
* **Definition:** ISO 639-1 lower-case code of the language used for the abstract atom (here **`en`** when English text exists).
|
|
82
|
+
* **Derivation:** `(SELECT a.language FROM UNNEST(p.abstract_localized) AS a WHERE a.language = 'en' LIMIT 1)`. Emitted only when an abstract is present.
|
|
83
|
+
|
|
84
|
+
* `cpc_code` (multi-valued)
|
|
85
|
+
* **Definition:** A direct CPC symbol assigned to this patent. One atom per code.
|
|
86
|
+
* **Derivation:** `STRING_AGG` of `cpc.code` from unnested `cpc`, then split on `,`.
|
|
87
|
+
* **Attribute `cpc_description`:** When the code exists in the in-memory CPC map (loaded at streamer startup from `patents-public-data.cpc.definition`), the human-readable taxonomy path is attached as quad attribute **`cpc_description`** on that `cpc_code` atom (kgschema quad attr id 24). Codes not present in the map emit `cpc_code` only with no attribute.
|
|
88
|
+
|
|
89
|
+
## Entity Relationships Summary
|
|
90
|
+
|
|
91
|
+
```
|
|
92
|
+
patent ──[has_inventor]──→ person
|
|
93
|
+
patent ──[has_assignee]──→ organization
|
|
94
|
+
patent ──[cites_patent]──→ patent (other publications cited as prior art)
|
|
95
|
+
patent (multi-valued `cpc_code`; optional `cpc_description` quad attribute per code)
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
The `patent` subject's primary citation text is always the canonical Google
|
|
99
|
+
Patents URL: `https://patents.google.com/patent/<UNHYPHENATED_PUBNUM>` (Google
|
|
100
|
+
Patents URLs require the publication number with hyphens removed — e.g.
|
|
101
|
+
`US12433179B2`, `JPH01160014U`). The same form is used for `cites_patent` target
|
|
102
|
+
citations.
|
|
103
|
+
|
|
104
|
+
Inventor and assignee names come from `STRING_AGG` lists on `inventor_harmonized` and `assignee_harmonized`, split on `;` after export.
|
|
105
|
+
|
|
106
|
+
## Reference Data
|
|
107
|
+
|
|
108
|
+
### CPC code → description mapping
|
|
109
|
+
|
|
110
|
+
The streamer needs a CPC code → human-readable description map to populate
|
|
111
|
+
the `cpc_description` quad attributes on `cpc_code` atoms. It hydrates the map by querying the public
|
|
112
|
+
BigQuery table **`patents-public-data.cpc.definition`** — the same project /
|
|
113
|
+
credentials already used to scan `patents-public-data.patents.publications`.
|
|
114
|
+
The query is hardcoded inside the streamer (`cpcTaxonomySQL` in
|
|
115
|
+
`patents_streamer.go`).
|
|
116
|
+
|
|
117
|
+
The map is **refreshed at the start of every polling cycle** so that newly
|
|
118
|
+
published CPC codes (the EPO updates the taxonomy a few times per year) are
|
|
119
|
+
picked up without requiring a streamer restart. If a refresh fails due to a
|
|
120
|
+
transient BQ outage and a previous successful load is cached, the streamer
|
|
121
|
+
logs a warning and continues with the cached map for that cycle; if the
|
|
122
|
+
first-ever load fails (no cache yet), the cycle is skipped without advancing
|
|
123
|
+
the checkpoint. Either way, no grant_date window is silently lost.
|
|
124
|
+
|
|
125
|
+
Codes referenced by patents but absent from the loaded map emit
|
|
126
|
+
`cpc_code` only, without a `cpc_description` attribute. The streamer never fails on
|
|
127
|
+
missing codes — gaps in the taxonomy are tolerated and reported only by
|
|
128
|
+
the absence of descriptions in the output.
|
|
129
|
+
|
|
130
|
+
### How the description query works
|
|
131
|
+
|
|
132
|
+
CPC is a tree: each symbol stores its own short title fragment in
|
|
133
|
+
`titlePart` plus a `parents` array listing every ancestor up to the root
|
|
134
|
+
(e.g. `H05H 1/02` → `H05H 1/00` → `H05H` → `H05` → `H`). To produce a
|
|
135
|
+
useful, *self-contained* description for any one code, the query walks
|
|
136
|
+
that ancestor chain for every leaf and concatenates each ancestor's
|
|
137
|
+
`titlePart` into a single root → leaf path string. So `H05H 1/02` becomes
|
|
138
|
+
something like *"PHYSICS > NUCLEAR PHYSICS > Plasma technique >
|
|
139
|
+
Generating plasma > Glow discharges"* rather than just *"Glow
|
|
140
|
+
discharges"*. Each patent gets one description per code without the
|
|
141
|
+
streamer needing the rest of the taxonomy at atomization time.
|
|
142
|
+
|
|
143
|
+
The mechanics:
|
|
144
|
+
|
|
145
|
+
1. Compute a normalized `sym_key` (whitespace stripped) for each row so
|
|
146
|
+
ancestor lookups are robust to spacing differences (`H05H 1/02` vs
|
|
147
|
+
`H05H1/02`).
|
|
148
|
+
2. For each leaf, build an ordered `path_keys` array of normalized
|
|
149
|
+
ancestor keys, root-first, ending with the leaf itself.
|
|
150
|
+
3. Unnest `path_keys` so each (leaf, ancestor, position) is one row.
|
|
151
|
+
4. Join back to the definition table to recover each ancestor's title
|
|
152
|
+
fragment and group on the leaf, ordering by position so the output
|
|
153
|
+
reads from root to leaf.
|
|
154
|
+
|
|
155
|
+
The streamer consumes the `code` and `description` columns; `parent` is
|
|
156
|
+
informational and currently retained on the in-memory `CPCNode` but not
|
|
157
|
+
emitted as an atom.
|
|
158
|
+
|
|
159
|
+
See `cpcTaxonomySQL` in `moongoose/fetch/patents_streamer.go` for the
|
|
160
|
+
exact BigQuery query.
|
|
161
|
+
|
|
162
|
+
Notes:
|
|
163
|
+
|
|
164
|
+
- The whitespace-stripped `sym_key` is only used internally as the JOIN
|
|
165
|
+
key between leaves and ancestors. The emitted `code` keeps its original
|
|
166
|
+
spacing because patents publications retain it too (e.g. `G21B 1/00`,
|
|
167
|
+
not `G21B1/00`); the streamer matches against codes-as-published.
|
|
168
|
+
- `ARRAY_TO_STRING(titlePart, ' ')` collapses CPC's multi-segment node
|
|
169
|
+
titles into a single sentence per ancestor before they're joined with
|
|
170
|
+
`> ` separators.
|
|
171
|
+
- `NULLIF(..., '')` filters empty fragments so trailing or leading
|
|
172
|
+
separators don't leak into the final description.
|
|
@@ -0,0 +1,156 @@
|
|
|
1
|
+
# Dataset schema for patent grant publications from Google Patents Public Datasets.
|
|
2
|
+
#
|
|
3
|
+
# Structured atomization only — no LLM extraction.
|
|
4
|
+
name: "patents"
|
|
5
|
+
description: "Patent grant publications from Google Patents Public Datasets (titles, abstracts, CPC, inventors, assignees, countries, and languages."
|
|
6
|
+
|
|
7
|
+
extraction:
|
|
8
|
+
flavors: closed
|
|
9
|
+
properties: closed
|
|
10
|
+
relationships: closed
|
|
11
|
+
attributes: closed
|
|
12
|
+
events: closed
|
|
13
|
+
|
|
14
|
+
flavors:
|
|
15
|
+
- name: "patent"
|
|
16
|
+
description: "A granted patent publication identified by its publication number (WIPO-style number including office prefix, e.g. US-12345678-B2, EP-1234567-B1)"
|
|
17
|
+
display_name: "Patent"
|
|
18
|
+
mergeability: not_mergeable
|
|
19
|
+
strong_id_properties: ["patent_publication_number"]
|
|
20
|
+
passive: true
|
|
21
|
+
|
|
22
|
+
- name: "person"
|
|
23
|
+
description: "A real person as opposed to a fictional character, such as a CEO, politician, or public figure"
|
|
24
|
+
display_name: "Person"
|
|
25
|
+
mergeability: not_mergeable
|
|
26
|
+
passive: true
|
|
27
|
+
|
|
28
|
+
- name: "organization"
|
|
29
|
+
description: "A particular business, institution, or organization such as a corporation, university, government agency, or non-profit"
|
|
30
|
+
display_name: "Organization"
|
|
31
|
+
mergeability: not_mergeable
|
|
32
|
+
passive: true
|
|
33
|
+
|
|
34
|
+
properties:
|
|
35
|
+
- name: "patent_publication_number"
|
|
36
|
+
type: string
|
|
37
|
+
description: "Canonical patent publication identifier from the publications table (office prefix and kind suffix vary by jurisdiction, e.g. US-12345678-B2, JP-6001234-A)"
|
|
38
|
+
display_name: "Patent Publication Number"
|
|
39
|
+
mergeability: not_mergeable
|
|
40
|
+
domain_flavors: ["patent"]
|
|
41
|
+
passive: true
|
|
42
|
+
|
|
43
|
+
- name: "patent_grant_date"
|
|
44
|
+
type: string
|
|
45
|
+
description: "Patent grant date in YYYY-MM-DD format."
|
|
46
|
+
display_name: "Patent Grant Date"
|
|
47
|
+
mergeability: not_mergeable
|
|
48
|
+
domain_flavors: ["patent"]
|
|
49
|
+
passive: true
|
|
50
|
+
|
|
51
|
+
- name: "patent_filing_date"
|
|
52
|
+
type: string
|
|
53
|
+
description: "Application filing date in YYYY-MM-DD format."
|
|
54
|
+
display_name: "Patent Filing Date"
|
|
55
|
+
mergeability: not_mergeable
|
|
56
|
+
domain_flavors: ["patent"]
|
|
57
|
+
passive: true
|
|
58
|
+
|
|
59
|
+
- name: "patent_priority_date"
|
|
60
|
+
type: string
|
|
61
|
+
description: "Earliest priority date claimed by the patent in YYYY-MM-DD format."
|
|
62
|
+
display_name: "Patent Priority Date"
|
|
63
|
+
mergeability: not_mergeable
|
|
64
|
+
domain_flavors: ["patent"]
|
|
65
|
+
passive: true
|
|
66
|
+
|
|
67
|
+
- name: "patent_kind_code"
|
|
68
|
+
type: string
|
|
69
|
+
description: "Publication kind code (e.g. A1, B1, B2, C1, S, P) distinguishing applications from grants and other document types"
|
|
70
|
+
display_name: "Patent Kind Code"
|
|
71
|
+
mergeability: not_mergeable
|
|
72
|
+
domain_flavors: ["patent"]
|
|
73
|
+
passive: true
|
|
74
|
+
|
|
75
|
+
- name: "patent_country"
|
|
76
|
+
type: string
|
|
77
|
+
description: "WIPO country/office code identifying the issuing patent office (e.g. US, EP, JP, WO). Two-letter ISO 3166-1 alpha-2 for national offices, plus regional/international codes (EP for the European Patent Office, WO for WIPO/PCT)."
|
|
78
|
+
display_name: "Patent Country"
|
|
79
|
+
mergeability: not_mergeable
|
|
80
|
+
domain_flavors: ["patent"]
|
|
81
|
+
passive: true
|
|
82
|
+
|
|
83
|
+
- name: "title"
|
|
84
|
+
type: string
|
|
85
|
+
description: "Title of the entity"
|
|
86
|
+
display_name: "Title"
|
|
87
|
+
mergeability: not_mergeable
|
|
88
|
+
domain_flavors: ["patent"]
|
|
89
|
+
passive: true
|
|
90
|
+
|
|
91
|
+
- name: "patent_title_language"
|
|
92
|
+
type: string
|
|
93
|
+
description: "ISO 639-1 lower-case code of the language used for the patent's title atom (e.g. 'en', 'zh', 'ja'). Emitted only when a title is present."
|
|
94
|
+
display_name: "Patent Title Language"
|
|
95
|
+
mergeability: not_mergeable
|
|
96
|
+
domain_flavors: ["patent"]
|
|
97
|
+
passive: true
|
|
98
|
+
|
|
99
|
+
- name: "patent_abstract"
|
|
100
|
+
type: string
|
|
101
|
+
description: "Patent publication abstract text."
|
|
102
|
+
display_name: "Patent Abstract"
|
|
103
|
+
mergeability: not_mergeable
|
|
104
|
+
domain_flavors: ["patent"]
|
|
105
|
+
passive: true
|
|
106
|
+
|
|
107
|
+
- name: "patent_abstract_language"
|
|
108
|
+
type: string
|
|
109
|
+
description: "ISO 639-1 lower-case code of the language used for the patent's abstract atom (e.g. 'en', 'zh', 'ja'). Emitted only when an abstract is present."
|
|
110
|
+
display_name: "Patent Abstract Language"
|
|
111
|
+
mergeability: not_mergeable
|
|
112
|
+
domain_flavors: ["patent"]
|
|
113
|
+
passive: true
|
|
114
|
+
|
|
115
|
+
- name: "cpc_code"
|
|
116
|
+
type: string
|
|
117
|
+
description: "CPC symbol classifying this patent (multi-valued: one atom per direct code on the publication)"
|
|
118
|
+
display_name: "CPC Code"
|
|
119
|
+
mergeability: not_mergeable
|
|
120
|
+
domain_flavors: ["patent"]
|
|
121
|
+
passive: true
|
|
122
|
+
|
|
123
|
+
relationships:
|
|
124
|
+
- name: "has_inventor"
|
|
125
|
+
description: "A patent lists a person as an inventor"
|
|
126
|
+
display_name: "Has Inventor"
|
|
127
|
+
mergeability: not_mergeable
|
|
128
|
+
domain_flavors: ["patent"]
|
|
129
|
+
target_flavors: ["person"]
|
|
130
|
+
passive: true
|
|
131
|
+
|
|
132
|
+
- name: "has_assignee"
|
|
133
|
+
description: "A patent lists an organization as an assignee"
|
|
134
|
+
display_name: "Has Assignee"
|
|
135
|
+
mergeability: not_mergeable
|
|
136
|
+
domain_flavors: ["patent"]
|
|
137
|
+
target_flavors: ["organization"]
|
|
138
|
+
passive: true
|
|
139
|
+
|
|
140
|
+
- name: "cites_patent"
|
|
141
|
+
description: "A patent cites another patent publication as prior art (non-patent literature citations are not represented as edges)"
|
|
142
|
+
display_name: "Cites Patent"
|
|
143
|
+
mergeability: not_mergeable
|
|
144
|
+
domain_flavors: ["patent"]
|
|
145
|
+
target_flavors: ["patent"]
|
|
146
|
+
passive: true
|
|
147
|
+
|
|
148
|
+
# Human-readable CPC code description. Stored as a quad attribute on each
|
|
149
|
+
# cpc_code atom when known.
|
|
150
|
+
attributes:
|
|
151
|
+
- property: "cpc_code"
|
|
152
|
+
name: "cpc_description"
|
|
153
|
+
type: string
|
|
154
|
+
description: "CPC taxonomy title path for this code (from patents-public-data.cpc.definition); omitted when the code is missing from the map"
|
|
155
|
+
display_name: "CPC Description"
|
|
156
|
+
mergeability: not_mergeable
|