@yottagraph-app/data-model-skill 0.0.20 → 0.0.22

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@yottagraph-app/data-model-skill",
3
- "version": "0.0.20",
3
+ "version": "0.0.22",
4
4
  "description": "Data model skill documentation for AI agents - entity types, properties, and schemas from Lovelace fetch sources",
5
5
  "repository": {
6
6
  "type": "git",
@@ -0,0 +1,140 @@
1
+ # Data Dictionary: DOT Authority
2
+
3
+ ## Source Overview
4
+
5
+ FMCSA Carrier Authority data from the "Carrier All With History" Socrata dataset (`6eyk-hxee`), published by the Federal Motor Carrier Safety Administration. Contains operating authority records for motor carriers, brokers, and freight forwarders, including docket numbers, authority statuses (common, contract, broker), insurance requirements, and business contact information.
6
+
7
+ Bulk-refreshed periodically by FMCSA; no per-row timestamps. The streamer performs a single full download per run.
8
+
9
+ | Pipeline | `Record.Source` |
10
+ |----------|----------------|
11
+ | Carrier Authority | `dotauthority` |
12
+
13
+ This is a companion dataset to DOT Census (`dotcensus`), which covers company registration, fleet size, and safety ratings. Authority covers operating authority statuses and insurance/bond requirements. Both sources share the `usdot_number` strong ID, enabling entity merging across datasets.
14
+
15
+ ---
16
+
17
+ ## Entity Types
18
+
19
+ ### `organization`
20
+
21
+ A motor carrier, broker, or freight forwarder registered with FMCSA.
22
+
23
+ - Primary key: `usdot_number` (USDOT number assigned by FMCSA)
24
+ - Entity resolver: named entity. Strong ID = `usdot_number`. Disambiguation via legal name and business address.
25
+ - Entity name: DBA name if present (with legal name as the owner entity); otherwise the legal name.
26
+
27
+ ### `person`
28
+
29
+ A named individual who is the legal owner of a carrier operating under a DBA name.
30
+
31
+ - Only created when the carrier has a DBA name different from the legal name, and the legal name appears to be a person (not an organization).
32
+ - Entity resolver: named entity. No strong ID.
33
+
34
+ ### `location`
35
+
36
+ The business location of a carrier, derived from city, state, and country fields.
37
+
38
+ - Entity resolver: named entity. No strong ID.
39
+ - Entity name: formatted as "City, State" or "City, State, Country".
40
+
41
+ ---
42
+
43
+ ## Properties
44
+
45
+ ### Organization: Identity and Registration
46
+
47
+ * `usdot_number`
48
+ * Definition: USDOT number uniquely identifying the registered motor carrier, broker, or shipper.
49
+ * Examples: `1234567`, `3456789`
50
+ * Derivation: `dot_number` field from the Socrata API. Carriers with DOT number `00000000` or empty are treated as not having a DOT number.
51
+
52
+ * `dot_docket_number`
53
+ * Definition: FMCSA docket number (MC/FF/MX number) for the carrier's operating authority.
54
+ * Examples: `MC012892`, `MC599911`
55
+ * Derivation: `docket_number` field from the Socrata API.
56
+
57
+ ### Organization: Business Contact
58
+
59
+ * `address`
60
+ * Definition: Formatted business street address of the carrier.
61
+ * Examples: `1200 SEABOARD DR, HIALEAH, FL 33010`
62
+ * Derivation: Composed from `bus_street_po`, `bus_city`, `bus_state_code`, `bus_zip_code`, and `bus_ctry_code` fields, formatted as "Street, City, State Zip".
63
+
64
+ * `dot_phone_number`
65
+ * Definition: Primary business phone number of the carrier.
66
+ * Examples: `5551234567`
67
+ * Derivation: `bus_telno` field from the Socrata API.
68
+
69
+ ### Organization: Authority Status
70
+
71
+ * `dot_common_authority_status`
72
+ * Definition: Status of the carrier's common carrier authority.
73
+ * Examples: `A (Active)`, `I (Inactive)`, `N (None)`
74
+ * Derivation: `common_stat` field. Single-letter code expanded to include the human-readable label.
75
+
76
+ * `dot_contract_authority_status`
77
+ * Definition: Status of the carrier's contract carrier authority.
78
+ * Examples: `A (Active)`, `I (Inactive)`, `N (None)`
79
+ * Derivation: `contract_stat` field. Same code expansion as common authority.
80
+
81
+ * `dot_broker_authority_status`
82
+ * Definition: Status of the carrier's broker authority.
83
+ * Examples: `A (Active)`, `I (Inactive)`, `N (None)`
84
+ * Derivation: `broker_stat` field. Same code expansion as common authority.
85
+
86
+ * `dot_authority_type`
87
+ * Definition: Authority type flags indicating which categories the carrier is authorized for.
88
+ * Examples: `Property`, `Passenger`, `Household Goods`, `Property; Passenger`
89
+ * Derivation: Composed from five checkbox fields (`property_chk`, `passenger_chk`, `hhg_chk`, `private_auth_chk`, `enterprise_chk`). Values with `Y` are included, joined with `; `.
90
+
91
+ ### Organization: Insurance and Bonding
92
+
93
+ * `dot_min_coverage_amount`
94
+ * Definition: Minimum insurance coverage amount required, in thousands of dollars.
95
+ * Examples: `00750`, `05000`
96
+ * Derivation: `min_cov_amount` field. Values of `00000` are suppressed.
97
+
98
+ * `dot_cargo_insurance_required`
99
+ * Definition: Whether cargo insurance is required for this carrier.
100
+ * Examples: `1.0` (yes), `0.0` (no)
101
+ * Derivation: `cargo_req` field. `Y` → 1.0, `N` → 0.0. Other values are not emitted.
102
+ * Note: Stored as float per KG boolean convention.
103
+
104
+ * `dot_bond_required`
105
+ * Definition: Whether a surety bond is required for this carrier.
106
+ * Examples: `1.0` (yes), `0.0` (no)
107
+ * Derivation: `bond_req` field. `Y` → 1.0, `N` → 0.0. Other values are not emitted.
108
+ * Note: Stored as float per KG boolean convention.
109
+
110
+ * `dot_bipd_insurance_on_file`
111
+ * Definition: Bodily injury / property damage insurance filing amount on file.
112
+ * Examples: `01000`, `05000`
113
+ * Derivation: `bipd_file` field. Values of `00000` are suppressed.
114
+
115
+ ---
116
+
117
+ ## Entity Relationships Summary
118
+
119
+ ```
120
+ organization ──[is_located_at]──→ location
121
+ person ──[doing_business_as]──→ organization
122
+ organization ──[doing_business_as]──→ organization
123
+ ```
124
+
125
+ The `doing_business_as` relationship is created when a carrier's legal name differs from its DBA name. The legal entity (person or organization, determined by name heuristics) is the subject; the DBA organization is the target. The DBA organization is also the primary record subject carrying all authority properties.
126
+
127
+ The `is_located_at` relationship is created when both city and state are present. The target is a location entity named "City, State" (or "City, State, Country" if the country code is present).
128
+
129
+ ---
130
+
131
+ ## Source Fields Not Mapped
132
+
133
+ The following Socrata fields are present in the API response but not currently mapped to KG properties:
134
+
135
+ - `common_app_pend`, `contract_app_pend`, `broker_app_pend` -- application pending flags
136
+ - `common_rev_pend`, `contract_rev_pend`, `broker_rev_pend` -- revocation pending flags
137
+ - `cargo_file` -- cargo insurance filing amount
138
+ - `bond_file` -- bond filing amount
139
+
140
+ These were omitted as lower-priority. They could be added as future enhancements if needed for analysis.
@@ -0,0 +1,161 @@
1
+ # Dataset schema for FMCSA Carrier Authority (All With History).
2
+ #
3
+ # Source: https://data.transportation.gov/Trucking-and-Motorcoaches/Carrier-All-With-History/6eyk-hxee
4
+ # Bulk-refreshed periodically by FMCSA; no per-row timestamps.
5
+ #
6
+ # This schema describes motor carrier operating authority records including
7
+ # docket information, authority statuses, insurance requirements, and
8
+ # business contact information.
9
+ name: "dotauthority"
10
+ description: "FMCSA carrier operating authority data including docket numbers, authority statuses, insurance requirements, and business contact information from the Carrier All With History file"
11
+
12
+ extraction:
13
+ flavors: closed
14
+ properties: closed
15
+ relationships: closed
16
+ attributes: closed
17
+ events: closed
18
+
19
+ flavors:
20
+ - name: "organization"
21
+ description: "A particular business, institution, or organization such as a corporation, university, government agency, or non-profit"
22
+ display_name: "Organization"
23
+ mergeability: not_mergeable
24
+ strong_id_properties: ["usdot_number"]
25
+ passive: true
26
+
27
+ - name: "person"
28
+ description: "A named individual such as a business owner, executive, or public figure"
29
+ display_name: "Person"
30
+ mergeability: not_mergeable
31
+ passive: true
32
+
33
+ - name: "location"
34
+ description: "A specific named geographic location such as a city, country, region, or landmark"
35
+ display_name: "Location"
36
+ mergeability: not_mergeable
37
+ examples: ["New York City", "San Francisco", "North America", "Bakery Square"]
38
+ passive: true
39
+
40
+ properties:
41
+ - name: "usdot_number"
42
+ type: string
43
+ description: "USDOT number assigned by FMCSA, uniquely identifying a registered motor carrier, broker, or shipper"
44
+ display_name: "USDOT Number"
45
+ mergeability: not_mergeable
46
+ domain_flavors: ["organization"]
47
+ passive: true
48
+
49
+ - name: "dot_docket_number"
50
+ type: string
51
+ description: "FMCSA docket number (MC number) for operating authority"
52
+ display_name: "Docket Number"
53
+ mergeability: not_mergeable
54
+ domain_flavors: ["organization"]
55
+ examples: ["MC012892", "MC599911"]
56
+ passive: true
57
+
58
+ - name: "address"
59
+ type: string
60
+ description: "Physical street address of the entity"
61
+ display_name: "Address"
62
+ mergeability: not_mergeable
63
+ domain_flavors: ["organization"]
64
+ examples: ["1200 SEABOARD DR, HIALEAH, FL 33010"]
65
+ passive: true
66
+
67
+ - name: "dot_phone_number"
68
+ type: string
69
+ description: "Primary phone number of the carrier"
70
+ display_name: "Phone Number"
71
+ mergeability: not_mergeable
72
+ domain_flavors: ["organization"]
73
+ examples: ["5551234567"]
74
+ passive: true
75
+
76
+ - name: "dot_common_authority_status"
77
+ type: string
78
+ description: "Status of common carrier authority (A=Active, I=Inactive, N=None)"
79
+ display_name: "Common Authority Status"
80
+ mergeability: not_mergeable
81
+ domain_flavors: ["organization"]
82
+ examples: ["A (Active)", "I (Inactive)", "N (None)"]
83
+ passive: true
84
+
85
+ - name: "dot_contract_authority_status"
86
+ type: string
87
+ description: "Status of contract carrier authority (A=Active, I=Inactive, N=None)"
88
+ display_name: "Contract Authority Status"
89
+ mergeability: not_mergeable
90
+ domain_flavors: ["organization"]
91
+ examples: ["A (Active)", "I (Inactive)", "N (None)"]
92
+ passive: true
93
+
94
+ - name: "dot_broker_authority_status"
95
+ type: string
96
+ description: "Status of broker authority (A=Active, I=Inactive, N=None)"
97
+ display_name: "Broker Authority Status"
98
+ mergeability: not_mergeable
99
+ domain_flavors: ["organization"]
100
+ examples: ["A (Active)", "I (Inactive)", "N (None)"]
101
+ passive: true
102
+
103
+ - name: "dot_authority_type"
104
+ type: string
105
+ description: "Carrier authority type flags indicating property, passenger, household goods, private, or enterprise authorization"
106
+ display_name: "Authority Type"
107
+ mergeability: not_mergeable
108
+ domain_flavors: ["organization"]
109
+ examples: ["Property", "Passenger", "Household Goods", "Property; Passenger"]
110
+ passive: true
111
+
112
+ - name: "dot_min_coverage_amount"
113
+ type: string
114
+ description: "Minimum insurance coverage amount required (in thousands of dollars)"
115
+ display_name: "Minimum Coverage Amount"
116
+ mergeability: not_mergeable
117
+ domain_flavors: ["organization"]
118
+ examples: ["00750", "05000"]
119
+ passive: true
120
+
121
+ - name: "dot_cargo_insurance_required"
122
+ type: float
123
+ description: "Whether cargo insurance is required (1.0 = yes, 0.0 = no)"
124
+ display_name: "Cargo Insurance Required"
125
+ mergeability: not_mergeable
126
+ domain_flavors: ["organization"]
127
+ passive: true
128
+
129
+ - name: "dot_bond_required"
130
+ type: float
131
+ description: "Whether a surety bond is required (1.0 = yes, 0.0 = no)"
132
+ display_name: "Bond Required"
133
+ mergeability: not_mergeable
134
+ domain_flavors: ["organization"]
135
+ passive: true
136
+
137
+ - name: "dot_bipd_insurance_on_file"
138
+ type: string
139
+ description: "Bodily injury / property damage insurance filing amount"
140
+ display_name: "BIPD Insurance On File"
141
+ mergeability: not_mergeable
142
+ domain_flavors: ["organization"]
143
+ examples: ["01000", "05000"]
144
+ passive: true
145
+
146
+ relationships:
147
+ - name: "doing_business_as"
148
+ description: "A legal entity is doing business as (DBA) an organization"
149
+ display_name: "Doing Business As"
150
+ mergeability: not_mergeable
151
+ domain_flavors: ["person", "organization"]
152
+ target_flavors: ["organization"]
153
+ passive: true
154
+
155
+ - name: "is_located_at"
156
+ description: "An entity is located at, operates in, resides in, is headquartered in, was born in, visits, or died in a location"
157
+ display_name: "Located At"
158
+ mergeability: not_mergeable
159
+ domain_flavors: ["organization"]
160
+ target_flavors: ["location"]
161
+ passive: true
@@ -0,0 +1,172 @@
1
+ # Data Dictionary: Patents (Google Patents Public Datasets / BigQuery)
2
+
3
+ *Last updated: 2026-04-29 — aligned with `patentsBQQueryBase` / streamer in `moongoose/fetch/patents_streamer.go` (US-only SQL, English title/abstract, no publication dedupe).*
4
+
5
+ ## Purpose / Source Overview
6
+
7
+ This source ingests **granted US patent publications** from the Google-hosted BigQuery table `patents-public-data.patents.publications`. Each row is atomized into one **patent** record carrying its own metadata (publication number, grant date, title, abstract), CPC classifications (multi-valued: code + human-readable description), and **person** / **organization** edges for inventors and assignees. Data is structured from BigQuery only (no LLM extraction). The CPC code → description map is loaded at streamer startup from the public BigQuery table `patents-public-data.cpc.definition` (see "Reference Data" below).
8
+
9
+ The upstream query scans `publications` with **`WHERE p.country_code = 'US'`** (hardcoded in SQL, not a stream arg), **`grant_date`** in the current poll window, **`ORDER BY grant_date, publication_number`**, and an optional **`LIMIT`** when `maxPatents` is set. It selects **`publication_number`**, **`country_code`**, dates, **`kind_code`**, English-only **`title`** / **`title_language`** and **`abstract`** / **`abstract_language`** (subqueries over `title_localized` / `abstract_localized` with `WHERE language = 'en'`), aggregated CPC and inventor/assignee/citation lists. There is **no `QUALIFY`** deduplication in SQL and **no Go-side deduplication** of publication numbers: each BigQuery row returned for the window is processed as one publication. (In the public dataset, `publication_number` is unique per row at table scale; if upstream ever returned duplicates within a window, they would be emitted twice.)
10
+
11
+ | `Record.Source` value | Meaning |
12
+ |----------------------|---------|
13
+ | `patents` | Patent publication record and its atoms |
14
+
15
+ Poll cadence and grant-date windows are configured per stream (`pollTimeMin`, `windowDays`, required `initialGrantDateMin`). Optional: `maxPatents`, `batchSize`, `projectId`.
16
+
17
+ ## Entity Types
18
+
19
+ ### `patent` (patent publication)
20
+
21
+ Represents one patent publication identified by Google’s publication number (for example `US-12345678-B2`). The `patent` flavor is distinct from the generic `document` flavor used by other sources (news, EDGAR filings) so that queries can filter to patents directly.
22
+
23
+ - **Primary key:** `patent_publication_number` (strong ID on the subject entity)
24
+
25
+ ### `person` (inventor)
26
+
27
+ A named inventor appearing on the publication’s harmonized inventor list.
28
+
29
+ - **Primary key:** none in source data; entity resolution uses mergeable name + disambiguation snippet (patent publication context).
30
+
31
+ ### `organization` (assignee)
32
+
33
+ A named assignee from the harmonized assignee list (typically a company).
34
+
35
+ - **Primary key:** none in source data; mergeable name + snippet for resolution.
36
+
37
+ ## Properties
38
+
39
+ ### On `patent`
40
+
41
+ * `patent_publication_number`
42
+ * **Definition:** Canonical publication identifier from the `publication_number` field.
43
+ * **Examples:** `US-12345678-B2`
44
+ * **Derivation:** BigQuery `publication_number`, copied verbatim.
45
+
46
+ * `patent_grant_date`
47
+ * **Definition:** Grant date in `YYYY-MM-DD` (UTC calendar interpretation of the integer `grant_date`).
48
+ * **Examples:** `2025-10-15`
49
+ * **Derivation:** BigQuery `grant_date` (YYYYMMDD) reformatted.
50
+
51
+ * `patent_filing_date`
52
+ * **Definition:** Application filing date in `YYYY-MM-DD`.
53
+ * **Derivation:** BigQuery `filing_date` (YYYYMMDD) reformatted. Omitted when null.
54
+
55
+ * `patent_priority_date`
56
+ * **Definition:** Earliest priority date claimed by the patent in `YYYY-MM-DD`.
57
+ * **Derivation:** BigQuery `priority_date` (YYYYMMDD) reformatted. Omitted when null.
58
+
59
+ * `patent_kind_code`
60
+ * **Definition:** Document kind code (e.g. `A1`, `B1`, `B2`, `C1`, `S`, `P`).
61
+ * **Derivation:** BigQuery `kind_code`, copied verbatim.
62
+
63
+ * `patent_country`
64
+ * **Definition:** WIPO country/office code identifying the issuing patent office. Two-letter ISO 3166-1 alpha-2 for national offices (`US`, `JP`, `DE`, …) plus regional/international codes (`EP` for the European Patent Office, `WO` for WIPO/PCT).
65
+ * **Derivation:** BigQuery `country_code` on the selected row. The patents stream only ingests **`US`** (`WHERE` clause); the atom reflects the column when present, with a streamer fallback to `US` if empty.
66
+
67
+ * `title`
68
+ * **Definition:** Title of the patent publication **when an English row exists** in `title_localized`; otherwise empty (no fallback to other languages).
69
+ * **Examples:** “Example fusion reactor control”
70
+ * **Derivation:** `(SELECT t.text FROM UNNEST(p.title_localized) AS t WHERE t.language = 'en' LIMIT 1)`.
71
+
72
+ * `patent_title_language`
73
+ * **Definition:** ISO 639-1 lower-case code of the language used for the title atom (here **`en`** when English text exists).
74
+ * **Derivation:** `(SELECT t.language FROM UNNEST(p.title_localized) AS t WHERE t.language = 'en' LIMIT 1)`. Emitted only when a title is present.
75
+
76
+ * `patent_abstract`
77
+ * **Definition:** Abstract text **when an English row exists** in `abstract_localized`; otherwise empty. Not truncated in the current BigQuery SQL (full English text as stored).
78
+ * **Derivation:** `(SELECT a.text FROM UNNEST(p.abstract_localized) AS a WHERE a.language = 'en' LIMIT 1)`.
79
+
80
+ * `patent_abstract_language`
81
+ * **Definition:** ISO 639-1 lower-case code of the language used for the abstract atom (here **`en`** when English text exists).
82
+ * **Derivation:** `(SELECT a.language FROM UNNEST(p.abstract_localized) AS a WHERE a.language = 'en' LIMIT 1)`. Emitted only when an abstract is present.
83
+
84
+ * `cpc_code` (multi-valued)
85
+ * **Definition:** A direct CPC symbol assigned to this patent. One atom per code.
86
+ * **Derivation:** `STRING_AGG` of `cpc.code` from unnested `cpc`, then split on `,`.
87
+ * **Attribute `cpc_description`:** When the code exists in the in-memory CPC map (loaded at streamer startup from `patents-public-data.cpc.definition`), the human-readable taxonomy path is attached as quad attribute **`cpc_description`** on that `cpc_code` atom (kgschema quad attr id 24). Codes not present in the map emit `cpc_code` only with no attribute.
88
+
89
+ ## Entity Relationships Summary
90
+
91
+ ```
92
+ patent ──[has_inventor]──→ person
93
+ patent ──[has_assignee]──→ organization
94
+ patent ──[cites_patent]──→ patent (other publications cited as prior art)
95
+ patent (multi-valued `cpc_code`; optional `cpc_description` quad attribute per code)
96
+ ```
97
+
98
+ The `patent` subject's primary citation text is always the canonical Google
99
+ Patents URL: `https://patents.google.com/patent/<UNHYPHENATED_PUBNUM>` (Google
100
+ Patents URLs require the publication number with hyphens removed — e.g.
101
+ `US12433179B2`, `JPH01160014U`). The same form is used for `cites_patent` target
102
+ citations.
103
+
104
+ Inventor and assignee names come from `STRING_AGG` lists on `inventor_harmonized` and `assignee_harmonized`, split on `;` after export.
105
+
106
+ ## Reference Data
107
+
108
+ ### CPC code → description mapping
109
+
110
+ The streamer needs a CPC code → human-readable description map to populate
111
+ the `cpc_description` quad attributes on `cpc_code` atoms. It hydrates the map by querying the public
112
+ BigQuery table **`patents-public-data.cpc.definition`** — the same project /
113
+ credentials already used to scan `patents-public-data.patents.publications`.
114
+ The query is hardcoded inside the streamer (`cpcTaxonomySQL` in
115
+ `patents_streamer.go`).
116
+
117
+ The map is **refreshed at the start of every polling cycle** so that newly
118
+ published CPC codes (the EPO updates the taxonomy a few times per year) are
119
+ picked up without requiring a streamer restart. If a refresh fails due to a
120
+ transient BQ outage and a previous successful load is cached, the streamer
121
+ logs a warning and continues with the cached map for that cycle; if the
122
+ first-ever load fails (no cache yet), the cycle is skipped without advancing
123
+ the checkpoint. Either way, no grant_date window is silently lost.
124
+
125
+ Codes referenced by patents but absent from the loaded map emit
126
+ `cpc_code` only, without a `cpc_description` attribute. The streamer never fails on
127
+ missing codes — gaps in the taxonomy are tolerated and reported only by
128
+ the absence of descriptions in the output.
129
+
130
+ ### How the description query works
131
+
132
+ CPC is a tree: each symbol stores its own short title fragment in
133
+ `titlePart` plus a `parents` array listing every ancestor up to the root
134
+ (e.g. `H05H 1/02` → `H05H 1/00` → `H05H` → `H05` → `H`). To produce a
135
+ useful, *self-contained* description for any one code, the query walks
136
+ that ancestor chain for every leaf and concatenates each ancestor's
137
+ `titlePart` into a single root → leaf path string. So `H05H 1/02` becomes
138
+ something like *"PHYSICS > NUCLEAR PHYSICS > Plasma technique >
139
+ Generating plasma > Glow discharges"* rather than just *"Glow
140
+ discharges"*. Each patent gets one description per code without the
141
+ streamer needing the rest of the taxonomy at atomization time.
142
+
143
+ The mechanics:
144
+
145
+ 1. Compute a normalized `sym_key` (whitespace stripped) for each row so
146
+ ancestor lookups are robust to spacing differences (`H05H 1/02` vs
147
+ `H05H1/02`).
148
+ 2. For each leaf, build an ordered `path_keys` array of normalized
149
+ ancestor keys, root-first, ending with the leaf itself.
150
+ 3. Unnest `path_keys` so each (leaf, ancestor, position) is one row.
151
+ 4. Join back to the definition table to recover each ancestor's title
152
+ fragment and group on the leaf, ordering by position so the output
153
+ reads from root to leaf.
154
+
155
+ The streamer consumes the `code` and `description` columns; `parent` is
156
+ informational and currently retained on the in-memory `CPCNode` but not
157
+ emitted as an atom.
158
+
159
+ See `cpcTaxonomySQL` in `moongoose/fetch/patents_streamer.go` for the
160
+ exact BigQuery query.
161
+
162
+ Notes:
163
+
164
+ - The whitespace-stripped `sym_key` is only used internally as the JOIN
165
+ key between leaves and ancestors. The emitted `code` keeps its original
166
+ spacing because patents publications retain it too (e.g. `G21B 1/00`,
167
+ not `G21B1/00`); the streamer matches against codes-as-published.
168
+ - `ARRAY_TO_STRING(titlePart, ' ')` collapses CPC's multi-segment node
169
+ titles into a single sentence per ancestor before they're joined with
170
+ `> ` separators.
171
+ - `NULLIF(..., '')` filters empty fragments so trailing or leading
172
+ separators don't leak into the final description.
@@ -0,0 +1,156 @@
1
+ # Dataset schema for patent grant publications from Google Patents Public Datasets.
2
+ #
3
+ # Structured atomization only — no LLM extraction.
4
+ name: "patents"
5
+ description: "Patent grant publications from Google Patents Public Datasets (titles, abstracts, CPC, inventors, assignees, countries, and languages."
6
+
7
+ extraction:
8
+ flavors: closed
9
+ properties: closed
10
+ relationships: closed
11
+ attributes: closed
12
+ events: closed
13
+
14
+ flavors:
15
+ - name: "patent"
16
+ description: "A granted patent publication identified by its publication number (WIPO-style number including office prefix, e.g. US-12345678-B2, EP-1234567-B1)"
17
+ display_name: "Patent"
18
+ mergeability: not_mergeable
19
+ strong_id_properties: ["patent_publication_number"]
20
+ passive: true
21
+
22
+ - name: "person"
23
+ description: "A real person as opposed to a fictional character, such as a CEO, politician, or public figure"
24
+ display_name: "Person"
25
+ mergeability: not_mergeable
26
+ passive: true
27
+
28
+ - name: "organization"
29
+ description: "A particular business, institution, or organization such as a corporation, university, government agency, or non-profit"
30
+ display_name: "Organization"
31
+ mergeability: not_mergeable
32
+ passive: true
33
+
34
+ properties:
35
+ - name: "patent_publication_number"
36
+ type: string
37
+ description: "Canonical patent publication identifier from the publications table (office prefix and kind suffix vary by jurisdiction, e.g. US-12345678-B2, JP-6001234-A)"
38
+ display_name: "Patent Publication Number"
39
+ mergeability: not_mergeable
40
+ domain_flavors: ["patent"]
41
+ passive: true
42
+
43
+ - name: "patent_grant_date"
44
+ type: string
45
+ description: "Patent grant date in YYYY-MM-DD format."
46
+ display_name: "Patent Grant Date"
47
+ mergeability: not_mergeable
48
+ domain_flavors: ["patent"]
49
+ passive: true
50
+
51
+ - name: "patent_filing_date"
52
+ type: string
53
+ description: "Application filing date in YYYY-MM-DD format."
54
+ display_name: "Patent Filing Date"
55
+ mergeability: not_mergeable
56
+ domain_flavors: ["patent"]
57
+ passive: true
58
+
59
+ - name: "patent_priority_date"
60
+ type: string
61
+ description: "Earliest priority date claimed by the patent in YYYY-MM-DD format."
62
+ display_name: "Patent Priority Date"
63
+ mergeability: not_mergeable
64
+ domain_flavors: ["patent"]
65
+ passive: true
66
+
67
+ - name: "patent_kind_code"
68
+ type: string
69
+ description: "Publication kind code (e.g. A1, B1, B2, C1, S, P) distinguishing applications from grants and other document types"
70
+ display_name: "Patent Kind Code"
71
+ mergeability: not_mergeable
72
+ domain_flavors: ["patent"]
73
+ passive: true
74
+
75
+ - name: "patent_country"
76
+ type: string
77
+ description: "WIPO country/office code identifying the issuing patent office (e.g. US, EP, JP, WO). Two-letter ISO 3166-1 alpha-2 for national offices, plus regional/international codes (EP for the European Patent Office, WO for WIPO/PCT)."
78
+ display_name: "Patent Country"
79
+ mergeability: not_mergeable
80
+ domain_flavors: ["patent"]
81
+ passive: true
82
+
83
+ - name: "title"
84
+ type: string
85
+ description: "Title of the entity"
86
+ display_name: "Title"
87
+ mergeability: not_mergeable
88
+ domain_flavors: ["patent"]
89
+ passive: true
90
+
91
+ - name: "patent_title_language"
92
+ type: string
93
+ description: "ISO 639-1 lower-case code of the language used for the patent's title atom (e.g. 'en', 'zh', 'ja'). Emitted only when a title is present."
94
+ display_name: "Patent Title Language"
95
+ mergeability: not_mergeable
96
+ domain_flavors: ["patent"]
97
+ passive: true
98
+
99
+ - name: "patent_abstract"
100
+ type: string
101
+ description: "Patent publication abstract text."
102
+ display_name: "Patent Abstract"
103
+ mergeability: not_mergeable
104
+ domain_flavors: ["patent"]
105
+ passive: true
106
+
107
+ - name: "patent_abstract_language"
108
+ type: string
109
+ description: "ISO 639-1 lower-case code of the language used for the patent's abstract atom (e.g. 'en', 'zh', 'ja'). Emitted only when an abstract is present."
110
+ display_name: "Patent Abstract Language"
111
+ mergeability: not_mergeable
112
+ domain_flavors: ["patent"]
113
+ passive: true
114
+
115
+ - name: "cpc_code"
116
+ type: string
117
+ description: "CPC symbol classifying this patent (multi-valued: one atom per direct code on the publication)"
118
+ display_name: "CPC Code"
119
+ mergeability: not_mergeable
120
+ domain_flavors: ["patent"]
121
+ passive: true
122
+
123
+ relationships:
124
+ - name: "has_inventor"
125
+ description: "A patent lists a person as an inventor"
126
+ display_name: "Has Inventor"
127
+ mergeability: not_mergeable
128
+ domain_flavors: ["patent"]
129
+ target_flavors: ["person"]
130
+ passive: true
131
+
132
+ - name: "has_assignee"
133
+ description: "A patent lists an organization as an assignee"
134
+ display_name: "Has Assignee"
135
+ mergeability: not_mergeable
136
+ domain_flavors: ["patent"]
137
+ target_flavors: ["organization"]
138
+ passive: true
139
+
140
+ - name: "cites_patent"
141
+ description: "A patent cites another patent publication as prior art (non-patent literature citations are not represented as edges)"
142
+ display_name: "Cites Patent"
143
+ mergeability: not_mergeable
144
+ domain_flavors: ["patent"]
145
+ target_flavors: ["patent"]
146
+ passive: true
147
+
148
+ # Human-readable CPC code description. Stored as a quad attribute on each
149
+ # cpc_code atom when known.
150
+ attributes:
151
+ - property: "cpc_code"
152
+ name: "cpc_description"
153
+ type: string
154
+ description: "CPC taxonomy title path for this code (from patents-public-data.cpc.definition); omitted when the code is missing from the map"
155
+ display_name: "CPC Description"
156
+ mergeability: not_mergeable