@yottagraph-app/data-model-skill 0.0.20 → 0.0.21
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skill/patents/DATA_DICTIONARY.md +172 -0
- package/skill/patents/schema.yaml +156 -0
package/package.json
CHANGED
|
@@ -0,0 +1,172 @@
|
|
|
1
|
+
# Data Dictionary: Patents (Google Patents Public Datasets / BigQuery)
|
|
2
|
+
|
|
3
|
+
*Last updated: 2026-04-29 — aligned with `patentsBQQueryBase` / streamer in `moongoose/fetch/patents_streamer.go` (US-only SQL, English title/abstract, no publication dedupe).*
|
|
4
|
+
|
|
5
|
+
## Purpose / Source Overview
|
|
6
|
+
|
|
7
|
+
This source ingests **granted US patent publications** from the Google-hosted BigQuery table `patents-public-data.patents.publications`. Each row is atomized into one **patent** record carrying its own metadata (publication number, grant date, title, abstract), CPC classifications (multi-valued: code + human-readable description), and **person** / **organization** edges for inventors and assignees. Data is structured from BigQuery only (no LLM extraction). The CPC code → description map is loaded at streamer startup from the public BigQuery table `patents-public-data.cpc.definition` (see "Reference Data" below).
|
|
8
|
+
|
|
9
|
+
The upstream query scans `publications` with **`WHERE p.country_code = 'US'`** (hardcoded in SQL, not a stream arg), **`grant_date`** in the current poll window, **`ORDER BY grant_date, publication_number`**, and an optional **`LIMIT`** when `maxPatents` is set. It selects **`publication_number`**, **`country_code`**, dates, **`kind_code`**, English-only **`title`** / **`title_language`** and **`abstract`** / **`abstract_language`** (subqueries over `title_localized` / `abstract_localized` with `WHERE language = 'en'`), aggregated CPC and inventor/assignee/citation lists. There is **no `QUALIFY`** deduplication in SQL and **no Go-side deduplication** of publication numbers: each BigQuery row returned for the window is processed as one publication. (In the public dataset, `publication_number` is unique per row at table scale; if upstream ever returned duplicates within a window, they would be emitted twice.)
|
|
10
|
+
|
|
11
|
+
| `Record.Source` value | Meaning |
|
|
12
|
+
|----------------------|---------|
|
|
13
|
+
| `patents` | Patent publication record and its atoms |
|
|
14
|
+
|
|
15
|
+
Poll cadence and grant-date windows are configured per stream (`pollTimeMin`, `windowDays`, required `initialGrantDateMin`). Optional: `maxPatents`, `batchSize`, `projectId`.
|
|
16
|
+
|
|
17
|
+
## Entity Types
|
|
18
|
+
|
|
19
|
+
### `patent` (patent publication)
|
|
20
|
+
|
|
21
|
+
Represents one patent publication identified by Google’s publication number (for example `US-12345678-B2`). The `patent` flavor is distinct from the generic `document` flavor used by other sources (news, EDGAR filings) so that queries can filter to patents directly.
|
|
22
|
+
|
|
23
|
+
- **Primary key:** `patent_publication_number` (strong ID on the subject entity)
|
|
24
|
+
|
|
25
|
+
### `person` (inventor)
|
|
26
|
+
|
|
27
|
+
A named inventor appearing on the publication’s harmonized inventor list.
|
|
28
|
+
|
|
29
|
+
- **Primary key:** none in source data; entity resolution uses mergeable name + disambiguation snippet (patent publication context).
|
|
30
|
+
|
|
31
|
+
### `organization` (assignee)
|
|
32
|
+
|
|
33
|
+
A named assignee from the harmonized assignee list (typically a company).
|
|
34
|
+
|
|
35
|
+
- **Primary key:** none in source data; mergeable name + snippet for resolution.
|
|
36
|
+
|
|
37
|
+
## Properties
|
|
38
|
+
|
|
39
|
+
### On `patent`
|
|
40
|
+
|
|
41
|
+
* `patent_publication_number`
|
|
42
|
+
* **Definition:** Canonical publication identifier from the `publication_number` field.
|
|
43
|
+
* **Examples:** `US-12345678-B2`
|
|
44
|
+
* **Derivation:** BigQuery `publication_number`, copied verbatim.
|
|
45
|
+
|
|
46
|
+
* `patent_grant_date`
|
|
47
|
+
* **Definition:** Grant date in `YYYY-MM-DD` (UTC calendar interpretation of the integer `grant_date`).
|
|
48
|
+
* **Examples:** `2025-10-15`
|
|
49
|
+
* **Derivation:** BigQuery `grant_date` (YYYYMMDD) reformatted.
|
|
50
|
+
|
|
51
|
+
* `patent_filing_date`
|
|
52
|
+
* **Definition:** Application filing date in `YYYY-MM-DD`.
|
|
53
|
+
* **Derivation:** BigQuery `filing_date` (YYYYMMDD) reformatted. Omitted when null.
|
|
54
|
+
|
|
55
|
+
* `patent_priority_date`
|
|
56
|
+
* **Definition:** Earliest priority date claimed by the patent in `YYYY-MM-DD`.
|
|
57
|
+
* **Derivation:** BigQuery `priority_date` (YYYYMMDD) reformatted. Omitted when null.
|
|
58
|
+
|
|
59
|
+
* `patent_kind_code`
|
|
60
|
+
* **Definition:** Document kind code (e.g. `A1`, `B1`, `B2`, `C1`, `S`, `P`).
|
|
61
|
+
* **Derivation:** BigQuery `kind_code`, copied verbatim.
|
|
62
|
+
|
|
63
|
+
* `patent_country`
|
|
64
|
+
* **Definition:** WIPO country/office code identifying the issuing patent office. Two-letter ISO 3166-1 alpha-2 for national offices (`US`, `JP`, `DE`, …) plus regional/international codes (`EP` for the European Patent Office, `WO` for WIPO/PCT).
|
|
65
|
+
* **Derivation:** BigQuery `country_code` on the selected row. The patents stream only ingests **`US`** (`WHERE` clause); the atom reflects the column when present, with a streamer fallback to `US` if empty.
|
|
66
|
+
|
|
67
|
+
* `title`
|
|
68
|
+
* **Definition:** Title of the patent publication **when an English row exists** in `title_localized`; otherwise empty (no fallback to other languages).
|
|
69
|
+
* **Examples:** “Example fusion reactor control”
|
|
70
|
+
* **Derivation:** `(SELECT t.text FROM UNNEST(p.title_localized) AS t WHERE t.language = 'en' LIMIT 1)`.
|
|
71
|
+
|
|
72
|
+
* `patent_title_language`
|
|
73
|
+
* **Definition:** ISO 639-1 lower-case code of the language used for the title atom (here **`en`** when English text exists).
|
|
74
|
+
* **Derivation:** `(SELECT t.language FROM UNNEST(p.title_localized) AS t WHERE t.language = 'en' LIMIT 1)`. Emitted only when a title is present.
|
|
75
|
+
|
|
76
|
+
* `patent_abstract`
|
|
77
|
+
* **Definition:** Abstract text **when an English row exists** in `abstract_localized`; otherwise empty. Not truncated in the current BigQuery SQL (full English text as stored).
|
|
78
|
+
* **Derivation:** `(SELECT a.text FROM UNNEST(p.abstract_localized) AS a WHERE a.language = 'en' LIMIT 1)`.
|
|
79
|
+
|
|
80
|
+
* `patent_abstract_language`
|
|
81
|
+
* **Definition:** ISO 639-1 lower-case code of the language used for the abstract atom (here **`en`** when English text exists).
|
|
82
|
+
* **Derivation:** `(SELECT a.language FROM UNNEST(p.abstract_localized) AS a WHERE a.language = 'en' LIMIT 1)`. Emitted only when an abstract is present.
|
|
83
|
+
|
|
84
|
+
* `cpc_code` (multi-valued)
|
|
85
|
+
* **Definition:** A direct CPC symbol assigned to this patent. One atom per code.
|
|
86
|
+
* **Derivation:** `STRING_AGG` of `cpc.code` from unnested `cpc`, then split on `,`.
|
|
87
|
+
* **Attribute `cpc_description`:** When the code exists in the in-memory CPC map (loaded at streamer startup from `patents-public-data.cpc.definition`), the human-readable taxonomy path is attached as quad attribute **`cpc_description`** on that `cpc_code` atom (kgschema quad attr id 24). Codes not present in the map emit `cpc_code` only with no attribute.
|
|
88
|
+
|
|
89
|
+
## Entity Relationships Summary
|
|
90
|
+
|
|
91
|
+
```
|
|
92
|
+
patent ──[has_inventor]──→ person
|
|
93
|
+
patent ──[has_assignee]──→ organization
|
|
94
|
+
patent ──[cites_patent]──→ patent (other publications cited as prior art)
|
|
95
|
+
patent (multi-valued `cpc_code`; optional `cpc_description` quad attribute per code)
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
The `patent` subject's primary citation text is always the canonical Google
|
|
99
|
+
Patents URL: `https://patents.google.com/patent/<UNHYPHENATED_PUBNUM>` (Google
|
|
100
|
+
Patents URLs require the publication number with hyphens removed — e.g.
|
|
101
|
+
`US12433179B2`, `JPH01160014U`). The same form is used for `cites_patent` target
|
|
102
|
+
citations.
|
|
103
|
+
|
|
104
|
+
Inventor and assignee names come from `STRING_AGG` lists on `inventor_harmonized` and `assignee_harmonized`, split on `;` after export.
|
|
105
|
+
|
|
106
|
+
## Reference Data
|
|
107
|
+
|
|
108
|
+
### CPC code → description mapping
|
|
109
|
+
|
|
110
|
+
The streamer needs a CPC code → human-readable description map to populate
|
|
111
|
+
the `cpc_description` quad attributes on `cpc_code` atoms. It hydrates the map by querying the public
|
|
112
|
+
BigQuery table **`patents-public-data.cpc.definition`** — the same project /
|
|
113
|
+
credentials already used to scan `patents-public-data.patents.publications`.
|
|
114
|
+
The query is hardcoded inside the streamer (`cpcTaxonomySQL` in
|
|
115
|
+
`patents_streamer.go`).
|
|
116
|
+
|
|
117
|
+
The map is **refreshed at the start of every polling cycle** so that newly
|
|
118
|
+
published CPC codes (the EPO updates the taxonomy a few times per year) are
|
|
119
|
+
picked up without requiring a streamer restart. If a refresh fails due to a
|
|
120
|
+
transient BQ outage and a previous successful load is cached, the streamer
|
|
121
|
+
logs a warning and continues with the cached map for that cycle; if the
|
|
122
|
+
first-ever load fails (no cache yet), the cycle is skipped without advancing
|
|
123
|
+
the checkpoint. Either way, no grant_date window is silently lost.
|
|
124
|
+
|
|
125
|
+
Codes referenced by patents but absent from the loaded map emit
|
|
126
|
+
`cpc_code` only, without a `cpc_description` attribute. The streamer never fails on
|
|
127
|
+
missing codes — gaps in the taxonomy are tolerated and reported only by
|
|
128
|
+
the absence of descriptions in the output.
|
|
129
|
+
|
|
130
|
+
### How the description query works
|
|
131
|
+
|
|
132
|
+
CPC is a tree: each symbol stores its own short title fragment in
|
|
133
|
+
`titlePart` plus a `parents` array listing every ancestor up to the root
|
|
134
|
+
(e.g. `H05H 1/02` → `H05H 1/00` → `H05H` → `H05` → `H`). To produce a
|
|
135
|
+
useful, *self-contained* description for any one code, the query walks
|
|
136
|
+
that ancestor chain for every leaf and concatenates each ancestor's
|
|
137
|
+
`titlePart` into a single root → leaf path string. So `H05H 1/02` becomes
|
|
138
|
+
something like *"PHYSICS > NUCLEAR PHYSICS > Plasma technique >
|
|
139
|
+
Generating plasma > Glow discharges"* rather than just *"Glow
|
|
140
|
+
discharges"*. Each patent gets one description per code without the
|
|
141
|
+
streamer needing the rest of the taxonomy at atomization time.
|
|
142
|
+
|
|
143
|
+
The mechanics:
|
|
144
|
+
|
|
145
|
+
1. Compute a normalized `sym_key` (whitespace stripped) for each row so
|
|
146
|
+
ancestor lookups are robust to spacing differences (`H05H 1/02` vs
|
|
147
|
+
`H05H1/02`).
|
|
148
|
+
2. For each leaf, build an ordered `path_keys` array of normalized
|
|
149
|
+
ancestor keys, root-first, ending with the leaf itself.
|
|
150
|
+
3. Unnest `path_keys` so each (leaf, ancestor, position) is one row.
|
|
151
|
+
4. Join back to the definition table to recover each ancestor's title
|
|
152
|
+
fragment and group on the leaf, ordering by position so the output
|
|
153
|
+
reads from root to leaf.
|
|
154
|
+
|
|
155
|
+
The streamer consumes the `code` and `description` columns; `parent` is
|
|
156
|
+
informational and currently retained on the in-memory `CPCNode` but not
|
|
157
|
+
emitted as an atom.
|
|
158
|
+
|
|
159
|
+
See `cpcTaxonomySQL` in `moongoose/fetch/patents_streamer.go` for the
|
|
160
|
+
exact BigQuery query.
|
|
161
|
+
|
|
162
|
+
Notes:
|
|
163
|
+
|
|
164
|
+
- The whitespace-stripped `sym_key` is only used internally as the JOIN
|
|
165
|
+
key between leaves and ancestors. The emitted `code` keeps its original
|
|
166
|
+
spacing because patents publications retain it too (e.g. `G21B 1/00`,
|
|
167
|
+
not `G21B1/00`); the streamer matches against codes-as-published.
|
|
168
|
+
- `ARRAY_TO_STRING(titlePart, ' ')` collapses CPC's multi-segment node
|
|
169
|
+
titles into a single sentence per ancestor before they're joined with
|
|
170
|
+
`> ` separators.
|
|
171
|
+
- `NULLIF(..., '')` filters empty fragments so trailing or leading
|
|
172
|
+
separators don't leak into the final description.
|
|
@@ -0,0 +1,156 @@
|
|
|
1
|
+
# Dataset schema for patent grant publications from Google Patents Public Datasets.
|
|
2
|
+
#
|
|
3
|
+
# Structured atomization only — no LLM extraction.
|
|
4
|
+
name: "patents"
|
|
5
|
+
description: "Patent grant publications from Google Patents Public Datasets (titles, abstracts, CPC, inventors, assignees, countries, and languages."
|
|
6
|
+
|
|
7
|
+
extraction:
|
|
8
|
+
flavors: closed
|
|
9
|
+
properties: closed
|
|
10
|
+
relationships: closed
|
|
11
|
+
attributes: closed
|
|
12
|
+
events: closed
|
|
13
|
+
|
|
14
|
+
flavors:
|
|
15
|
+
- name: "patent"
|
|
16
|
+
description: "A granted patent publication identified by its publication number (WIPO-style number including office prefix, e.g. US-12345678-B2, EP-1234567-B1)"
|
|
17
|
+
display_name: "Patent"
|
|
18
|
+
mergeability: not_mergeable
|
|
19
|
+
strong_id_properties: ["patent_publication_number"]
|
|
20
|
+
passive: true
|
|
21
|
+
|
|
22
|
+
- name: "person"
|
|
23
|
+
description: "A real person as opposed to a fictional character, such as a CEO, politician, or public figure"
|
|
24
|
+
display_name: "Person"
|
|
25
|
+
mergeability: not_mergeable
|
|
26
|
+
passive: true
|
|
27
|
+
|
|
28
|
+
- name: "organization"
|
|
29
|
+
description: "A particular business, institution, or organization such as a corporation, university, government agency, or non-profit"
|
|
30
|
+
display_name: "Organization"
|
|
31
|
+
mergeability: not_mergeable
|
|
32
|
+
passive: true
|
|
33
|
+
|
|
34
|
+
properties:
|
|
35
|
+
- name: "patent_publication_number"
|
|
36
|
+
type: string
|
|
37
|
+
description: "Canonical patent publication identifier from the publications table (office prefix and kind suffix vary by jurisdiction, e.g. US-12345678-B2, JP-6001234-A)"
|
|
38
|
+
display_name: "Patent Publication Number"
|
|
39
|
+
mergeability: not_mergeable
|
|
40
|
+
domain_flavors: ["patent"]
|
|
41
|
+
passive: true
|
|
42
|
+
|
|
43
|
+
- name: "patent_grant_date"
|
|
44
|
+
type: string
|
|
45
|
+
description: "Patent grant date in YYYY-MM-DD format."
|
|
46
|
+
display_name: "Patent Grant Date"
|
|
47
|
+
mergeability: not_mergeable
|
|
48
|
+
domain_flavors: ["patent"]
|
|
49
|
+
passive: true
|
|
50
|
+
|
|
51
|
+
- name: "patent_filing_date"
|
|
52
|
+
type: string
|
|
53
|
+
description: "Application filing date in YYYY-MM-DD format."
|
|
54
|
+
display_name: "Patent Filing Date"
|
|
55
|
+
mergeability: not_mergeable
|
|
56
|
+
domain_flavors: ["patent"]
|
|
57
|
+
passive: true
|
|
58
|
+
|
|
59
|
+
- name: "patent_priority_date"
|
|
60
|
+
type: string
|
|
61
|
+
description: "Earliest priority date claimed by the patent in YYYY-MM-DD format."
|
|
62
|
+
display_name: "Patent Priority Date"
|
|
63
|
+
mergeability: not_mergeable
|
|
64
|
+
domain_flavors: ["patent"]
|
|
65
|
+
passive: true
|
|
66
|
+
|
|
67
|
+
- name: "patent_kind_code"
|
|
68
|
+
type: string
|
|
69
|
+
description: "Publication kind code (e.g. A1, B1, B2, C1, S, P) distinguishing applications from grants and other document types"
|
|
70
|
+
display_name: "Patent Kind Code"
|
|
71
|
+
mergeability: not_mergeable
|
|
72
|
+
domain_flavors: ["patent"]
|
|
73
|
+
passive: true
|
|
74
|
+
|
|
75
|
+
- name: "patent_country"
|
|
76
|
+
type: string
|
|
77
|
+
description: "WIPO country/office code identifying the issuing patent office (e.g. US, EP, JP, WO). Two-letter ISO 3166-1 alpha-2 for national offices, plus regional/international codes (EP for the European Patent Office, WO for WIPO/PCT)."
|
|
78
|
+
display_name: "Patent Country"
|
|
79
|
+
mergeability: not_mergeable
|
|
80
|
+
domain_flavors: ["patent"]
|
|
81
|
+
passive: true
|
|
82
|
+
|
|
83
|
+
- name: "title"
|
|
84
|
+
type: string
|
|
85
|
+
description: "Title of the entity"
|
|
86
|
+
display_name: "Title"
|
|
87
|
+
mergeability: not_mergeable
|
|
88
|
+
domain_flavors: ["patent"]
|
|
89
|
+
passive: true
|
|
90
|
+
|
|
91
|
+
- name: "patent_title_language"
|
|
92
|
+
type: string
|
|
93
|
+
description: "ISO 639-1 lower-case code of the language used for the patent's title atom (e.g. 'en', 'zh', 'ja'). Emitted only when a title is present."
|
|
94
|
+
display_name: "Patent Title Language"
|
|
95
|
+
mergeability: not_mergeable
|
|
96
|
+
domain_flavors: ["patent"]
|
|
97
|
+
passive: true
|
|
98
|
+
|
|
99
|
+
- name: "patent_abstract"
|
|
100
|
+
type: string
|
|
101
|
+
description: "Patent publication abstract text."
|
|
102
|
+
display_name: "Patent Abstract"
|
|
103
|
+
mergeability: not_mergeable
|
|
104
|
+
domain_flavors: ["patent"]
|
|
105
|
+
passive: true
|
|
106
|
+
|
|
107
|
+
- name: "patent_abstract_language"
|
|
108
|
+
type: string
|
|
109
|
+
description: "ISO 639-1 lower-case code of the language used for the patent's abstract atom (e.g. 'en', 'zh', 'ja'). Emitted only when an abstract is present."
|
|
110
|
+
display_name: "Patent Abstract Language"
|
|
111
|
+
mergeability: not_mergeable
|
|
112
|
+
domain_flavors: ["patent"]
|
|
113
|
+
passive: true
|
|
114
|
+
|
|
115
|
+
- name: "cpc_code"
|
|
116
|
+
type: string
|
|
117
|
+
description: "CPC symbol classifying this patent (multi-valued: one atom per direct code on the publication)"
|
|
118
|
+
display_name: "CPC Code"
|
|
119
|
+
mergeability: not_mergeable
|
|
120
|
+
domain_flavors: ["patent"]
|
|
121
|
+
passive: true
|
|
122
|
+
|
|
123
|
+
relationships:
|
|
124
|
+
- name: "has_inventor"
|
|
125
|
+
description: "A patent lists a person as an inventor"
|
|
126
|
+
display_name: "Has Inventor"
|
|
127
|
+
mergeability: not_mergeable
|
|
128
|
+
domain_flavors: ["patent"]
|
|
129
|
+
target_flavors: ["person"]
|
|
130
|
+
passive: true
|
|
131
|
+
|
|
132
|
+
- name: "has_assignee"
|
|
133
|
+
description: "A patent lists an organization as an assignee"
|
|
134
|
+
display_name: "Has Assignee"
|
|
135
|
+
mergeability: not_mergeable
|
|
136
|
+
domain_flavors: ["patent"]
|
|
137
|
+
target_flavors: ["organization"]
|
|
138
|
+
passive: true
|
|
139
|
+
|
|
140
|
+
- name: "cites_patent"
|
|
141
|
+
description: "A patent cites another patent publication as prior art (non-patent literature citations are not represented as edges)"
|
|
142
|
+
display_name: "Cites Patent"
|
|
143
|
+
mergeability: not_mergeable
|
|
144
|
+
domain_flavors: ["patent"]
|
|
145
|
+
target_flavors: ["patent"]
|
|
146
|
+
passive: true
|
|
147
|
+
|
|
148
|
+
# Human-readable CPC code description. Stored as a quad attribute on each
|
|
149
|
+
# cpc_code atom when known.
|
|
150
|
+
attributes:
|
|
151
|
+
- property: "cpc_code"
|
|
152
|
+
name: "cpc_description"
|
|
153
|
+
type: string
|
|
154
|
+
description: "CPC taxonomy title path for this code (from patents-public-data.cpc.definition); omitted when the code is missing from the map"
|
|
155
|
+
display_name: "CPC Description"
|
|
156
|
+
mergeability: not_mergeable
|