@yottagraph-app/data-model-skill 0.0.20 → 0.0.21

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@yottagraph-app/data-model-skill",
3
- "version": "0.0.20",
3
+ "version": "0.0.21",
4
4
  "description": "Data model skill documentation for AI agents - entity types, properties, and schemas from Lovelace fetch sources",
5
5
  "repository": {
6
6
  "type": "git",
@@ -0,0 +1,172 @@
1
+ # Data Dictionary: Patents (Google Patents Public Datasets / BigQuery)
2
+
3
+ *Last updated: 2026-04-29 — aligned with `patentsBQQueryBase` / streamer in `moongoose/fetch/patents_streamer.go` (US-only SQL, English title/abstract, no publication dedupe).*
4
+
5
+ ## Purpose / Source Overview
6
+
7
+ This source ingests **granted US patent publications** from the Google-hosted BigQuery table `patents-public-data.patents.publications`. Each row is atomized into one **patent** record carrying its own metadata (publication number, grant date, title, abstract), CPC classifications (multi-valued: code + human-readable description), and **person** / **organization** edges for inventors and assignees. Data is structured from BigQuery only (no LLM extraction). The CPC code → description map is loaded at streamer startup from the public BigQuery table `patents-public-data.cpc.definition` (see "Reference Data" below).
8
+
9
+ The upstream query scans `publications` with **`WHERE p.country_code = 'US'`** (hardcoded in SQL, not a stream arg), **`grant_date`** in the current poll window, **`ORDER BY grant_date, publication_number`**, and an optional **`LIMIT`** when `maxPatents` is set. It selects **`publication_number`**, **`country_code`**, dates, **`kind_code`**, English-only **`title`** / **`title_language`** and **`abstract`** / **`abstract_language`** (subqueries over `title_localized` / `abstract_localized` with `WHERE language = 'en'`), aggregated CPC and inventor/assignee/citation lists. There is **no `QUALIFY`** deduplication in SQL and **no Go-side deduplication** of publication numbers: each BigQuery row returned for the window is processed as one publication. (In the public dataset, `publication_number` is unique per row at table scale; if upstream ever returned duplicates within a window, they would be emitted twice.)
10
+
11
+ | `Record.Source` value | Meaning |
12
+ |----------------------|---------|
13
+ | `patents` | Patent publication record and its atoms |
14
+
15
+ Poll cadence and grant-date windows are configured per stream (`pollTimeMin`, `windowDays`, required `initialGrantDateMin`). Optional: `maxPatents`, `batchSize`, `projectId`.
16
+
17
+ ## Entity Types
18
+
19
+ ### `patent` (patent publication)
20
+
21
+ Represents one patent publication identified by Google’s publication number (for example `US-12345678-B2`). The `patent` flavor is distinct from the generic `document` flavor used by other sources (news, EDGAR filings) so that queries can filter to patents directly.
22
+
23
+ - **Primary key:** `patent_publication_number` (strong ID on the subject entity)
24
+
25
+ ### `person` (inventor)
26
+
27
+ A named inventor appearing on the publication’s harmonized inventor list.
28
+
29
+ - **Primary key:** none in source data; entity resolution uses mergeable name + disambiguation snippet (patent publication context).
30
+
31
+ ### `organization` (assignee)
32
+
33
+ A named assignee from the harmonized assignee list (typically a company).
34
+
35
+ - **Primary key:** none in source data; mergeable name + snippet for resolution.
36
+
37
+ ## Properties
38
+
39
+ ### On `patent`
40
+
41
+ * `patent_publication_number`
42
+ * **Definition:** Canonical publication identifier from the `publication_number` field.
43
+ * **Examples:** `US-12345678-B2`
44
+ * **Derivation:** BigQuery `publication_number`, copied verbatim.
45
+
46
+ * `patent_grant_date`
47
+ * **Definition:** Grant date in `YYYY-MM-DD` (UTC calendar interpretation of the integer `grant_date`).
48
+ * **Examples:** `2025-10-15`
49
+ * **Derivation:** BigQuery `grant_date` (YYYYMMDD) reformatted.
50
+
51
+ * `patent_filing_date`
52
+ * **Definition:** Application filing date in `YYYY-MM-DD`.
53
+ * **Derivation:** BigQuery `filing_date` (YYYYMMDD) reformatted. Omitted when null.
54
+
55
+ * `patent_priority_date`
56
+ * **Definition:** Earliest priority date claimed by the patent in `YYYY-MM-DD`.
57
+ * **Derivation:** BigQuery `priority_date` (YYYYMMDD) reformatted. Omitted when null.
58
+
59
+ * `patent_kind_code`
60
+ * **Definition:** Document kind code (e.g. `A1`, `B1`, `B2`, `C1`, `S`, `P`).
61
+ * **Derivation:** BigQuery `kind_code`, copied verbatim.
62
+
63
+ * `patent_country`
64
+ * **Definition:** WIPO country/office code identifying the issuing patent office. Two-letter ISO 3166-1 alpha-2 for national offices (`US`, `JP`, `DE`, …) plus regional/international codes (`EP` for the European Patent Office, `WO` for WIPO/PCT).
65
+ * **Derivation:** BigQuery `country_code` on the selected row. The patents stream only ingests **`US`** (`WHERE` clause); the atom reflects the column when present, with a streamer fallback to `US` if empty.
66
+
67
+ * `title`
68
+ * **Definition:** Title of the patent publication **when an English row exists** in `title_localized`; otherwise empty (no fallback to other languages).
69
+ * **Examples:** “Example fusion reactor control”
70
+ * **Derivation:** `(SELECT t.text FROM UNNEST(p.title_localized) AS t WHERE t.language = 'en' LIMIT 1)`.
71
+
72
+ * `patent_title_language`
73
+ * **Definition:** ISO 639-1 lower-case code of the language used for the title atom (here **`en`** when English text exists).
74
+ * **Derivation:** `(SELECT t.language FROM UNNEST(p.title_localized) AS t WHERE t.language = 'en' LIMIT 1)`. Emitted only when a title is present.
75
+
76
+ * `patent_abstract`
77
+ * **Definition:** Abstract text **when an English row exists** in `abstract_localized`; otherwise empty. Not truncated in the current BigQuery SQL (full English text as stored).
78
+ * **Derivation:** `(SELECT a.text FROM UNNEST(p.abstract_localized) AS a WHERE a.language = 'en' LIMIT 1)`.
79
+
80
+ * `patent_abstract_language`
81
+ * **Definition:** ISO 639-1 lower-case code of the language used for the abstract atom (here **`en`** when English text exists).
82
+ * **Derivation:** `(SELECT a.language FROM UNNEST(p.abstract_localized) AS a WHERE a.language = 'en' LIMIT 1)`. Emitted only when an abstract is present.
83
+
84
+ * `cpc_code` (multi-valued)
85
+ * **Definition:** A direct CPC symbol assigned to this patent. One atom per code.
86
+ * **Derivation:** `STRING_AGG` of `cpc.code` from unnested `cpc`, then split on `,`.
87
+ * **Attribute `cpc_description`:** When the code exists in the in-memory CPC map (loaded at streamer startup from `patents-public-data.cpc.definition`), the human-readable taxonomy path is attached as quad attribute **`cpc_description`** on that `cpc_code` atom (kgschema quad attr id 24). Codes not present in the map emit `cpc_code` only with no attribute.
88
+
89
+ ## Entity Relationships Summary
90
+
91
+ ```
92
+ patent ──[has_inventor]──→ person
93
+ patent ──[has_assignee]──→ organization
94
+ patent ──[cites_patent]──→ patent (other publications cited as prior art)
95
+ patent (multi-valued `cpc_code`; optional `cpc_description` quad attribute per code)
96
+ ```
97
+
98
+ The `patent` subject's primary citation text is always the canonical Google
99
+ Patents URL: `https://patents.google.com/patent/<UNHYPHENATED_PUBNUM>` (Google
100
+ Patents URLs require the publication number with hyphens removed — e.g.
101
+ `US12433179B2`, `JPH01160014U`). The same form is used for `cites_patent` target
102
+ citations.
103
+
104
+ Inventor and assignee names come from `STRING_AGG` lists on `inventor_harmonized` and `assignee_harmonized`, split on `;` after export.
105
+
106
+ ## Reference Data
107
+
108
+ ### CPC code → description mapping
109
+
110
+ The streamer needs a CPC code → human-readable description map to populate
111
+ the `cpc_description` quad attributes on `cpc_code` atoms. It hydrates the map by querying the public
112
+ BigQuery table **`patents-public-data.cpc.definition`** — the same project /
113
+ credentials already used to scan `patents-public-data.patents.publications`.
114
+ The query is hardcoded inside the streamer (`cpcTaxonomySQL` in
115
+ `patents_streamer.go`).
116
+
117
+ The map is **refreshed at the start of every polling cycle** so that newly
118
+ published CPC codes (the EPO updates the taxonomy a few times per year) are
119
+ picked up without requiring a streamer restart. If a refresh fails due to a
120
+ transient BQ outage and a previous successful load is cached, the streamer
121
+ logs a warning and continues with the cached map for that cycle; if the
122
+ first-ever load fails (no cache yet), the cycle is skipped without advancing
123
+ the checkpoint. Either way, no grant_date window is silently lost.
124
+
125
+ Codes referenced by patents but absent from the loaded map emit
126
+ `cpc_code` only, without a `cpc_description` attribute. The streamer never fails on
127
+ missing codes — gaps in the taxonomy are tolerated and reported only by
128
+ the absence of descriptions in the output.
129
+
130
+ ### How the description query works
131
+
132
+ CPC is a tree: each symbol stores its own short title fragment in
133
+ `titlePart` plus a `parents` array listing every ancestor up to the root
134
+ (e.g. `H05H 1/02` → `H05H 1/00` → `H05H` → `H05` → `H`). To produce a
135
+ useful, *self-contained* description for any one code, the query walks
136
+ that ancestor chain for every leaf and concatenates each ancestor's
137
+ `titlePart` into a single root → leaf path string. So `H05H 1/02` becomes
138
+ something like *"PHYSICS > NUCLEAR PHYSICS > Plasma technique >
139
+ Generating plasma > Glow discharges"* rather than just *"Glow
140
+ discharges"*. Each patent gets one description per code without the
141
+ streamer needing the rest of the taxonomy at atomization time.
142
+
143
+ The mechanics:
144
+
145
+ 1. Compute a normalized `sym_key` (whitespace stripped) for each row so
146
+ ancestor lookups are robust to spacing differences (`H05H 1/02` vs
147
+ `H05H1/02`).
148
+ 2. For each leaf, build an ordered `path_keys` array of normalized
149
+ ancestor keys, root-first, ending with the leaf itself.
150
+ 3. Unnest `path_keys` so each (leaf, ancestor, position) is one row.
151
+ 4. Join back to the definition table to recover each ancestor's title
152
+ fragment and group on the leaf, ordering by position so the output
153
+ reads from root to leaf.
154
+
155
+ The streamer consumes the `code` and `description` columns; `parent` is
156
+ informational and currently retained on the in-memory `CPCNode` but not
157
+ emitted as an atom.
158
+
159
+ See `cpcTaxonomySQL` in `moongoose/fetch/patents_streamer.go` for the
160
+ exact BigQuery query.
161
+
162
+ Notes:
163
+
164
+ - The whitespace-stripped `sym_key` is only used internally as the JOIN
165
+ key between leaves and ancestors. The emitted `code` keeps its original
166
+ spacing because patents publications retain it too (e.g. `G21B 1/00`,
167
+ not `G21B1/00`); the streamer matches against codes-as-published.
168
+ - `ARRAY_TO_STRING(titlePart, ' ')` collapses CPC's multi-segment node
169
+ titles into a single sentence per ancestor before they're joined with
170
+ `> ` separators.
171
+ - `NULLIF(..., '')` filters empty fragments so trailing or leading
172
+ separators don't leak into the final description.
@@ -0,0 +1,156 @@
1
+ # Dataset schema for patent grant publications from Google Patents Public Datasets.
2
+ #
3
+ # Structured atomization only — no LLM extraction.
4
+ name: "patents"
5
+ description: "Patent grant publications from Google Patents Public Datasets (titles, abstracts, CPC, inventors, assignees, countries, and languages."
6
+
7
+ extraction:
8
+ flavors: closed
9
+ properties: closed
10
+ relationships: closed
11
+ attributes: closed
12
+ events: closed
13
+
14
+ flavors:
15
+ - name: "patent"
16
+ description: "A granted patent publication identified by its publication number (WIPO-style number including office prefix, e.g. US-12345678-B2, EP-1234567-B1)"
17
+ display_name: "Patent"
18
+ mergeability: not_mergeable
19
+ strong_id_properties: ["patent_publication_number"]
20
+ passive: true
21
+
22
+ - name: "person"
23
+ description: "A real person as opposed to a fictional character, such as a CEO, politician, or public figure"
24
+ display_name: "Person"
25
+ mergeability: not_mergeable
26
+ passive: true
27
+
28
+ - name: "organization"
29
+ description: "A particular business, institution, or organization such as a corporation, university, government agency, or non-profit"
30
+ display_name: "Organization"
31
+ mergeability: not_mergeable
32
+ passive: true
33
+
34
+ properties:
35
+ - name: "patent_publication_number"
36
+ type: string
37
+ description: "Canonical patent publication identifier from the publications table (office prefix and kind suffix vary by jurisdiction, e.g. US-12345678-B2, JP-6001234-A)"
38
+ display_name: "Patent Publication Number"
39
+ mergeability: not_mergeable
40
+ domain_flavors: ["patent"]
41
+ passive: true
42
+
43
+ - name: "patent_grant_date"
44
+ type: string
45
+ description: "Patent grant date in YYYY-MM-DD format."
46
+ display_name: "Patent Grant Date"
47
+ mergeability: not_mergeable
48
+ domain_flavors: ["patent"]
49
+ passive: true
50
+
51
+ - name: "patent_filing_date"
52
+ type: string
53
+ description: "Application filing date in YYYY-MM-DD format."
54
+ display_name: "Patent Filing Date"
55
+ mergeability: not_mergeable
56
+ domain_flavors: ["patent"]
57
+ passive: true
58
+
59
+ - name: "patent_priority_date"
60
+ type: string
61
+ description: "Earliest priority date claimed by the patent in YYYY-MM-DD format."
62
+ display_name: "Patent Priority Date"
63
+ mergeability: not_mergeable
64
+ domain_flavors: ["patent"]
65
+ passive: true
66
+
67
+ - name: "patent_kind_code"
68
+ type: string
69
+ description: "Publication kind code (e.g. A1, B1, B2, C1, S, P) distinguishing applications from grants and other document types"
70
+ display_name: "Patent Kind Code"
71
+ mergeability: not_mergeable
72
+ domain_flavors: ["patent"]
73
+ passive: true
74
+
75
+ - name: "patent_country"
76
+ type: string
77
+ description: "WIPO country/office code identifying the issuing patent office (e.g. US, EP, JP, WO). Two-letter ISO 3166-1 alpha-2 for national offices, plus regional/international codes (EP for the European Patent Office, WO for WIPO/PCT)."
78
+ display_name: "Patent Country"
79
+ mergeability: not_mergeable
80
+ domain_flavors: ["patent"]
81
+ passive: true
82
+
83
+ - name: "title"
84
+ type: string
85
+ description: "Title of the entity"
86
+ display_name: "Title"
87
+ mergeability: not_mergeable
88
+ domain_flavors: ["patent"]
89
+ passive: true
90
+
91
+ - name: "patent_title_language"
92
+ type: string
93
+ description: "ISO 639-1 lower-case code of the language used for the patent's title atom (e.g. 'en', 'zh', 'ja'). Emitted only when a title is present."
94
+ display_name: "Patent Title Language"
95
+ mergeability: not_mergeable
96
+ domain_flavors: ["patent"]
97
+ passive: true
98
+
99
+ - name: "patent_abstract"
100
+ type: string
101
+ description: "Patent publication abstract text."
102
+ display_name: "Patent Abstract"
103
+ mergeability: not_mergeable
104
+ domain_flavors: ["patent"]
105
+ passive: true
106
+
107
+ - name: "patent_abstract_language"
108
+ type: string
109
+ description: "ISO 639-1 lower-case code of the language used for the patent's abstract atom (e.g. 'en', 'zh', 'ja'). Emitted only when an abstract is present."
110
+ display_name: "Patent Abstract Language"
111
+ mergeability: not_mergeable
112
+ domain_flavors: ["patent"]
113
+ passive: true
114
+
115
+ - name: "cpc_code"
116
+ type: string
117
+ description: "CPC symbol classifying this patent (multi-valued: one atom per direct code on the publication)"
118
+ display_name: "CPC Code"
119
+ mergeability: not_mergeable
120
+ domain_flavors: ["patent"]
121
+ passive: true
122
+
123
+ relationships:
124
+ - name: "has_inventor"
125
+ description: "A patent lists a person as an inventor"
126
+ display_name: "Has Inventor"
127
+ mergeability: not_mergeable
128
+ domain_flavors: ["patent"]
129
+ target_flavors: ["person"]
130
+ passive: true
131
+
132
+ - name: "has_assignee"
133
+ description: "A patent lists an organization as an assignee"
134
+ display_name: "Has Assignee"
135
+ mergeability: not_mergeable
136
+ domain_flavors: ["patent"]
137
+ target_flavors: ["organization"]
138
+ passive: true
139
+
140
+ - name: "cites_patent"
141
+ description: "A patent cites another patent publication as prior art (non-patent literature citations are not represented as edges)"
142
+ display_name: "Cites Patent"
143
+ mergeability: not_mergeable
144
+ domain_flavors: ["patent"]
145
+ target_flavors: ["patent"]
146
+ passive: true
147
+
148
+ # Human-readable CPC code description. Stored as a quad attribute on each
149
+ # cpc_code atom when known.
150
+ attributes:
151
+ - property: "cpc_code"
152
+ name: "cpc_description"
153
+ type: string
154
+ description: "CPC taxonomy title path for this code (from patents-public-data.cpc.definition); omitted when the code is missing from the map"
155
+ display_name: "CPC Description"
156
+ mergeability: not_mergeable