commonmeta-ruby 3.2.15 → 3.3

Sign up to get free protection for your applications and to get access to all the features.
Files changed (25) hide show
  1. checksums.yaml +4 -4
  2. data/Gemfile.lock +1 -1
  3. data/bin/commonmeta +1 -1
  4. data/lib/commonmeta/author_utils.rb +1 -1
  5. data/lib/commonmeta/cli.rb +14 -0
  6. data/lib/commonmeta/crossref_utils.rb +56 -14
  7. data/lib/commonmeta/readers/json_feed_reader.rb +25 -1
  8. data/lib/commonmeta/utils.rb +34 -0
  9. data/lib/commonmeta/version.rb +1 -1
  10. data/spec/cli_spec.rb +12 -3
  11. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/doi_prefix/doi_prefix_by_blog.yml +997 -0
  12. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/doi_prefix/doi_prefix_by_uuid.yml +256 -0
  13. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_blog.yml +997 -0
  14. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_uuid.yml +256 -0
  15. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_id.yml +997 -0
  16. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_uuid.yml +389 -0
  17. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_uuid_specific_prefix.yml +389 -0
  18. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item/by_uuid.yml +136 -0
  19. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/blog_post_with_non-url_id.yml +136 -0
  20. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_organizational_author.yml +91 -0
  21. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_rogue_scholar_with_organizational_author.yml +91 -0
  22. data/spec/readers/json_feed_reader_spec.rb +68 -0
  23. data/spec/utils_spec.rb +8 -0
  24. data/spec/writers/crossref_xml_writer_spec.rb +28 -0
  25. metadata +13 -2
@@ -0,0 +1,997 @@
1
+ ---
2
+ http_interactions:
3
+ - request:
4
+ method: get
5
+ uri: https://rogue-scholar.org/api/blogs/tyfqw20
6
+ body:
7
+ encoding: UTF-8
8
+ string: ''
9
+ headers:
10
+ Connection:
11
+ - close
12
+ Host:
13
+ - rogue-scholar.org
14
+ User-Agent:
15
+ - http.rb/5.1.1
16
+ response:
17
+ status:
18
+ code: 200
19
+ message: OK
20
+ headers:
21
+ Age:
22
+ - '0'
23
+ Cache-Control:
24
+ - public, max-age=0, must-revalidate
25
+ Content-Length:
26
+ - '88356'
27
+ Content-Type:
28
+ - application/json; charset=utf-8
29
+ Date:
30
+ - Sun, 18 Jun 2023 06:09:04 GMT
31
+ Etag:
32
+ - '"xxw11pz9vb1w14"'
33
+ Server:
34
+ - Vercel
35
+ Strict-Transport-Security:
36
+ - max-age=63072000
37
+ X-Matched-Path:
38
+ - "/api/blogs/[slug]"
39
+ X-Vercel-Cache:
40
+ - MISS
41
+ X-Vercel-Id:
42
+ - fra1::iad1::jpsj8-1687068539414-5ae2f138cf70
43
+ Connection:
44
+ - close
45
+ body:
46
+ encoding: UTF-8
47
+ string: '{"id":"tyfqw20","title":"iPhylo","description":"Rants, raves (and occasionally
48
+ considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For
49
+ more ranty and less considered opinions, see my <a href=\"https://twitter.com/rdmpage\">Twitter
50
+ feed</a>.<br>ISSN 2051-8188. Written content on this site is licensed under
51
+ a <a href=\"https://creativecommons.org/licenses/by/4.0/\">Creative Commons
52
+ Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default","feed_format":"application/atom+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-05-31T17:26:00+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger
53
+ 7.00","category":"Natural Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/en7e9-5s882","uuid":"20b9d31e-513f-496b-b399-4215306e1588","url":"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html","title":"Obsidian,
54
+ markdown, and taxonomic trees","summary":"Returning to the subject of personal
55
+ knowledge graphs Kyle Scheer has an interesting repository of Markdown files
56
+ that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines
57
+ (see his blog post for more background). If you add these files to Obsidian
58
+ you get a nice visualisation of a taxonomy of academic disciplines. The applications
59
+ of this to biological taxonomy seem obvious, especially as a tool like Obsidian
60
+ enables all sorts of interesting links to be added...","date_published":"2022-04-07T21:07:00Z","date_modified":"2022-04-07T21:15:34Z","date_indexed":"1909-06-16T09:41:45+00:00","authors":[{"url":null,"name":"Roderic
61
+ Page"}],"image":null,"content_html":"<p>Returning to the subject of <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">personal
62
+ knowledge graphs</a> Kyle Scheer has an interesting repository of Markdown
63
+ files that describe academic disciplines at <a href=\"https://github.com/kyletscheer/academic-disciplines\">https://github.com/kyletscheer/academic-disciplines</a>
64
+ (see <a href=\"https://kyletscheer.medium.com/on-creating-a-tree-of-knowledge-f099c1028bf6\">his
65
+ blog post</a> for more background).</p>\n\n<p>If you add these files to <a
66
+ href=\"https://obsidian.md/\">Obsidian</a> you get a nice visualisation of
67
+ a taxonomy of academic disciplines. The applications of this to biological
68
+ taxonomy seem obvious, especially as a tool like Obsidian enables all sorts
69
+ of interesting links to be added (e.g., we could add links to the taxonomic
70
+ research behind each node in the taxonomic tree, the people doing that research,
71
+ etc. - although that would mean we''d no longer have a simple tree).</p>\n\n<p>The
72
+ more I look at these sort of simple Markdown-based tools the more I wonder
73
+ whether we could make more use of them to create simple but persistent databases.
74
+ Text files seem the most stable, long-lived digital format around, maybe this
75
+ would be a way to minimise the inevitable obsolescence of database and server
76
+ software. Time for some experiments I feel... can we take a taxonomic group,
77
+ such as mammals, and create a richly connected database purely in Markdown?</p>\n\n<div
78
+ class=\"separator\" style=\"clear: both; text-align: center;\"><iframe allowfullscreen=''allowfullscreen''
79
+ webkitallowfullscreen=''webkitallowfullscreen'' mozallowfullscreen=''mozallowfullscreen''
80
+ width=''400'' height=''322'' src=''https://www.blogger.com/video.g?token=AD6v5dxZtweOTJTdg6aqvICq_tKF0la1QZuDAEpwPPCVQKtG5vjB-DzuQv-ApL8JnpyZ1FffYtWo6ymizNQ''
81
+ class=''b-hbp-video b-uploaded'' frameborder=''0''></iframe></div>","tags":["markdown","obsidian"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/m48f7-c2128","uuid":"8aea47e4-f227-45f4-b37b-0454a8a7a3ff","url":"https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html","title":"ChatGPT,
82
+ semantic search, and knowledge graphs","summary":"One thing about ChatGPT
83
+ is it has opened my eyes to some concepts I was dimly aware of but am only
84
+ now beginning to fully appreciate. ChatGPT enables you ask it questions, but
85
+ the answers depend on what ChatGPT “knows”. As several people have noted,
86
+ what would be even better is to be able to run ChatGPT on your own content.
87
+ Indeed, ChatGPT itself now supports this using plugins. Paul Graham GPT However,
88
+ it’s still useful to see how to add ChatGPT functionality to your own content
89
+ from...","date_published":"2023-04-03T15:30:00Z","date_modified":"2023-04-03T15:32:04Z","date_indexed":"1909-06-16T09:02:34+00:00","authors":[{"url":null,"name":"Roderic
90
+ Page"}],"image":null,"content_html":"<p>One thing about ChatGPT is it has
91
+ opened my eyes to some concepts I was dimly aware of but am only now beginning
92
+ to fully appreciate. ChatGPT enables you ask it questions, but the answers
93
+ depend on what ChatGPT “knows”. As several people have noted, what would be
94
+ even better is to be able to run ChatGPT on your own content. Indeed, ChatGPT
95
+ itself now supports this using <a href=\"https://openai.com/blog/chatgpt-plugins\">plugins</a>.</p>\n<h4
96
+ id=\"paul-graham-gpt\">Paul Graham GPT</h4>\n<p>However, it’s still useful
97
+ to see how to add ChatGPT functionality to your own content from scratch.
98
+ A nice example of this is <a href=\"https://paul-graham-gpt.vercel.app/\">Paul
99
+ Graham GPT</a> by <a href=\"https://twitter.com/mckaywrigley\">Mckay Wrigley</a>.
100
+ Mckay Wrigley took essays by Paul Graham (a well known venture capitalist)
101
+ and built a question and answer tool very like ChatGPT.</p>\n<iframe width=\"560\"
102
+ height=\"315\" src=\"https://www.youtube.com/embed/ii1jcLg-eIQ\" title=\"YouTube
103
+ video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write;
104
+ encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n<p>Because
105
+ you can send a block of text to ChatGPT (as part of the prompt) you can get
106
+ ChatGPT to summarise or transform that information, or answer questions based
107
+ on that information. But there is a limit to how much information you can
108
+ pack into a prompt. You can’t put all of Paul Graham’s essays into a prompt
109
+ for example. So a solution is to do some preprocessing. For example, given
110
+ a question such as “How do I start a startup?” we could first find the essays
111
+ that are most relevant to this question, then use them to create a prompt
112
+ for ChatGPT. A quick and dirty way to do this is simply do a text search over
113
+ the essays and take the top hits. But we aren’t searching for words, we are
114
+ searching for answers to a question. The essay with the best answer might
115
+ not include the phrase “How do I start a startup?”.</p>\n<h4 id=\"semantic-search\">Semantic
116
+ search</h4>\n<p>Enter <a href=\"https://en.wikipedia.org/wiki/Semantic_search\">Semantic
117
+ search</a>. The key concept behind semantic search is that we are looking
118
+ for documents with similar meaning, not just similarity of text. One approach
119
+ to this is to represent documents by “embeddings”, that is, a vector of numbers
120
+ that encapsulate features of the document. Documents with similar vectors
121
+ are potentially related. In semantic search we take the query (e.g., “How
122
+ do I start a startup?”), compute its embedding, then search among the documents
123
+ for those with similar embeddings.</p>\n<p>To create Paul Graham GPT Mckay
124
+ Wrigley did the following. First he sent each essay to the OpenAI API underlying
125
+ ChatGPT, and in return he got the embedding for that essay (a vector of 1536
126
+ numbers). Each embedding was stored in a database (Mckay uses Postgres with
127
+ <a href=\"https://github.com/pgvector/pgvector\">pgvector</a>). When a user
128
+ enters a query such as “How do I start a startup?” that query is also sent
129
+ to the OpenAI API to retrieve its embedding vector. Then we query the database
130
+ of embeddings for Paul Graham’s essays and take the top five hits. These hits
131
+ are, one hopes, the most likely to contain relevant answers. The original
132
+ question and the most similar essays are then bundled up and sent to ChatGPT
133
+ which then synthesises an answer. See his <a href=\"https://github.com/mckaywrigley/paul-graham-gpt\">GitHub
134
+ repo</a> for more details. Note that we are still using ChatGPT, but on a
135
+ set of documents it doesn’t already have.</p>\n<h4 id=\"knowledge-graphs\">Knowledge
136
+ graphs</h4>\n<p>I’m a fan of knowledge graphs, but they are not terribly easy
137
+ to use. For example, I built a knowledge graph of Australian animals <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a>
138
+ that contains a wealth of information on taxa, publications, and people, wrapped
139
+ up in a web site. If you want to learn more you need to figure out how to
140
+ write queries in SPARQL, which is not fun. Maybe we could use ChatGPT to write
141
+ the SPARQL queries for us, but it would be much more fun to be simply ask
142
+ natural language queries (e.g., “who are the experts on Australian ants?”).
143
+ I made some naïve notes on these ideas <a href=\"https://iphylo.blogspot.com/2015/09/possible-project-natural-language.html\">Possible
144
+ project: natural language queries, or answering “how many species are there?”</a>
145
+ and <a href=\"https://iphylo.blogspot.com/2019/05/ozymandias-meets-wikipedia-with-notes.html\">Ozymandias
146
+ meets Wikipedia, with notes on natural language generation</a>.</p>\n<p>Of
147
+ course, this is a well known problem. Tools such as <a href=\"http://rdf2vec.org\">RDF2vec</a>
148
+ can take RDF from a knowledge graph and create embeddings which could in tern
149
+ be used to support semantic search. But it seems to me that we could simply
150
+ this process a bit by making use of ChatGPT.</p>\n<p>Firstly we would generate
151
+ natural language statements from the knowledge graph (e.g., “species x belongs
152
+ to genus y and was described in z”, “this paper on ants was authored by x”,
153
+ etc.) that cover the basic questions we expect people to ask. We then get
154
+ embeddings for these (e.g., using OpenAI). We then have an interface where
155
+ people can ask a question (“is species x a valid species?”, “who has published
156
+ on ants”, etc.), we get the embedding for that question, retrieve natural
157
+ language statements that the closest in embedding “space”, package everything
158
+ up and ask ChatGPT to summarise the answer.</p>\n<p>The trick, of course,
159
+ is to figure out how t generate natural language statements from the knowledge
160
+ graph (which amounts to deciding what paths to traverse in the knowledge graph,
161
+ and how to write those paths is something approximating English). We also
162
+ want to know something about the sorts of questions people are likely to ask
163
+ so that we have a reasonable chance of having the answers (for example, are
164
+ people going to ask about individual species, or questions about summary statistics
165
+ such as numbers of species in a genus, etc.).</p>\n<p>What makes this attractive
166
+ is that it seems a straightforward way to go from a largely academic exercise
167
+ (build a knowledge graph) to something potentially useful (a question and
168
+ answer machine). Imagine if something like the defunct BBC wildlife site (see
169
+ <a href=\"https://iphylo.blogspot.com/2017/12/blue-planet-ii-bbc-and-semantic-web.html\">Blue
170
+ Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and
171
+ opportunities lost</a>) revived <a href=\"https://aspiring-look.glitch.me\">here</a>
172
+ had a question and answer interface where we could ask questions rather than
173
+ passively browse.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>I have so much
174
+ more to learn, and need to think about ways to incorporate semantic search
175
+ and ChatGPT-like tools into knowledge graphs.</p>\n<blockquote>\n<p>Written
176
+ with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/zc4qc-77616","uuid":"30c78d9d-2e50-49db-9f4f-b3baa060387b","url":"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html","title":"Does
177
+ anyone cite taxonomic treatments?","summary":"Taxonomic treatments have come
178
+ up in various discussions I''m involved in, and I''m curious as to whether
179
+ they are actually being used, in particular, whether they are actually being
180
+ cited. Consider the following quote: The taxa are described in taxonomic treatments,
181
+ well defined sections of scientific publications (Catapano 2019). They include
182
+ a nomenclatural section and one or more sections including descriptions, material
183
+ citations referring to studied specimens, or notes ecology and...","date_published":"2022-09-01T16:49:00Z","date_modified":"2022-09-01T16:49:51Z","date_indexed":"1909-06-16T09:31:50+00:00","authors":[{"url":null,"name":"Roderic
184
+ Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
185
+ both;\"><a href=\"https://zenodo.org/record/5731100/thumb100\" style=\"display:
186
+ block; padding: 1em 0; text-align: center; clear: right; float: right;\"><img
187
+ alt=\"\" border=\"0\" height=\"128\" data-original-height=\"106\" data-original-width=\"100\"
188
+ src=\"https://zenodo.org/record/5731100/thumb250\"/></a></div>\nTaxonomic
189
+ treatments have come up in various discussions I''m involved in, and I''m
190
+ curious as to whether they are actually being used, in particular, whether
191
+ they are actually being cited. Consider the following quote:\n\n<blockquote>\nThe
192
+ taxa are described in taxonomic treatments, well defined sections of scientific
193
+ publications (Catapano 2019). They include a nomenclatural section and one
194
+ or more sections including descriptions, material citations referring to studied
195
+ specimens, or notes ecology and behavior. In case the treatment does not describe
196
+ a new discovered taxon, previous treatments are cited in the form of treatment
197
+ citations. This citation can refer to a previous treatment and add additional
198
+ data, or it can be a statement synonymizing the taxon with another taxon.
199
+ This allows building a citation network, and ultimately is a constituent part
200
+ of the catalogue of life. - Taxonomic Treatments as Open FAIR Digital Objects
201
+ <a href=\"https://doi.org/10.3897/rio.8.e93709\">https://doi.org/10.3897/rio.8.e93709</a>\n</blockquote>\n\n<p>\n
202
+ \"Traditional\" academic citation is from article to article. For example,
203
+ consider these two papers:\n\n<blockquote>\nLi Y, Li S, Lin Y (2021) Taxonomic
204
+ study on fourteen symphytognathid species from Asia (Araneae, Symphytognathidae).
205
+ ZooKeys 1072: 1-47. https://doi.org/10.3897/zookeys.1072.67935\n</blockquote>\n\n<blockquote>\nMiller
206
+ J, Griswold C, Yin C (2009) The symphytognathoid spiders of the Gaoligongshan,
207
+ Yunnan, China (Araneae: Araneoidea): Systematics and diversity of micro-orbweavers.
208
+ ZooKeys 11: 9-195. https://doi.org/10.3897/zookeys.11.160\n</blockquote>\n</p>\n\n<p>Li
209
+ et al. 2021 cites Miller et al. 2009 (although Pensoft seems to have broken
210
+ the citation such that it does appear correctly either on their web page or
211
+ in CrossRef).</p>\n\n<p>So, we have this link: [article]10.3897/zookeys.1072.67935
212
+ --cites--> [article]10.3897/zookeys.11.160. One article cites another.</p>\n\n<p>In
213
+ their 2021 paper Li et al. discuss <i>Patu jidanweishi</i> Miller, Griswold
214
+ & Yin, 2009:\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s1040/Screenshot%202022-09-01%20at%2017.12.27.png\"
215
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
216
+ border=\"0\" width=\"400\" data-original-height=\"314\" data-original-width=\"1040\"
217
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s400/Screenshot%202022-09-01%20at%2017.12.27.png\"/></a></div>\n\n<p>There
218
+ is a treatment for the original description of <i>Patu jidanweishi</i> at
219
+ <a href=\"https://doi.org/10.5281/zenodo.3792232\">https://doi.org/10.5281/zenodo.3792232</a>,
220
+ which was created by Plazi with a time stamp \"2020-05-06T04:59:53.278684+00:00\".
221
+ The original publication date was 2009, the treatments are being added retrospectively.</p>\n\n<p>In
222
+ an ideal world my expectation would be that Li et al. 2021 would have cited
223
+ the treatment, instead of just providing the text string \"Patu jidanweishi
224
+ Miller, Griswold & Yin, 2009: 64, figs 65A–E, 66A, B, 67A–D, 68A–F, 69A–F,
225
+ 70A–F and 71A–F (♂♀).\" Isn''t the expectation under the treatment model that
226
+ we would have seen this relationship:</p>\n\n<p>[article]10.3897/zookeys.1072.67935
227
+ --cites--> [treatment]https://doi.org/10.5281/zenodo.3792232</p>\n\n<p>Furthermore,
228
+ if it is the case that \"[i]n case the treatment does not describe a new discovered
229
+ taxon, previous treatments are cited in the form of treatment citations\"
230
+ then we should also see a citation between treatments, in other words Li et
231
+ al.''s 2021 treatment of <i>Patu jidanweishi</i> (which doesn''t seem to have
232
+ a DOI but is available on Plazi'' web site as <a href=\"https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74\">https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74</a>)
233
+ should also cite the original treatment? It doesn''t - but it does cite the
234
+ Miller et al. paper.</p>\n\n<p>So in this example we don''t see articles citing
235
+ treatments, nor do we see treatments citing treatments. Playing Devil''s advocate,
236
+ why then do we have treatments? Does''t the lack of citations suggest that
237
+ - despite some taxonomists saying this is the unit that matters - they actually
238
+ don''t. If we pay attention to what people do rather than what they say they
239
+ do, they cite articles.</p>\n\n<p>Now, there are all sorts of reasons why
240
+ we don''t see [article] -> [treatment] citations, or [treatment] -> [treatment]
241
+ citations. Treatments are being added after the fact by Plazi, not by the
242
+ authors of the original work. And in many cases the treatments that could
243
+ be cited haven''t appeared until after that potentially citing work was published.
244
+ In the example above the Miller et al. paper dates from 2009, but the treatment
245
+ extracted only went online in 2020. And while there is a long standing culture
246
+ of citing publications (ideally using DOIs) there isn''t an equivalent culture
247
+ of citing treatments (beyond the simple text strings).</p>\n\n<p>Obviously
248
+ this is but one example. I''d need to do some exploration of the citation
249
+ graph to get a better sense of citations patterns, perhaps using <a href=\"https://www.crossref.org/documentation/event-data/\">CrossRef''s
250
+ event data</a>. But my sense is that taxonomists don''t cite treatments.</p>\n\n<p>I''m
251
+ guessing Plazi would respond by saying treatments are cited, for example (indirectly)
252
+ in GBIF downloads. This is true, although arguably people aren''t citing the
253
+ treatment, they''re citing specimen data in those treatments, and that specimen
254
+ data could be extracted at the level of articles rather than treatments. In
255
+ other words, it''s not the treatments themselves that people are citing.</p>\n\n<p>To
256
+ be clear, I think there is value in being able to identify those \"well defined
257
+ sections\" of a publication that deal with a given taxon (i.e., treatments),
258
+ but it''s not clear to me that these are actually the citable units people
259
+ might hope them to be. Likewise, journals such as <i>ZooKeys</i> have DOIs
260
+ for individual figures. Does anyone actually cite those?</p>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/pmhat-5ky65","uuid":"5891c709-d139-440f-bacb-06244424587a","url":"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html","title":"Problems
261
+ with Plazi parsing: how reliable are automated methods for extracting specimens
262
+ from the literature?","summary":"The Plazi project has become one of the major
263
+ contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
264
+ (see Plazi''s GBIF page for details). These occurrences are extracted from
265
+ taxonomic publication using automated methods. New data is published almost
266
+ daily (see latest treatments). The map below shows the geographic distribution
267
+ of material citations provided to GBIF by Plazi, which gives you a sense of
268
+ the size of the dataset. By any metric Plazi represents a...","date_published":"2021-10-25T11:10:00Z","date_modified":"2021-10-28T16:08:18Z","date_indexed":"1970-01-01T00:00:00+00:00","authors":[{"url":null,"name":"Roderic
269
+ Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
270
+ both;\"><a href=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s240/Rf7UoXTw_400x400.jpg\"
271
+ style=\"display: block; padding: 1em 0; text-align: center; clear: right;
272
+ float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"240\"
273
+ data-original-width=\"240\" src=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s200/Rf7UoXTw_400x400.jpg\"/></a></div><p>The
274
+ <a href=\"http://plazi.org\">Plazi</a> project has become one of the major
275
+ contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
276
+ (see <a href=\"https://www.gbif.org/publisher/7ce8aef0-9e92-11dc-8738-b8a03c50a862\">Plazi''s
277
+ GBIF page</a> for details). These occurrences are extracted from taxonomic
278
+ publication using automated methods. New data is published almost daily (see
279
+ <a href=\"https://tb.plazi.org/GgServer/static/newToday.html\">latest treatments</a>).
280
+ The map below shows the geographic distribution of material citations provided
281
+ to GBIF by Plazi, which gives you a sense of the size of the dataset.</p>\n\n<div
282
+ class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s1030/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"
283
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
284
+ border=\"0\" width=\"400\" data-original-height=\"514\" data-original-width=\"1030\"
285
+ src=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"/></a></div>\n\n<p>By
286
+ any metric Plazi represents a considerable achievement. But often when I browse
287
+ individual records on Plazi I find records that seem clearly incorrect. Text
288
+ mining the literature is a challenging problem, but at the moment Plazi seems
289
+ something of a \"black box\". PDFs go in, the content is mined, and data comes
290
+ up to be displayed on the Plazi web site and uploaded to GBIF. Nowhere does
291
+ there seem to be an evaluation of how accurate this text mining actually is.
292
+ Anecdotally it seems to work well in some cases, but in others it produces
293
+ what can only be described as bogus records.</p>\n\n<h2>Finding errors</h2>\n\n<p>A
294
+ treatment in Plazi is a block of text (and sometimes illustrations) that refers
295
+ to a single taxon. Often that text will include a description of the taxon,
296
+ and list one or more specimens that have been examined. These lists of specimens
297
+ (\"material citations\") are one of the key bits of information that Plaza
298
+ extracts from a treatment as these citations get fed into GBIF as occurrences.</p>\n\n<p>To
299
+ help explore treatments I''ve constructed a simple web site that takes the
300
+ Plazi identifier for a treatment and displays that treatment with the material
301
+ citations highlighted. For example, for the Plazi treatment <a href=\"https://tb.plazi.org/GgServer/html/03B5A943FFBB6F02FE27EC94FABEEAE7\">03B5A943FFBB6F02FE27EC94FABEEAE7</a>
302
+ you can view the marked up version at <a href=\"https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228\">https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228</a>.
303
+ Below is an example of a material citation with its component parts tagged:</p>\n\n<div
304
+ class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s693/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"
305
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
306
+ border=\"0\" width=\"400\" data-original-height=\"94\" data-original-width=\"693\"
307
+ src=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"/></a></div>\n\n<p>This
308
+ is an example where Plazi has successfully parsed the specimen. But I keep
309
+ coming across cases where specimens have not been parsed correctly, resulting
310
+ in issues such as single specimens being split into multiple records (e.g., <a
311
+ href=\"https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496\">https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496</a>),
312
+ geographical coordinates being misinterpreted (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9\">https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9</a>),
313
+ or collector''s initials being confused with codes for natural history collections
314
+ (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E\">https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E</a>).</p>\n\n<p>Parsing
315
+ specimens is a hard problem so it''s not unexpected to find errors. But they
316
+ do seem common enough to be easily found, which raises the question of just
317
+ what percentage of these material citations are correct? How much of the
318
+ data Plazi feeds to GBIF is correct? How would we know?</p>\n\n<h2>Systemic
319
+ problems</h2>\n\n<p>Some of the errors I''ve found concern the interpretation
320
+ of the parsed data. For example, it is striking that despite including marine
321
+ taxa <b>no</b> Plazi record has a value for depth below sea level (see <a
322
+ href=\"https://www.gbif.org/occurrence/map?depth=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">GBIF
323
+ search on depth range 0-9999 for Plazi</a>). But <a href=\"https://www.gbif.org/occurrence/map?elevation=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">many
324
+ records do have an elevation</a>, including records from marine environments.
325
+ Any record that has a depth value is interpreted by Plazi as being elevation,
326
+ so we have aerial crustacea and fish.</p>\n\n<h3>Map of Plazi records with
327
+ depth 0-9999m</h3>\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s673/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"
328
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
329
+ border=\"0\" width=\"400\" data-original-height=\"258\" data-original-width=\"673\"
330
+ src=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"/></a></div>\n\n<h3>Map
331
+ of Plazi records with elevation 0-9999m </h3>\n<div class=\"separator\" style=\"clear:
332
+ both;\"><a href=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s675/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"
333
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
334
+ border=\"0\" width=\"400\" data-original-height=\"256\" data-original-width=\"675\"
335
+ src=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"/></a></div>\n\n<p>Anecdotally
336
+ I''ve also noticed that Plazi seems to do well on zoological data, especially
337
+ journals like <i>Zootaxa</i>, but it often struggles with botanical specimens.
338
+ Botanists tend to cite specimens rather differently to zoologists (botanists
339
+ emphasise collector numbers rather than specimen codes). Hence data quality
340
+ in Plazi is likely to taxonomic biased.</p>\n\n<p>Plazi is <a href=\"https://github.com/plazi/community/issues\">using
341
+ GitHub to track issues with treatments</a> so feedback on erroneous records
342
+ is possible, but this seems inadequate to the task. There are tens of thousands
343
+ of data sets, with more being released daily, and hundreds of thousands of
344
+ occurrences, and relying on GitHub issues devolves the responsibility for
345
+ error checking onto the data users. I don''t have a measure of how many records
346
+ in Plazi have problems, but because I suspect it is a significant fraction
347
+ because for any given day''s output I can typically find errors.</p>\n\n<h2>What
348
+ to do?</h2>\n\n<p>Faced with a process that generates noisy data there are
349
+ several of things we could do:</p>\n\n<ol>\n<li>Have tools to detect and flag
350
+ errors made in generating the data.</li>\n<li>Have the data generator give
351
+ estimates the confidence of its results.</li>\n<li>Improve the data generator.</li>\n</ol>\n\n<p>I
352
+ think a comparison with the problem of parsing bibliographic references might
353
+ be instructive here. There is a long history of people developing tools to
354
+ parse references (<a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">I''ve
355
+ even had a go</a>). State-of-the art tools such as <a href=\"https://anystyle.io\">AnyStyle</a>
356
+ feature machine learning, and are tested against <a href=\"https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml\">human
357
+ curated datasets</a> of tagged bibliographic records. This means we can evaluate
358
+ the performance of a method (how well does it retrieve the same results as
359
+ human experts?) and also improve the method by expanding the corpus of training
360
+ data. Some of these tools can provide a measures of how confident they are
361
+ when classifying a string as, say, a person''s name, which means we could
362
+ flag potential issues for anyone wanting to use that record.</p>\n\n<p>We
363
+ don''t have equivalent tools for parsing specimens in the literature, and
364
+ hence have no easy way to quantify how good existing methods are, nor do we
365
+ have a public corpus of material citations that we can use as training data.
366
+ I <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">blogged
367
+ about this</a> a few months ago and was considering using Plazi as a source
368
+ of marked up specimen data to use for training. However based on what I''ve
369
+ looked at so far Plazi''s data would need to be carefully scrutinised before
370
+ it could be used as training data.</p>\n\n<p>Going forward, I think it would
371
+ be desirable to have a set of records that can be used to benchmark specimen
372
+ parsers, and ideally have the parsers themselves available as web services
373
+ so that anyone can evaluate them. Even better would be a way to contribute
374
+ to the training data so that these tools improve over time.</p>\n\n<p>Plazi''s
375
+ data extraction tools are mostly desktop-based, that is, you need to download
376
+ software to use their methods. However, there are experimental web services
377
+ available as well. I''ve created a simple wrapper around the material citation
378
+ parser, you can try it at <a href=\"https://plazi-tester.herokuapp.com/parser.php\">https://plazi-tester.herokuapp.com/parser.php</a>.
379
+ It takes a single material citation and returns a version with elements such
380
+ as specimen code and collector name tagged in different colours.</p>\n\n<h2>Summary</h2>\n\n<p>Text
381
+ mining the taxonomic literature is clearly a gold mine of data, but at the
382
+ same time it is potentially fraught as we try and extract structured data
383
+ from semi-structured text. Plazi has demonstrated that it is possible to extract
384
+ a lot of data from the literature, but at the same time the quality of that
385
+ data seems highly variable. Even minor issues in parsing text can have big
386
+ implications for data quality (e.g., marine organisms apparently living above
387
+ sea level). Historically in biodiversity informatics we have favoured data
388
+ quantity over data quality. Quantity has an obvious metric, and has milestones
389
+ we can celebrate (e.g., <a href=\"GBIF at 1 billion - what''s next?\">one
390
+ billion specimens</a>). There aren''t really any equivalent metrics for data
391
+ quality.</p>\n\n<p>Adding new types of data can sometimes initially result
392
+ in a new set of quality issues (e.g., <a href=\"https://iphylo.blogspot.com/2019/12/gbif-metagenomics-and-metacrap.html\">GBIF
393
+ metagenomics and metacrap</a>) that take time to resolve. In the case of Plazi,
394
+ I think it would be worthwhile to quantify just how many records have errors,
395
+ and develop benchmarks that we can use to test methods for extracting specimen
396
+ data from text. If we don''t do this then there will remain uncertainty as
397
+ to how much trust we can place in data mined from the taxonomic literature.</p>\n\n<h2>Update</h2>\n\nPlazi
398
+ has responded, see <a href=\"http://plazi.org/posts/2021/10/liberation-first-step-toward-quality/\">Liberating
399
+ material citations as a first step to more better data</a>. My reading of
400
+ their repsonse is that it essentially just reiterates Plazi''s approach and
401
+ doesn''t tackle the underlying issue: their method for extracting material
402
+ citations is error prone, and many of those errors end up in GBIF.","tags":["data
403
+ quality","parsing","Plazi","specimen","text mining"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/j77nc-e8x98","uuid":"c6b101f4-bfbc-4d01-921d-805c43c85757","url":"https://iphylo.blogspot.com/2022/08/linking-taxonomic-names-to-literature.html","title":"Linking
404
+ taxonomic names to the literature","summary":"Just some thoughts as I work
405
+ through some datasets linking taxonomic names to the literature. In the diagram
406
+ above I''ve tried to capture the different situatios I encounter. Much of
407
+ the work I''ve done on this has focussed on case 1 in the diagram: I want
408
+ to link a taxonomic name to an identifier for the work in which that name
409
+ was published. In practise this means linking names to DOIs. This has the
410
+ advantage of linking to a citable indentifier, raising questions such as whether
411
+ citations...","date_published":"2022-08-22T17:19:00Z","date_modified":"2022-08-22T17:19:08Z","date_indexed":"1909-06-16T08:21:41+00:00","authors":[{"url":null,"name":"Roderic
412
+ Page"}],"image":null,"content_html":"Just some thoughts as I work through
413
+ some datasets linking taxonomic names to the literature.\n\n<div class=\"separator\"
414
+ style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s2140/linking%20to%20names144.jpg\"
415
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
416
+ border=\"0\" height=\"600\" data-original-height=\"2140\" data-original-width=\"1604\"
417
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s600/linking%20to%20names144.jpg\"/></a></div>\n\n<p>In
418
+ the diagram above I''ve tried to capture the different situatios I encounter.
419
+ Much of the work I''ve done on this has focussed on case 1 in the diagram:
420
+ I want to link a taxonomic name to an identifier for the work in which that
421
+ name was published. In practise this means linking names to DOIs. This has
422
+ the advantage of linking to a citable indentifier, raising questions such
423
+ as whether citations of taxonmic papers by taxonomic databases could become
424
+ part of a <a href=\"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html\">taxonomist''s
425
+ Google Scholar profile</a>.</p>\n\n<p>In many taxonomic databases full work-level
426
+ citations are not the norm, instead taxonomists cite one or more pages within
427
+ a work that are relevant to a taxonomic name. These \"microcitations\" (what
428
+ the U.S. legal profession refer to as \"point citations\" or \"pincites\", see
429
+ <a href=\"https://rasmussen.libanswers.com/faq/283203\">What are pincites,
430
+ pinpoints, or jump legal references?</a>) require some work to map to the
431
+ work itself (which is typically the thing that has a citatble identifier such
432
+ as a DOI).</p>\n\n<p>Microcitations (case 2 in the diagram above) can be quite
433
+ complex. Some might simply mention a single page, but others might list a
434
+ series of (not necessarily contiguous) pages, as well as figures, plates etc.
435
+ Converting these to citable identifiers can be tricky, especially as in most
436
+ cases we don''t have page-level identifiers. The Biodiversity Heritage Library
437
+ (BHL) does have URLs for each scanned page, and we have a standard for referring
438
+ to pages in a PDF (<code>page=&lt;pageNum&gt;</code>, see <a href=\"https://datatracker.ietf.org/doc/html/rfc8118\">RFC
439
+ 8118</a>). But how do we refer to a set of pages? Do we pick the first page?
440
+ Do we try and represent a set of pages, and if so, how?</p>\n\n<p>Another
441
+ issue with page-level identifiers is that not everything on a given page may
442
+ be relevant to the taxonomic name. In case 2 above I''ve shaded in the parts
443
+ of the pages and figure that refer to the taxonomic name. An example where
444
+ this can be problematic is the recent test case I created for BHL where a
445
+ page image was included for the taxonomic name <a href=\"https://www.gbif.org/species/195763322\"><i>Aphrophora
446
+ impressa</i></a>. The image includes the species description and a illustration,
447
+ as well as text that relates to other species.</p>\n\n<div class=\"separator\"
448
+ style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s3467/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"
449
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
450
+ border=\"0\" height=\"400\" data-original-height=\"3467\" data-original-width=\"2106\"
451
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s400/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"/></a></div>\n\n<p>Given
452
+ that not everything on a page need be relevant, we could extract just the
453
+ relevant blocks of text and illustrations (e.g., paragraphs of text, panels
454
+ within a figure, etc.) and treat that set of elements as the thing to cite.
455
+ This is, of course, what <a href=\"http://plazi.org\">Plazi</a> are doing.
456
+ The set of extracted blocks is glued together as a \"treatment\", assigned
457
+ an identifier (often a DOI), and treated as a citable unit. It would be interesting
458
+ to see to what extent these treatments are actually cited, for example, do
459
+ subsequent revisions that cite work that include treatments cite those treatments,
460
+ or just the work itself? Put another way, are we creating <a href=\"https://iphylo.blogspot.com/2012/09/decoding-nature-encode-ipad-app-omg-it.html\">\"threads\"</a>
461
+ between taxonomic revisions?</p>\n\n<p>One reason for these notes is that
462
+ I''m exploring uploading taxonomic name - literature links to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
463
+ and case 1 above is easy, as is case 3 (if we have treatment-level identifiers).
464
+ But case 2 is problematic because we are linking to a set of things that may
465
+ not have an identifier, which means a decision has to be made about which
466
+ page to link to, and how to refer to that page.</p>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
467
+ barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
468
+ tweeted about this but want to bookmark it for later as well. The paper “A
469
+ molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
470
+ contains the following: …the annotated barcode records assembled by FinBOL
471
+ participants represent a tremendous intergenerational transfer of taxonomic
472
+ knowledge … the time contributed by current taxonomists in identifying and
473
+ contributing voucher specimens represents a great gift to future generations
474
+ who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":"2022-09-29T13:57:30Z","date_indexed":"1909-06-16T11:02:21+00:00","authors":[{"url":null,"name":"Roderic
475
+ Page"}],"image":null,"content_html":"<p>I <a href=\"https://twitter.com/rdmpage/status/1569738844416638981?s=21&amp;t=9OVXuoUEwZtQt-Ldzlutfw\">tweeted
476
+ about this</a> but want to bookmark it for later as well. The paper “A molecular-based
477
+ identification resource for the arthropods of Finland” <a href=\"https://doi.org/10.1111/1755-0998.13510\">doi:10.1111/1755-0998.13510</a>
478
+ contains the following:</p>\n<blockquote>\n<p>…the annotated barcode records
479
+ assembled by FinBOL participants represent a tremendous <mark>intergenerational
480
+ transfer of taxonomic knowledge</mark> … the time contributed by current taxonomists
481
+ in identifying and contributing voucher specimens represents a great gift
482
+ to future generations who will benefit from their expertise when they are
483
+ no longer able to process new material.</p>\n</blockquote>\n<p>I think this
484
+ is a very clever way to characterise the project. In an age of machine learning
485
+ this may be commonest way to share knowledge , namely as expert-labelled training
486
+ data used to build tools for others. Of course, this means the expertise itself
487
+ may be lost, which has implications for updating the models if the data isn’t
488
+ complete. But it speaks to Charles Godfrey’s theme of <a href=\"https://biostor.org/reference/250587\">“Taxonomy
489
+ as information science”</a>.</p>\n<p>Note that the knowledge is also transformed
490
+ in the sense that the underlying expertise of interpreting morphology, ecology,
491
+ behaviour, genomics, and the past literature is not what is being passed on.
492
+ Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>\n<p>This
493
+ feels is different to, say iNaturalist, where there is a machine learning
494
+ model to identify images. In that case, the model is built on something the
495
+ community itself has created, and continues to create. Yes, the underlying
496
+ idea is that same: “experts” have labelled the data, a model is trained, the
497
+ model is used. But the benefits of the <a href=\"https://www.inaturalist.org\">iNaturalist</a>
498
+ model are immediately applicable to the people whose data built the model.
499
+ In the case of barcoding, because the technology itself is still not in the
500
+ hands of many (relative to, say, digital imaging), the benefits are perhaps
501
+ less tangible. Obviously researchers working with environmental DNA will find
502
+ it very useful, but broader impact may await the arrival of citizen science
503
+ DNA barcoding.</p>\n<p>The other consideration is whether the barcoding helps
504
+ taxonomists. Is it to be used to help prioritise future work (“we are getting
505
+ lots of unknown sequences in these taxa, lets do some taxonomy there”), or
506
+ is it simply capturing the knowledge of a generation that won’t be replaced:</p>\n<blockquote>\n<p>The
507
+ need to capture such knowledge is essential because there are, for example,
508
+ no young Finnish taxonomists who can critically identify species in many key
509
+ groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges,
510
+ most mite lineages).</p>\n</blockquote>\n<p>The cycle of collect data, test
511
+ and refine model, collect more data, rinse and repeat that happens with iNaturalist
512
+ creates a feedback loop. It’s not clear that a similar cycle exists for DNA
513
+ barcoding.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
514
+ on \"The Macroscope\" - a tool for the 21st Century?","summary":"This is a
515
+ guest post by Tony Rees. It would be difficult to encounter a scientist, or
516
+ anyone interested in science, who is not familiar with the microscope, a tool
517
+ for making objects visible that are otherwise too small to be properly seen
518
+ by the unaided eye, or to reveal otherwise invisible fine detail in larger
519
+ objects. A select few with a particular interest in microscopy may also have
520
+ encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop
521
+ microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":"2021-10-08T10:26:22Z","date_indexed":"1909-06-16T10:02:25+00:00","authors":[{"url":null,"name":"Roderic
522
+ Page"}],"image":null,"content_html":"<p><img src=\"https://lh3.googleusercontent.com/-A99btr6ERMs/Vl1Wvjp2OtI/AAAAAAAAEFI/7bKdRjNG5w0/ytNkVT2U.jpg?imgmax=800\"
523
+ alt=\"YtNkVT2U\" title=\"ytNkVT2U.jpg\" border=\"0\" width=\"128\" height=\"128\"
524
+ style=\"float:right;\" /> This is a guest post by <a href=\"https://about.me/TonyRees\">Tony
525
+ Rees</a>.</p>\n\n<p>It would be difficult to encounter a scientist, or anyone
526
+ interested in science, who is not familiar with the microscope, a tool for
527
+ making objects visible that are otherwise too small to be properly seen by
528
+ the unaided eye, or to reveal otherwise invisible fine detail in larger objects.
529
+ A select few with a particular interest in microscopy may also have encountered
530
+ the Wild-Leica \"Macroscope\", a specialised type of benchtop microscope optimised
531
+ for low-power macro-photography. However in this overview I discuss the \"Macroscope\"
532
+ in a different sense, which is that of the antithesis to the microscope: namely
533
+ a method for visualizing subjects too large to be encompassed by a single
534
+ field of vision, such as the Earth or some subset of its phenomena (the biosphere,
535
+ for example), or conceptually, the universe.</p>\n\n<p><div class=\"separator\"
536
+ style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s500/2020045672.jpg\"
537
+ style=\"display: block; padding: 1em 0; text-align: center; clear: right;
538
+ float: right;\"><img alt=\"\" border=\"0\" height=\"320\" data-original-height=\"500\"
539
+ data-original-width=\"303\" src=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s320/2020045672.jpg\"/></a></div>My
540
+ introduction to the term was via addresses given by Jesse Ausubel in the formative
541
+ years of the 2001-2010 <a href=\"http://www.coml.org\">Census of Marine Life</a>,
542
+ for which he was a key proponent. In Ausubel''s view, the Census would perform
543
+ the function of a macroscope, permitting a view of everything that lives in
544
+ the global ocean (or at least, that subset which could realistically be sampled
545
+ in the time frame available) as opposed to more limited subsets available
546
+ via previous data collection efforts. My view (which could, of course, be
547
+ wrong) was that his thinking had been informed by a work entitled \"Le macroscope,
548
+ vers une vision globale\" published in 1975 by the French thinker Joël de
549
+ Rosnay, who had expressed such a concept as being globally applicable in many
550
+ fields, including the physical and natural worlds but also extending to human
551
+ society, the growth of cities, and more. Yet again, some ecologists may also
552
+ have encountered the term, sometimes in the guise of \"Odum''s macroscope\",
553
+ as an approach for obtaining \"big picture\" analyses of macroecological processes
554
+ suitable for mathematical modelling, typically by elimination of fine detail
555
+ so that only the larger patterns remain, as initially advocated by Howard
556
+ T. Odum in his 1971 book \"Environment, Power, and Society\".</p>\n\n<p>From
557
+ the standpoint of the 21st century, it seems that we are closer to achieving
558
+ a \"macroscope\" (or possibly, multiple such tools) than ever before, based
559
+ on the availability of existing and continuing new data streams, improved
560
+ technology for data assembly and storage, and advanced ways to query and combine
561
+ these large streams of data to produce new visualizations, data products,
562
+ and analytical findings. I devote the remainder of this article to examples
563
+ where either particular workers have employed \"macroscope\" terminology to
564
+ describe their activities, or where potentially equivalent actions are taking
565
+ place without the explicit \"macroscope\" association, but are equally worthy
566
+ of consideration. To save space here, references cited here (most or all)
567
+ can be found via a Wikipedia article entitled \"<a href=\"https://en.wikipedia.org/wiki/Macroscope_(science_concept)\">Macroscope
568
+ (science concept)</a>\" that I authored on the subject around a year ago,
569
+ and have continued to add to on occasion as new thoughts or information come
570
+ to hand (see <a href=\"https://en.wikipedia.org/w/index.php?title=Macroscope_(science_concept)&offset=&limit=500&action=history\">edit
571
+ history for the article</a>).</p>\n\n<p>First, one can ask, what constitutes
572
+ a macroscope, in the present context? In the Wikipedia article I point to
573
+ a book \"Big Data - Related Technologies, Challenges and Future Prospects\"
574
+ by Chen <em>et al.</em> (2014) (<a href=\"https://doi.org/10.1007/978-3-319-06245-7\">doi:10.1007/978-3-319-06245-7</a>),
575
+ in which the \"value chain of big data\" is characterised as divisible into
576
+ four phases, namely data generation, data acquisition (aka data assembly),
577
+ data storage, and data analysis. To my mind, data generation (which others
578
+ may term acquisition, differently from the usage by Chen <em>et al.</em>)
579
+ is obviously the first step, but does not in itself constitute the macroscope,
580
+ except in rare cases - such as Landsat imagery, perhaps - where on its own,
581
+ a single co-ordinated data stream is sufficient to meet the need for a particular
582
+ type of \"global view\". A variant of this might be a coordinated data collection
583
+ program - such as that of the ten year Census of Marine Life - which might
584
+ produce the data required for the desired global view; but again, in reality,
585
+ such data are collected in a series of discrete chunks, in many and often
586
+ disparate data formats, and must be \"wrangled\" into a more coherent whole
587
+ before any meaningful \"macroscope\" functionality becomes available.</p>\n\n<p>Here
588
+ we come to what, in my view, constitutes the heart of the \"macroscope\":
589
+ an intelligently organized (i.e. indexable and searchable), coherent data
590
+ store or repository (where \"data\" may include imagery and other non numeric
591
+ data forms, but much else besides). Taking the Census of Marine Life example,
592
+ the data repository for that project''s data (plus other available sources
593
+ as inputs) is the <a href=\"https://obis.org\">Ocean Biodiversity Information
594
+ System</a> or OBIS (previously the Ocean Biogeographic Information System),
595
+ which according to this view forms the \"macroscope\" for which the Census
596
+ data is a feed. (For non habitat-specific biodiversity data, <a href=\"https://www.gbif.org\">GBIF</a>
597
+ is an equivalent, and more extensive, operation). Other planetary scale \"macroscopes\",
598
+ by this definition (which may or may not have an explicit geographic, i.e.
599
+ spatial, component) would include inventories of biological taxa such as the
600
+ <a href=\"https://www.catalogueoflife.org\">Catalogue of Life</a> and so on,
601
+ all the way back to the pioneering compendia published by Linnaeus in the
602
+ eighteenth century; while for cartography and topographic imagery, the current
603
+ \"blockbuster\" of <a href=\"http://earth.google.com\">Google Earth</a> and
604
+ its predecessors also come well into public consciousness.</p>\n\n<p>In the
605
+ view of some workers and/or operations, both of these phases are precursors
606
+ to the real \"work\" of the macroscope which is to reveal previously unseen
607
+ portions of the \"big picture\" by means either of the availability of large,
608
+ synoptic datasets, or fusion between different data streams to produce novel
609
+ insights. Companies such as IBM and Microsoft have used phraseology such as:</p>\n\n<blockquote>By
610
+ 2022 we will use machine-learning algorithms and software to help us organize
611
+ information about the physical world, helping bring the vast and complex data
612
+ gathered by billions of devices within the range of our vision and understanding.
613
+ We call this a \"macroscope\" – but unlike the microscope to see the very
614
+ small, or the telescope that can see far away, it is a system of software
615
+ and algorithms to bring all of Earth''s complex data together to analyze it
616
+ by space and time for meaning.\" (IBM)</blockquote>\n\n<blockquote>As the
617
+ Earth becomes increasingly instrumented with low-cost, high-bandwidth sensors,
618
+ we will gain a better understanding of our environment via a virtual, distributed
619
+ whole-Earth \"macroscope\"... Massive-scale data analytics will enable real-time
620
+ tracking of disease and targeted responses to potential pandemics. Our virtual
621
+ \"macroscope\" can now be used on ourselves, as well as on our planet.\" (Microsoft)
622
+ (references available via the Wikipedia article cited above).</blockquote>\n\n<p>Whether
623
+ or not the analytical capabilities described here are viewed as being an integral
624
+ part of the \"macroscope\" concept, or are maybe an add-on, is ultimately
625
+ a question of semantics and perhaps, personal opinion. Continuing the Census
626
+ of Marine Life/OBIS example, OBIS offers some (arguably rather basic) visualization
627
+ and summary tools, but also makes its data available for download to users
628
+ wishing to analyse it further according to their own particular interests;
629
+ using OBIS data in this manner, Mark Costello et al. in 2017 were able to
630
+ demarcate a finite number of data-supported marine biogeographic realms for
631
+ the first time (Costello et al. 2017: Nature Communications. 8: 1057. <a href=\"https://doi.org/10.1038/s41467-017-01121-2\">doi:10.1038/s41467-017-01121-2</a>),
632
+ a project which I was able to assist in a small way in an advisory capacity.
633
+ In a case such as this, perhaps the final function of the macroscope, namely
634
+ data visualization and analysis, was outsourced to the authors'' own research
635
+ institution. Similarly at an earlier phase, \"data aggregation\" can also
636
+ be virtual rather than actual, i.e. avoiding using a single physical system
637
+ to hold all the data, enabled by open web mapping standards WMS (web map service)
638
+ and WFS (web feature service) to access a set of distributed data stores,
639
+ e.g. as implemented on the portal for the <a href=\"https://portal.aodn.org.au/\">Australian
640
+ Ocean Data Network</a>.</p>\n\n<p>So, as we pass through the third decade
641
+ of the twenty first century, what developments await us in the \"macroscope\"
642
+ area\"? In the biodiversity space, one can reasonably presume that the existing
643
+ \"macroscopic\" data assembly projects such as OBIS and GBIF will continue,
644
+ and hopefully slowly fill current gaps in their coverage - although in the
645
+ marine area, strategic new data collection exercises may be required (Census
646
+ 2020, or 2025, anyone?), while (again hopefully), the Catalogue of Life will
647
+ continue its progress towards a \"complete\" species inventory for the biosphere.
648
+ The Landsat project, with imagery dating back to 1972, continues with the
649
+ launch of its latest satellite Landsat 9 just this year (21 September 2021)
650
+ with a planned mission duration for the next 5 years, so the \"macroscope\"
651
+ functionality of that project seems set to continue for the medium term at
652
+ least. Meanwhile the ongoing development of sensor networks, both on land
653
+ and in the ocean, offers an exciting new method of \"instrumenting the earth\"
654
+ to obtain much more real time data than has ever been available in the past,
655
+ offering scope for many more, use case-specific \"macroscopes\" to be constructed
656
+ that can fuse (e.g.) satellite imagery with much more that is happening at
657
+ a local level.</p>\n\n<p>So, the \"macroscope\" concept appears to be alive
658
+ and well, even though the nomenclature can change from time to time (IBM''s
659
+ \"Macroscope\", foreshadowed in 2017, became the \"IBM Pairs Geoscope\" on
660
+ implementation, and is now simply the \"Geospatial Analytics component within
661
+ the IBM Environmental Intelligence Suite\" according to available IBM publicity
662
+ materials). In reality this illustrates a new dichotomy: even if \"everyone\"
663
+ in principle has access to huge quantities of publicly available data, maybe
664
+ only a few well funded entities now have the computational ability to make
665
+ sense of it, and can charge clients a good fee for their services...</p>\n\n<p>I
666
+ present this account partly to give a brief picture of \"macroscope\" concepts
667
+ today and in the past, for those who may be interested, and partly to present
668
+ a few personal views which would be out of scope in a \"neutral point of view\"
669
+ article such as is required on Wikipedia; also to see if readers of this blog
670
+ would like to contribute further to discussion of any of the concepts traversed
671
+ herein.</p>","tags":["guest post","macroscope"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
672
+ Stuart Page 1936-2022","summary":"My dad died last weekend. Below is a notice
673
+ in today''s New Zealand Herald. I''m in New Zealand for his funeral. Don''t
674
+ really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":"2023-03-22T07:25:56Z","date_indexed":"1909-06-16T10:41:55+00:00","authors":[{"url":null,"name":"Roderic
675
+ Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
676
+ both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s3454/_DSC5106.jpg\"
677
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
678
+ border=\"0\" width=\"400\" data-original-height=\"2582\" data-original-width=\"3454\"
679
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s400/_DSC5106.jpg\"/></a></div>\n\nMy
680
+ dad died last weekend. Below is a notice in today''s New Zealand Herald. I''m
681
+ in New Zealand for his funeral. Don''t really have the words for this right
682
+ now.\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s3640/IMG_2870.jpeg\"
683
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
684
+ border=\"0\" height=\"320\" data-original-height=\"3640\" data-original-width=\"1391\"
685
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg\"/></a></div>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
686
+ interfaces, knowledge graphs, and Miller columns","summary":"Some quick notes
687
+ on interface ideas for digital libraries and/or knowledge graphs. Recently
688
+ there’s been something of an explosion in bibliographic tools to explore the
689
+ literature. Examples include: Elicit which uses AI to search for and summarise
690
+ papers _scite which uses AI to do sentiment analysis on citations (does paper
691
+ A cite paper B favourably or not?) ResearchRabbit which uses lists, networks,
692
+ and timelines to discover related research Scispace which navigates connections
693
+ between...","date_published":"2023-04-25T13:01:00Z","date_modified":"2023-04-27T14:51:08Z","date_indexed":"1909-06-16T11:25:14+00:00","authors":[{"url":null,"name":"Roderic
694
+ Page"}],"image":null,"content_html":"<p>Some quick notes on interface ideas
695
+ for digital libraries and/or knowledge graphs.</p>\n<p>Recently there’s been
696
+ something of an explosion in bibliographic tools to explore the literature.
697
+ Examples include:</p>\n<ul>\n<li><a href=\"https://elicit.org\">Elicit</a>
698
+ which uses AI to search for and summarise papers</li>\n<li><a href=\"https://scite.ai\">_scite</a>
699
+ which uses AI to do sentiment analysis on citations (does paper A cite paper
700
+ B favourably or not?)</li>\n<li><a href=\"https://www.researchrabbit.ai\">ResearchRabbit</a>
701
+ which uses lists, networks, and timelines to discover related research</li>\n<li><a
702
+ href=\"https://typeset.io\">Scispace</a> which navigates connections between
703
+ papers, authors, topics, etc., and provides AI summaries.</li>\n</ul>\n<p>As
704
+ an aside, I think these (and similar tools) are a great example of how bibliographic
705
+ data such as abstracts, the citation graph and - to a lesser extent - full
706
+ text - have become commodities. That is, what was once proprietary information
707
+ is now free to anyone, which in turns means a whole ecosystem of new tools
708
+ can emerge. If I was clever I’d be building a <a href=\"https://en.wikipedia.org/wiki/Wardley_map\">Wardley
709
+ map</a> to explore this. Note that a decade or so ago reference managers like
710
+ <a href=\"https://www.zotero.org\">Zotero</a> were made possible by publishers
711
+ exposing basic bibliographic data on their articles. As we move to <a href=\"https://i4oc.org\">open
712
+ citations</a> we are seeing the next generation of tools.</p>\n<p>Back to
713
+ my main topic. As usual, rather than focus on what these tools do I’m more
714
+ interested in how they <strong>look</strong>. I have history here, when the
715
+ iPad came out I was intrigued by the possibilities it offered for displaying
716
+ academic articles, as discussed <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad.html\">here</a>,
717
+ <a href=\"https://iphylo.blogspot.com/2010/09/viewing-scientific-articles-on-ipad.html\">here</a>,
718
+ <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_24.html\">here</a>,
719
+ <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_3052.html\">here</a>,
720
+ and <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_31.html\">here</a>.
721
+ ResearchRabbit looks like this:</p>\n<div style=\"padding:86.91% 0 0 0;position:relative;\"><iframe
722
+ src=\"https://player.vimeo.com/video/820871442?h=23b05b0dae&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479\"
723
+ frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
724
+ style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"ResearchRabbit\"></iframe></div><script
725
+ src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>Scispace’s <a
726
+ href=\"https://typeset.io/explore/journals/parassitologia-1ieodjwe\">“trace”
727
+ view</a> looks like this:</p>\n<div style=\"padding:84.55% 0 0 0;position:relative;\"><iframe
728
+ src=\"https://player.vimeo.com/video/820871348?h=2db7b661ef&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479\"
729
+ frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
730
+ style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"Scispace
731
+ screencast\"></iframe></div><script src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>What
732
+ is interesting about both is that they display content from left to right
733
+ in vertical columns, rather than the more common horizontal rows. This sort
734
+ of display is sometimes called <a href=\"https://en.wikipedia.org/wiki/Miller_columns\">Miller
735
+ columns</a> or a <a href=\"https://web.archive.org/web/20210726134921/http://designinginterfaces.com/firstedition/index.php?page=Cascading_Lists\">cascading
736
+ list</a>.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s1024/GNUstep-liveCD.png\"
737
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
738
+ border=\"0\" width=\"400\" data-original-height=\"768\" data-original-width=\"1024\"
739
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s400/GNUstep-liveCD.png\"/></a></div>\n\n<p>By
740
+ Gürkan Sengün (talk) - Own work, Public Domain, <a href=\"https://commons.wikimedia.org/w/index.php?curid=594715\">https://commons.wikimedia.org/w/index.php?curid=594715</a></p>\n<p>I’ve
741
+ always found displaying a knowledge graph to be a challenge, as discussed
742
+ <a href=\"https://iphylo.blogspot.com/2019/07/notes-on-collections-knowledge-graphs.html\">elsewhere
743
+ on this blog</a> and in my paper on <a href=\"https://peerj.com/articles/6739/#p-29\">Ozymandias</a>.
744
+ Miller columns enable one to drill down in increasing depth, but it doesn’t
745
+ need to be a tree, it can be a path within a network. What I like about ResearchRabbit
746
+ and the original Scispace interface is that they present the current item
747
+ together with a list of possible connections (e.g., authors, citations) that
748
+ you can drill down on. Clicking on these will result in a new column being
749
+ appended to the right, with a view (typically a list) of the next candidates
750
+ to visit. In graph terms, these are adjacent nodes to the original item. The
751
+ clickable badges on each item can be thought of as sets of edges that have
752
+ the same label (e.g., “authored by”, “cites”, “funded”, “is about”, etc.).
753
+ Each of these nodes itself becomes a starting point for further exploration.
754
+ Note that the original starting point isn’t privileged, other than being the
755
+ starting point. That is, each time we drill down we are seeing the same type
756
+ of information displayed in the same way. Note also that the navigation can
757
+ be though of as a <strong>card</strong> for a node, with <strong>buttons</strong>
758
+ grouping the adjacent nodes. When we click on an individual button, it expands
759
+ into a <strong>list</strong> in the next column. This can be thought of as
760
+ a preview for each adjacent node. Clicking on an element in the list generates
761
+ a new card (we are viewing a single node) and we get another set of buttons
762
+ corresponding to the adjacent nodes.</p>\n<p>One important behaviour in a
763
+ Miller column interface is that the current path can be pruned at any point.
764
+ If we go back (i.e., scroll to the left) and click on another tab on an item,
765
+ everything downstream of that item (i.e., to the right) gets deleted and replaced
766
+ by a new set of nodes. This could make retrieving a particular history of
767
+ browsing a bit tricky, but encourages exploration. Both Scispace and ResearchRabbit have
768
+ the ability to add items to a collection, so you can keep track of things
769
+ you discover.</p>\n<p>Lots of food for thought, I’m assuming that there is
770
+ some user interface/experience research on Miller columns. One thing to remember
771
+ is that Miller columns are most often associated with trees, but in this case
772
+ we are exploring a network. That means that potentially there is no limit
773
+ to the number of columns being generated as we wander through the graph. It
774
+ will be interesting to think about what the average depth is likely to be,
775
+ in other words, how deep down the rabbit hole will be go?</p>\n\n<h3>Update</h3>\n<p>Should
776
+ add link to David Regev''s explorations of <a href=\"https://medium.com/david-regev-on-ux/flow-browser-b730daf0f717\">Flow
777
+ Browser</a>.\n\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":["cards","flow","Knowledge
778
+ Graph","Miller column","RabbitResearch"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/t6fb9-4fn44","uuid":"8bc3fea6-cb86-4344-8dad-f312fbf58041","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The
779
+ Business of Extracting Knowledge from Academic Publications","summary":"Markus
780
+ Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting
781
+ Knowledge from Academic Publications\". I spent months working on domain-specific
782
+ search engines and knowledge discovery apps for biomedicine and eventually
783
+ figured that synthesizing \"insights\" or building knowledge graphs by machine-reading
784
+ the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc—
785
+ Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I worked on biomedical...","date_published":"2021-12-11T00:01:00Z","date_modified":"2021-12-11T00:01:21Z","date_indexed":"1909-06-16T11:32:09+00:00","authors":[{"url":null,"name":"Roderic
786
+ Page"}],"image":null,"content_html":"<p>Markus Strasser (<a href=\"https://twitter.com/mkstra\">@mkstra</a>
787
+ write a fascinating article entitled <a href=\"https://markusstrasser.org/extracting-knowledge-from-literature/\">\"The
788
+ Business of Extracting Knowledge from Academic Publications\"</a>.</p>\n\n<blockquote
789
+ class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">I spent months working
790
+ on domain-specific search engines and knowledge discovery apps for biomedicine
791
+ and eventually figured that synthesizing &quot;insights&quot; or building
792
+ knowledge graphs by machine-reading the academic literature (papers) is *barely
793
+ useful* :<a href=\"https://t.co/eciOg30Odc\">https://t.co/eciOg30Odc</a></p>&mdash;
794
+ Markus Strasser (@mkstra) <a href=\"https://twitter.com/mkstra/status/1468334482113523716?ref_src=twsrc%5Etfw\">December
795
+ 7, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
796
+ charset=\"utf-8\"></script>\n\n<p>His TL;DR:</p>\n\n<p><blockquote>\nTL;DR:
797
+ I worked on biomedical literature search, discovery and recommender web applications
798
+ for many months and concluded that extracting, structuring or synthesizing
799
+ \"insights\" from academic publications (papers) or building knowledge bases
800
+ from a domain corpus of literature has negligible value in industry.</p>\n\n<p>Close
801
+ to nothing of what makes science actually work is published as text on the
802
+ web.\n</blockquote></p>\n\n<p>After recounting the many problems of knowledge
803
+ extraction - including a swipe at nanopubs which \"are ... dead in my view
804
+ (without admitting it)\" - he concludes:</p>\n\n<p><blockquote>\nI’ve been
805
+ flirting with this entire cluster of ideas including open source web annotation,
806
+ semantic search and semantic web, public knowledge graphs, nano-publications,
807
+ knowledge maps, interoperable protocols and structured data, serendipitous
808
+ discovery apps, knowledge organization, communal sense making and academic
809
+ literature/publishing toolchains for a few years on and off ... nothing of
810
+ it will go anywhere.</p>\n\n<p>Don’t take that as a challenge. Take it as
811
+ a red flag and run. Run towards better problems.\n</blockquote></p>\n\n<p>Well
812
+ worth a read, and much food for thought.</p>","tags":["ai","business model","text
813
+ mining"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/463yw-pbj26","uuid":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David
814
+ Remsen","summary":"I heard yesterday from Martin Kalfatovic (BHL) that David
815
+ Remsen has died. Very sad news. It''s starting to feel like iPhylo might end
816
+ up being a list of obituaries of people working on biodiversity informatics
817
+ (e.g., Scott Federhen). I spent several happy visits at MBL at Woods Hole
818
+ talking to Dave at the height of the uBio project, which really kickstarted
819
+ large scale indexing of taxonomic names, and the use of taxonomic name finding
820
+ tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":"2022-12-17T08:12:23Z","date_indexed":"1909-06-16T11:41:39+00:00","authors":[{"url":null,"name":"Roderic
821
+ Page"}],"image":null,"content_html":"<p>I heard yesterday from Martin Kalfatovic
822
+ (BHL) that David Remsen has died. Very sad news. It''s starting to feel like
823
+ iPhylo might end up being a list of obituaries of people working on biodiversity
824
+ informatics (e.g., <a href=\"https://iphylo.blogspot.com/2016/05/scott-federhen-rip.html\">Scott
825
+ Federhen</a>).</p>\n\n<p>I spent several happy visits at MBL at Woods Hole
826
+ talking to Dave at the height of the uBio project, which really kickstarted
827
+ large scale indexing of taxonomic names, and the use of taxonomic name finding
828
+ tools to index the literature. His work on uBio with David (\"Paddy\") Patterson
829
+ led to the <a href=\"https://eol.org\">Encyclopedia of Life</a> (EOL).</p>\n\n<p>A
830
+ number of the things I''m currently working on are things Dave started. For
831
+ example, I recently uploaded a version of his dataset for Nomenclator Zoologicus[1]
832
+ to <a href=\"https://www.checklistbank.org/dataset/126539/about\">ChecklistBank</a>
833
+ where I''m working on augmenting that original dataset by adding links to
834
+ the taxonomic literature. My <a href=\"https://biorss.herokuapp.com/?feed=Y291bnRyeT1XT1JMRCZwYXRoPSU1QiUyMkJJT1RBJTIyJTVE\">BioRSS
835
+ project</a> is essentially an attempt to revive uBioRSS[2] (see <a href=\"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html\">Revisiting
836
+ RSS to monitor the latest taxonomic research</a>).</p>\n\n<p>I have fond memories
837
+ of those visits to Woods Hole. A very sad day indeed.</p>\n\n<p><b>Update:</b>
838
+ The David Remsen Memorial Fund has been set up on <a href=\"https://www.gofundme.com/f/david-remsen-memorial-fund\">GoFundMe</a>.</p>\n\n<p>1.
839
+ Remsen, D. P., Norton, C., & Patterson, D. J. (2006). Taxonomic Informatics
840
+ Tools for the Electronic Nomenclator Zoologicus. The Biological Bulletin,
841
+ 210(1), 18–24. https://doi.org/10.2307/4134533</p>\n\n<p>2. Patrick R. Leary,
842
+ David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar,
843
+ uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23,
844
+ Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109</p>","tags":["David
845
+ Remsen","obituary","uBio"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
846
+ bibliographic data","summary":"There are several instances where I have a
847
+ collection of references that I want to deduplicate and merge. For example,
848
+ in Zootaxa has no impact factor I describe a dataset of the literature cited
849
+ by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4),
850
+ as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1).
851
+ Given that the same articles may be cited many times, these datasets have
852
+ lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":"2022-02-03T15:11:29Z","date_indexed":"1909-06-16T10:22:30+00:00","authors":[{"url":null,"name":"Roderic
853
+ Page"}],"image":null,"content_html":"<p>There are several instances where
854
+ I have a collection of references that I want to deduplicate and merge. For
855
+ example, in <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
856
+ has no impact factor</a> I describe a dataset of the literature cited by articles
857
+ in the journal <i>Zootaxa</i>. This data is available on Figshare (<a href=\"https://doi.org/10.6084/m9.figshare.c.5054372.v4\">https://doi.org/10.6084/m9.figshare.c.5054372.v4</a>),
858
+ as is the equivalent dataset for <i>Phytotaxa</i> (<a href=\"https://doi.org/10.6084/m9.figshare.c.5525901.v1\">https://doi.org/10.6084/m9.figshare.c.5525901.v1</a>).
859
+ Given that the same articles may be cited many times, these datasets have
860
+ lots of duplicates. Similarly, articles in <a href=\"https://species.wikimedia.org\">Wikispecies</a>
861
+ often have extensive lists of references cited, and the same reference may
862
+ appear on multiple pages (for an initial attempt to extract these references
863
+ see <a href=\"https://doi.org/10.5281/zenodo.5801661\">https://doi.org/10.5281/zenodo.5801661</a>
864
+ and <a href=\"https://github.com/rdmpage/wikispecies-parser\">https://github.com/rdmpage/wikispecies-parser</a>).</p>\n\n<p>There
865
+ are several reasons I want to merge these references. If I want to build a
866
+ citation graph for <i>Zootaxa</i> or <i>Phytotaxa</i> I need to merge references
867
+ that are the same so that I can accurate count citations. I am also interested
868
+ in harvesting the metadata to help find those articles in the <a href=\"https://www.biodiversitylibrary.org\">Biodiversity
869
+ Heritage Library</a> (BHL), and the literature cited section of scientific
870
+ articles is a potential goldmine of bibliographic metadata, as is Wikispecies.</p>\n\n<p>After
871
+ various experiments and false starts I''ve created a repository <a href=\"https://github.com/rdmpage/bib-dedup\">https://github.com/rdmpage/bib-dedup</a>
872
+ to host a series of PHP scripts to deduplicate bibliographics data. I''ve
873
+ settled on using CSL-JSON as the format for bibliographic data. Because deduplication
874
+ relies on comparing pairs of references, the standard format for most of the
875
+ scripts is a JSON array containing a pair of CSL-JSON objects to compare.
876
+ Below are the steps the code takes.</p>\n\n<h2>Generating pairs to compare</h2>\n\n<p>The
877
+ first step is to take a list of references and generate the pairs that will
878
+ be compared. I started with this approach as I wanted to explore machine learning
879
+ and wanted a simple format for training data, such as an array of two CSL-JSON
880
+ objects and an integer flag representing whether the two references were the
881
+ same of different.</p>\n\n<p>There are various ways to generate CSL-JSON for
882
+ a reference. I use a tool I wrote (see <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
883
+ parsing tool released</a>) that has a simple API where you parse one or more
884
+ references and it returns that reference as structured data in CSL-JSON.</p>\n\n<p>Attempting
885
+ to do all possible pairwise comparisons rapidly gets impractical as the number
886
+ of references increases, so we need some way to restrict the number of comparisons
887
+ we make. One approach I''ve explored is the “sorted neighbourhood method”
888
+ where we sort the references 9for example by their title) then move a sliding
889
+ window down the list of references, comparing all references within that window.
890
+ This greatly reduces the number of pairwise comparisons. So the first step
891
+ is to sort the references, then run a sliding window over them, output all
892
+ the pairs in each window (ignoring in pairwise comparisons already made in
893
+ a previous window). Other methods of \"blocking\" could also be used, such
894
+ as only including references in a particular year, or a particular journal.</p>\n\n<p>So,
895
+ the output of this step is a set of JSON arrays, each with a pair of references
896
+ in CSL-JSON format. Each array is stored on a single line in the same file
897
+ in <a href=\"https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON_2\">line-delimited
898
+ JSON</a> (JSONL).</p>\n\n<h2>Comparing pairs</h2>\n\n<p>The next step is to
899
+ compare each pair of references and decide whether they are a match or not.
900
+ Initially I explored a machine learning approach used in the following paper:</p>\n\n<blockquote>\nWilson
901
+ DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex
902
+ features to improve genealogical record linkage. In: The 2011 International
903
+ Joint Conference on Neural Networks. 9–14. DOI: <a href=\"https://doi.org/10.1109/IJCNN.2011.6033192\">10.1109/IJCNN.2011.6033192</a>\n</blockquote>\n\n<p>Initial
904
+ experiments using <a href=\"https://github.com/jtet/Perceptron\">https://github.com/jtet/Perceptron</a>
905
+ were promising and I want to play with this further, but I deciding to skip
906
+ this for now and just use simple string comparison. So for each CSL-JSON object
907
+ I generate a citation string in the same format using CiteProc, then compute
908
+ the <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein
909
+ distance</a> between the two strings. By normalising this distance by the
910
+ length of the two strings being compared I can use an arbitrary threshold
911
+ to decide if the references are the same or not.</p>\n\n<h2>Clustering</h2>\n\n<p>For
912
+ this step we read the JSONL file produced above and record whether the two
913
+ references are a match or not. Assuming each reference has a unique identifier
914
+ (needs only be unique within the file) then we can use those identifier to
915
+ record the clusters each reference belongs to. I do this using a <a href=\"https://en.wikipedia.org/wiki/Disjoint-set_data_structure\">Disjoint-set
916
+ data structure</a>. For each reference start with a graph where each node
917
+ represents a reference, and each node has a pointer to a parent node. Initially
918
+ the reference is its own parent. A simple implementation is to have an array
919
+ index by reference identifiers and where the value of each cell in the array
920
+ is the node''s parent.</p>\n\n<p>As we discover pairs we update the parents
921
+ of the nodes to reflect this, such that once all the comparisons are done
922
+ we have a one or more sets of clusters corresponding to the references that
923
+ we think are the same. Another way to think of this is that we are getting
924
+ the components of a graph where each node is a reference and pair of references
925
+ that match are connected by an edge.</p>\n\n<p>In the code I''m using I write
926
+ this graph in <a href=\"https://en.wikipedia.org/wiki/Trivial_Graph_Format\">Trivial
927
+ Graph Format</a> (TGF) which can be visualised using a tools such as <a href=\"https://www.yworks.com/products/yed\">yEd</a>.</p>\n\n<h2>Merging</h2>\n\n<p>Now
928
+ that we have a graph representing the sets of references that we think are
929
+ the same we need to merge them. This is where things get interesting as the
930
+ references are similar (by definition) but may differ in some details. The
931
+ paper below describes a simple Bayesian approach for merging records:</p>\n\n<blockquote>\nCouncill
932
+ IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles
933
+ CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching
934
+ Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
935
+ Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: <a href=\"https://doi.org/10.1145/1141753.1141817\">10.1145/1141753.1141817</a>.\n</blockquote>\n\n<p>So
936
+ the next step is to read the graph with the clusters, generate the sets of
937
+ bibliographic references that correspond to each cluster, then use the method
938
+ described in Councill et al. to produce a single bibliographic record for
939
+ that cluster. These records could then be used to, say locate the corresponding
940
+ article in BHL, or populate Wikidata with missing references.</p>\n\n<p>Obviously
941
+ there is always the potential for errors, such as trying to merge references
942
+ that are not the same. As a quick and dirty check I flag as dubious any cluster
943
+ where the page numbers vary among members of the cluster. More sophisticated
944
+ checks are possible, especially if I go down the ML route (i.e., I would have
945
+ evidence for the probability that the same reference can disagree on some
946
+ aspects of metadata).</p>\n\n<h2>Summary</h2>\n\n<p>At this stage the code
947
+ is working well enough for me to play with and explore some example datasets.
948
+ The focus is on structured bibliographic metadata, but I may simplify things
949
+ and have a version that handles simple string matching, for example to cluster
950
+ together different abbreviations of the same journal name.</p>","tags":["data
951
+ cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/c79vq-7rr11","uuid":"3cb94422-5506-4e24-a41c-a250bb521ee0","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL
952
+ for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
953
+ for WikiData. More precisely, the endpoint is for a subset of the entities
954
+ that are of interest to WikiCite, such as scholarly articles, people, and
955
+ journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
956
+ The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
957
+ There are various ways to interact with the endpoint, personally I like the
958
+ Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
959
+ taken...","date_published":"2021-12-20T13:16:00Z","date_modified":"2021-12-20T13:20:05Z","date_indexed":"1909-06-16T10:52:00+00:00","authors":[{"url":null,"name":"Roderic
960
+ Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
961
+ both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s1000\"
962
+ style=\"display: block; padding: 1em 0; text-align: center; clear: right;
963
+ float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"1000\"
964
+ data-original-width=\"1000\" src=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s200\"/></a></div><p>I''ve
965
+ released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint
966
+ is for a subset of the entities that are of interest to WikiCite, such as
967
+ scholarly articles, people, and journals. There is a crude demo at <a href=\"https://wikicite-graphql.herokuapp.com\">https://wikicite-graphql.herokuapp.com</a>.
968
+ The endpoint itself is at <a href=\"https://wikicite-graphql.herokuapp.com/gql.php\">https://wikicite-graphql.herokuapp.com/gql.php</a>.
969
+ There are various ways to interact with the endpoint, personally I like the
970
+ <a href=\"https://altair.sirmuel.design\">Altair GraphQL Client</a> by <a
971
+ href=\"https://github.com/imolorhe\">Samuel Imolorhe</a>.</p>\n\n<p>As I''ve
972
+ <a href=\"https://iphylo.blogspot.com/2021/04/it-been-while.html\">mentioned
973
+ earlier</a> it''s taken me a while to see the point of GraphQL. But it is
974
+ clear it is gaining traction in the biodiversity world (see for example the
975
+ <a href=\"https://dev.gbif.org/hosted-portals.html\">GBIF Hosted Portals</a>)
976
+ so it''s worth exploring. My take on GraphQL is that it is a way to create
977
+ a self-describing API that someone developing a web site can use without them
978
+ having to bury themselves in the gory details of how data is internally modelled.
979
+ For example, WikiData''s query interface uses SPARQL, a powerful language
980
+ that has a steep learning curve (in part because of the administrative overhead
981
+ brought by RDF namespaces, etc.). In my previous SPARQL-based projects such
982
+ as <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a> and <a
983
+ href=\"http://alec-demo.herokuapp.com\">ALEC</a> I have either returned SPARQL
984
+ results directly (Ozymandias) or formatted SPARQL results as <a href=\"https://schema.org/DataFeed\">schema.org
985
+ DataFeeds</a> (equivalent to RSS feeds) (ALEC). Both approaches work, but
986
+ they are project-specific and if anyone else tried to build based on these
987
+ projects they might struggle for figure out what was going on. I certainly
988
+ struggle, and I wrote them!</p>\n\n<p>So it seems worthwhile to explore this
989
+ approach a little further and see if I can develop a GraphQL interface that
990
+ can be used to build the sort of rich apps that I want to see. The demo I''ve
991
+ created uses SPARQL under the hood to provide responses to the GraphQL queries.
992
+ So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully)
993
+ simpler overlay on top of SPARQL so that we can retrieve the data we want
994
+ without having to learn the intricacies of SPARQL, nor how Wikidata models
995
+ publications and people.</p>","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":[]}]}'
996
+ recorded_at: Sun, 18 Jun 2023 06:09:05 GMT
997
+ recorded_with: VCR 6.1.0