commonmeta-ruby 3.2.15 → 3.3.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Gemfile.lock +1 -1
- data/bin/commonmeta +1 -1
- data/lib/commonmeta/author_utils.rb +1 -1
- data/lib/commonmeta/cli.rb +17 -0
- data/lib/commonmeta/crossref_utils.rb +56 -14
- data/lib/commonmeta/readers/json_feed_reader.rb +25 -1
- data/lib/commonmeta/utils.rb +37 -0
- data/lib/commonmeta/version.rb +1 -1
- data/spec/cli_spec.rb +27 -3
- data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/doi_prefix/doi_prefix_by_blog.yml +997 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/doi_prefix/doi_prefix_by_uuid.yml +256 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_blog.yml +997 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_blog_unknown_blog_id.yml +49 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_uuid.yml +256 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_uuid_unknown_uuid.yml +49 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_id.yml +997 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_uuid.yml +389 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_uuid_specific_prefix.yml +389 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item/by_uuid.yml +136 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/blog_post_with_non-url_id.yml +136 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_organizational_author.yml +91 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_rogue_scholar_with_organizational_author.yml +91 -0
- data/spec/readers/json_feed_reader_spec.rb +68 -0
- data/spec/utils_spec.rb +8 -0
- data/spec/writers/crossref_xml_writer_spec.rb +28 -0
- metadata +15 -2
@@ -0,0 +1,997 @@
|
|
1
|
+
---
|
2
|
+
http_interactions:
|
3
|
+
- request:
|
4
|
+
method: get
|
5
|
+
uri: https://rogue-scholar.org/api/blogs/tyfqw20
|
6
|
+
body:
|
7
|
+
encoding: UTF-8
|
8
|
+
string: ''
|
9
|
+
headers:
|
10
|
+
Connection:
|
11
|
+
- close
|
12
|
+
Host:
|
13
|
+
- rogue-scholar.org
|
14
|
+
User-Agent:
|
15
|
+
- http.rb/5.1.1
|
16
|
+
response:
|
17
|
+
status:
|
18
|
+
code: 200
|
19
|
+
message: OK
|
20
|
+
headers:
|
21
|
+
Age:
|
22
|
+
- '0'
|
23
|
+
Cache-Control:
|
24
|
+
- public, max-age=0, must-revalidate
|
25
|
+
Content-Length:
|
26
|
+
- '88356'
|
27
|
+
Content-Type:
|
28
|
+
- application/json; charset=utf-8
|
29
|
+
Date:
|
30
|
+
- Sun, 18 Jun 2023 06:01:20 GMT
|
31
|
+
Etag:
|
32
|
+
- '"xxw11pz9vb1w14"'
|
33
|
+
Server:
|
34
|
+
- Vercel
|
35
|
+
Strict-Transport-Security:
|
36
|
+
- max-age=63072000
|
37
|
+
X-Matched-Path:
|
38
|
+
- "/api/blogs/[slug]"
|
39
|
+
X-Vercel-Cache:
|
40
|
+
- MISS
|
41
|
+
X-Vercel-Id:
|
42
|
+
- fra1::iad1::fqtvx-1687068074466-8dd4935835aa
|
43
|
+
Connection:
|
44
|
+
- close
|
45
|
+
body:
|
46
|
+
encoding: UTF-8
|
47
|
+
string: '{"id":"tyfqw20","title":"iPhylo","description":"Rants, raves (and occasionally
|
48
|
+
considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For
|
49
|
+
more ranty and less considered opinions, see my <a href=\"https://twitter.com/rdmpage\">Twitter
|
50
|
+
feed</a>.<br>ISSN 2051-8188. Written content on this site is licensed under
|
51
|
+
a <a href=\"https://creativecommons.org/licenses/by/4.0/\">Creative Commons
|
52
|
+
Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default","feed_format":"application/atom+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-05-31T17:26:00+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger
|
53
|
+
7.00","category":"Natural Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/en7e9-5s882","uuid":"20b9d31e-513f-496b-b399-4215306e1588","url":"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html","title":"Obsidian,
|
54
|
+
markdown, and taxonomic trees","summary":"Returning to the subject of personal
|
55
|
+
knowledge graphs Kyle Scheer has an interesting repository of Markdown files
|
56
|
+
that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines
|
57
|
+
(see his blog post for more background). If you add these files to Obsidian
|
58
|
+
you get a nice visualisation of a taxonomy of academic disciplines. The applications
|
59
|
+
of this to biological taxonomy seem obvious, especially as a tool like Obsidian
|
60
|
+
enables all sorts of interesting links to be added...","date_published":"2022-04-07T21:07:00Z","date_modified":"2022-04-07T21:15:34Z","date_indexed":"1909-06-16T09:41:45+00:00","authors":[{"url":null,"name":"Roderic
|
61
|
+
Page"}],"image":null,"content_html":"<p>Returning to the subject of <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">personal
|
62
|
+
knowledge graphs</a> Kyle Scheer has an interesting repository of Markdown
|
63
|
+
files that describe academic disciplines at <a href=\"https://github.com/kyletscheer/academic-disciplines\">https://github.com/kyletscheer/academic-disciplines</a>
|
64
|
+
(see <a href=\"https://kyletscheer.medium.com/on-creating-a-tree-of-knowledge-f099c1028bf6\">his
|
65
|
+
blog post</a> for more background).</p>\n\n<p>If you add these files to <a
|
66
|
+
href=\"https://obsidian.md/\">Obsidian</a> you get a nice visualisation of
|
67
|
+
a taxonomy of academic disciplines. The applications of this to biological
|
68
|
+
taxonomy seem obvious, especially as a tool like Obsidian enables all sorts
|
69
|
+
of interesting links to be added (e.g., we could add links to the taxonomic
|
70
|
+
research behind each node in the taxonomic tree, the people doing that research,
|
71
|
+
etc. - although that would mean we''d no longer have a simple tree).</p>\n\n<p>The
|
72
|
+
more I look at these sort of simple Markdown-based tools the more I wonder
|
73
|
+
whether we could make more use of them to create simple but persistent databases.
|
74
|
+
Text files seem the most stable, long-lived digital format around, maybe this
|
75
|
+
would be a way to minimise the inevitable obsolescence of database and server
|
76
|
+
software. Time for some experiments I feel... can we take a taxonomic group,
|
77
|
+
such as mammals, and create a richly connected database purely in Markdown?</p>\n\n<div
|
78
|
+
class=\"separator\" style=\"clear: both; text-align: center;\"><iframe allowfullscreen=''allowfullscreen''
|
79
|
+
webkitallowfullscreen=''webkitallowfullscreen'' mozallowfullscreen=''mozallowfullscreen''
|
80
|
+
width=''400'' height=''322'' src=''https://www.blogger.com/video.g?token=AD6v5dxZtweOTJTdg6aqvICq_tKF0la1QZuDAEpwPPCVQKtG5vjB-DzuQv-ApL8JnpyZ1FffYtWo6ymizNQ''
|
81
|
+
class=''b-hbp-video b-uploaded'' frameborder=''0''></iframe></div>","tags":["markdown","obsidian"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/m48f7-c2128","uuid":"8aea47e4-f227-45f4-b37b-0454a8a7a3ff","url":"https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html","title":"ChatGPT,
|
82
|
+
semantic search, and knowledge graphs","summary":"One thing about ChatGPT
|
83
|
+
is it has opened my eyes to some concepts I was dimly aware of but am only
|
84
|
+
now beginning to fully appreciate. ChatGPT enables you ask it questions, but
|
85
|
+
the answers depend on what ChatGPT “knows”. As several people have noted,
|
86
|
+
what would be even better is to be able to run ChatGPT on your own content.
|
87
|
+
Indeed, ChatGPT itself now supports this using plugins. Paul Graham GPT However,
|
88
|
+
it’s still useful to see how to add ChatGPT functionality to your own content
|
89
|
+
from...","date_published":"2023-04-03T15:30:00Z","date_modified":"2023-04-03T15:32:04Z","date_indexed":"1909-06-16T09:02:34+00:00","authors":[{"url":null,"name":"Roderic
|
90
|
+
Page"}],"image":null,"content_html":"<p>One thing about ChatGPT is it has
|
91
|
+
opened my eyes to some concepts I was dimly aware of but am only now beginning
|
92
|
+
to fully appreciate. ChatGPT enables you ask it questions, but the answers
|
93
|
+
depend on what ChatGPT “knows”. As several people have noted, what would be
|
94
|
+
even better is to be able to run ChatGPT on your own content. Indeed, ChatGPT
|
95
|
+
itself now supports this using <a href=\"https://openai.com/blog/chatgpt-plugins\">plugins</a>.</p>\n<h4
|
96
|
+
id=\"paul-graham-gpt\">Paul Graham GPT</h4>\n<p>However, it’s still useful
|
97
|
+
to see how to add ChatGPT functionality to your own content from scratch.
|
98
|
+
A nice example of this is <a href=\"https://paul-graham-gpt.vercel.app/\">Paul
|
99
|
+
Graham GPT</a> by <a href=\"https://twitter.com/mckaywrigley\">Mckay Wrigley</a>.
|
100
|
+
Mckay Wrigley took essays by Paul Graham (a well known venture capitalist)
|
101
|
+
and built a question and answer tool very like ChatGPT.</p>\n<iframe width=\"560\"
|
102
|
+
height=\"315\" src=\"https://www.youtube.com/embed/ii1jcLg-eIQ\" title=\"YouTube
|
103
|
+
video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write;
|
104
|
+
encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n<p>Because
|
105
|
+
you can send a block of text to ChatGPT (as part of the prompt) you can get
|
106
|
+
ChatGPT to summarise or transform that information, or answer questions based
|
107
|
+
on that information. But there is a limit to how much information you can
|
108
|
+
pack into a prompt. You can’t put all of Paul Graham’s essays into a prompt
|
109
|
+
for example. So a solution is to do some preprocessing. For example, given
|
110
|
+
a question such as “How do I start a startup?” we could first find the essays
|
111
|
+
that are most relevant to this question, then use them to create a prompt
|
112
|
+
for ChatGPT. A quick and dirty way to do this is simply do a text search over
|
113
|
+
the essays and take the top hits. But we aren’t searching for words, we are
|
114
|
+
searching for answers to a question. The essay with the best answer might
|
115
|
+
not include the phrase “How do I start a startup?”.</p>\n<h4 id=\"semantic-search\">Semantic
|
116
|
+
search</h4>\n<p>Enter <a href=\"https://en.wikipedia.org/wiki/Semantic_search\">Semantic
|
117
|
+
search</a>. The key concept behind semantic search is that we are looking
|
118
|
+
for documents with similar meaning, not just similarity of text. One approach
|
119
|
+
to this is to represent documents by “embeddings”, that is, a vector of numbers
|
120
|
+
that encapsulate features of the document. Documents with similar vectors
|
121
|
+
are potentially related. In semantic search we take the query (e.g., “How
|
122
|
+
do I start a startup?”), compute its embedding, then search among the documents
|
123
|
+
for those with similar embeddings.</p>\n<p>To create Paul Graham GPT Mckay
|
124
|
+
Wrigley did the following. First he sent each essay to the OpenAI API underlying
|
125
|
+
ChatGPT, and in return he got the embedding for that essay (a vector of 1536
|
126
|
+
numbers). Each embedding was stored in a database (Mckay uses Postgres with
|
127
|
+
<a href=\"https://github.com/pgvector/pgvector\">pgvector</a>). When a user
|
128
|
+
enters a query such as “How do I start a startup?” that query is also sent
|
129
|
+
to the OpenAI API to retrieve its embedding vector. Then we query the database
|
130
|
+
of embeddings for Paul Graham’s essays and take the top five hits. These hits
|
131
|
+
are, one hopes, the most likely to contain relevant answers. The original
|
132
|
+
question and the most similar essays are then bundled up and sent to ChatGPT
|
133
|
+
which then synthesises an answer. See his <a href=\"https://github.com/mckaywrigley/paul-graham-gpt\">GitHub
|
134
|
+
repo</a> for more details. Note that we are still using ChatGPT, but on a
|
135
|
+
set of documents it doesn’t already have.</p>\n<h4 id=\"knowledge-graphs\">Knowledge
|
136
|
+
graphs</h4>\n<p>I’m a fan of knowledge graphs, but they are not terribly easy
|
137
|
+
to use. For example, I built a knowledge graph of Australian animals <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a>
|
138
|
+
that contains a wealth of information on taxa, publications, and people, wrapped
|
139
|
+
up in a web site. If you want to learn more you need to figure out how to
|
140
|
+
write queries in SPARQL, which is not fun. Maybe we could use ChatGPT to write
|
141
|
+
the SPARQL queries for us, but it would be much more fun to be simply ask
|
142
|
+
natural language queries (e.g., “who are the experts on Australian ants?”).
|
143
|
+
I made some naïve notes on these ideas <a href=\"https://iphylo.blogspot.com/2015/09/possible-project-natural-language.html\">Possible
|
144
|
+
project: natural language queries, or answering “how many species are there?”</a>
|
145
|
+
and <a href=\"https://iphylo.blogspot.com/2019/05/ozymandias-meets-wikipedia-with-notes.html\">Ozymandias
|
146
|
+
meets Wikipedia, with notes on natural language generation</a>.</p>\n<p>Of
|
147
|
+
course, this is a well known problem. Tools such as <a href=\"http://rdf2vec.org\">RDF2vec</a>
|
148
|
+
can take RDF from a knowledge graph and create embeddings which could in tern
|
149
|
+
be used to support semantic search. But it seems to me that we could simply
|
150
|
+
this process a bit by making use of ChatGPT.</p>\n<p>Firstly we would generate
|
151
|
+
natural language statements from the knowledge graph (e.g., “species x belongs
|
152
|
+
to genus y and was described in z”, “this paper on ants was authored by x”,
|
153
|
+
etc.) that cover the basic questions we expect people to ask. We then get
|
154
|
+
embeddings for these (e.g., using OpenAI). We then have an interface where
|
155
|
+
people can ask a question (“is species x a valid species?”, “who has published
|
156
|
+
on ants”, etc.), we get the embedding for that question, retrieve natural
|
157
|
+
language statements that the closest in embedding “space”, package everything
|
158
|
+
up and ask ChatGPT to summarise the answer.</p>\n<p>The trick, of course,
|
159
|
+
is to figure out how t generate natural language statements from the knowledge
|
160
|
+
graph (which amounts to deciding what paths to traverse in the knowledge graph,
|
161
|
+
and how to write those paths is something approximating English). We also
|
162
|
+
want to know something about the sorts of questions people are likely to ask
|
163
|
+
so that we have a reasonable chance of having the answers (for example, are
|
164
|
+
people going to ask about individual species, or questions about summary statistics
|
165
|
+
such as numbers of species in a genus, etc.).</p>\n<p>What makes this attractive
|
166
|
+
is that it seems a straightforward way to go from a largely academic exercise
|
167
|
+
(build a knowledge graph) to something potentially useful (a question and
|
168
|
+
answer machine). Imagine if something like the defunct BBC wildlife site (see
|
169
|
+
<a href=\"https://iphylo.blogspot.com/2017/12/blue-planet-ii-bbc-and-semantic-web.html\">Blue
|
170
|
+
Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and
|
171
|
+
opportunities lost</a>) revived <a href=\"https://aspiring-look.glitch.me\">here</a>
|
172
|
+
had a question and answer interface where we could ask questions rather than
|
173
|
+
passively browse.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>I have so much
|
174
|
+
more to learn, and need to think about ways to incorporate semantic search
|
175
|
+
and ChatGPT-like tools into knowledge graphs.</p>\n<blockquote>\n<p>Written
|
176
|
+
with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/zc4qc-77616","uuid":"30c78d9d-2e50-49db-9f4f-b3baa060387b","url":"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html","title":"Does
|
177
|
+
anyone cite taxonomic treatments?","summary":"Taxonomic treatments have come
|
178
|
+
up in various discussions I''m involved in, and I''m curious as to whether
|
179
|
+
they are actually being used, in particular, whether they are actually being
|
180
|
+
cited. Consider the following quote: The taxa are described in taxonomic treatments,
|
181
|
+
well defined sections of scientific publications (Catapano 2019). They include
|
182
|
+
a nomenclatural section and one or more sections including descriptions, material
|
183
|
+
citations referring to studied specimens, or notes ecology and...","date_published":"2022-09-01T16:49:00Z","date_modified":"2022-09-01T16:49:51Z","date_indexed":"1909-06-16T09:31:50+00:00","authors":[{"url":null,"name":"Roderic
|
184
|
+
Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
185
|
+
both;\"><a href=\"https://zenodo.org/record/5731100/thumb100\" style=\"display:
|
186
|
+
block; padding: 1em 0; text-align: center; clear: right; float: right;\"><img
|
187
|
+
alt=\"\" border=\"0\" height=\"128\" data-original-height=\"106\" data-original-width=\"100\"
|
188
|
+
src=\"https://zenodo.org/record/5731100/thumb250\"/></a></div>\nTaxonomic
|
189
|
+
treatments have come up in various discussions I''m involved in, and I''m
|
190
|
+
curious as to whether they are actually being used, in particular, whether
|
191
|
+
they are actually being cited. Consider the following quote:\n\n<blockquote>\nThe
|
192
|
+
taxa are described in taxonomic treatments, well defined sections of scientific
|
193
|
+
publications (Catapano 2019). They include a nomenclatural section and one
|
194
|
+
or more sections including descriptions, material citations referring to studied
|
195
|
+
specimens, or notes ecology and behavior. In case the treatment does not describe
|
196
|
+
a new discovered taxon, previous treatments are cited in the form of treatment
|
197
|
+
citations. This citation can refer to a previous treatment and add additional
|
198
|
+
data, or it can be a statement synonymizing the taxon with another taxon.
|
199
|
+
This allows building a citation network, and ultimately is a constituent part
|
200
|
+
of the catalogue of life. - Taxonomic Treatments as Open FAIR Digital Objects
|
201
|
+
<a href=\"https://doi.org/10.3897/rio.8.e93709\">https://doi.org/10.3897/rio.8.e93709</a>\n</blockquote>\n\n<p>\n
|
202
|
+
\"Traditional\" academic citation is from article to article. For example,
|
203
|
+
consider these two papers:\n\n<blockquote>\nLi Y, Li S, Lin Y (2021) Taxonomic
|
204
|
+
study on fourteen symphytognathid species from Asia (Araneae, Symphytognathidae).
|
205
|
+
ZooKeys 1072: 1-47. https://doi.org/10.3897/zookeys.1072.67935\n</blockquote>\n\n<blockquote>\nMiller
|
206
|
+
J, Griswold C, Yin C (2009) The symphytognathoid spiders of the Gaoligongshan,
|
207
|
+
Yunnan, China (Araneae: Araneoidea): Systematics and diversity of micro-orbweavers.
|
208
|
+
ZooKeys 11: 9-195. https://doi.org/10.3897/zookeys.11.160\n</blockquote>\n</p>\n\n<p>Li
|
209
|
+
et al. 2021 cites Miller et al. 2009 (although Pensoft seems to have broken
|
210
|
+
the citation such that it does appear correctly either on their web page or
|
211
|
+
in CrossRef).</p>\n\n<p>So, we have this link: [article]10.3897/zookeys.1072.67935
|
212
|
+
--cites--> [article]10.3897/zookeys.11.160. One article cites another.</p>\n\n<p>In
|
213
|
+
their 2021 paper Li et al. discuss <i>Patu jidanweishi</i> Miller, Griswold
|
214
|
+
& Yin, 2009:\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s1040/Screenshot%202022-09-01%20at%2017.12.27.png\"
|
215
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
216
|
+
border=\"0\" width=\"400\" data-original-height=\"314\" data-original-width=\"1040\"
|
217
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s400/Screenshot%202022-09-01%20at%2017.12.27.png\"/></a></div>\n\n<p>There
|
218
|
+
is a treatment for the original description of <i>Patu jidanweishi</i> at
|
219
|
+
<a href=\"https://doi.org/10.5281/zenodo.3792232\">https://doi.org/10.5281/zenodo.3792232</a>,
|
220
|
+
which was created by Plazi with a time stamp \"2020-05-06T04:59:53.278684+00:00\".
|
221
|
+
The original publication date was 2009, the treatments are being added retrospectively.</p>\n\n<p>In
|
222
|
+
an ideal world my expectation would be that Li et al. 2021 would have cited
|
223
|
+
the treatment, instead of just providing the text string \"Patu jidanweishi
|
224
|
+
Miller, Griswold & Yin, 2009: 64, figs 65A–E, 66A, B, 67A–D, 68A–F, 69A–F,
|
225
|
+
70A–F and 71A–F (♂♀).\" Isn''t the expectation under the treatment model that
|
226
|
+
we would have seen this relationship:</p>\n\n<p>[article]10.3897/zookeys.1072.67935
|
227
|
+
--cites--> [treatment]https://doi.org/10.5281/zenodo.3792232</p>\n\n<p>Furthermore,
|
228
|
+
if it is the case that \"[i]n case the treatment does not describe a new discovered
|
229
|
+
taxon, previous treatments are cited in the form of treatment citations\"
|
230
|
+
then we should also see a citation between treatments, in other words Li et
|
231
|
+
al.''s 2021 treatment of <i>Patu jidanweishi</i> (which doesn''t seem to have
|
232
|
+
a DOI but is available on Plazi'' web site as <a href=\"https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74\">https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74</a>)
|
233
|
+
should also cite the original treatment? It doesn''t - but it does cite the
|
234
|
+
Miller et al. paper.</p>\n\n<p>So in this example we don''t see articles citing
|
235
|
+
treatments, nor do we see treatments citing treatments. Playing Devil''s advocate,
|
236
|
+
why then do we have treatments? Does''t the lack of citations suggest that
|
237
|
+
- despite some taxonomists saying this is the unit that matters - they actually
|
238
|
+
don''t. If we pay attention to what people do rather than what they say they
|
239
|
+
do, they cite articles.</p>\n\n<p>Now, there are all sorts of reasons why
|
240
|
+
we don''t see [article] -> [treatment] citations, or [treatment] -> [treatment]
|
241
|
+
citations. Treatments are being added after the fact by Plazi, not by the
|
242
|
+
authors of the original work. And in many cases the treatments that could
|
243
|
+
be cited haven''t appeared until after that potentially citing work was published.
|
244
|
+
In the example above the Miller et al. paper dates from 2009, but the treatment
|
245
|
+
extracted only went online in 2020. And while there is a long standing culture
|
246
|
+
of citing publications (ideally using DOIs) there isn''t an equivalent culture
|
247
|
+
of citing treatments (beyond the simple text strings).</p>\n\n<p>Obviously
|
248
|
+
this is but one example. I''d need to do some exploration of the citation
|
249
|
+
graph to get a better sense of citations patterns, perhaps using <a href=\"https://www.crossref.org/documentation/event-data/\">CrossRef''s
|
250
|
+
event data</a>. But my sense is that taxonomists don''t cite treatments.</p>\n\n<p>I''m
|
251
|
+
guessing Plazi would respond by saying treatments are cited, for example (indirectly)
|
252
|
+
in GBIF downloads. This is true, although arguably people aren''t citing the
|
253
|
+
treatment, they''re citing specimen data in those treatments, and that specimen
|
254
|
+
data could be extracted at the level of articles rather than treatments. In
|
255
|
+
other words, it''s not the treatments themselves that people are citing.</p>\n\n<p>To
|
256
|
+
be clear, I think there is value in being able to identify those \"well defined
|
257
|
+
sections\" of a publication that deal with a given taxon (i.e., treatments),
|
258
|
+
but it''s not clear to me that these are actually the citable units people
|
259
|
+
might hope them to be. Likewise, journals such as <i>ZooKeys</i> have DOIs
|
260
|
+
for individual figures. Does anyone actually cite those?</p>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/pmhat-5ky65","uuid":"5891c709-d139-440f-bacb-06244424587a","url":"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html","title":"Problems
|
261
|
+
with Plazi parsing: how reliable are automated methods for extracting specimens
|
262
|
+
from the literature?","summary":"The Plazi project has become one of the major
|
263
|
+
contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
|
264
|
+
(see Plazi''s GBIF page for details). These occurrences are extracted from
|
265
|
+
taxonomic publication using automated methods. New data is published almost
|
266
|
+
daily (see latest treatments). The map below shows the geographic distribution
|
267
|
+
of material citations provided to GBIF by Plazi, which gives you a sense of
|
268
|
+
the size of the dataset. By any metric Plazi represents a...","date_published":"2021-10-25T11:10:00Z","date_modified":"2021-10-28T16:08:18Z","date_indexed":"1970-01-01T00:00:00+00:00","authors":[{"url":null,"name":"Roderic
|
269
|
+
Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
270
|
+
both;\"><a href=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s240/Rf7UoXTw_400x400.jpg\"
|
271
|
+
style=\"display: block; padding: 1em 0; text-align: center; clear: right;
|
272
|
+
float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"240\"
|
273
|
+
data-original-width=\"240\" src=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s200/Rf7UoXTw_400x400.jpg\"/></a></div><p>The
|
274
|
+
<a href=\"http://plazi.org\">Plazi</a> project has become one of the major
|
275
|
+
contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
|
276
|
+
(see <a href=\"https://www.gbif.org/publisher/7ce8aef0-9e92-11dc-8738-b8a03c50a862\">Plazi''s
|
277
|
+
GBIF page</a> for details). These occurrences are extracted from taxonomic
|
278
|
+
publication using automated methods. New data is published almost daily (see
|
279
|
+
<a href=\"https://tb.plazi.org/GgServer/static/newToday.html\">latest treatments</a>).
|
280
|
+
The map below shows the geographic distribution of material citations provided
|
281
|
+
to GBIF by Plazi, which gives you a sense of the size of the dataset.</p>\n\n<div
|
282
|
+
class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s1030/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"
|
283
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
284
|
+
border=\"0\" width=\"400\" data-original-height=\"514\" data-original-width=\"1030\"
|
285
|
+
src=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"/></a></div>\n\n<p>By
|
286
|
+
any metric Plazi represents a considerable achievement. But often when I browse
|
287
|
+
individual records on Plazi I find records that seem clearly incorrect. Text
|
288
|
+
mining the literature is a challenging problem, but at the moment Plazi seems
|
289
|
+
something of a \"black box\". PDFs go in, the content is mined, and data comes
|
290
|
+
up to be displayed on the Plazi web site and uploaded to GBIF. Nowhere does
|
291
|
+
there seem to be an evaluation of how accurate this text mining actually is.
|
292
|
+
Anecdotally it seems to work well in some cases, but in others it produces
|
293
|
+
what can only be described as bogus records.</p>\n\n<h2>Finding errors</h2>\n\n<p>A
|
294
|
+
treatment in Plazi is a block of text (and sometimes illustrations) that refers
|
295
|
+
to a single taxon. Often that text will include a description of the taxon,
|
296
|
+
and list one or more specimens that have been examined. These lists of specimens
|
297
|
+
(\"material citations\") are one of the key bits of information that Plaza
|
298
|
+
extracts from a treatment as these citations get fed into GBIF as occurrences.</p>\n\n<p>To
|
299
|
+
help explore treatments I''ve constructed a simple web site that takes the
|
300
|
+
Plazi identifier for a treatment and displays that treatment with the material
|
301
|
+
citations highlighted. For example, for the Plazi treatment <a href=\"https://tb.plazi.org/GgServer/html/03B5A943FFBB6F02FE27EC94FABEEAE7\">03B5A943FFBB6F02FE27EC94FABEEAE7</a>
|
302
|
+
you can view the marked up version at <a href=\"https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228\">https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228</a>.
|
303
|
+
Below is an example of a material citation with its component parts tagged:</p>\n\n<div
|
304
|
+
class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s693/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"
|
305
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
306
|
+
border=\"0\" width=\"400\" data-original-height=\"94\" data-original-width=\"693\"
|
307
|
+
src=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"/></a></div>\n\n<p>This
|
308
|
+
is an example where Plazi has successfully parsed the specimen. But I keep
|
309
|
+
coming across cases where specimens have not been parsed correctly, resulting
|
310
|
+
in issues such as single specimens being split into multiple records (e.g., <a
|
311
|
+
href=\"https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496\">https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496</a>),
|
312
|
+
geographical coordinates being misinterpreted (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9\">https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9</a>),
|
313
|
+
or collector''s initials being confused with codes for natural history collections
|
314
|
+
(e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E\">https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E</a>).</p>\n\n<p>Parsing
|
315
|
+
specimens is a hard problem so it''s not unexpected to find errors. But they
|
316
|
+
do seem common enough to be easily found, which raises the question of just
|
317
|
+
what percentage of these material citations are correct? How much of the
|
318
|
+
data Plazi feeds to GBIF is correct? How would we know?</p>\n\n<h2>Systemic
|
319
|
+
problems</h2>\n\n<p>Some of the errors I''ve found concern the interpretation
|
320
|
+
of the parsed data. For example, it is striking that despite including marine
|
321
|
+
taxa <b>no</b> Plazi record has a value for depth below sea level (see <a
|
322
|
+
href=\"https://www.gbif.org/occurrence/map?depth=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">GBIF
|
323
|
+
search on depth range 0-9999 for Plazi</a>). But <a href=\"https://www.gbif.org/occurrence/map?elevation=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">many
|
324
|
+
records do have an elevation</a>, including records from marine environments.
|
325
|
+
Any record that has a depth value is interpreted by Plazi as being elevation,
|
326
|
+
so we have aerial crustacea and fish.</p>\n\n<h3>Map of Plazi records with
|
327
|
+
depth 0-9999m</h3>\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s673/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"
|
328
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
329
|
+
border=\"0\" width=\"400\" data-original-height=\"258\" data-original-width=\"673\"
|
330
|
+
src=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"/></a></div>\n\n<h3>Map
|
331
|
+
of Plazi records with elevation 0-9999m </h3>\n<div class=\"separator\" style=\"clear:
|
332
|
+
both;\"><a href=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s675/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"
|
333
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
334
|
+
border=\"0\" width=\"400\" data-original-height=\"256\" data-original-width=\"675\"
|
335
|
+
src=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"/></a></div>\n\n<p>Anecdotally
|
336
|
+
I''ve also noticed that Plazi seems to do well on zoological data, especially
|
337
|
+
journals like <i>Zootaxa</i>, but it often struggles with botanical specimens.
|
338
|
+
Botanists tend to cite specimens rather differently to zoologists (botanists
|
339
|
+
emphasise collector numbers rather than specimen codes). Hence data quality
|
340
|
+
in Plazi is likely to taxonomic biased.</p>\n\n<p>Plazi is <a href=\"https://github.com/plazi/community/issues\">using
|
341
|
+
GitHub to track issues with treatments</a> so feedback on erroneous records
|
342
|
+
is possible, but this seems inadequate to the task. There are tens of thousands
|
343
|
+
of data sets, with more being released daily, and hundreds of thousands of
|
344
|
+
occurrences, and relying on GitHub issues devolves the responsibility for
|
345
|
+
error checking onto the data users. I don''t have a measure of how many records
|
346
|
+
in Plazi have problems, but because I suspect it is a significant fraction
|
347
|
+
because for any given day''s output I can typically find errors.</p>\n\n<h2>What
|
348
|
+
to do?</h2>\n\n<p>Faced with a process that generates noisy data there are
|
349
|
+
several of things we could do:</p>\n\n<ol>\n<li>Have tools to detect and flag
|
350
|
+
errors made in generating the data.</li>\n<li>Have the data generator give
|
351
|
+
estimates the confidence of its results.</li>\n<li>Improve the data generator.</li>\n</ol>\n\n<p>I
|
352
|
+
think a comparison with the problem of parsing bibliographic references might
|
353
|
+
be instructive here. There is a long history of people developing tools to
|
354
|
+
parse references (<a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">I''ve
|
355
|
+
even had a go</a>). State-of-the art tools such as <a href=\"https://anystyle.io\">AnyStyle</a>
|
356
|
+
feature machine learning, and are tested against <a href=\"https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml\">human
|
357
|
+
curated datasets</a> of tagged bibliographic records. This means we can evaluate
|
358
|
+
the performance of a method (how well does it retrieve the same results as
|
359
|
+
human experts?) and also improve the method by expanding the corpus of training
|
360
|
+
data. Some of these tools can provide a measures of how confident they are
|
361
|
+
when classifying a string as, say, a person''s name, which means we could
|
362
|
+
flag potential issues for anyone wanting to use that record.</p>\n\n<p>We
|
363
|
+
don''t have equivalent tools for parsing specimens in the literature, and
|
364
|
+
hence have no easy way to quantify how good existing methods are, nor do we
|
365
|
+
have a public corpus of material citations that we can use as training data.
|
366
|
+
I <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">blogged
|
367
|
+
about this</a> a few months ago and was considering using Plazi as a source
|
368
|
+
of marked up specimen data to use for training. However based on what I''ve
|
369
|
+
looked at so far Plazi''s data would need to be carefully scrutinised before
|
370
|
+
it could be used as training data.</p>\n\n<p>Going forward, I think it would
|
371
|
+
be desirable to have a set of records that can be used to benchmark specimen
|
372
|
+
parsers, and ideally have the parsers themselves available as web services
|
373
|
+
so that anyone can evaluate them. Even better would be a way to contribute
|
374
|
+
to the training data so that these tools improve over time.</p>\n\n<p>Plazi''s
|
375
|
+
data extraction tools are mostly desktop-based, that is, you need to download
|
376
|
+
software to use their methods. However, there are experimental web services
|
377
|
+
available as well. I''ve created a simple wrapper around the material citation
|
378
|
+
parser, you can try it at <a href=\"https://plazi-tester.herokuapp.com/parser.php\">https://plazi-tester.herokuapp.com/parser.php</a>.
|
379
|
+
It takes a single material citation and returns a version with elements such
|
380
|
+
as specimen code and collector name tagged in different colours.</p>\n\n<h2>Summary</h2>\n\n<p>Text
|
381
|
+
mining the taxonomic literature is clearly a gold mine of data, but at the
|
382
|
+
same time it is potentially fraught as we try and extract structured data
|
383
|
+
from semi-structured text. Plazi has demonstrated that it is possible to extract
|
384
|
+
a lot of data from the literature, but at the same time the quality of that
|
385
|
+
data seems highly variable. Even minor issues in parsing text can have big
|
386
|
+
implications for data quality (e.g., marine organisms apparently living above
|
387
|
+
sea level). Historically in biodiversity informatics we have favoured data
|
388
|
+
quantity over data quality. Quantity has an obvious metric, and has milestones
|
389
|
+
we can celebrate (e.g., <a href=\"GBIF at 1 billion - what''s next?\">one
|
390
|
+
billion specimens</a>). There aren''t really any equivalent metrics for data
|
391
|
+
quality.</p>\n\n<p>Adding new types of data can sometimes initially result
|
392
|
+
in a new set of quality issues (e.g., <a href=\"https://iphylo.blogspot.com/2019/12/gbif-metagenomics-and-metacrap.html\">GBIF
|
393
|
+
metagenomics and metacrap</a>) that take time to resolve. In the case of Plazi,
|
394
|
+
I think it would be worthwhile to quantify just how many records have errors,
|
395
|
+
and develop benchmarks that we can use to test methods for extracting specimen
|
396
|
+
data from text. If we don''t do this then there will remain uncertainty as
|
397
|
+
to how much trust we can place in data mined from the taxonomic literature.</p>\n\n<h2>Update</h2>\n\nPlazi
|
398
|
+
has responded, see <a href=\"http://plazi.org/posts/2021/10/liberation-first-step-toward-quality/\">Liberating
|
399
|
+
material citations as a first step to more better data</a>. My reading of
|
400
|
+
their repsonse is that it essentially just reiterates Plazi''s approach and
|
401
|
+
doesn''t tackle the underlying issue: their method for extracting material
|
402
|
+
citations is error prone, and many of those errors end up in GBIF.","tags":["data
|
403
|
+
quality","parsing","Plazi","specimen","text mining"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/j77nc-e8x98","uuid":"c6b101f4-bfbc-4d01-921d-805c43c85757","url":"https://iphylo.blogspot.com/2022/08/linking-taxonomic-names-to-literature.html","title":"Linking
|
404
|
+
taxonomic names to the literature","summary":"Just some thoughts as I work
|
405
|
+
through some datasets linking taxonomic names to the literature. In the diagram
|
406
|
+
above I''ve tried to capture the different situatios I encounter. Much of
|
407
|
+
the work I''ve done on this has focussed on case 1 in the diagram: I want
|
408
|
+
to link a taxonomic name to an identifier for the work in which that name
|
409
|
+
was published. In practise this means linking names to DOIs. This has the
|
410
|
+
advantage of linking to a citable indentifier, raising questions such as whether
|
411
|
+
citations...","date_published":"2022-08-22T17:19:00Z","date_modified":"2022-08-22T17:19:08Z","date_indexed":"1909-06-16T08:21:41+00:00","authors":[{"url":null,"name":"Roderic
|
412
|
+
Page"}],"image":null,"content_html":"Just some thoughts as I work through
|
413
|
+
some datasets linking taxonomic names to the literature.\n\n<div class=\"separator\"
|
414
|
+
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s2140/linking%20to%20names144.jpg\"
|
415
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
416
|
+
border=\"0\" height=\"600\" data-original-height=\"2140\" data-original-width=\"1604\"
|
417
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s600/linking%20to%20names144.jpg\"/></a></div>\n\n<p>In
|
418
|
+
the diagram above I''ve tried to capture the different situatios I encounter.
|
419
|
+
Much of the work I''ve done on this has focussed on case 1 in the diagram:
|
420
|
+
I want to link a taxonomic name to an identifier for the work in which that
|
421
|
+
name was published. In practise this means linking names to DOIs. This has
|
422
|
+
the advantage of linking to a citable indentifier, raising questions such
|
423
|
+
as whether citations of taxonmic papers by taxonomic databases could become
|
424
|
+
part of a <a href=\"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html\">taxonomist''s
|
425
|
+
Google Scholar profile</a>.</p>\n\n<p>In many taxonomic databases full work-level
|
426
|
+
citations are not the norm, instead taxonomists cite one or more pages within
|
427
|
+
a work that are relevant to a taxonomic name. These \"microcitations\" (what
|
428
|
+
the U.S. legal profession refer to as \"point citations\" or \"pincites\", see
|
429
|
+
<a href=\"https://rasmussen.libanswers.com/faq/283203\">What are pincites,
|
430
|
+
pinpoints, or jump legal references?</a>) require some work to map to the
|
431
|
+
work itself (which is typically the thing that has a citatble identifier such
|
432
|
+
as a DOI).</p>\n\n<p>Microcitations (case 2 in the diagram above) can be quite
|
433
|
+
complex. Some might simply mention a single page, but others might list a
|
434
|
+
series of (not necessarily contiguous) pages, as well as figures, plates etc.
|
435
|
+
Converting these to citable identifiers can be tricky, especially as in most
|
436
|
+
cases we don''t have page-level identifiers. The Biodiversity Heritage Library
|
437
|
+
(BHL) does have URLs for each scanned page, and we have a standard for referring
|
438
|
+
to pages in a PDF (<code>page=<pageNum></code>, see <a href=\"https://datatracker.ietf.org/doc/html/rfc8118\">RFC
|
439
|
+
8118</a>). But how do we refer to a set of pages? Do we pick the first page?
|
440
|
+
Do we try and represent a set of pages, and if so, how?</p>\n\n<p>Another
|
441
|
+
issue with page-level identifiers is that not everything on a given page may
|
442
|
+
be relevant to the taxonomic name. In case 2 above I''ve shaded in the parts
|
443
|
+
of the pages and figure that refer to the taxonomic name. An example where
|
444
|
+
this can be problematic is the recent test case I created for BHL where a
|
445
|
+
page image was included for the taxonomic name <a href=\"https://www.gbif.org/species/195763322\"><i>Aphrophora
|
446
|
+
impressa</i></a>. The image includes the species description and a illustration,
|
447
|
+
as well as text that relates to other species.</p>\n\n<div class=\"separator\"
|
448
|
+
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s3467/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"
|
449
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
450
|
+
border=\"0\" height=\"400\" data-original-height=\"3467\" data-original-width=\"2106\"
|
451
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s400/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"/></a></div>\n\n<p>Given
|
452
|
+
that not everything on a page need be relevant, we could extract just the
|
453
|
+
relevant blocks of text and illustrations (e.g., paragraphs of text, panels
|
454
|
+
within a figure, etc.) and treat that set of elements as the thing to cite.
|
455
|
+
This is, of course, what <a href=\"http://plazi.org\">Plazi</a> are doing.
|
456
|
+
The set of extracted blocks is glued together as a \"treatment\", assigned
|
457
|
+
an identifier (often a DOI), and treated as a citable unit. It would be interesting
|
458
|
+
to see to what extent these treatments are actually cited, for example, do
|
459
|
+
subsequent revisions that cite work that include treatments cite those treatments,
|
460
|
+
or just the work itself? Put another way, are we creating <a href=\"https://iphylo.blogspot.com/2012/09/decoding-nature-encode-ipad-app-omg-it.html\">\"threads\"</a>
|
461
|
+
between taxonomic revisions?</p>\n\n<p>One reason for these notes is that
|
462
|
+
I''m exploring uploading taxonomic name - literature links to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
|
463
|
+
and case 1 above is easy, as is case 3 (if we have treatment-level identifiers).
|
464
|
+
But case 2 is problematic because we are linking to a set of things that may
|
465
|
+
not have an identifier, which means a decision has to be made about which
|
466
|
+
page to link to, and how to refer to that page.</p>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
|
467
|
+
barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
|
468
|
+
tweeted about this but want to bookmark it for later as well. The paper “A
|
469
|
+
molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
|
470
|
+
contains the following: …the annotated barcode records assembled by FinBOL
|
471
|
+
participants represent a tremendous intergenerational transfer of taxonomic
|
472
|
+
knowledge … the time contributed by current taxonomists in identifying and
|
473
|
+
contributing voucher specimens represents a great gift to future generations
|
474
|
+
who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":"2022-09-29T13:57:30Z","date_indexed":"1909-06-16T11:02:21+00:00","authors":[{"url":null,"name":"Roderic
|
475
|
+
Page"}],"image":null,"content_html":"<p>I <a href=\"https://twitter.com/rdmpage/status/1569738844416638981?s=21&t=9OVXuoUEwZtQt-Ldzlutfw\">tweeted
|
476
|
+
about this</a> but want to bookmark it for later as well. The paper “A molecular-based
|
477
|
+
identification resource for the arthropods of Finland” <a href=\"https://doi.org/10.1111/1755-0998.13510\">doi:10.1111/1755-0998.13510</a>
|
478
|
+
contains the following:</p>\n<blockquote>\n<p>…the annotated barcode records
|
479
|
+
assembled by FinBOL participants represent a tremendous <mark>intergenerational
|
480
|
+
transfer of taxonomic knowledge</mark> … the time contributed by current taxonomists
|
481
|
+
in identifying and contributing voucher specimens represents a great gift
|
482
|
+
to future generations who will benefit from their expertise when they are
|
483
|
+
no longer able to process new material.</p>\n</blockquote>\n<p>I think this
|
484
|
+
is a very clever way to characterise the project. In an age of machine learning
|
485
|
+
this may be commonest way to share knowledge , namely as expert-labelled training
|
486
|
+
data used to build tools for others. Of course, this means the expertise itself
|
487
|
+
may be lost, which has implications for updating the models if the data isn’t
|
488
|
+
complete. But it speaks to Charles Godfrey’s theme of <a href=\"https://biostor.org/reference/250587\">“Taxonomy
|
489
|
+
as information science”</a>.</p>\n<p>Note that the knowledge is also transformed
|
490
|
+
in the sense that the underlying expertise of interpreting morphology, ecology,
|
491
|
+
behaviour, genomics, and the past literature is not what is being passed on.
|
492
|
+
Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>\n<p>This
|
493
|
+
feels is different to, say iNaturalist, where there is a machine learning
|
494
|
+
model to identify images. In that case, the model is built on something the
|
495
|
+
community itself has created, and continues to create. Yes, the underlying
|
496
|
+
idea is that same: “experts” have labelled the data, a model is trained, the
|
497
|
+
model is used. But the benefits of the <a href=\"https://www.inaturalist.org\">iNaturalist</a>
|
498
|
+
model are immediately applicable to the people whose data built the model.
|
499
|
+
In the case of barcoding, because the technology itself is still not in the
|
500
|
+
hands of many (relative to, say, digital imaging), the benefits are perhaps
|
501
|
+
less tangible. Obviously researchers working with environmental DNA will find
|
502
|
+
it very useful, but broader impact may await the arrival of citizen science
|
503
|
+
DNA barcoding.</p>\n<p>The other consideration is whether the barcoding helps
|
504
|
+
taxonomists. Is it to be used to help prioritise future work (“we are getting
|
505
|
+
lots of unknown sequences in these taxa, lets do some taxonomy there”), or
|
506
|
+
is it simply capturing the knowledge of a generation that won’t be replaced:</p>\n<blockquote>\n<p>The
|
507
|
+
need to capture such knowledge is essential because there are, for example,
|
508
|
+
no young Finnish taxonomists who can critically identify species in many key
|
509
|
+
groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges,
|
510
|
+
most mite lineages).</p>\n</blockquote>\n<p>The cycle of collect data, test
|
511
|
+
and refine model, collect more data, rinse and repeat that happens with iNaturalist
|
512
|
+
creates a feedback loop. It’s not clear that a similar cycle exists for DNA
|
513
|
+
barcoding.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
|
514
|
+
on \"The Macroscope\" - a tool for the 21st Century?","summary":"This is a
|
515
|
+
guest post by Tony Rees. It would be difficult to encounter a scientist, or
|
516
|
+
anyone interested in science, who is not familiar with the microscope, a tool
|
517
|
+
for making objects visible that are otherwise too small to be properly seen
|
518
|
+
by the unaided eye, or to reveal otherwise invisible fine detail in larger
|
519
|
+
objects. A select few with a particular interest in microscopy may also have
|
520
|
+
encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop
|
521
|
+
microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":"2021-10-08T10:26:22Z","date_indexed":"1909-06-16T10:02:25+00:00","authors":[{"url":null,"name":"Roderic
|
522
|
+
Page"}],"image":null,"content_html":"<p><img src=\"https://lh3.googleusercontent.com/-A99btr6ERMs/Vl1Wvjp2OtI/AAAAAAAAEFI/7bKdRjNG5w0/ytNkVT2U.jpg?imgmax=800\"
|
523
|
+
alt=\"YtNkVT2U\" title=\"ytNkVT2U.jpg\" border=\"0\" width=\"128\" height=\"128\"
|
524
|
+
style=\"float:right;\" /> This is a guest post by <a href=\"https://about.me/TonyRees\">Tony
|
525
|
+
Rees</a>.</p>\n\n<p>It would be difficult to encounter a scientist, or anyone
|
526
|
+
interested in science, who is not familiar with the microscope, a tool for
|
527
|
+
making objects visible that are otherwise too small to be properly seen by
|
528
|
+
the unaided eye, or to reveal otherwise invisible fine detail in larger objects.
|
529
|
+
A select few with a particular interest in microscopy may also have encountered
|
530
|
+
the Wild-Leica \"Macroscope\", a specialised type of benchtop microscope optimised
|
531
|
+
for low-power macro-photography. However in this overview I discuss the \"Macroscope\"
|
532
|
+
in a different sense, which is that of the antithesis to the microscope: namely
|
533
|
+
a method for visualizing subjects too large to be encompassed by a single
|
534
|
+
field of vision, such as the Earth or some subset of its phenomena (the biosphere,
|
535
|
+
for example), or conceptually, the universe.</p>\n\n<p><div class=\"separator\"
|
536
|
+
style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s500/2020045672.jpg\"
|
537
|
+
style=\"display: block; padding: 1em 0; text-align: center; clear: right;
|
538
|
+
float: right;\"><img alt=\"\" border=\"0\" height=\"320\" data-original-height=\"500\"
|
539
|
+
data-original-width=\"303\" src=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s320/2020045672.jpg\"/></a></div>My
|
540
|
+
introduction to the term was via addresses given by Jesse Ausubel in the formative
|
541
|
+
years of the 2001-2010 <a href=\"http://www.coml.org\">Census of Marine Life</a>,
|
542
|
+
for which he was a key proponent. In Ausubel''s view, the Census would perform
|
543
|
+
the function of a macroscope, permitting a view of everything that lives in
|
544
|
+
the global ocean (or at least, that subset which could realistically be sampled
|
545
|
+
in the time frame available) as opposed to more limited subsets available
|
546
|
+
via previous data collection efforts. My view (which could, of course, be
|
547
|
+
wrong) was that his thinking had been informed by a work entitled \"Le macroscope,
|
548
|
+
vers une vision globale\" published in 1975 by the French thinker Joël de
|
549
|
+
Rosnay, who had expressed such a concept as being globally applicable in many
|
550
|
+
fields, including the physical and natural worlds but also extending to human
|
551
|
+
society, the growth of cities, and more. Yet again, some ecologists may also
|
552
|
+
have encountered the term, sometimes in the guise of \"Odum''s macroscope\",
|
553
|
+
as an approach for obtaining \"big picture\" analyses of macroecological processes
|
554
|
+
suitable for mathematical modelling, typically by elimination of fine detail
|
555
|
+
so that only the larger patterns remain, as initially advocated by Howard
|
556
|
+
T. Odum in his 1971 book \"Environment, Power, and Society\".</p>\n\n<p>From
|
557
|
+
the standpoint of the 21st century, it seems that we are closer to achieving
|
558
|
+
a \"macroscope\" (or possibly, multiple such tools) than ever before, based
|
559
|
+
on the availability of existing and continuing new data streams, improved
|
560
|
+
technology for data assembly and storage, and advanced ways to query and combine
|
561
|
+
these large streams of data to produce new visualizations, data products,
|
562
|
+
and analytical findings. I devote the remainder of this article to examples
|
563
|
+
where either particular workers have employed \"macroscope\" terminology to
|
564
|
+
describe their activities, or where potentially equivalent actions are taking
|
565
|
+
place without the explicit \"macroscope\" association, but are equally worthy
|
566
|
+
of consideration. To save space here, references cited here (most or all)
|
567
|
+
can be found via a Wikipedia article entitled \"<a href=\"https://en.wikipedia.org/wiki/Macroscope_(science_concept)\">Macroscope
|
568
|
+
(science concept)</a>\" that I authored on the subject around a year ago,
|
569
|
+
and have continued to add to on occasion as new thoughts or information come
|
570
|
+
to hand (see <a href=\"https://en.wikipedia.org/w/index.php?title=Macroscope_(science_concept)&offset=&limit=500&action=history\">edit
|
571
|
+
history for the article</a>).</p>\n\n<p>First, one can ask, what constitutes
|
572
|
+
a macroscope, in the present context? In the Wikipedia article I point to
|
573
|
+
a book \"Big Data - Related Technologies, Challenges and Future Prospects\"
|
574
|
+
by Chen <em>et al.</em> (2014) (<a href=\"https://doi.org/10.1007/978-3-319-06245-7\">doi:10.1007/978-3-319-06245-7</a>),
|
575
|
+
in which the \"value chain of big data\" is characterised as divisible into
|
576
|
+
four phases, namely data generation, data acquisition (aka data assembly),
|
577
|
+
data storage, and data analysis. To my mind, data generation (which others
|
578
|
+
may term acquisition, differently from the usage by Chen <em>et al.</em>)
|
579
|
+
is obviously the first step, but does not in itself constitute the macroscope,
|
580
|
+
except in rare cases - such as Landsat imagery, perhaps - where on its own,
|
581
|
+
a single co-ordinated data stream is sufficient to meet the need for a particular
|
582
|
+
type of \"global view\". A variant of this might be a coordinated data collection
|
583
|
+
program - such as that of the ten year Census of Marine Life - which might
|
584
|
+
produce the data required for the desired global view; but again, in reality,
|
585
|
+
such data are collected in a series of discrete chunks, in many and often
|
586
|
+
disparate data formats, and must be \"wrangled\" into a more coherent whole
|
587
|
+
before any meaningful \"macroscope\" functionality becomes available.</p>\n\n<p>Here
|
588
|
+
we come to what, in my view, constitutes the heart of the \"macroscope\":
|
589
|
+
an intelligently organized (i.e. indexable and searchable), coherent data
|
590
|
+
store or repository (where \"data\" may include imagery and other non numeric
|
591
|
+
data forms, but much else besides). Taking the Census of Marine Life example,
|
592
|
+
the data repository for that project''s data (plus other available sources
|
593
|
+
as inputs) is the <a href=\"https://obis.org\">Ocean Biodiversity Information
|
594
|
+
System</a> or OBIS (previously the Ocean Biogeographic Information System),
|
595
|
+
which according to this view forms the \"macroscope\" for which the Census
|
596
|
+
data is a feed. (For non habitat-specific biodiversity data, <a href=\"https://www.gbif.org\">GBIF</a>
|
597
|
+
is an equivalent, and more extensive, operation). Other planetary scale \"macroscopes\",
|
598
|
+
by this definition (which may or may not have an explicit geographic, i.e.
|
599
|
+
spatial, component) would include inventories of biological taxa such as the
|
600
|
+
<a href=\"https://www.catalogueoflife.org\">Catalogue of Life</a> and so on,
|
601
|
+
all the way back to the pioneering compendia published by Linnaeus in the
|
602
|
+
eighteenth century; while for cartography and topographic imagery, the current
|
603
|
+
\"blockbuster\" of <a href=\"http://earth.google.com\">Google Earth</a> and
|
604
|
+
its predecessors also come well into public consciousness.</p>\n\n<p>In the
|
605
|
+
view of some workers and/or operations, both of these phases are precursors
|
606
|
+
to the real \"work\" of the macroscope which is to reveal previously unseen
|
607
|
+
portions of the \"big picture\" by means either of the availability of large,
|
608
|
+
synoptic datasets, or fusion between different data streams to produce novel
|
609
|
+
insights. Companies such as IBM and Microsoft have used phraseology such as:</p>\n\n<blockquote>By
|
610
|
+
2022 we will use machine-learning algorithms and software to help us organize
|
611
|
+
information about the physical world, helping bring the vast and complex data
|
612
|
+
gathered by billions of devices within the range of our vision and understanding.
|
613
|
+
We call this a \"macroscope\" – but unlike the microscope to see the very
|
614
|
+
small, or the telescope that can see far away, it is a system of software
|
615
|
+
and algorithms to bring all of Earth''s complex data together to analyze it
|
616
|
+
by space and time for meaning.\" (IBM)</blockquote>\n\n<blockquote>As the
|
617
|
+
Earth becomes increasingly instrumented with low-cost, high-bandwidth sensors,
|
618
|
+
we will gain a better understanding of our environment via a virtual, distributed
|
619
|
+
whole-Earth \"macroscope\"... Massive-scale data analytics will enable real-time
|
620
|
+
tracking of disease and targeted responses to potential pandemics. Our virtual
|
621
|
+
\"macroscope\" can now be used on ourselves, as well as on our planet.\" (Microsoft)
|
622
|
+
(references available via the Wikipedia article cited above).</blockquote>\n\n<p>Whether
|
623
|
+
or not the analytical capabilities described here are viewed as being an integral
|
624
|
+
part of the \"macroscope\" concept, or are maybe an add-on, is ultimately
|
625
|
+
a question of semantics and perhaps, personal opinion. Continuing the Census
|
626
|
+
of Marine Life/OBIS example, OBIS offers some (arguably rather basic) visualization
|
627
|
+
and summary tools, but also makes its data available for download to users
|
628
|
+
wishing to analyse it further according to their own particular interests;
|
629
|
+
using OBIS data in this manner, Mark Costello et al. in 2017 were able to
|
630
|
+
demarcate a finite number of data-supported marine biogeographic realms for
|
631
|
+
the first time (Costello et al. 2017: Nature Communications. 8: 1057. <a href=\"https://doi.org/10.1038/s41467-017-01121-2\">doi:10.1038/s41467-017-01121-2</a>),
|
632
|
+
a project which I was able to assist in a small way in an advisory capacity.
|
633
|
+
In a case such as this, perhaps the final function of the macroscope, namely
|
634
|
+
data visualization and analysis, was outsourced to the authors'' own research
|
635
|
+
institution. Similarly at an earlier phase, \"data aggregation\" can also
|
636
|
+
be virtual rather than actual, i.e. avoiding using a single physical system
|
637
|
+
to hold all the data, enabled by open web mapping standards WMS (web map service)
|
638
|
+
and WFS (web feature service) to access a set of distributed data stores,
|
639
|
+
e.g. as implemented on the portal for the <a href=\"https://portal.aodn.org.au/\">Australian
|
640
|
+
Ocean Data Network</a>.</p>\n\n<p>So, as we pass through the third decade
|
641
|
+
of the twenty first century, what developments await us in the \"macroscope\"
|
642
|
+
area\"? In the biodiversity space, one can reasonably presume that the existing
|
643
|
+
\"macroscopic\" data assembly projects such as OBIS and GBIF will continue,
|
644
|
+
and hopefully slowly fill current gaps in their coverage - although in the
|
645
|
+
marine area, strategic new data collection exercises may be required (Census
|
646
|
+
2020, or 2025, anyone?), while (again hopefully), the Catalogue of Life will
|
647
|
+
continue its progress towards a \"complete\" species inventory for the biosphere.
|
648
|
+
The Landsat project, with imagery dating back to 1972, continues with the
|
649
|
+
launch of its latest satellite Landsat 9 just this year (21 September 2021)
|
650
|
+
with a planned mission duration for the next 5 years, so the \"macroscope\"
|
651
|
+
functionality of that project seems set to continue for the medium term at
|
652
|
+
least. Meanwhile the ongoing development of sensor networks, both on land
|
653
|
+
and in the ocean, offers an exciting new method of \"instrumenting the earth\"
|
654
|
+
to obtain much more real time data than has ever been available in the past,
|
655
|
+
offering scope for many more, use case-specific \"macroscopes\" to be constructed
|
656
|
+
that can fuse (e.g.) satellite imagery with much more that is happening at
|
657
|
+
a local level.</p>\n\n<p>So, the \"macroscope\" concept appears to be alive
|
658
|
+
and well, even though the nomenclature can change from time to time (IBM''s
|
659
|
+
\"Macroscope\", foreshadowed in 2017, became the \"IBM Pairs Geoscope\" on
|
660
|
+
implementation, and is now simply the \"Geospatial Analytics component within
|
661
|
+
the IBM Environmental Intelligence Suite\" according to available IBM publicity
|
662
|
+
materials). In reality this illustrates a new dichotomy: even if \"everyone\"
|
663
|
+
in principle has access to huge quantities of publicly available data, maybe
|
664
|
+
only a few well funded entities now have the computational ability to make
|
665
|
+
sense of it, and can charge clients a good fee for their services...</p>\n\n<p>I
|
666
|
+
present this account partly to give a brief picture of \"macroscope\" concepts
|
667
|
+
today and in the past, for those who may be interested, and partly to present
|
668
|
+
a few personal views which would be out of scope in a \"neutral point of view\"
|
669
|
+
article such as is required on Wikipedia; also to see if readers of this blog
|
670
|
+
would like to contribute further to discussion of any of the concepts traversed
|
671
|
+
herein.</p>","tags":["guest post","macroscope"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
|
672
|
+
Stuart Page 1936-2022","summary":"My dad died last weekend. Below is a notice
|
673
|
+
in today''s New Zealand Herald. I''m in New Zealand for his funeral. Don''t
|
674
|
+
really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":"2023-03-22T07:25:56Z","date_indexed":"1909-06-16T10:41:55+00:00","authors":[{"url":null,"name":"Roderic
|
675
|
+
Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
676
|
+
both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s3454/_DSC5106.jpg\"
|
677
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
678
|
+
border=\"0\" width=\"400\" data-original-height=\"2582\" data-original-width=\"3454\"
|
679
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s400/_DSC5106.jpg\"/></a></div>\n\nMy
|
680
|
+
dad died last weekend. Below is a notice in today''s New Zealand Herald. I''m
|
681
|
+
in New Zealand for his funeral. Don''t really have the words for this right
|
682
|
+
now.\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s3640/IMG_2870.jpeg\"
|
683
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
684
|
+
border=\"0\" height=\"320\" data-original-height=\"3640\" data-original-width=\"1391\"
|
685
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg\"/></a></div>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
|
686
|
+
interfaces, knowledge graphs, and Miller columns","summary":"Some quick notes
|
687
|
+
on interface ideas for digital libraries and/or knowledge graphs. Recently
|
688
|
+
there’s been something of an explosion in bibliographic tools to explore the
|
689
|
+
literature. Examples include: Elicit which uses AI to search for and summarise
|
690
|
+
papers _scite which uses AI to do sentiment analysis on citations (does paper
|
691
|
+
A cite paper B favourably or not?) ResearchRabbit which uses lists, networks,
|
692
|
+
and timelines to discover related research Scispace which navigates connections
|
693
|
+
between...","date_published":"2023-04-25T13:01:00Z","date_modified":"2023-04-27T14:51:08Z","date_indexed":"1909-06-16T11:25:14+00:00","authors":[{"url":null,"name":"Roderic
|
694
|
+
Page"}],"image":null,"content_html":"<p>Some quick notes on interface ideas
|
695
|
+
for digital libraries and/or knowledge graphs.</p>\n<p>Recently there’s been
|
696
|
+
something of an explosion in bibliographic tools to explore the literature.
|
697
|
+
Examples include:</p>\n<ul>\n<li><a href=\"https://elicit.org\">Elicit</a>
|
698
|
+
which uses AI to search for and summarise papers</li>\n<li><a href=\"https://scite.ai\">_scite</a>
|
699
|
+
which uses AI to do sentiment analysis on citations (does paper A cite paper
|
700
|
+
B favourably or not?)</li>\n<li><a href=\"https://www.researchrabbit.ai\">ResearchRabbit</a>
|
701
|
+
which uses lists, networks, and timelines to discover related research</li>\n<li><a
|
702
|
+
href=\"https://typeset.io\">Scispace</a> which navigates connections between
|
703
|
+
papers, authors, topics, etc., and provides AI summaries.</li>\n</ul>\n<p>As
|
704
|
+
an aside, I think these (and similar tools) are a great example of how bibliographic
|
705
|
+
data such as abstracts, the citation graph and - to a lesser extent - full
|
706
|
+
text - have become commodities. That is, what was once proprietary information
|
707
|
+
is now free to anyone, which in turns means a whole ecosystem of new tools
|
708
|
+
can emerge. If I was clever I’d be building a <a href=\"https://en.wikipedia.org/wiki/Wardley_map\">Wardley
|
709
|
+
map</a> to explore this. Note that a decade or so ago reference managers like
|
710
|
+
<a href=\"https://www.zotero.org\">Zotero</a> were made possible by publishers
|
711
|
+
exposing basic bibliographic data on their articles. As we move to <a href=\"https://i4oc.org\">open
|
712
|
+
citations</a> we are seeing the next generation of tools.</p>\n<p>Back to
|
713
|
+
my main topic. As usual, rather than focus on what these tools do I’m more
|
714
|
+
interested in how they <strong>look</strong>. I have history here, when the
|
715
|
+
iPad came out I was intrigued by the possibilities it offered for displaying
|
716
|
+
academic articles, as discussed <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad.html\">here</a>,
|
717
|
+
<a href=\"https://iphylo.blogspot.com/2010/09/viewing-scientific-articles-on-ipad.html\">here</a>,
|
718
|
+
<a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_24.html\">here</a>,
|
719
|
+
<a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_3052.html\">here</a>,
|
720
|
+
and <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_31.html\">here</a>.
|
721
|
+
ResearchRabbit looks like this:</p>\n<div style=\"padding:86.91% 0 0 0;position:relative;\"><iframe
|
722
|
+
src=\"https://player.vimeo.com/video/820871442?h=23b05b0dae&badge=0&autopause=0&player_id=0&app_id=58479\"
|
723
|
+
frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
|
724
|
+
style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"ResearchRabbit\"></iframe></div><script
|
725
|
+
src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>Scispace’s <a
|
726
|
+
href=\"https://typeset.io/explore/journals/parassitologia-1ieodjwe\">“trace”
|
727
|
+
view</a> looks like this:</p>\n<div style=\"padding:84.55% 0 0 0;position:relative;\"><iframe
|
728
|
+
src=\"https://player.vimeo.com/video/820871348?h=2db7b661ef&badge=0&autopause=0&player_id=0&app_id=58479\"
|
729
|
+
frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
|
730
|
+
style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"Scispace
|
731
|
+
screencast\"></iframe></div><script src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>What
|
732
|
+
is interesting about both is that they display content from left to right
|
733
|
+
in vertical columns, rather than the more common horizontal rows. This sort
|
734
|
+
of display is sometimes called <a href=\"https://en.wikipedia.org/wiki/Miller_columns\">Miller
|
735
|
+
columns</a> or a <a href=\"https://web.archive.org/web/20210726134921/http://designinginterfaces.com/firstedition/index.php?page=Cascading_Lists\">cascading
|
736
|
+
list</a>.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s1024/GNUstep-liveCD.png\"
|
737
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
738
|
+
border=\"0\" width=\"400\" data-original-height=\"768\" data-original-width=\"1024\"
|
739
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s400/GNUstep-liveCD.png\"/></a></div>\n\n<p>By
|
740
|
+
Gürkan Sengün (talk) - Own work, Public Domain, <a href=\"https://commons.wikimedia.org/w/index.php?curid=594715\">https://commons.wikimedia.org/w/index.php?curid=594715</a></p>\n<p>I’ve
|
741
|
+
always found displaying a knowledge graph to be a challenge, as discussed
|
742
|
+
<a href=\"https://iphylo.blogspot.com/2019/07/notes-on-collections-knowledge-graphs.html\">elsewhere
|
743
|
+
on this blog</a> and in my paper on <a href=\"https://peerj.com/articles/6739/#p-29\">Ozymandias</a>.
|
744
|
+
Miller columns enable one to drill down in increasing depth, but it doesn’t
|
745
|
+
need to be a tree, it can be a path within a network. What I like about ResearchRabbit
|
746
|
+
and the original Scispace interface is that they present the current item
|
747
|
+
together with a list of possible connections (e.g., authors, citations) that
|
748
|
+
you can drill down on. Clicking on these will result in a new column being
|
749
|
+
appended to the right, with a view (typically a list) of the next candidates
|
750
|
+
to visit. In graph terms, these are adjacent nodes to the original item. The
|
751
|
+
clickable badges on each item can be thought of as sets of edges that have
|
752
|
+
the same label (e.g., “authored by”, “cites”, “funded”, “is about”, etc.).
|
753
|
+
Each of these nodes itself becomes a starting point for further exploration.
|
754
|
+
Note that the original starting point isn’t privileged, other than being the
|
755
|
+
starting point. That is, each time we drill down we are seeing the same type
|
756
|
+
of information displayed in the same way. Note also that the navigation can
|
757
|
+
be though of as a <strong>card</strong> for a node, with <strong>buttons</strong>
|
758
|
+
grouping the adjacent nodes. When we click on an individual button, it expands
|
759
|
+
into a <strong>list</strong> in the next column. This can be thought of as
|
760
|
+
a preview for each adjacent node. Clicking on an element in the list generates
|
761
|
+
a new card (we are viewing a single node) and we get another set of buttons
|
762
|
+
corresponding to the adjacent nodes.</p>\n<p>One important behaviour in a
|
763
|
+
Miller column interface is that the current path can be pruned at any point.
|
764
|
+
If we go back (i.e., scroll to the left) and click on another tab on an item,
|
765
|
+
everything downstream of that item (i.e., to the right) gets deleted and replaced
|
766
|
+
by a new set of nodes. This could make retrieving a particular history of
|
767
|
+
browsing a bit tricky, but encourages exploration. Both Scispace and ResearchRabbit have
|
768
|
+
the ability to add items to a collection, so you can keep track of things
|
769
|
+
you discover.</p>\n<p>Lots of food for thought, I’m assuming that there is
|
770
|
+
some user interface/experience research on Miller columns. One thing to remember
|
771
|
+
is that Miller columns are most often associated with trees, but in this case
|
772
|
+
we are exploring a network. That means that potentially there is no limit
|
773
|
+
to the number of columns being generated as we wander through the graph. It
|
774
|
+
will be interesting to think about what the average depth is likely to be,
|
775
|
+
in other words, how deep down the rabbit hole will be go?</p>\n\n<h3>Update</h3>\n<p>Should
|
776
|
+
add link to David Regev''s explorations of <a href=\"https://medium.com/david-regev-on-ux/flow-browser-b730daf0f717\">Flow
|
777
|
+
Browser</a>.\n\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":["cards","flow","Knowledge
|
778
|
+
Graph","Miller column","RabbitResearch"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/t6fb9-4fn44","uuid":"8bc3fea6-cb86-4344-8dad-f312fbf58041","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The
|
779
|
+
Business of Extracting Knowledge from Academic Publications","summary":"Markus
|
780
|
+
Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting
|
781
|
+
Knowledge from Academic Publications\". I spent months working on domain-specific
|
782
|
+
search engines and knowledge discovery apps for biomedicine and eventually
|
783
|
+
figured that synthesizing \"insights\" or building knowledge graphs by machine-reading
|
784
|
+
the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc—
|
785
|
+
Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I worked on biomedical...","date_published":"2021-12-11T00:01:00Z","date_modified":"2021-12-11T00:01:21Z","date_indexed":"1909-06-16T11:32:09+00:00","authors":[{"url":null,"name":"Roderic
|
786
|
+
Page"}],"image":null,"content_html":"<p>Markus Strasser (<a href=\"https://twitter.com/mkstra\">@mkstra</a>
|
787
|
+
write a fascinating article entitled <a href=\"https://markusstrasser.org/extracting-knowledge-from-literature/\">\"The
|
788
|
+
Business of Extracting Knowledge from Academic Publications\"</a>.</p>\n\n<blockquote
|
789
|
+
class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">I spent months working
|
790
|
+
on domain-specific search engines and knowledge discovery apps for biomedicine
|
791
|
+
and eventually figured that synthesizing "insights" or building
|
792
|
+
knowledge graphs by machine-reading the academic literature (papers) is *barely
|
793
|
+
useful* :<a href=\"https://t.co/eciOg30Odc\">https://t.co/eciOg30Odc</a></p>—
|
794
|
+
Markus Strasser (@mkstra) <a href=\"https://twitter.com/mkstra/status/1468334482113523716?ref_src=twsrc%5Etfw\">December
|
795
|
+
7, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
|
796
|
+
charset=\"utf-8\"></script>\n\n<p>His TL;DR:</p>\n\n<p><blockquote>\nTL;DR:
|
797
|
+
I worked on biomedical literature search, discovery and recommender web applications
|
798
|
+
for many months and concluded that extracting, structuring or synthesizing
|
799
|
+
\"insights\" from academic publications (papers) or building knowledge bases
|
800
|
+
from a domain corpus of literature has negligible value in industry.</p>\n\n<p>Close
|
801
|
+
to nothing of what makes science actually work is published as text on the
|
802
|
+
web.\n</blockquote></p>\n\n<p>After recounting the many problems of knowledge
|
803
|
+
extraction - including a swipe at nanopubs which \"are ... dead in my view
|
804
|
+
(without admitting it)\" - he concludes:</p>\n\n<p><blockquote>\nI’ve been
|
805
|
+
flirting with this entire cluster of ideas including open source web annotation,
|
806
|
+
semantic search and semantic web, public knowledge graphs, nano-publications,
|
807
|
+
knowledge maps, interoperable protocols and structured data, serendipitous
|
808
|
+
discovery apps, knowledge organization, communal sense making and academic
|
809
|
+
literature/publishing toolchains for a few years on and off ... nothing of
|
810
|
+
it will go anywhere.</p>\n\n<p>Don’t take that as a challenge. Take it as
|
811
|
+
a red flag and run. Run towards better problems.\n</blockquote></p>\n\n<p>Well
|
812
|
+
worth a read, and much food for thought.</p>","tags":["ai","business model","text
|
813
|
+
mining"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/463yw-pbj26","uuid":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David
|
814
|
+
Remsen","summary":"I heard yesterday from Martin Kalfatovic (BHL) that David
|
815
|
+
Remsen has died. Very sad news. It''s starting to feel like iPhylo might end
|
816
|
+
up being a list of obituaries of people working on biodiversity informatics
|
817
|
+
(e.g., Scott Federhen). I spent several happy visits at MBL at Woods Hole
|
818
|
+
talking to Dave at the height of the uBio project, which really kickstarted
|
819
|
+
large scale indexing of taxonomic names, and the use of taxonomic name finding
|
820
|
+
tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":"2022-12-17T08:12:23Z","date_indexed":"1909-06-16T11:41:39+00:00","authors":[{"url":null,"name":"Roderic
|
821
|
+
Page"}],"image":null,"content_html":"<p>I heard yesterday from Martin Kalfatovic
|
822
|
+
(BHL) that David Remsen has died. Very sad news. It''s starting to feel like
|
823
|
+
iPhylo might end up being a list of obituaries of people working on biodiversity
|
824
|
+
informatics (e.g., <a href=\"https://iphylo.blogspot.com/2016/05/scott-federhen-rip.html\">Scott
|
825
|
+
Federhen</a>).</p>\n\n<p>I spent several happy visits at MBL at Woods Hole
|
826
|
+
talking to Dave at the height of the uBio project, which really kickstarted
|
827
|
+
large scale indexing of taxonomic names, and the use of taxonomic name finding
|
828
|
+
tools to index the literature. His work on uBio with David (\"Paddy\") Patterson
|
829
|
+
led to the <a href=\"https://eol.org\">Encyclopedia of Life</a> (EOL).</p>\n\n<p>A
|
830
|
+
number of the things I''m currently working on are things Dave started. For
|
831
|
+
example, I recently uploaded a version of his dataset for Nomenclator Zoologicus[1]
|
832
|
+
to <a href=\"https://www.checklistbank.org/dataset/126539/about\">ChecklistBank</a>
|
833
|
+
where I''m working on augmenting that original dataset by adding links to
|
834
|
+
the taxonomic literature. My <a href=\"https://biorss.herokuapp.com/?feed=Y291bnRyeT1XT1JMRCZwYXRoPSU1QiUyMkJJT1RBJTIyJTVE\">BioRSS
|
835
|
+
project</a> is essentially an attempt to revive uBioRSS[2] (see <a href=\"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html\">Revisiting
|
836
|
+
RSS to monitor the latest taxonomic research</a>).</p>\n\n<p>I have fond memories
|
837
|
+
of those visits to Woods Hole. A very sad day indeed.</p>\n\n<p><b>Update:</b>
|
838
|
+
The David Remsen Memorial Fund has been set up on <a href=\"https://www.gofundme.com/f/david-remsen-memorial-fund\">GoFundMe</a>.</p>\n\n<p>1.
|
839
|
+
Remsen, D. P., Norton, C., & Patterson, D. J. (2006). Taxonomic Informatics
|
840
|
+
Tools for the Electronic Nomenclator Zoologicus. The Biological Bulletin,
|
841
|
+
210(1), 18–24. https://doi.org/10.2307/4134533</p>\n\n<p>2. Patrick R. Leary,
|
842
|
+
David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar,
|
843
|
+
uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23,
|
844
|
+
Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109</p>","tags":["David
|
845
|
+
Remsen","obituary","uBio"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
|
846
|
+
bibliographic data","summary":"There are several instances where I have a
|
847
|
+
collection of references that I want to deduplicate and merge. For example,
|
848
|
+
in Zootaxa has no impact factor I describe a dataset of the literature cited
|
849
|
+
by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4),
|
850
|
+
as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1).
|
851
|
+
Given that the same articles may be cited many times, these datasets have
|
852
|
+
lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":"2022-02-03T15:11:29Z","date_indexed":"1909-06-16T10:22:30+00:00","authors":[{"url":null,"name":"Roderic
|
853
|
+
Page"}],"image":null,"content_html":"<p>There are several instances where
|
854
|
+
I have a collection of references that I want to deduplicate and merge. For
|
855
|
+
example, in <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
|
856
|
+
has no impact factor</a> I describe a dataset of the literature cited by articles
|
857
|
+
in the journal <i>Zootaxa</i>. This data is available on Figshare (<a href=\"https://doi.org/10.6084/m9.figshare.c.5054372.v4\">https://doi.org/10.6084/m9.figshare.c.5054372.v4</a>),
|
858
|
+
as is the equivalent dataset for <i>Phytotaxa</i> (<a href=\"https://doi.org/10.6084/m9.figshare.c.5525901.v1\">https://doi.org/10.6084/m9.figshare.c.5525901.v1</a>).
|
859
|
+
Given that the same articles may be cited many times, these datasets have
|
860
|
+
lots of duplicates. Similarly, articles in <a href=\"https://species.wikimedia.org\">Wikispecies</a>
|
861
|
+
often have extensive lists of references cited, and the same reference may
|
862
|
+
appear on multiple pages (for an initial attempt to extract these references
|
863
|
+
see <a href=\"https://doi.org/10.5281/zenodo.5801661\">https://doi.org/10.5281/zenodo.5801661</a>
|
864
|
+
and <a href=\"https://github.com/rdmpage/wikispecies-parser\">https://github.com/rdmpage/wikispecies-parser</a>).</p>\n\n<p>There
|
865
|
+
are several reasons I want to merge these references. If I want to build a
|
866
|
+
citation graph for <i>Zootaxa</i> or <i>Phytotaxa</i> I need to merge references
|
867
|
+
that are the same so that I can accurate count citations. I am also interested
|
868
|
+
in harvesting the metadata to help find those articles in the <a href=\"https://www.biodiversitylibrary.org\">Biodiversity
|
869
|
+
Heritage Library</a> (BHL), and the literature cited section of scientific
|
870
|
+
articles is a potential goldmine of bibliographic metadata, as is Wikispecies.</p>\n\n<p>After
|
871
|
+
various experiments and false starts I''ve created a repository <a href=\"https://github.com/rdmpage/bib-dedup\">https://github.com/rdmpage/bib-dedup</a>
|
872
|
+
to host a series of PHP scripts to deduplicate bibliographics data. I''ve
|
873
|
+
settled on using CSL-JSON as the format for bibliographic data. Because deduplication
|
874
|
+
relies on comparing pairs of references, the standard format for most of the
|
875
|
+
scripts is a JSON array containing a pair of CSL-JSON objects to compare.
|
876
|
+
Below are the steps the code takes.</p>\n\n<h2>Generating pairs to compare</h2>\n\n<p>The
|
877
|
+
first step is to take a list of references and generate the pairs that will
|
878
|
+
be compared. I started with this approach as I wanted to explore machine learning
|
879
|
+
and wanted a simple format for training data, such as an array of two CSL-JSON
|
880
|
+
objects and an integer flag representing whether the two references were the
|
881
|
+
same of different.</p>\n\n<p>There are various ways to generate CSL-JSON for
|
882
|
+
a reference. I use a tool I wrote (see <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
|
883
|
+
parsing tool released</a>) that has a simple API where you parse one or more
|
884
|
+
references and it returns that reference as structured data in CSL-JSON.</p>\n\n<p>Attempting
|
885
|
+
to do all possible pairwise comparisons rapidly gets impractical as the number
|
886
|
+
of references increases, so we need some way to restrict the number of comparisons
|
887
|
+
we make. One approach I''ve explored is the “sorted neighbourhood method”
|
888
|
+
where we sort the references 9for example by their title) then move a sliding
|
889
|
+
window down the list of references, comparing all references within that window.
|
890
|
+
This greatly reduces the number of pairwise comparisons. So the first step
|
891
|
+
is to sort the references, then run a sliding window over them, output all
|
892
|
+
the pairs in each window (ignoring in pairwise comparisons already made in
|
893
|
+
a previous window). Other methods of \"blocking\" could also be used, such
|
894
|
+
as only including references in a particular year, or a particular journal.</p>\n\n<p>So,
|
895
|
+
the output of this step is a set of JSON arrays, each with a pair of references
|
896
|
+
in CSL-JSON format. Each array is stored on a single line in the same file
|
897
|
+
in <a href=\"https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON_2\">line-delimited
|
898
|
+
JSON</a> (JSONL).</p>\n\n<h2>Comparing pairs</h2>\n\n<p>The next step is to
|
899
|
+
compare each pair of references and decide whether they are a match or not.
|
900
|
+
Initially I explored a machine learning approach used in the following paper:</p>\n\n<blockquote>\nWilson
|
901
|
+
DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex
|
902
|
+
features to improve genealogical record linkage. In: The 2011 International
|
903
|
+
Joint Conference on Neural Networks. 9–14. DOI: <a href=\"https://doi.org/10.1109/IJCNN.2011.6033192\">10.1109/IJCNN.2011.6033192</a>\n</blockquote>\n\n<p>Initial
|
904
|
+
experiments using <a href=\"https://github.com/jtet/Perceptron\">https://github.com/jtet/Perceptron</a>
|
905
|
+
were promising and I want to play with this further, but I deciding to skip
|
906
|
+
this for now and just use simple string comparison. So for each CSL-JSON object
|
907
|
+
I generate a citation string in the same format using CiteProc, then compute
|
908
|
+
the <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein
|
909
|
+
distance</a> between the two strings. By normalising this distance by the
|
910
|
+
length of the two strings being compared I can use an arbitrary threshold
|
911
|
+
to decide if the references are the same or not.</p>\n\n<h2>Clustering</h2>\n\n<p>For
|
912
|
+
this step we read the JSONL file produced above and record whether the two
|
913
|
+
references are a match or not. Assuming each reference has a unique identifier
|
914
|
+
(needs only be unique within the file) then we can use those identifier to
|
915
|
+
record the clusters each reference belongs to. I do this using a <a href=\"https://en.wikipedia.org/wiki/Disjoint-set_data_structure\">Disjoint-set
|
916
|
+
data structure</a>. For each reference start with a graph where each node
|
917
|
+
represents a reference, and each node has a pointer to a parent node. Initially
|
918
|
+
the reference is its own parent. A simple implementation is to have an array
|
919
|
+
index by reference identifiers and where the value of each cell in the array
|
920
|
+
is the node''s parent.</p>\n\n<p>As we discover pairs we update the parents
|
921
|
+
of the nodes to reflect this, such that once all the comparisons are done
|
922
|
+
we have a one or more sets of clusters corresponding to the references that
|
923
|
+
we think are the same. Another way to think of this is that we are getting
|
924
|
+
the components of a graph where each node is a reference and pair of references
|
925
|
+
that match are connected by an edge.</p>\n\n<p>In the code I''m using I write
|
926
|
+
this graph in <a href=\"https://en.wikipedia.org/wiki/Trivial_Graph_Format\">Trivial
|
927
|
+
Graph Format</a> (TGF) which can be visualised using a tools such as <a href=\"https://www.yworks.com/products/yed\">yEd</a>.</p>\n\n<h2>Merging</h2>\n\n<p>Now
|
928
|
+
that we have a graph representing the sets of references that we think are
|
929
|
+
the same we need to merge them. This is where things get interesting as the
|
930
|
+
references are similar (by definition) but may differ in some details. The
|
931
|
+
paper below describes a simple Bayesian approach for merging records:</p>\n\n<blockquote>\nCouncill
|
932
|
+
IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles
|
933
|
+
CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching
|
934
|
+
Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
|
935
|
+
Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: <a href=\"https://doi.org/10.1145/1141753.1141817\">10.1145/1141753.1141817</a>.\n</blockquote>\n\n<p>So
|
936
|
+
the next step is to read the graph with the clusters, generate the sets of
|
937
|
+
bibliographic references that correspond to each cluster, then use the method
|
938
|
+
described in Councill et al. to produce a single bibliographic record for
|
939
|
+
that cluster. These records could then be used to, say locate the corresponding
|
940
|
+
article in BHL, or populate Wikidata with missing references.</p>\n\n<p>Obviously
|
941
|
+
there is always the potential for errors, such as trying to merge references
|
942
|
+
that are not the same. As a quick and dirty check I flag as dubious any cluster
|
943
|
+
where the page numbers vary among members of the cluster. More sophisticated
|
944
|
+
checks are possible, especially if I go down the ML route (i.e., I would have
|
945
|
+
evidence for the probability that the same reference can disagree on some
|
946
|
+
aspects of metadata).</p>\n\n<h2>Summary</h2>\n\n<p>At this stage the code
|
947
|
+
is working well enough for me to play with and explore some example datasets.
|
948
|
+
The focus is on structured bibliographic metadata, but I may simplify things
|
949
|
+
and have a version that handles simple string matching, for example to cluster
|
950
|
+
together different abbreviations of the same journal name.</p>","tags":["data
|
951
|
+
cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/c79vq-7rr11","uuid":"3cb94422-5506-4e24-a41c-a250bb521ee0","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL
|
952
|
+
for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
|
953
|
+
for WikiData. More precisely, the endpoint is for a subset of the entities
|
954
|
+
that are of interest to WikiCite, such as scholarly articles, people, and
|
955
|
+
journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
|
956
|
+
The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
|
957
|
+
There are various ways to interact with the endpoint, personally I like the
|
958
|
+
Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
|
959
|
+
taken...","date_published":"2021-12-20T13:16:00Z","date_modified":"2021-12-20T13:20:05Z","date_indexed":"1909-06-16T10:52:00+00:00","authors":[{"url":null,"name":"Roderic
|
960
|
+
Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
961
|
+
both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s1000\"
|
962
|
+
style=\"display: block; padding: 1em 0; text-align: center; clear: right;
|
963
|
+
float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"1000\"
|
964
|
+
data-original-width=\"1000\" src=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s200\"/></a></div><p>I''ve
|
965
|
+
released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint
|
966
|
+
is for a subset of the entities that are of interest to WikiCite, such as
|
967
|
+
scholarly articles, people, and journals. There is a crude demo at <a href=\"https://wikicite-graphql.herokuapp.com\">https://wikicite-graphql.herokuapp.com</a>.
|
968
|
+
The endpoint itself is at <a href=\"https://wikicite-graphql.herokuapp.com/gql.php\">https://wikicite-graphql.herokuapp.com/gql.php</a>.
|
969
|
+
There are various ways to interact with the endpoint, personally I like the
|
970
|
+
<a href=\"https://altair.sirmuel.design\">Altair GraphQL Client</a> by <a
|
971
|
+
href=\"https://github.com/imolorhe\">Samuel Imolorhe</a>.</p>\n\n<p>As I''ve
|
972
|
+
<a href=\"https://iphylo.blogspot.com/2021/04/it-been-while.html\">mentioned
|
973
|
+
earlier</a> it''s taken me a while to see the point of GraphQL. But it is
|
974
|
+
clear it is gaining traction in the biodiversity world (see for example the
|
975
|
+
<a href=\"https://dev.gbif.org/hosted-portals.html\">GBIF Hosted Portals</a>)
|
976
|
+
so it''s worth exploring. My take on GraphQL is that it is a way to create
|
977
|
+
a self-describing API that someone developing a web site can use without them
|
978
|
+
having to bury themselves in the gory details of how data is internally modelled.
|
979
|
+
For example, WikiData''s query interface uses SPARQL, a powerful language
|
980
|
+
that has a steep learning curve (in part because of the administrative overhead
|
981
|
+
brought by RDF namespaces, etc.). In my previous SPARQL-based projects such
|
982
|
+
as <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a> and <a
|
983
|
+
href=\"http://alec-demo.herokuapp.com\">ALEC</a> I have either returned SPARQL
|
984
|
+
results directly (Ozymandias) or formatted SPARQL results as <a href=\"https://schema.org/DataFeed\">schema.org
|
985
|
+
DataFeeds</a> (equivalent to RSS feeds) (ALEC). Both approaches work, but
|
986
|
+
they are project-specific and if anyone else tried to build based on these
|
987
|
+
projects they might struggle for figure out what was going on. I certainly
|
988
|
+
struggle, and I wrote them!</p>\n\n<p>So it seems worthwhile to explore this
|
989
|
+
approach a little further and see if I can develop a GraphQL interface that
|
990
|
+
can be used to build the sort of rich apps that I want to see. The demo I''ve
|
991
|
+
created uses SPARQL under the hood to provide responses to the GraphQL queries.
|
992
|
+
So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully)
|
993
|
+
simpler overlay on top of SPARQL so that we can retrieve the data we want
|
994
|
+
without having to learn the intricacies of SPARQL, nor how Wikidata models
|
995
|
+
publications and people.</p>","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":[]}]}'
|
996
|
+
recorded_at: Sun, 18 Jun 2023 06:01:20 GMT
|
997
|
+
recorded_with: VCR 6.1.0
|