commonmeta-ruby 3.3.15 → 3.3.16

Sign up to get free protection for your applications and to get access to all the features.
@@ -23,23 +23,23 @@ http_interactions:
23
23
  Cache-Control:
24
24
  - public, max-age=0, must-revalidate
25
25
  Content-Length:
26
- - '162607'
26
+ - '26135'
27
27
  Content-Type:
28
28
  - application/json; charset=utf-8
29
29
  Date:
30
- - Thu, 15 Jun 2023 20:47:10 GMT
30
+ - Mon, 10 Jul 2023 20:37:16 GMT
31
31
  Etag:
32
- - '"6w7me0q1i23h72"'
32
+ - '"a986ggx9d0k3r"'
33
33
  Server:
34
34
  - Vercel
35
35
  Strict-Transport-Security:
36
36
  - max-age=63072000
37
37
  X-Matched-Path:
38
- - "/api/blogs/[slug]"
38
+ - "/api/blogs/[[...params]]"
39
39
  X-Vercel-Cache:
40
40
  - MISS
41
41
  X-Vercel-Id:
42
- - fra1::iad1::95rf9-1686862029064-f08aa4b0d0a5
42
+ - fra1::iad1::b9rbm-1689021435986-bb09a174a89c
43
43
  Connection:
44
44
  - close
45
45
  body:
@@ -49,91 +49,26 @@ http_interactions:
49
49
  more ranty and less considered opinions, see my <a href=\"https://twitter.com/rdmpage\">Twitter
50
50
  feed</a>.<br>ISSN 2051-8188. Written content on this site is licensed under
51
51
  a <a href=\"https://creativecommons.org/licenses/by/4.0/\">Creative Commons
52
- Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default?alt=rss","feed_format":"application/rss+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-05-31T17:26:00+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger","category":"Natural
53
- Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/btdk4-42879","uuid":"3e1278f6-e7c0-43e1-bb54-6829e1344c0d","url":"https://iphylo.blogspot.com/2022/09/the-ideal-taxonomic-journal.html","title":"The
54
- ideal taxonomic journal","summary":"This is just some random notes on an “ideal”
55
- taxonomic journal, inspired in part by some recent discussions on “turbo-taxonomy”
56
- (e.g., https://doi.org/10.3897/zookeys.1087.76720 and https://doi.org/10.1186/1742-9994-10-15),
57
- and also examples such as the Australian Journal of Taxonomy https://doi.org/10.54102/ajt.qxi3r
58
- which seems well-intentioned but limited. XML One approach is to have highly
59
- structured text that embeds detailed markup, and ideally a tool that generates
60
- markup in XML. This is...","date_published":"2022-09-29T14:00:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
61
- (Roderic Page)"}],"image":null,"content_html":"<p>This is just some random
62
- notes on an “ideal” taxonomic journal, inspired in part by some recent discussions
63
- on “turbo-taxonomy” (e.g., <a href=\"https://doi.org/10.3897/zookeys.1087.76720\">https://doi.org/10.3897/zookeys.1087.76720</a>
64
- and <a href=\"https://doi.org/10.1186/1742-9994-10-15\">https://doi.org/10.1186/1742-9994-10-15</a>),
65
- and also examples such as the Australian Journal of Taxonomy <a href=\"https://doi.org/10.54102/ajt.qxi3r\">https://doi.org/10.54102/ajt.qxi3r</a>
66
- which seems well-intentioned but limited.</p>\n<h2 id=\"xml\">XML</h2>\n<p>One
67
- approach is to have highly structured text that embeds detailed markup, and
68
- ideally a tool that generates markup in XML. This is the approach taken by
69
- Pensoft. There is an inevitable trade-off between the burden on authors of
70
- marking up text versus making the paper machine readable. In some ways this
71
- seems misplaced effort given that there is little evidence that publications
72
- by themselves have much value (see <a href=\"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html\">The
73
- Business of Extracting Knowledge from Academic Publications</a>). “Value”
74
- in this case means as a source of data or factual statements that we can compute
75
- over. Human-readable text is not a good way to convey this sort of information.</p>\n<p>It’s
76
- also interesting that many editing tools are going in the opposite direction,
77
- for example there are minimalist tools using <a href=\"https://en.wikipedia.org/wiki/Markdown\">Markdown</a>
78
- where the goal is to <em>get out of the author’s way</em>, rather than impose
79
- a way of writing. Text is written by humans for humans, so the tools should
80
- be human-friendly.</p>\n<p>The idea of publishing using XML is attractive
81
- in that it gives you XML that can be archived by, say, PubMed Central, but
82
- other than that the value seems limited. A cursory glance at download stats
83
- for journals that provide PDF and XML downloads, such as <em>PLoS One</em>
84
- and <em>ZooKeys</em>, PDF is by far the more popular format. So arguably there
85
- is little value in providing XML. Those who have tried to use JATS-XML as
86
- an authoring tool have not had a happy time: <a href=\"https://doi.org/10.7557/15.5517\">How
87
- we tried to JATS XML</a>. However, there are various tools to help with the
88
- process, such as <a href=\"https://github.com/Vitaliy-1/docxToJats\">docxToJats</a>,<br>\ntexture,
89
- and <a href=\"https://github.com/elifesciences/jats-xml-to-pdf\">jats-xml-to-pdf</a>
90
- if this is the route one wants to take.</p>\n<h2 id=\"automating-writing-manuscripts\">Automating
91
- writing manuscripts</h2>\n<p>The dream, of course, is to have a tool where
92
- you store all your taxonomic data (literature, specimens, characters, images,
93
- sequences, media files, etc.) and at the click of a button generate a paper.
94
- Certainly some of this can be automated, much nomenclatural and specimen information
95
- could be converted to human-readable text. Ideally this computer-generated
96
- text would not be edited (otherwise it could get out of sync with the underlying
97
- data). The text should be <a href=\"https://en.wikipedia.org/wiki/Transclusion\">transcluded</a>.
98
- As an aside, one way to do this would be to include things such as lists of
99
- material examined as images rather than text while the manuscript is being
100
- edited. In the same way that you (probably) wouldn’t edit a photograph within
101
- your text editor, you shouldn’t be editing data. When the manuscript is published
102
- the data-generated portions can then be output as text.</p>\n<p>Of course
103
- all of this assumes that we have taxonomic data in a database (or some other
104
- storage format, including plain text and Mark-down, e.g. <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
105
- markdown, and taxonomic trees</a>) that can generate outputs in the various
106
- formats that we need.</p>\n<h2 id=\"archiving-data-and-images\">Archiving
107
- data and images</h2>\n<p>One of the really nice things that <a href=\"http://plazi.org\">Plazi</a>
108
- do is have a pipeline that sends taxonomic descriptions and images to Zenodo,
109
- and similar data to GBIF. Any taxonomic journal should be able to do this.
110
- Indeed, arguably each taxonomic treatment within the paper should be linked
111
- to the Zenodo DOI at the time of publication. Indeed, we could imagine ultimately
112
- having treatments as transclusions within the larger manuscript. Alternatively
113
- we could store the treatments as parts of the larger article (rather like
114
- chapters in a book), each with a CrossRef DOI. I’m still sceptical about whether
115
- these treatments are as important as we make out, see <a href=\"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html\">Does
116
- anyone cite taxonomic treatments?</a>. But having machine-readable taxonomic
117
- data archived and accessible is a good thing. Uploading the same data to GBIF
118
- makes much of that data immediately accessible. Now that GBIF offers <a href=\"https://www.gbif.org/composition/3kQFinjwHbCGZeLb5OhwN2/gbif-hosted-portals\">hosted
119
- portals</a> there is the possibility of having custom interfaces to data from
120
- a particular journal.</p>\n<h2 id=\"name-and-identifier-registration\">Name
121
- and identifier registration</h2>\n<p>We would also want automatic registration
122
- of new taxonomic names, for which there are pipelines (see “A common registration-to-publication
123
- automated pipeline for nomenclatural acts for higher plants (International
124
- Plant Names Index, IPNI), fungi (Index Fungorum, MycoBank) and animals (ZooBank)”
125
- <a href=\"https://doi.org/10.3897/zookeys.550.9551\">https://doi.org/10.3897/zookeys.550.9551</a>).
126
- These pipelines do not seem to be documented in much detail, and the data
127
- formats differ across registration agencies (e.g., IPNI and ZooBank). For
128
- example, ZooBank seems to require TaxPub XML.</p>\n<p>Registration of names
129
- and identifiers, especially across multiple registration agencies (ZooBank,
130
- CrossRef, DataCite, etc.) requires some coordination, especially when one
131
- registration agency requires identifiers from another.</p>\n<h2 id=\"summary\">Summary</h2>\n<p>If
132
- data is key, then the taxonomic paper itself becomes something of a wrapper
133
- around that data. It still serves the function of being human-readable, providing
134
- broader context for the work, and as an archive that conforms to currently
135
- accepted ways to publish taxonomic names. But in some ways it is the last
136
- interesting part of the process.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/37y2z-gre70","uuid":"f3629c86-06e0-42c0-844a-266b03a91ef1","url":"https://iphylo.blogspot.com/2023/05/ten-years-and-million-links.html","title":"Ten
52
+ Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default","current_feed_url":null,"feed_format":"application/atom+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-06-17T15:38:20+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger
53
+ 7.00","category":"Natural Sciences","backlog":true,"prefix":"10.59350","expired":null,"items":[{"id":"37538c38-66e6-4ac4-ab5c-679684622ade","doi":"https://doi.org/10.59350/2b1j9-qmw12","url":"https://iphylo.blogspot.com/2022/05/round-trip-from-identifiers-to.html","title":"Round
54
+ trip from identifiers to citations and back again","summary":"Note to self
55
+ (basically rewriting last year''s Finding citations of specimens). Bibliographic
56
+ data supports going from identifier to citation string and back again, so
57
+ we can do a \"round trip.\" 1. Given a DOI we can get structured data with
58
+ a simple HTTP fetch, then use a tool such as citation.js to convert that data
59
+ into a human-readable string in a variety of formats. Identifier Structured
60
+ data Human readable string 10.7717/peerj-cs.214 HTTP with...","published_at":1653669240,"updated_at":1653669259,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
61
+ Page"}],"image":null,"tags":["citation","GBIF","material examined","specimen
62
+ codes"],"language":"en","reference":[]},{"id":"545c177f-cea5-4b79-b554-3ccae9c789d7","doi":"https://doi.org/10.59350/d3dc0-7an69","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
63
+ on \"The Macroscope\" - a tool for the 21st Century?","summary":"This is a
64
+ guest post by Tony Rees. It would be difficult to encounter a scientist, or
65
+ anyone interested in science, who is not familiar with the microscope, a tool
66
+ for making objects visible that are otherwise too small to be properly seen
67
+ by the unaided eye, or to reveal otherwise invisible fine detail in larger
68
+ objects. A select few with a particular interest in microscopy may also have
69
+ encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop
70
+ microscope optimised for...","published_at":1633610280,"updated_at":1633688782,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
71
+ Page"}],"image":null,"tags":["guest post","macroscope"],"language":"en","reference":[]},{"id":"f3629c86-06e0-42c0-844a-266b03a91ef1","doi":"https://doi.org/10.59350/37y2z-gre70","url":"https://iphylo.blogspot.com/2023/05/ten-years-and-million-links.html","title":"Ten
137
72
  years and a million links","summary":"As trailed on a Twitter thread last
138
73
  week I’ve been working on a manuscript describing the efforts to map taxonomic
139
74
  names to their original descriptions in the taxonomic literature. Putting
@@ -141,563 +76,24 @@ http_interactions:
141
76
  basically “um, what, exactly, have you been doing all these years?”. TL;DR
142
77
  Across fungi, plants, and animals approx 1.3 million names have been linked
143
78
  to a persistent identifier for a publication.— Roderic Page (@rdmpage) May
144
- 25,...","date_published":"2023-05-31T17:26:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
145
- (Roderic Page)"}],"image":null,"content_html":"<p>As trailed on a Twitter
146
- thread last week I’ve been working on a manuscript describing the efforts
147
- to map taxonomic names to their original descriptions in the taxonomic literature.</p>\n<blockquote
148
- class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Putting together a manuscript
149
- on linking taxonomic names to the primary literature, basically “um, what,
150
- exactly, have you been doing all these years?”. TL;DR Across fungi, plants,
151
- and animals approx 1.3 million names have been linked to a persistent identifier
152
- for a publication.</p>— Roderic Page (@rdmpage) <a href=\"https://twitter.com/rdmpage/status/1661714128413573120?ref_src=twsrc%5Etfw\">May
153
- 25, 2023</a></blockquote> \n<p>The preprint is on bioRxiv <a href=\"https://doi.org/10.1101/2023.05.29.542697\">doi:10.1101/2023.05.29.542697</a></p>\n<blockquote>\n<p>A
154
- major gap in the biodiversity knowledge graph is a connection between taxonomic
155
- names and the taxonomic literature. While both names and publications often
156
- have persistent identifiers (PIDs), such as Life Science Identifiers (LSIDs)
157
- or Digital Object Identifiers (DOIs), LSIDs for names are rarely linked to
158
- DOIs for publications. This article describes efforts to make those connections
159
- across three large taxonomic databases: Index Fungorum, International Plant
160
- Names Index (IPNI), and the Index of Organism Names (ION). Over a million
161
- names have been matched to DOIs or other persistent identifiers for taxonomic
162
- publications. This represents approximately 36% of names for which publication
163
- data is available. The mappings between LSIDs and publication PIDs are made
164
- available through ChecklistBank. Applications of this mapping are discussed,
165
- including a web app to locate the citation of a taxonomic name, and a knowledge
166
- graph that uses data on researcher’s ORCID ids to connect taxonomic names
167
- and publications to authors of those names.</p>\n</blockquote>\n<p>Much of
168
- the work has been linking taxa to names, which still has huge gaps. There
169
- are also interesting differences in coverage between plants, animals, and
170
- fungi (see preprint for details).</p>\n\n<div class=\"separator\" style=\"clear:
171
- both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s2276/animals-coverage.png\"
172
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
173
- border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
174
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s320/animals-coverage.png\"/></a></div><div
175
- class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s2276/fungi-coverage.png\"
176
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
177
- border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
178
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s320/fungi-coverage.png\"/></a></div><div
179
- class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s2276/plants-coverage.png\"
180
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
181
- border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
182
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s320/plants-coverage.png\"/></a></div>\n\n\nThere
183
- is also a simple app to demonstrate these links, see <a href=\"https://species-cite.herokuapp.com\">https://species-cite.herokuapp.com</a>.\n\n\n\n<blockquote>\n<p>Written
184
- with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/92rdb-5fe58","uuid":"d33d4f49-b281-4997-9eb9-dbad1e52d9bd","url":"https://iphylo.blogspot.com/2022/09/local-global-identifiers-for.html","title":"Local
185
- global identifiers for decentralised wikis","summary":"I''ve been thinking
186
- a bit about how one could use a Markdown wiki-like tool such as Obsidian to
187
- work with taxonomic data (see earlier posts Obsidian, markdown, and taxonomic
188
- trees and Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu).
189
- One \"gotcha\" would be how to name pages. If we treat the database as entirely
190
- local, then the page names don''t matter, but what if we envisage sharing
191
- the database, or merging it with others (for example, if we divided a taxon
192
- up into chunks, and...","date_published":"2022-09-08T16:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
193
- (Roderic Page)"}],"image":null,"content_html":"<p>I''ve been thinking a bit
194
- about how one could use a Markdown wiki-like tool such as Obsidian to work
195
- with taxonomic data (see earlier posts <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
196
- markdown, and taxonomic trees</a> and <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">Personal
197
- knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu</a>).</p>\n\n<p>One
198
- \"gotcha\" would be how to name pages. If we treat the database as entirely
199
- local, then the page names don''t matter, but what if we envisage sharing
200
- the database, or merging it with others (for example, if we divided a taxon
201
- up into chunks, and different people worked on those different chunks)? </p>\n\n<p>This
202
- is the attraction of globally unique identifiers. You and I can independently
203
- work on the same thing, such as data linked to scientific paper, safe in the
204
- knowledge that if we both use the DOI for that paper we can easily combine
205
- what we''ve done. But global identifiers can also be a pain, especially if
206
- we need to use a service to look them up (\"is there a DOI for this paper?\",
207
- \"what is the LSID for this taxonomic name?\").</p>\n\n<p>Life would be easier
208
- if we could generate identifiers \"locally\", but had some assurance that
209
- they would be globally unique, and that anyone else generating an identifier
210
- for the same thing would arrive at the same identifier (this eliminates things
211
- such as <a href=\"https://en.wikipedia.org/wiki/Universally_unique_identifier\">UUIDs</a>
212
- which are intentionally designed to prvent people genrrating the same identifier).
213
- One approach is \"content addressing\" (see, e.g. <a href=\"https://web.archive.org/web/20210514054054/https://bentrask.com/notes/content-addressing.html\">Principles
214
- of Content Addressing</a> - dead link but in the Wayabck Machine, see also
215
- <a href=\"https://github.com/btrask/stronglink\">btrask/stronglink</a>). For
216
- example, we can generate a cryptographic hash of a file (such as a PDF) and
217
- use that as the identifier.</p>\n\n<p>Now the problem is that we have globally
218
- unique, but ugly and unfriendly identifiers (such as \"6c98136eba9084ea9a5fc0b7693fed8648014505\").
219
- What we need are nice, easy to use identifiers we can use as page names. <a
220
- href=\"https://species.wikimedia.org/wiki/Main_Page\">Wikispecies</a> serves
221
- as a possible role model, where taxon names serve as page names, as do simplified
222
- citations (e.g., authors and years). This model runs into the problem that
223
- taxon names aren''t unique, nor are author + year combinations. In Wikispecies
224
- this is resolved by having a centralised database where it''s first come,
225
- first served. If there is a name clash you have to create a new name for your
226
- page. This works, but what if you have multiple databases un by different
227
- people? How do we ensure the identifiers are the same?</p>\n\n<p>Then I remembered
228
- Roger Hyam''s flight of fantasy over a decade ago: <a href=\"http://www.hyam.net/blog/archives/1007\">SpeciesIndex.org
229
- – an impractical, practical solution</a>. He proposed the following rules
230
- to generate a unique URI for a taxonomic name:\n\n<ul>\n <li>The URI must
231
- start with \"http://speciesindex.org\" followed by one or more of the following
232
- separated by slashes.</li>\n\n <li>First word of name. Must only contain
233
- letters. Must not be the same as one of the names of the nomenclatural codes
234
- (icbn or iczn). Optional but highly recommended.</li> \n\n <li>Second word
235
- of name. Must only contain letters and not be a nomenclatural code name. Optional.</li>
236
- \n\n <li>Third word of name. Must only contain letters and not be a nomenclatural
237
- code name. Optional.</li> \n\n <li>Year of publication. Must be an integer
238
- greater than 1650 and equal to or less than the current year. If this is an
239
- ICZN name then this should be the year the species (epithet) was published
240
- as is commonly cited after the name. If this is an ICBN name at species or
241
- below then it is the date of the combination. Optional. Recommended for zoological
242
- names if known. Not recommended for botanical names unless there is a known
243
- problem with homonyms in use by non-taxonomists.</li>\n \n<li>Nomenclatural
244
- code governing the name of the taxon. Currently this must be either ''icbn''
245
- or ''iczn''. This may be omitted if the code is unknown or not relevant. Other
246
- codes may be added to this list.</li> \n <li>Qualifier This must be a Version
247
- 4 RFC-4122 UUID. Optional. Used to generate a new independent identifier for
248
- a taxon for which the conventional name is unknown or does not exist or to
249
- indicate a particular taxon concept that bears the embedded name.</li>\n\n <li>The
250
- whole speciesindex.org URI string should be considered case\nsensitive. Everything
251
- should be lower case apart from the first letter of words that are specified
252
- as having upper case in their relevant codes e.g. names at and above the rank
253
- of genus.</li>\n</ul>\n</p>\n\n<p>Roger is basically arging that while names
254
- aren''t unique (i.e., we have homonyms such as <i>Abronia</i>) they are pretty
255
- close to being so, and with a few tweaks we can come up with a unique representation.
256
- Another way to think about this if we had a database of all taxonomics, we
257
- could construct a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
258
- and for each name find the shortest set of name parts (genus, species, etc),
259
- year, and code that gave us a unique string for that name. In many cases the
260
- species name may be all we need, in other cases we may need to add year and/or
261
- nomenclatural code to arrive at a unique string. \n\n</p>\n\n<p>What about
262
- bibliographic references? Well many of us will have databases (e.g., Endnote,
263
- Mendeley, Zotero, etc.) which generate \"cite keys\". These are typically
264
- short, memorable identifiers for a reference that are unique within that database.
265
- There is an interesting discussion on the <a href=\"https://discourse.jabref.org/t/universal-citekey-generator/2441/2\">JabRef
266
- forum</a> regarding a \"Universal Citekey Generator\", and source code is
267
- available <a href=\"https://github.com/cparnot/universal-citekey-js\">cparnot/universal-citekey-js</a>.
268
- I''ve yet to explore this in detail, but it looks a promising way to generate
269
- unique identifiers from basic metadata (echos of more elaborate schemes such
270
- as <a href=\"https://en.wikipedia.org/wiki/Serial_Item_and_Contribution_Identifier\">SICIs</a>).
271
- For example,\n\n<blockquote>Senna AR, Guedes UN, Andrade LF, Pereira-Filho
272
- GH. 2021. A new species of amphipod Pariphinotus Kunkel, 1910 (Amphipoda:
273
- Phliantidae) from Southwestern Atlantic. Zool Stud 60:57. doi:10.6620/ZS.2021.60-57.</blockquote>\n\nbecomes
274
- \"Senna:2021ck\". So if two people have the same, core, metadata for a paper
275
- they can generate the same key.</p>\n\n<p>Hence it seems with a few conventions
276
- (and maybe some simple tools to support them) we could have decentralised
277
- wiki-like tools that used the same identifiers for the same things, and yet
278
- those identfiiers were short and human-friendly.</p>","tags":["citekey","identfiiers","markdown","obsidian","Roger
279
- Hyam"],"language":"en","references":null},{"id":"https://doi.org/10.59350/j77nc-e8x98","uuid":"c6b101f4-bfbc-4d01-921d-805c43c85757","url":"https://iphylo.blogspot.com/2022/08/linking-taxonomic-names-to-literature.html","title":"Linking
280
- taxonomic names to the literature","summary":"Just some thoughts as I work
281
- through some datasets linking taxonomic names to the literature. In the diagram
282
- above I''ve tried to capture the different situatios I encounter. Much of
283
- the work I''ve done on this has focussed on case 1 in the diagram: I want
284
- to link a taxonomic name to an identifier for the work in which that name
285
- was published. In practise this means linking names to DOIs. This has the
286
- advantage of linking to a citable indentifier, raising questions such as whether
287
- citations...","date_published":"2022-08-22T17:19:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
288
- (Roderic Page)"}],"image":null,"content_html":"Just some thoughts as I work
289
- through some datasets linking taxonomic names to the literature.\n\n<div class=\"separator\"
290
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s2140/linking%20to%20names144.jpg\"
291
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
292
- border=\"0\" height=\"600\" data-original-height=\"2140\" data-original-width=\"1604\"
293
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s600/linking%20to%20names144.jpg\"/></a></div>\n\n<p>In
294
- the diagram above I''ve tried to capture the different situatios I encounter.
295
- Much of the work I''ve done on this has focussed on case 1 in the diagram:
296
- I want to link a taxonomic name to an identifier for the work in which that
297
- name was published. In practise this means linking names to DOIs. This has
298
- the advantage of linking to a citable indentifier, raising questions such
299
- as whether citations of taxonmic papers by taxonomic databases could become
300
- part of a <a href=\"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html\">taxonomist''s
301
- Google Scholar profile</a>.</p>\n\n<p>In many taxonomic databases full work-level
302
- citations are not the norm, instead taxonomists cite one or more pages within
303
- a work that are relevant to a taxonomic name. These \"microcitations\" (what
304
- the U.S. legal profession refer to as \"point citations\" or \"pincites\", see
305
- <a href=\"https://rasmussen.libanswers.com/faq/283203\">What are pincites,
306
- pinpoints, or jump legal references?</a>) require some work to map to the
307
- work itself (which is typically the thing that has a citatble identifier such
308
- as a DOI).</p>\n\n<p>Microcitations (case 2 in the diagram above) can be quite
309
- complex. Some might simply mention a single page, but others might list a
310
- series of (not necessarily contiguous) pages, as well as figures, plates etc.
311
- Converting these to citable identifiers can be tricky, especially as in most
312
- cases we don''t have page-level identifiers. The Biodiversity Heritage Library
313
- (BHL) does have URLs for each scanned page, and we have a standard for referring
314
- to pages in a PDF (<code>page=&lt;pageNum&gt;</code>, see <a href=\"https://datatracker.ietf.org/doc/html/rfc8118\">RFC
315
- 8118</a>). But how do we refer to a set of pages? Do we pick the first page?
316
- Do we try and represent a set of pages, and if so, how?</p>\n\n<p>Another
317
- issue with page-level identifiers is that not everything on a given page may
318
- be relevant to the taxonomic name. In case 2 above I''ve shaded in the parts
319
- of the pages and figure that refer to the taxonomic name. An example where
320
- this can be problematic is the recent test case I created for BHL where a
321
- page image was included for the taxonomic name <a href=\"https://www.gbif.org/species/195763322\"><i>Aphrophora
322
- impressa</i></a>. The image includes the species description and a illustration,
323
- as well as text that relates to other species.</p>\n\n<div class=\"separator\"
324
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s3467/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"
325
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
326
- border=\"0\" height=\"400\" data-original-height=\"3467\" data-original-width=\"2106\"
327
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s400/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"/></a></div>\n\n<p>Given
328
- that not everything on a page need be relevant, we could extract just the
329
- relevant blocks of text and illustrations (e.g., paragraphs of text, panels
330
- within a figure, etc.) and treat that set of elements as the thing to cite.
331
- This is, of course, what <a href=\"http://plazi.org\">Plazi</a> are doing.
332
- The set of extracted blocks is glued together as a \"treatment\", assigned
333
- an identifier (often a DOI), and treated as a citable unit. It would be interesting
334
- to see to what extent these treatments are actually cited, for example, do
335
- subsequent revisions that cite work that include treatments cite those treatments,
336
- or just the work itself? Put another way, are we creating <a href=\"https://iphylo.blogspot.com/2012/09/decoding-nature-encode-ipad-app-omg-it.html\">\"threads\"</a>
337
- between taxonomic revisions?</p>\n\n<p>One reason for these notes is that
338
- I''m exploring uploading taxonomic name - literature links to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
339
- and case 1 above is easy, as is case 3 (if we have treatment-level identifiers).
340
- But case 2 is problematic because we are linking to a set of things that may
341
- not have an identifier, which means a decision has to be made about which
342
- page to link to, and how to refer to that page.</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/w18j9-v7j10","uuid":"d811172e-7798-403c-a83d-3d5317a9657e","url":"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html","title":"Papers
343
- citing data that cite papers: CrossRef, DataCite, and the Catalogue of Life","summary":"Quick
344
- notes to self following on from a conversation about linking taxonomic names
345
- to the literature. Is there a way to turn those links into countable citations
346
- (even if just one per database) for Google Scholar?&mdash; Wayne Maddison
347
- (@WayneMaddison) August 3, 2022 There are different sorts of citation: Paper
348
- cites another paper Paper cites a dataset Dataset cites a paper Citation
349
- type (1) is largely a solved problem (although there are issues of the ownership
350
- and use of this...","date_published":"2022-08-03T11:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
351
- (Roderic Page)"}],"image":null,"content_html":"Quick notes to self following
352
- on from a conversation about linking taxonomic names to the literature.\n\n<blockquote
353
- class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Is there a way to turn
354
- those links into countable citations (even if just one per database) for Google
355
- Scholar?</p>&mdash; Wayne Maddison (@WayneMaddison) <a href=\"https://twitter.com/WayneMaddison/status/1554644747406348288?ref_src=twsrc%5Etfw\">August
356
- 3, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
357
- charset=\"utf-8\"></script>\n\nThere are different sorts of citation:\n\n<ol>\n <li>Paper
358
- cites another paper</li>\n <li>Paper cites a dataset</li>\n <li>Dataset
359
- cites a paper</li>\n</ol>\n\nCitation type (1) is largely a solved problem
360
- (although there are issues of the ownership and use of this data, see e.g.
361
- <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
362
- has no impact factor</a>.\n\nCitation type (2) is becoming more widespread
363
- (but not perfect as GBIF''s <a href=\"https://twitter.com/search?q=%23citethedoi&src=typed_query\">#citethedoi</a>
364
- campaign demonstrates. But the idea is well accepted and there are guides
365
- to how to do it, e.g.:\n\n<blockquote>\nCousijn, H., Kenall, A., Ganley, E.
366
- et al. A data citation roadmap for scientific publishers. Sci Data 5, 180259
367
- (2018). <a href=\"https://doi.org/10.1038/sdata.2018.259\">https://doi.org/10.1038/sdata.2018.259</a>\n</blockquote>\n\nHowever,
368
- things do get problematic because most (but not all) DOIs for publications
369
- are managed by CrossRef, which has an extensive citation database linking
370
- papers to other paopers. Most datasets have DataCite DOIs, and DataCite manages
371
- its own citations links, but as far as I''m aware these two systems don''t
372
- really taklk to each other.\n\nCitation type (3) is the case where a database
373
- is largely based on the literature, which applies to taxonomy. Taxonomic databases
374
- are essentially collections of literature that have opinions on taxa, and
375
- the database may simply compile those (e.g., a nomenclator), or come to some
376
- view on the applicability of each name. In an ideal would, each reference
377
- included in a taxonomic database would gain a citation, which would help better
378
- reflect the value of that work (a long standing bone of contention for taxonomists).\n\nIt
379
- would be interesting to explore these issues further. CrossRef and DataCite
380
- do share <a href=\"https://www.crossref.org/services/event-data/\">Event Data</a>
381
- (see also <a href=\"https://support.datacite.org/docs/eventdata-guide\">DataCite
382
- Event Data</a>). Can this track citations of papers by a dataset?\n \n \nMy
383
- take on Wayne''s question:\n\n<blockquote>\n Is there a way to turn those
384
- links into countable citations (even if just one per database) for Google
385
- Scholar?\n</blockquote>\n\nis that he''s is after type 3 citations, which
386
- I don''t think we have a way to handle just yet (but I''d need to look at
387
- Event Data a bit more). Google Scholar is a black box, and the academic coimmunity''s
388
- reliance on it for metrics is troubling. But it would be interetsing to try
389
- and figure out if there is a way to get Google Scholar to index the citations
390
- of taxonomic papers by databases. For instance, the <a href=\"https://www.catalogueoflife.org/\">Catalogue
391
- of Life</a> has an ISSN <a href=\"https://portal.issn.org/resource/ISSN/2405-884X\">2405-884X</a>
392
- so it can be treated as a publication. At the moment its web pages have lots
393
- of identifiers for people managing data and their organisations (lots of <a
394
- href=\"https://orcid.org\">ORCIDs</a> and <a href=\"https://ror.org\">RORs</a>,
395
- and DOIs for individual datasets (e.g., <a href=\"https://www.checklistbank.org/dataset/9828/about\">checklistbank.org</a>)
396
- but precious little in the way of DOIs for publications (or, indeed, ORCIDs
397
- for taxonomists). What would it take for taxonomic publications in the Catalogue
398
- of Life to be treated as first class citations?","tags":["Catalogue of Life","citation","CrossRef","DataCite","DOI"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ws094-1w310","uuid":"6bed78ec-0029-4096-b1c3-48a55a9fdb3b","url":"https://iphylo.blogspot.com/2023/04/chatgpt-of-course.html","title":"ChatGPT,
79
+ 25,...","published_at":1685553960,"updated_at":1685554180,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
80
+ Page"}],"image":null,"tags":[],"language":"en","reference":[]},{"id":"6bed78ec-0029-4096-b1c3-48a55a9fdb3b","doi":"https://doi.org/10.59350/ws094-1w310","url":"https://iphylo.blogspot.com/2023/04/chatgpt-of-course.html","title":"ChatGPT,
399
81
  of course","summary":"I haven’t blogged for a while, work and other reasons
400
82
  have meant I’ve not had much time to think, and mostly I blog to help me think.
401
83
  ChatGPT is obviously a big thing at the moment, and once we get past the moral
402
84
  panic (“students can pass exams using AI!”) there are a lot of interesting
403
- possibilities to explore. Inspired by essays such as How Q&amp;A systems based
85
+ possibilities to explore. Inspired by essays such as How Q&A systems based
404
86
  on large language models (eg GPT4) will change things if they become the dominant
405
- search paradigm — 9 implications for libraries...","date_published":"2023-04-03T12:52:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
406
- (Roderic Page)"}],"image":null,"content_html":"<p>I haven’t blogged for a
407
- while, work and <a href=\"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html\">other
408
- reasons</a> have meant I’ve not had much time to think, and mostly I blog
409
- to help me think.</p>\n<p>ChatGPT is obviously a big thing at the moment,
410
- and once we get past the moral panic (“students can pass exams using AI!”)
411
- there are a lot of interesting possibilities to explore. Inspired by essays
412
- such as <a href=\"https://medium.com/@aarontay/how-q-a-systems-based-on-large-language-models-eg-gpt4-will-change-things-if-they-become-the-norm-c7cf62736ba\">How
413
- Q&amp;A systems based on large language models (eg GPT4) will change things
414
- if they become the dominant search paradigm — 9 implications for libraries</a>
415
- and <a href=\"https://about.sourcegraph.com/blog/cheating-is-all-you-need\">Cheating
416
- is All You Need</a>, as well as [<a href=\"https://paul-graham-gpt.vercel.app/\">Paul
417
- Graham GPT</a>](<a href=\"https://paul-graham-gpt.vercel.app\">https://paul-graham-gpt.vercel.app</a>)
418
- I thought I’d try a few things and see where this goes.</p>\n<p>ChatGPT can
419
- do some surprising things.</p>\n<h4 id=\"parse-bibliographic-data\">Parse
420
- bibliographic data</h4>\n<p>I spend a LOT of time working with bibliographic
421
- data, trying to parse it into structured data. ChatGPT can do this:</p>\n\n<div
422
- class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s924/Screenshot%202023-04-03%20at%2012.59.30.png\"
423
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
424
- border=\"0\" height=\"400\" data-original-height=\"924\" data-original-width=\"738\"
425
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s400/Screenshot%202023-04-03%20at%2012.59.30.png\"/></a></div>\n\n<p>Note
426
- that it does more than simply parse the strings, it expands journal abbreviations
427
- such as “J. Malay Brch. R. Asiat. Soc.” to the full name “Journal of the Malayan
428
- Branch of the Royal Asiatic Society”. So we can get clean, parsed data in
429
- a range of formats.</p>\n<h4 id=\"parse-specimens\">Parse specimens</h4>\n<p>Based
430
- on the success with parsing bibliographic strings I wondered how well it could
431
- handle citation software specimens (“material examined”). Elsewhere I’ve been
432
- critical of Plazi’s ability to do this, see <a href=\"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html\">Problems
433
- with Plazi parsing: how reliable are automated methods for extracting specimens
434
- from the literature?</a>.</p>\n<p>For example, given this specimen record
435
- on p. 130 of <a href=\"https://doi.org/10.5852/ejt.2021.775.1553\">doi:10.5852/ejt.2021.775.1553</a></p>\n<blockquote>\n<p>LAOS
436
- • Kammoune Province, Bunghona Market, 7 km Nof Xe Bangfai River;<br>\n17.13674°
437
- N, 104.98591° E; E. Jeratthitikul, K. Wisittikoson, A. Fanka, N. Wutthituntisil
438
- and P. Prasankok leg.; sold by local people;<br>\nMUMNH-UNI2831.</p>\n</blockquote>\n<p>ChatGPT
439
- extracted a plausible Darwin Core record:</p>\n\n<div class=\"separator\"
440
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s901/Screenshot%202023-04-03%20at%2013.30.54.png\"
441
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
442
- border=\"0\" height=\"400\" data-original-height=\"901\" data-original-width=\"764\"
443
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s400/Screenshot%202023-04-03%20at%2013.30.54.png\"/></a></div>\n\n<p>I’ve
444
- been experimenting with parsing specimen records using the same machine learning
445
- approach for bibliographic data (e.g., <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
446
- parsing tool released</a>), perhaps it’s time to rethink that idea.</p>\n<h4
447
- id=\"correct-ocr-errors\">Correct OCR errors</h4>\n<p>OCR errors are a well
448
- known limitation of the Biodiversity Heritage Library (BHL), see for example
449
- <a href=\"https://blog.biodiversitylibrary.org/2022/07/ocr-improvements-early-analysis.html\">OCR
450
- Improvements: An Early Analysis</a>. Given that ChatGPT is built on top of
451
- a large language model it seems obvious to see how it handles OCR errors.
452
- I tried the following snippet of text from <a href=\"https://www.biodiversitylibrary.org/page/4312227\">BHL
453
- page 4312227</a>:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Jiiaii
454
- Jr., 16, who always join him and Hedriek in th(&gt; search for<br>\nfrogs.
455
- Iloraeio Mayorga, research assistant to tlie author, has also<br>\nbeen (wtreniely
456
- useful, not only in field work, but also in the<br>\ntedious curatorial and
457
- laboratory work that usually fol- lows. Dr.<br>\nJohn Randall kindly took
458
- tlie photouraplis of E. hedricki, and the<br>\nGaliiiaues family of San Juan
459
- made their El Verde house available as<br>\na (MMiter of activities for the
460
- colleetiuff party. To all these<br>\njx’opic, the autlioi- feels (iee])ly
461
- indebted.</p>\n<p>Elet’tiierodactylus iiKnRrcKi s]). n.</p>\n<p>Type. Museum
462
- of Coniparative Zoology Xo. ;}()!)()8, c^ from El Verde,<br>\nwest flank
463
- of El Yuiuinc Puerto Hico. I.IOO ft. Coll. J. A. Kivero,<br>\n11 Aug-. I!)(i2.</p>\n</blockquote>\n<p>ChatGPT
464
- returns this:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Juan Jr., 16,
465
- who always joins him and Hedrick in the search for<br>\nfrogs. Horacio Mayorga,
466
- research assistant to the author, has also<br>\nbeen extremely useful, not
467
- only in field work, but also in the tedious<br>\ncuratorial and laboratory
468
- work that usually follows. Dr. John Randall<br>\nkindly took the photographs
469
- of E. hedricki, and the Galinanes family<br>\nof San Juan made their El Verde
470
- house available as a center of<br>\nactivities for the collecting party. To
471
- all these people, the author<br>\nfeels deeply indebted.</p>\n<p>Eleutherodactylus
472
- hedricki sp. n.</p>\n<p>Type. Museum of Comparative Zoology No. 30908, c?
473
- from El Verde, west<br>\nflank of El Yunque, Puerto Rico. 1100 ft. Coll. J.
474
- A. Rivero, 11 Aug.<br>\n1962.</p>\n</blockquote>\n\n<div class=\"separator\"
475
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s859/Screenshot%202023-04-03%20at%2013.07.53.png\"
476
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
477
- border=\"0\" width=\"400\" data-original-height=\"510\" data-original-width=\"859\"
478
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s400/Screenshot%202023-04-03%20at%2013.07.53.png\"/></a></div>\n\n<p>Comparing
479
- this to the scanned image ChatGPT it does pretty well, for example the gobbledegook
480
- “Elet’tiierodactylus iiKnRrcKi” is correctly translated as “Eleutherodactylus
481
- hedricki”. Running all of BHL through ChatGPT probably isn’t feasible, but
482
- one could imagine targeted cleaning of key papers.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>These
483
- small experiments are fairly trivial, but they are the sort of tedious tasks
484
- that would otherwise require significant programming (or other resources)
485
- to solve. But ChatGPT can do rather more, as I hope to discuss in the next
486
- post.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/7esgr-61v1","uuid":"96fa91d5-459c-482f-aa38-dda6e0a30e20","url":"https://iphylo.blogspot.com/2022/01/large-graph-viewer-experiments.html","title":"Large
487
- graph viewer experiments","summary":"I keep returning to the problem of viewing
488
- large graphs and trees, which means my hard drive has accumulated lots of
489
- failed prototypes. Inspired by some recent discussions on comparing taxonomic
490
- classifications I decided to package one of these (wildly incomplete) prototypes
491
- up so that I can document the idea and put the code somewhere safe. Very cool,
492
- thanks for sharing this-- the tree diff is similar to what J Rees has been
493
- cooking up lately with his &#39;cl diff&#39; tool. I&#39;ll tag...","date_published":"2022-01-02T11:25:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
494
- (Roderic Page)"}],"image":null,"content_html":"<p>I keep returning to the
495
- problem of viewing large graphs and trees, which means my hard drive has accumulated
496
- lots of failed prototypes. Inspired by some recent discussions on comparing
497
- taxonomic classifications I decided to package one of these (wildly incomplete)
498
- prototypes up so that I can document the idea and put the code somewhere safe.</p>\n\n<blockquote
499
- class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Very cool, thanks for sharing
500
- this-- the tree diff is similar to what J Rees has been cooking up lately
501
- with his &#39;cl diff&#39; tool. I&#39;ll tag <a href=\"https://twitter.com/beckettws?ref_src=twsrc%5Etfw\">@beckettws</a>
502
- in here too so he can see potential crossover. The goal is autogenerate diffs
503
- like this as 1st step to mapping taxo name-to concept</p>&mdash; Nate Upham
504
- (@n8_upham) <a href=\"https://twitter.com/n8_upham/status/1475834371131289608?ref_src=twsrc%5Etfw\">December
505
- 28, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
506
- charset=\"utf-8\"></script>\n\n<h2>Google Maps-like viewer</h2>\n\n<div class=\"separator\"
507
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s500\"
508
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
509
- border=\"0\" width=\"400\" data-original-height=\"448\" data-original-width=\"500\"
510
- src=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s400\"/></a></div>\n\n<p>I''ve
511
- created a simple viewer that uses a tiled map viewer (like Google Maps) to
512
- display a large graph. The idea is to draw the entire graph scaled to a 256
513
- x 256 pixel tile. The graph is stored in a database that supports geospatial
514
- queries, which means the queries to retrieve the individual tiles need to
515
- display the graph at different levels of resolution are simply bounding box
516
- queries to a database. I realise that this description is cryptic at best.
517
- The GitHub repository <a href=\"https://github.com/rdmpage/gml-viewer\">https://github.com/rdmpage/gml-viewer</a>
518
- has more details and the code itself. There''s a lot to do, especially adding
519
- support for labels(!) which presents some interesting challenges (<a href=\"https://en.wikipedia.org/wiki/Level_of_detail_(computer_graphics)\">levels
520
- of detail</a> and <a href=\"https://en.wikipedia.org/wiki/Cartographic_generalization\">generalization</a>).
521
- The code doesn''t do any layout of the graph itself, instead I''ve used the
522
- <a href=\"https://www.yworks.com/products/yed\">yEd</a> tool to compute the
523
- x,y coordinates of the graph.</p>\n\n<p>Since this exercise was inspired by
524
- a discussion of the <a href=\"https://www.mammaldiversity.org\">ASM Mammal
525
- Diversity Database</a>, the graph I''ve used for the demonstration above is
526
- the ASM classification of extant mammals. I guess I need to solve the labelling
527
- issue fairly quickly!</p>","tags":["Google Maps","graph","Mammal Species of
528
- the World","mammals","taxonomy"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m48f7-c2128","uuid":"8aea47e4-f227-45f4-b37b-0454a8a7a3ff","url":"https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html","title":"ChatGPT,
529
- semantic search, and knowledge graphs","summary":"One thing about ChatGPT
530
- is it has opened my eyes to some concepts I was dimly aware of but am only
531
- now beginning to fully appreciate. ChatGPT enables you ask it questions, but
532
- the answers depend on what ChatGPT “knows”. As several people have noted,
533
- what would be even better is to be able to run ChatGPT on your own content.
534
- Indeed, ChatGPT itself now supports this using plugins. Paul Graham GPT However,
535
- it’s still useful to see how to add ChatGPT functionality to your own content
536
- from...","date_published":"2023-04-03T15:30:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
537
- (Roderic Page)"}],"image":null,"content_html":"<p>One thing about ChatGPT
538
- is it has opened my eyes to some concepts I was dimly aware of but am only
539
- now beginning to fully appreciate. ChatGPT enables you ask it questions, but
540
- the answers depend on what ChatGPT “knows”. As several people have noted,
541
- what would be even better is to be able to run ChatGPT on your own content.
542
- Indeed, ChatGPT itself now supports this using <a href=\"https://openai.com/blog/chatgpt-plugins\">plugins</a>.</p>\n<h4
543
- id=\"paul-graham-gpt\">Paul Graham GPT</h4>\n<p>However, it’s still useful
544
- to see how to add ChatGPT functionality to your own content from scratch.
545
- A nice example of this is <a href=\"https://paul-graham-gpt.vercel.app/\">Paul
546
- Graham GPT</a> by <a href=\"https://twitter.com/mckaywrigley\">Mckay Wrigley</a>.
547
- Mckay Wrigley took essays by Paul Graham (a well known venture capitalist)
548
- and built a question and answer tool very like ChatGPT.</p>\n<iframe width=\"560\"
549
- height=\"315\" src=\"https://www.youtube.com/embed/ii1jcLg-eIQ\" title=\"YouTube
550
- video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write;
551
- encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n<p>Because
552
- you can send a block of text to ChatGPT (as part of the prompt) you can get
553
- ChatGPT to summarise or transform that information, or answer questions based
554
- on that information. But there is a limit to how much information you can
555
- pack into a prompt. You can’t put all of Paul Graham’s essays into a prompt
556
- for example. So a solution is to do some preprocessing. For example, given
557
- a question such as “How do I start a startup?” we could first find the essays
558
- that are most relevant to this question, then use them to create a prompt
559
- for ChatGPT. A quick and dirty way to do this is simply do a text search over
560
- the essays and take the top hits. But we aren’t searching for words, we are
561
- searching for answers to a question. The essay with the best answer might
562
- not include the phrase “How do I start a startup?”.</p>\n<h4 id=\"semantic-search\">Semantic
563
- search</h4>\n<p>Enter <a href=\"https://en.wikipedia.org/wiki/Semantic_search\">Semantic
564
- search</a>. The key concept behind semantic search is that we are looking
565
- for documents with similar meaning, not just similarity of text. One approach
566
- to this is to represent documents by “embeddings”, that is, a vector of numbers
567
- that encapsulate features of the document. Documents with similar vectors
568
- are potentially related. In semantic search we take the query (e.g., “How
569
- do I start a startup?”), compute its embedding, then search among the documents
570
- for those with similar embeddings.</p>\n<p>To create Paul Graham GPT Mckay
571
- Wrigley did the following. First he sent each essay to the OpenAI API underlying
572
- ChatGPT, and in return he got the embedding for that essay (a vector of 1536
573
- numbers). Each embedding was stored in a database (Mckay uses Postgres with
574
- <a href=\"https://github.com/pgvector/pgvector\">pgvector</a>). When a user
575
- enters a query such as “How do I start a startup?” that query is also sent
576
- to the OpenAI API to retrieve its embedding vector. Then we query the database
577
- of embeddings for Paul Graham’s essays and take the top five hits. These hits
578
- are, one hopes, the most likely to contain relevant answers. The original
579
- question and the most similar essays are then bundled up and sent to ChatGPT
580
- which then synthesises an answer. See his <a href=\"https://github.com/mckaywrigley/paul-graham-gpt\">GitHub
581
- repo</a> for more details. Note that we are still using ChatGPT, but on a
582
- set of documents it doesn’t already have.</p>\n<h4 id=\"knowledge-graphs\">Knowledge
583
- graphs</h4>\n<p>I’m a fan of knowledge graphs, but they are not terribly easy
584
- to use. For example, I built a knowledge graph of Australian animals <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a>
585
- that contains a wealth of information on taxa, publications, and people, wrapped
586
- up in a web site. If you want to learn more you need to figure out how to
587
- write queries in SPARQL, which is not fun. Maybe we could use ChatGPT to write
588
- the SPARQL queries for us, but it would be much more fun to be simply ask
589
- natural language queries (e.g., “who are the experts on Australian ants?”).
590
- I made some naïve notes on these ideas <a href=\"https://iphylo.blogspot.com/2015/09/possible-project-natural-language.html\">Possible
591
- project: natural language queries, or answering “how many species are there?”</a>
592
- and <a href=\"https://iphylo.blogspot.com/2019/05/ozymandias-meets-wikipedia-with-notes.html\">Ozymandias
593
- meets Wikipedia, with notes on natural language generation</a>.</p>\n<p>Of
594
- course, this is a well known problem. Tools such as <a href=\"http://rdf2vec.org\">RDF2vec</a>
595
- can take RDF from a knowledge graph and create embeddings which could in tern
596
- be used to support semantic search. But it seems to me that we could simply
597
- this process a bit by making use of ChatGPT.</p>\n<p>Firstly we would generate
598
- natural language statements from the knowledge graph (e.g., “species x belongs
599
- to genus y and was described in z”, “this paper on ants was authored by x”,
600
- etc.) that cover the basic questions we expect people to ask. We then get
601
- embeddings for these (e.g., using OpenAI). We then have an interface where
602
- people can ask a question (“is species x a valid species?”, “who has published
603
- on ants”, etc.), we get the embedding for that question, retrieve natural
604
- language statements that the closest in embedding “space”, package everything
605
- up and ask ChatGPT to summarise the answer.</p>\n<p>The trick, of course,
606
- is to figure out how t generate natural language statements from the knowledge
607
- graph (which amounts to deciding what paths to traverse in the knowledge graph,
608
- and how to write those paths is something approximating English). We also
609
- want to know something about the sorts of questions people are likely to ask
610
- so that we have a reasonable chance of having the answers (for example, are
611
- people going to ask about individual species, or questions about summary statistics
612
- such as numbers of species in a genus, etc.).</p>\n<p>What makes this attractive
613
- is that it seems a straightforward way to go from a largely academic exercise
614
- (build a knowledge graph) to something potentially useful (a question and
615
- answer machine). Imagine if something like the defunct BBC wildlife site (see
616
- <a href=\"https://iphylo.blogspot.com/2017/12/blue-planet-ii-bbc-and-semantic-web.html\">Blue
617
- Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and
618
- opportunities lost</a>) revived <a href=\"https://aspiring-look.glitch.me\">here</a>
619
- had a question and answer interface where we could ask questions rather than
620
- passively browse.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>I have so much
621
- more to learn, and need to think about ways to incorporate semantic search
622
- and ChatGPT-like tools into knowledge graphs.</p>\n<blockquote>\n<p>Written
623
- with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/rfxj3-x6739","uuid":"6a4d5c44-f4a9-4d40-a32c-a4d5e512c55a","url":"https://iphylo.blogspot.com/2022/05/thoughts-on-treebase-dying.html","title":"Thoughts
624
- on TreeBASE dying(?)","summary":"@rvosa is Naturalis no longer hosting Treebase?
625
- https://t.co/MBRgcxaBmR&mdash; Hilmar Lapp (@hlapp) May 10, 2022 So it looks
626
- like TreeBASE is in trouble, it''s legacy Java code a victim of security issues.
627
- Perhaps this is a chance to rethink TreeBASE, assuming that a repository of
628
- published phylogenies is still considered a worthwhile thing to have (and
629
- I think that question is open). Here''s what I think could be done. The data
630
- (individual studies with trees and data) are packaged into...","date_published":"2022-05-11T16:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
631
- (Roderic Page)"}],"image":null,"content_html":"<blockquote class=\"twitter-tweet\"><p
632
- lang=\"en\" dir=\"ltr\"><a href=\"https://twitter.com/rvosa?ref_src=twsrc%5Etfw\">@rvosa</a>
633
- is Naturalis no longer hosting Treebase? <a href=\"https://t.co/MBRgcxaBmR\">https://t.co/MBRgcxaBmR</a></p>&mdash;
634
- Hilmar Lapp (@hlapp) <a href=\"https://twitter.com/hlapp/status/1524166490798309381?ref_src=twsrc%5Etfw\">May
635
- 10, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
636
- charset=\"utf-8\"></script>\n\n<p>So it looks like <a href=\"http://treebase.org\">TreeBASE</a>
637
- is in trouble, it''s legacy Java code a victim of security issues. Perhaps
638
- this is a chance to rethink TreeBASE, assuming that a repository of published
639
- phylogenies is still considered a worthwhile thing to have (and I think that
640
- question is open).</p>\n\n<p>Here''s what I think could be done.</p>\n\n<ol>\n<li>\nThe
641
- data (individual studies with trees and data) are packaged into whatever format
642
- is easiest (NEXUS, XML, JSON) and uploaded to a repository such as <a href=\"https://zenodo.org\">Zenodo</a>
643
- for long term storage. They get DOIs for citability. This becomes the default
644
- storage for TreeBASE.\n</li>\n<li>\nThe data is transformed into JSON and
645
- indexed using Elasticsearch. A simple web interface is placed on top so that
646
- people can easily find trees (never a strong point of the original TreeBASE).
647
- Trees are displayed natively on the web using SVG. The number one goal is
648
- for people to be able to find trees, view them, and download them.\n</li>\n<li>\nTo
649
- add data to TreeBASE the easiest way would be for people to upload them direct
650
- to Zenodo and tag them \"treebase\". A bot then grabs a feed of these datasets
651
- and adds them to the search engine in (1) above. As time allows, add an interface
652
- where people upload data directly, it gets curated, then deposited in Zenodo.
653
- This presupposes that there are people available to do curation. Maybe have
654
- \"stars\" for the level of curation so that users know whether anyone has
655
- checked the data.\n</li>\n</ol>\n\n<p>There''s lots of details to tweak, for
656
- example how many of the existing URLs for studies are preserved (some URL
657
- mapping), and what about the API? And I''m unclear about the relationship
658
- with <a href=\"https://datadryad.org\">Dryad</a>.</p>\n\n<p>My sense is that
659
- the TreeBASE code is very much of its time (10-15 years ago), a monolithic
660
- block of code with SQL, Java, etc. If one was starting from scratch today
661
- I don''t think this would be the obvious solution. Things have trended towards
662
- being simpler, with lots of building blocks now available in the cloud. Need
663
- a search engine? Just spin up a container in the cloud and you have one. More
664
- and more functionality can be devolved elsewhere.</p>\n\n<p>Another other
665
- issue is how to support TreeBASE. It has essentially been a volunteer effort
666
- to date, with little or no funding. One reason I think having Zenodo as a
667
- storage engine is that it takes care of long term sustainability of the data.</p>\n\n<p>I
668
- realise that this is all wild arm waving, but maybe now is the time to reinvent
669
- TreeBASE?</p>\n\n<h2>Updates</h2>\n\n<p>It''s been a while since I''ve paid
670
- a lot of attention to phylogenetic databases, and it shows. There is a file-based
671
- storage system for phylogenies <a href=\"https://github.com/OpenTreeOfLife/phylesystem-1\">phylesystem</a>
672
- (see \"Phylesystem: a git-based data store for community-curated phylogenetic
673
- estimates\" <a href=\"https://doi.org/10.1093/bioinformatics/btv276\">https://doi.org/10.1093/bioinformatics/btv276</a>)
674
- that is sort of what I had in mind, although long term persistence is based
675
- on GitHub rather than a repository such as Zenodo. Phylesystem uses a truly
676
- horrible-looking JSON transformation of <a href=\"http://nexml.github.io\">NeXML</a>
677
- (NeXML itself is ugly), and TreeBASE also supports NeXML, so some form of
678
- NeXML or a JSON transformation seems the obvious storage format. It will probably
679
- need some cleaning and simplification if it is to be indexed easily. Looking
680
- back over the long history of TreeBASE and phylogenetic databases I''m struck
681
- by how much complexity has been introduced over time. I think the tech has
682
- gotten in the way sometimes (which might just be another way of saying that
683
- I''m not smart enough to make sense of it all.</p>\n\n<p>So we could imagine
684
- a search engine that covers both TreeBASE and <a href=\"https://tree.opentreeoflife.org/curator\">Open
685
- Tree of Life studies</a>.</p>\n\n<p>Basic metadata-based searches would be
686
- straightforward, and we could have a user interface that highlights the trees
687
- (I think TreeBASE''s biggest search rival is a Google image search). The harder
688
- problem is searching by tree structure, for which there is an interesting
689
- literature without any decent implementations that I''m aware of (as I said,
690
- I''ve been out of this field a while).</p>\n\n<p>So my instinct is we could
691
- go a long way with simply indexing JSON (CouchDB or Elasticsearch), then need
692
- to think a bit more cleverly about higher taxon and tree based searching.
693
- I''ve always thought that one killer query would be not so much \"show me
694
- all the trees for my taxon\" but \"show me a synthesis of the trees for my
695
- taxon\". Imagine a supertree of recent studies that we could use as a summary
696
- of our current knowledge, or a visualisation that summarises where there are
697
- conflicts among the trees.</p>\n\n<h3>Relevant code and sites</h3>\n\n<ul>\n<li><a
698
- href=\"https://github.com/rdmpage/cdaotools\">CDAO Tools</a>, see \"CDAO-Store:
699
- Ontology-driven Data Integration for Phylogenetic Analysis\" <a href=\"https://doi.org/10.1186/1471-2105-12-98\">https://doi.org/10.1186/1471-2105-12-98</a></li>\n<li><a
700
- href=\"https://github.com/NESCent/phylocommons\">PhyloCommons</a></li>\n</ul>","tags":["phylogeny","TreeBASE"],"language":"en","references":null},{"id":"https://doi.org/10.59350/jzvs4-r9559","uuid":"23fa1dd8-5c6b-4aa9-9cad-c6f6b14ae9e0","url":"https://iphylo.blogspot.com/2021/08/json-ld-in-wild-examples-of-how.html","title":"JSON-LD
87
+ search paradigm — 9 implications for libraries...","published_at":1680526320,"updated_at":1680526621,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
88
+ Page"}],"image":null,"tags":[],"language":"en","reference":[]},{"id":"7d814863-43b5-4faf-a475-da8de5efd3ef","doi":"https://doi.org/10.59350/m7gb7-d7c49","url":"https://iphylo.blogspot.com/2022/02/duplicate-dois-again.html","title":"Duplicate
89
+ DOIs (again)","summary":"This blog post provides some background to a recent
90
+ tweet where I expressed my frustration about the duplication of DOIs for the
91
+ same article. I''m going to document the details here. The DOI that alerted
92
+ me to this problem is https://doi.org/10.2307/2436688 which is for the article
93
+ Snyder, W. C., & Hansen, H. N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American
94
+ Journal of Botany, 27(2), 64–67. This article is hosted by JSTOR at https://www.jstor.org/stable/2436688
95
+ which displays the DOI...","published_at":1644332760,"updated_at":1644332778,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
96
+ Page"}],"image":null,"tags":["CrossRef","DOI","duplicates"],"language":"en","reference":[]},{"id":"23fa1dd8-5c6b-4aa9-9cad-c6f6b14ae9e0","doi":"https://doi.org/10.59350/jzvs4-r9559","url":"https://iphylo.blogspot.com/2021/08/json-ld-in-wild-examples-of-how.html","title":"JSON-LD
701
97
  in the wild: examples of how structured data is represented on the web","summary":"I''ve
702
98
  created a GitHub repository so that I can keep track of the examples of JSON-LD
703
99
  that I''ve seen being actively used, for example embedded in web sites, or
@@ -705,504 +101,95 @@ http_interactions:
705
101
  The list is by no means exhaustive, I hope to add more examples as I come
706
102
  across them. One reason for doing this is to learn what others are doing.
707
103
  For example, after looking at SciGraph''s JSON-LD I now see how an ordered
708
- list can be modelled in RDF in...","date_published":"2021-08-27T13:20:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
709
- (Roderic Page)"}],"image":null,"content_html":"<p>I''ve created a GitHub repository
710
- so that I can keep track of the examples of JSON-LD that I''ve seen being
711
- actively used, for example embedded in web sites, or accessed using an API.
712
- The repository is <a href=\"https://github.com/rdmpage/wild-json-ld\">https://github.com/rdmpage/wild-json-ld</a>.
713
- The list is by no means exhaustive, I hope to add more examples as I come
714
- across them.</p>\n\n<p>One reason for doing this is to learn what others are
715
- doing. For example, after looking at SciGraph''s JSON-LD I now see how an
716
- ordered list can be modelled in RDF in such a way that the list of authors
717
- in a JSON-LD document for, say a scientific paper, is correct. By default
718
- RDF has no notion of ordered lists, so if you do a SPARQL query to get the
719
- authors of a paper, the order of the authors returned in the query will be
720
- arbitrary. There are various ways to try and tackle this. In my Ozymandias
721
- knowledge graph I used \"roles\" to represent order (see <a href=\"https://doi.org/10.7717/peerj.6739/fig-2\">Figure
722
- 2</a> in the Ozymandias paper). I then used properties of the role to order
723
- the list of authors.</p>\n\n<p>Another approach is to use rdf:lists (see <a
724
- href=\"http://www.snee.com/bobdc.blog/2014/04/rdf-lists-and-sparql.html\">RDF
725
- lists and SPARQL</a> and <a href=\"https://stackoverflow.com/questions/17523804/is-it-possible-to-get-the-position-of-an-element-in-an-rdf-collection-in-sparql/17530689#17530689\">Is
726
- it possible to get the position of an element in an RDF Collection in SPARQL?</a>
727
- for an introduction to lists). SciGraph uses this approach. The value for
728
- schema:author is not an author, but a blank node (bnode), and this bnode has
729
- two predicates, rdf:first and rdf:rest. One points to an author, the other
730
- points to another bnode. This pattern repeats until we encounter a value of
731
- rdf:nil for rdf:rest.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a
732
- href=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s629/list.png\"
733
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
734
- border=\"0\" height=\"320\" data-original-height=\"629\" data-original-width=\"401\"
735
- src=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s320/list.png\"/></a></div>\n\n<p>This
736
- introduces some complexity, but the benefit is that the JSON-LD version of
737
- the RDF will have the authors in the correct order, and hence any client that
738
- is using JSON will be able to treat the array of authors as ordered. Without
739
- some means of ordering the client could not make this assumption, hence the
740
- first author in the list might not actually be the first author of the paper.</p>","tags":["JSON-LD","RDF"],"language":"en","references":null},{"id":"https://doi.org/10.59350/zc4qc-77616","uuid":"30c78d9d-2e50-49db-9f4f-b3baa060387b","url":"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html","title":"Does
741
- anyone cite taxonomic treatments?","summary":"Taxonomic treatments have come
742
- up in various discussions I''m involved in, and I''m curious as to whether
743
- they are actually being used, in particular, whether they are actually being
744
- cited. Consider the following quote: The taxa are described in taxonomic treatments,
745
- well defined sections of scientific publications (Catapano 2019). They include
746
- a nomenclatural section and one or more sections including descriptions, material
747
- citations referring to studied specimens, or notes ecology and...","date_published":"2022-09-01T16:49:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
748
- (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
749
- both;\"><a href=\"https://zenodo.org/record/5731100/thumb100\" style=\"display:
750
- block; padding: 1em 0; text-align: center; clear: right; float: right;\"><img
751
- alt=\"\" border=\"0\" height=\"128\" data-original-height=\"106\" data-original-width=\"100\"
752
- src=\"https://zenodo.org/record/5731100/thumb250\"/></a></div>\nTaxonomic
753
- treatments have come up in various discussions I''m involved in, and I''m
754
- curious as to whether they are actually being used, in particular, whether
755
- they are actually being cited. Consider the following quote:\n\n<blockquote>\nThe
756
- taxa are described in taxonomic treatments, well defined sections of scientific
757
- publications (Catapano 2019). They include a nomenclatural section and one
758
- or more sections including descriptions, material citations referring to studied
759
- specimens, or notes ecology and behavior. In case the treatment does not describe
760
- a new discovered taxon, previous treatments are cited in the form of treatment
761
- citations. This citation can refer to a previous treatment and add additional
762
- data, or it can be a statement synonymizing the taxon with another taxon.
763
- This allows building a citation network, and ultimately is a constituent part
764
- of the catalogue of life. - Taxonomic Treatments as Open FAIR Digital Objects
765
- <a href=\"https://doi.org/10.3897/rio.8.e93709\">https://doi.org/10.3897/rio.8.e93709</a>\n</blockquote>\n\n<p>\n
766
- \"Traditional\" academic citation is from article to article. For example,
767
- consider these two papers:\n\n<blockquote>\nLi Y, Li S, Lin Y (2021) Taxonomic
768
- study on fourteen symphytognathid species from Asia (Araneae, Symphytognathidae).
769
- ZooKeys 1072: 1-47. https://doi.org/10.3897/zookeys.1072.67935\n</blockquote>\n\n<blockquote>\nMiller
770
- J, Griswold C, Yin C (2009) The symphytognathoid spiders of the Gaoligongshan,
771
- Yunnan, China (Araneae: Araneoidea): Systematics and diversity of micro-orbweavers.
772
- ZooKeys 11: 9-195. https://doi.org/10.3897/zookeys.11.160\n</blockquote>\n</p>\n\n<p>Li
773
- et al. 2021 cites Miller et al. 2009 (although Pensoft seems to have broken
774
- the citation such that it does appear correctly either on their web page or
775
- in CrossRef).</p>\n\n<p>So, we have this link: [article]10.3897/zookeys.1072.67935
776
- --cites--> [article]10.3897/zookeys.11.160. One article cites another.</p>\n\n<p>In
777
- their 2021 paper Li et al. discuss <i>Patu jidanweishi</i> Miller, Griswold
778
- & Yin, 2009:\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s1040/Screenshot%202022-09-01%20at%2017.12.27.png\"
779
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
780
- border=\"0\" width=\"400\" data-original-height=\"314\" data-original-width=\"1040\"
781
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s400/Screenshot%202022-09-01%20at%2017.12.27.png\"/></a></div>\n\n<p>There
782
- is a treatment for the original description of <i>Patu jidanweishi</i> at
783
- <a href=\"https://doi.org/10.5281/zenodo.3792232\">https://doi.org/10.5281/zenodo.3792232</a>,
784
- which was created by Plazi with a time stamp \"2020-05-06T04:59:53.278684+00:00\".
785
- The original publication date was 2009, the treatments are being added retrospectively.</p>\n\n<p>In
786
- an ideal world my expectation would be that Li et al. 2021 would have cited
787
- the treatment, instead of just providing the text string \"Patu jidanweishi
788
- Miller, Griswold & Yin, 2009: 64, figs 65A–E, 66A, B, 67A–D, 68A–F, 69A–F,
789
- 70A–F and 71A–F (♂♀).\" Isn''t the expectation under the treatment model that
790
- we would have seen this relationship:</p>\n\n<p>[article]10.3897/zookeys.1072.67935
791
- --cites--> [treatment]https://doi.org/10.5281/zenodo.3792232</p>\n\n<p>Furthermore,
792
- if it is the case that \"[i]n case the treatment does not describe a new discovered
793
- taxon, previous treatments are cited in the form of treatment citations\"
794
- then we should also see a citation between treatments, in other words Li et
795
- al.''s 2021 treatment of <i>Patu jidanweishi</i> (which doesn''t seem to have
796
- a DOI but is available on Plazi'' web site as <a href=\"https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74\">https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74</a>)
797
- should also cite the original treatment? It doesn''t - but it does cite the
798
- Miller et al. paper.</p>\n\n<p>So in this example we don''t see articles citing
799
- treatments, nor do we see treatments citing treatments. Playing Devil''s advocate,
800
- why then do we have treatments? Does''t the lack of citations suggest that
801
- - despite some taxonomists saying this is the unit that matters - they actually
802
- don''t. If we pay attention to what people do rather than what they say they
803
- do, they cite articles.</p>\n\n<p>Now, there are all sorts of reasons why
804
- we don''t see [article] -> [treatment] citations, or [treatment] -> [treatment]
805
- citations. Treatments are being added after the fact by Plazi, not by the
806
- authors of the original work. And in many cases the treatments that could
807
- be cited haven''t appeared until after that potentially citing work was published.
808
- In the example above the Miller et al. paper dates from 2009, but the treatment
809
- extracted only went online in 2020. And while there is a long standing culture
810
- of citing publications (ideally using DOIs) there isn''t an equivalent culture
811
- of citing treatments (beyond the simple text strings).</p>\n\n<p>Obviously
812
- this is but one example. I''d need to do some exploration of the citation
813
- graph to get a better sense of citations patterns, perhaps using <a href=\"https://www.crossref.org/documentation/event-data/\">CrossRef''s
814
- event data</a>. But my sense is that taxonomists don''t cite treatments.</p>\n\n<p>I''m
815
- guessing Plazi would respond by saying treatments are cited, for example (indirectly)
816
- in GBIF downloads. This is true, although arguably people aren''t citing the
817
- treatment, they''re citing specimen data in those treatments, and that specimen
818
- data could be extracted at the level of articles rather than treatments. In
819
- other words, it''s not the treatments themselves that people are citing.</p>\n\n<p>To
820
- be clear, I think there is value in being able to identify those \"well defined
821
- sections\" of a publication that deal with a given taxon (i.e., treatments),
822
- but it''s not clear to me that these are actually the citable units people
823
- might hope them to be. Likewise, journals such as <i>ZooKeys</i> have DOIs
824
- for individual figures. Does anyone actually cite those?</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/en7e9-5s882","uuid":"20b9d31e-513f-496b-b399-4215306e1588","url":"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html","title":"Obsidian,
825
- markdown, and taxonomic trees","summary":"Returning to the subject of personal
826
- knowledge graphs Kyle Scheer has an interesting repository of Markdown files
827
- that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines
828
- (see his blog post for more background). If you add these files to Obsidian
829
- you get a nice visualisation of a taxonomy of academic disciplines. The applications
830
- of this to biological taxonomy seem obvious, especially as a tool like Obsidian
831
- enables all sorts of interesting links to be added...","date_published":"2022-04-07T21:07:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
832
- (Roderic Page)"}],"image":null,"content_html":"<p>Returning to the subject
833
- of <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">personal
834
- knowledge graphs</a> Kyle Scheer has an interesting repository of Markdown
835
- files that describe academic disciplines at <a href=\"https://github.com/kyletscheer/academic-disciplines\">https://github.com/kyletscheer/academic-disciplines</a>
836
- (see <a href=\"https://kyletscheer.medium.com/on-creating-a-tree-of-knowledge-f099c1028bf6\">his
837
- blog post</a> for more background).</p>\n\n<p>If you add these files to <a
838
- href=\"https://obsidian.md/\">Obsidian</a> you get a nice visualisation of
839
- a taxonomy of academic disciplines. The applications of this to biological
840
- taxonomy seem obvious, especially as a tool like Obsidian enables all sorts
841
- of interesting links to be added (e.g., we could add links to the taxonomic
842
- research behind each node in the taxonomic tree, the people doing that research,
843
- etc. - although that would mean we''d no longer have a simple tree).</p>\n\n<p>The
844
- more I look at these sort of simple Markdown-based tools the more I wonder
845
- whether we could make more use of them to create simple but persistent databases.
846
- Text files seem the most stable, long-lived digital format around, maybe this
847
- would be a way to minimise the inevitable obsolescence of database and server
848
- software. Time for some experiments I feel... can we take a taxonomic group,
849
- such as mammals, and create a richly connected database purely in Markdown?</p>\n\n<div
850
- class=\"separator\" style=\"clear: both; text-align: center;\"><iframe allowfullscreen=''allowfullscreen''
851
- webkitallowfullscreen=''webkitallowfullscreen'' mozallowfullscreen=''mozallowfullscreen''
852
- width=''400'' height=''322'' src=''https://www.blogger.com/video.g?token=AD6v5dy3Sa_SY_MJCZYYCT-bAGe9QD1z_V0tkE0qM5FaQJfAEgGOoHtYPATsNNbBvTEh_tHOZ83nMGzpYRg''
853
- class=''b-hbp-video b-uploaded'' frameborder=''0''></iframe></div>","tags":["markdown","obsidian"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m7gb7-d7c49","uuid":"7d814863-43b5-4faf-a475-da8de5efd3ef","url":"https://iphylo.blogspot.com/2022/02/duplicate-dois-again.html","title":"Duplicate
854
- DOIs (again)","summary":"This blog post provides some background to a recent
855
- tweet where I expressed my frustration about the duplication of DOIs for the
856
- same article. I''m going to document the details here. The DOI that alerted
857
- me to this problem is https://doi.org/10.2307/2436688 which is for the article
858
- Snyder, W. C., & Hansen, H. N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American
859
- Journal of Botany, 27(2), 64–67. This article is hosted by JSTOR at https://www.jstor.org/stable/2436688
860
- which displays the DOI...","date_published":"2022-02-08T15:06:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
861
- (Roderic Page)"}],"image":null,"content_html":"<p>This blog post provides
862
- some background to a <a href=\"https://twitter.com/rdmpage/status/1491023036199600132\">recent
863
- tweet</a> where I expressed my frustration about the duplication of DOIs for
864
- the same article. I''m going to document the details here.</p>\n\n<p>The DOI
865
- that alerted me to this problem is <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
866
- which is for the article</p>\n\n<blockquote>\nSnyder, W. C., & Hansen, H.
867
- N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American Journal of Botany, 27(2),
868
- 64–67.\n</blockquote>\n\n<p>This article is hosted by JSTOR at <a href=\"https://www.jstor.org/stable/2436688\">https://www.jstor.org/stable/2436688</a>
869
- which displays the DOI <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
870
- .</p>\n\n<p>This same article is also hosted by Wiley at <a href=\"https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x\">https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x</a>
871
- with the DOI <a href=\"https://doi.org/10.1002/j.1537-2197.1940.tb14217.x\">https://doi.org/10.1002/j.1537-2197.1940.tb14217.x</a>.</p>\n\n<h2>Expected
872
- behaviour</h2>\n\n<p>What should happen is if Wiley is going to be the publisher
873
- of this content (taking over from JSTOR), the DOI <b>10.2307/2436688</b> should
874
- be redirected to the Wiley page, and the Wiley page displays this DOI (i.e.,
875
- <b>10.2307/2436688</b>). If I want to get metadata for this DOI, I should
876
- be able to use CrossRef''s API to retrieve that metadata, e.g. <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
877
- should return metadata for the article.</p>\n\n<h2>What actually happens</h2>\n\n<p>Wiley
878
- display the same article on their web site with the DOI <b>10.1002/j.1537-2197.1940.tb14217.x</b>.
879
- They have minted a new DOI for the same article! The original JSTOR DOI now
880
- resolves to the Wiley page (you can see this using the <a href=\"https://hdl.handle.net\">Handle
881
- Resolver</a>), which is what is supposed to happen. However, Wiley should
882
- have reused the original DOI rather than mint their own.</p>\n\n<p>Furthermore,
883
- while the original DOI still resolves in a web browser, I can''t retrieve
884
- metadata about that DOI from CrossRef, so any attempt to build upon that DOI
885
- fails. However, I can retrieve metadata for the Wiley DOI, i.e. <a href=\"https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x\">https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x</a>
886
- works, but <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
887
- doesn''t.</p>\n\n<h2>Why does this matter?</h2>\n\n<p>For anyone using DOIs
888
- as stable links to the literature the persistence of DOIs is something you
889
- should be able to rely upon, both for people clicking on links in web browsers
890
- and developers getting metadata from those DOIs. The whole rationale of the
891
- DOI system is a single, globally unique identifier for each article, and that
892
- these DOIs persist even when the publisher of the content changes. If this
893
- property doesn''t hold, then why would a developer such as myself invest effort
894
- in linking using DOIs?</p>\n\n<p>Just for the record, I think CrossRef is
895
- great and is a hugely important part of the scholarly landscape. There are
896
- lots of things that I do that would be nearly impossible without CrossRef
897
- and its tools. But cases like this where we get massive duplication of DOIs
898
- when a publishers takes over an existing journal fundamentally breaks the
899
- underlying model of stable, persistent identifiers.</p>","tags":["CrossRef","DOI","duplicates"],"language":"en","references":null},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
900
- on \"The Macroscope\" - a tool for the 21st Century?","summary":"This is a
901
- guest post by Tony Rees. It would be difficult to encounter a scientist, or
902
- anyone interested in science, who is not familiar with the microscope, a tool
903
- for making objects visible that are otherwise too small to be properly seen
904
- by the unaided eye, or to reveal otherwise invisible fine detail in larger
905
- objects. A select few with a particular interest in microscopy may also have
906
- encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop
907
- microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
908
- (Roderic Page)"}],"image":null,"content_html":"<p><img src=\"https://lh3.googleusercontent.com/-A99btr6ERMs/Vl1Wvjp2OtI/AAAAAAAAEFI/7bKdRjNG5w0/ytNkVT2U.jpg?imgmax=800\"
909
- alt=\"YtNkVT2U\" title=\"ytNkVT2U.jpg\" border=\"0\" width=\"128\" height=\"128\"
910
- style=\"float:right;\" /> This is a guest post by <a href=\"https://about.me/TonyRees\">Tony
911
- Rees</a>.</p>\n\n<p>It would be difficult to encounter a scientist, or anyone
912
- interested in science, who is not familiar with the microscope, a tool for
913
- making objects visible that are otherwise too small to be properly seen by
914
- the unaided eye, or to reveal otherwise invisible fine detail in larger objects.
915
- A select few with a particular interest in microscopy may also have encountered
916
- the Wild-Leica \"Macroscope\", a specialised type of benchtop microscope optimised
917
- for low-power macro-photography. However in this overview I discuss the \"Macroscope\"
918
- in a different sense, which is that of the antithesis to the microscope: namely
919
- a method for visualizing subjects too large to be encompassed by a single
920
- field of vision, such as the Earth or some subset of its phenomena (the biosphere,
921
- for example), or conceptually, the universe.</p>\n\n<p><div class=\"separator\"
922
- style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s500/2020045672.jpg\"
923
- style=\"display: block; padding: 1em 0; text-align: center; clear: right;
924
- float: right;\"><img alt=\"\" border=\"0\" height=\"320\" data-original-height=\"500\"
925
- data-original-width=\"303\" src=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s320/2020045672.jpg\"/></a></div>My
926
- introduction to the term was via addresses given by Jesse Ausubel in the formative
927
- years of the 2001-2010 <a href=\"http://www.coml.org\">Census of Marine Life</a>,
928
- for which he was a key proponent. In Ausubel''s view, the Census would perform
929
- the function of a macroscope, permitting a view of everything that lives in
930
- the global ocean (or at least, that subset which could realistically be sampled
931
- in the time frame available) as opposed to more limited subsets available
932
- via previous data collection efforts. My view (which could, of course, be
933
- wrong) was that his thinking had been informed by a work entitled \"Le macroscope,
934
- vers une vision globale\" published in 1975 by the French thinker Joël de
935
- Rosnay, who had expressed such a concept as being globally applicable in many
936
- fields, including the physical and natural worlds but also extending to human
937
- society, the growth of cities, and more. Yet again, some ecologists may also
938
- have encountered the term, sometimes in the guise of \"Odum''s macroscope\",
939
- as an approach for obtaining \"big picture\" analyses of macroecological processes
940
- suitable for mathematical modelling, typically by elimination of fine detail
941
- so that only the larger patterns remain, as initially advocated by Howard
942
- T. Odum in his 1971 book \"Environment, Power, and Society\".</p>\n\n<p>From
943
- the standpoint of the 21st century, it seems that we are closer to achieving
944
- a \"macroscope\" (or possibly, multiple such tools) than ever before, based
945
- on the availability of existing and continuing new data streams, improved
946
- technology for data assembly and storage, and advanced ways to query and combine
947
- these large streams of data to produce new visualizations, data products,
948
- and analytical findings. I devote the remainder of this article to examples
949
- where either particular workers have employed \"macroscope\" terminology to
950
- describe their activities, or where potentially equivalent actions are taking
951
- place without the explicit \"macroscope\" association, but are equally worthy
952
- of consideration. To save space here, references cited here (most or all)
953
- can be found via a Wikipedia article entitled \"<a href=\"https://en.wikipedia.org/wiki/Macroscope_(science_concept)\">Macroscope
954
- (science concept)</a>\" that I authored on the subject around a year ago,
955
- and have continued to add to on occasion as new thoughts or information come
956
- to hand (see <a href=\"https://en.wikipedia.org/w/index.php?title=Macroscope_(science_concept)&offset=&limit=500&action=history\">edit
957
- history for the article</a>).</p>\n\n<p>First, one can ask, what constitutes
958
- a macroscope, in the present context? In the Wikipedia article I point to
959
- a book \"Big Data - Related Technologies, Challenges and Future Prospects\"
960
- by Chen <em>et al.</em> (2014) (<a href=\"https://doi.org/10.1007/978-3-319-06245-7\">doi:10.1007/978-3-319-06245-7</a>),
961
- in which the \"value chain of big data\" is characterised as divisible into
962
- four phases, namely data generation, data acquisition (aka data assembly),
963
- data storage, and data analysis. To my mind, data generation (which others
964
- may term acquisition, differently from the usage by Chen <em>et al.</em>)
965
- is obviously the first step, but does not in itself constitute the macroscope,
966
- except in rare cases - such as Landsat imagery, perhaps - where on its own,
967
- a single co-ordinated data stream is sufficient to meet the need for a particular
968
- type of \"global view\". A variant of this might be a coordinated data collection
969
- program - such as that of the ten year Census of Marine Life - which might
970
- produce the data required for the desired global view; but again, in reality,
971
- such data are collected in a series of discrete chunks, in many and often
972
- disparate data formats, and must be \"wrangled\" into a more coherent whole
973
- before any meaningful \"macroscope\" functionality becomes available.</p>\n\n<p>Here
974
- we come to what, in my view, constitutes the heart of the \"macroscope\":
975
- an intelligently organized (i.e. indexable and searchable), coherent data
976
- store or repository (where \"data\" may include imagery and other non numeric
977
- data forms, but much else besides). Taking the Census of Marine Life example,
978
- the data repository for that project''s data (plus other available sources
979
- as inputs) is the <a href=\"https://obis.org\">Ocean Biodiversity Information
980
- System</a> or OBIS (previously the Ocean Biogeographic Information System),
981
- which according to this view forms the \"macroscope\" for which the Census
982
- data is a feed. (For non habitat-specific biodiversity data, <a href=\"https://www.gbif.org\">GBIF</a>
983
- is an equivalent, and more extensive, operation). Other planetary scale \"macroscopes\",
984
- by this definition (which may or may not have an explicit geographic, i.e.
985
- spatial, component) would include inventories of biological taxa such as the
986
- <a href=\"https://www.catalogueoflife.org\">Catalogue of Life</a> and so on,
987
- all the way back to the pioneering compendia published by Linnaeus in the
988
- eighteenth century; while for cartography and topographic imagery, the current
989
- \"blockbuster\" of <a href=\"http://earth.google.com\">Google Earth</a> and
990
- its predecessors also come well into public consciousness.</p>\n\n<p>In the
991
- view of some workers and/or operations, both of these phases are precursors
992
- to the real \"work\" of the macroscope which is to reveal previously unseen
993
- portions of the \"big picture\" by means either of the availability of large,
994
- synoptic datasets, or fusion between different data streams to produce novel
995
- insights. Companies such as IBM and Microsoft have used phraseology such as:</p>\n\n<blockquote>By
996
- 2022 we will use machine-learning algorithms and software to help us organize
997
- information about the physical world, helping bring the vast and complex data
998
- gathered by billions of devices within the range of our vision and understanding.
999
- We call this a \"macroscope\" – but unlike the microscope to see the very
1000
- small, or the telescope that can see far away, it is a system of software
1001
- and algorithms to bring all of Earth''s complex data together to analyze it
1002
- by space and time for meaning.\" (IBM)</blockquote>\n\n<blockquote>As the
1003
- Earth becomes increasingly instrumented with low-cost, high-bandwidth sensors,
1004
- we will gain a better understanding of our environment via a virtual, distributed
1005
- whole-Earth \"macroscope\"... Massive-scale data analytics will enable real-time
1006
- tracking of disease and targeted responses to potential pandemics. Our virtual
1007
- \"macroscope\" can now be used on ourselves, as well as on our planet.\" (Microsoft)
1008
- (references available via the Wikipedia article cited above).</blockquote>\n\n<p>Whether
1009
- or not the analytical capabilities described here are viewed as being an integral
1010
- part of the \"macroscope\" concept, or are maybe an add-on, is ultimately
1011
- a question of semantics and perhaps, personal opinion. Continuing the Census
1012
- of Marine Life/OBIS example, OBIS offers some (arguably rather basic) visualization
1013
- and summary tools, but also makes its data available for download to users
1014
- wishing to analyse it further according to their own particular interests;
1015
- using OBIS data in this manner, Mark Costello et al. in 2017 were able to
1016
- demarcate a finite number of data-supported marine biogeographic realms for
1017
- the first time (Costello et al. 2017: Nature Communications. 8: 1057. <a href=\"https://doi.org/10.1038/s41467-017-01121-2\">doi:10.1038/s41467-017-01121-2</a>),
1018
- a project which I was able to assist in a small way in an advisory capacity.
1019
- In a case such as this, perhaps the final function of the macroscope, namely
1020
- data visualization and analysis, was outsourced to the authors'' own research
1021
- institution. Similarly at an earlier phase, \"data aggregation\" can also
1022
- be virtual rather than actual, i.e. avoiding using a single physical system
1023
- to hold all the data, enabled by open web mapping standards WMS (web map service)
1024
- and WFS (web feature service) to access a set of distributed data stores,
1025
- e.g. as implemented on the portal for the <a href=\"https://portal.aodn.org.au/\">Australian
1026
- Ocean Data Network</a>.</p>\n\n<p>So, as we pass through the third decade
1027
- of the twenty first century, what developments await us in the \"macroscope\"
1028
- area\"? In the biodiversity space, one can reasonably presume that the existing
1029
- \"macroscopic\" data assembly projects such as OBIS and GBIF will continue,
1030
- and hopefully slowly fill current gaps in their coverage - although in the
1031
- marine area, strategic new data collection exercises may be required (Census
1032
- 2020, or 2025, anyone?), while (again hopefully), the Catalogue of Life will
1033
- continue its progress towards a \"complete\" species inventory for the biosphere.
1034
- The Landsat project, with imagery dating back to 1972, continues with the
1035
- launch of its latest satellite Landsat 9 just this year (21 September 2021)
1036
- with a planned mission duration for the next 5 years, so the \"macroscope\"
1037
- functionality of that project seems set to continue for the medium term at
1038
- least. Meanwhile the ongoing development of sensor networks, both on land
1039
- and in the ocean, offers an exciting new method of \"instrumenting the earth\"
1040
- to obtain much more real time data than has ever been available in the past,
1041
- offering scope for many more, use case-specific \"macroscopes\" to be constructed
1042
- that can fuse (e.g.) satellite imagery with much more that is happening at
1043
- a local level.</p>\n\n<p>So, the \"macroscope\" concept appears to be alive
1044
- and well, even though the nomenclature can change from time to time (IBM''s
1045
- \"Macroscope\", foreshadowed in 2017, became the \"IBM Pairs Geoscope\" on
1046
- implementation, and is now simply the \"Geospatial Analytics component within
1047
- the IBM Environmental Intelligence Suite\" according to available IBM publicity
1048
- materials). In reality this illustrates a new dichotomy: even if \"everyone\"
1049
- in principle has access to huge quantities of publicly available data, maybe
1050
- only a few well funded entities now have the computational ability to make
1051
- sense of it, and can charge clients a good fee for their services...</p>\n\n<p>I
1052
- present this account partly to give a brief picture of \"macroscope\" concepts
1053
- today and in the past, for those who may be interested, and partly to present
1054
- a few personal views which would be out of scope in a \"neutral point of view\"
1055
- article such as is required on Wikipedia; also to see if readers of this blog
1056
- would like to contribute further to discussion of any of the concepts traversed
1057
- herein.</p>","tags":["guest post","macroscope"],"language":"en","references":null},{"id":"https://doi.org/10.59350/2b1j9-qmw12","uuid":"37538c38-66e6-4ac4-ab5c-679684622ade","url":"https://iphylo.blogspot.com/2022/05/round-trip-from-identifiers-to.html","title":"Round
1058
- trip from identifiers to citations and back again","summary":"Note to self
1059
- (basically rewriting last year''s Finding citations of specimens). Bibliographic
1060
- data supports going from identifier to citation string and back again, so
1061
- we can do a \"round trip.\" 1. Given a DOI we can get structured data with
1062
- a simple HTTP fetch, then use a tool such as citation.js to convert that data
1063
- into a human-readable string in a variety of formats. Identifier ⟶ Structured
1064
- data ⟶ Human readable string 10.7717/peerj-cs.214 HTTP with...","date_published":"2022-05-27T16:34:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1065
- (Roderic Page)"}],"image":null,"content_html":"<p>Note to self (basically
1066
- rewriting last year''s <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">Finding
1067
- citations of specimens</a>).</p>\n\n<p>Bibliographic data supports going from
1068
- identifier to citation string and back again, so we can do a \"round trip.\"</p>\n\n<h2>1.</h2>\n\n<p>Given
1069
- a DOI we can get structured data with a simple HTTP fetch, then use a tool
1070
- such as <a href=\"https://citation.js.org\">citation.js</a> to convert that
1071
- data into a human-readable string in a variety of formats.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟶\n</th>\n<th>\nStructured
1072
- data\n</th>\n<th>\n⟶\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nHTTP
1073
- with content-negotiation\n</td>\n<td>\nCSL-JSON\n</td>\n<td>\nCSL templates\n</td>\n<td
1074
- width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
1075
- modular bibliography tool for the browser and command line. PeerJ Computer
1076
- Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>2.</h2>\n\n<p>Going
1077
- in the reverse direction (string to identifier) is a little more challenging.
1078
- In the \"old days\" a typical strategy was to attempt to parse the citation
1079
- string into structured data (see <a href=\"hhtps://anystyle.io\">AnyStyle</a>
1080
- for a nice example of this), then we could extract a truple of (journal, volume,
1081
- starting page) and use that to query CrossRef to find if there was an article
1082
- with that tuple, which gave us the DOI.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nStructured
1083
- data\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nOpenURL
1084
- query\n</td>\n<td>\njournal, volume, start page\n</td>\n<td>\nCitation parser
1085
- \n</td>\n<td width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
1086
- modular bibliography tool for the browser and command line. PeerJ Computer
1087
- Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>3.</h2>\n\n<p>Another
1088
- strategy is to take all the citations strings for each DOI, index those in
1089
- a search engine, then just use a simple search to find the best match to your
1090
- citation string, and hence the DOI. This is what <a href=\"https://search.crossref.org\">https://search.crossref.org</a>
1091
- does.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman
1092
- readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nsearch\n</td>\n<td
1093
- width=\"50%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
1094
- modular bibliography tool for the browser and command line. PeerJ Computer
1095
- Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<p>At
1096
- the moment my work on material citations (i.e., lists of specimens in taxonomic
1097
- papers) is focussing on 1 (generating citations from specimen data in GBIF)
1098
- and 2 (parsing citations into structured data).</p>","tags":["citation","GBIF","material
1099
- examined","specimen codes"],"language":"en","references":null},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
104
+ list can be modelled in RDF in...","published_at":1630070400,"updated_at":1630070987,"indexed_at":1688982503,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
105
+ Page"}],"image":null,"tags":["JSON-LD","RDF"],"language":"en","reference":[]},{"id":"5891c709-d139-440f-bacb-06244424587a","doi":"https://doi.org/10.59350/pmhat-5ky65","url":"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html","title":"Problems
106
+ with Plazi parsing: how reliable are automated methods for extracting specimens
107
+ from the literature?","summary":"The Plazi project has become one of the major
108
+ contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
109
+ (see Plazi''s GBIF page for details). These occurrences are extracted from
110
+ taxonomic publication using automated methods. New data is published almost
111
+ daily (see latest treatments). The map below shows the geographic distribution
112
+ of material citations provided to GBIF by Plazi, which gives you a sense of
113
+ the size of the dataset. By any metric Plazi represents a...","published_at":1635160200,"updated_at":1635437298,"indexed_at":1688982503,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
114
+ Page"}],"image":null,"tags":["data quality","parsing","Plazi","specimen","text
115
+ mining"],"language":"en","reference":[]},{"id":"3cb94422-5506-4e24-a41c-a250bb521ee0","doi":"https://doi.org/10.59350/c79vq-7rr11","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL
116
+ for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
117
+ for WikiData. More precisely, the endpoint is for a subset of the entities
118
+ that are of interest to WikiCite, such as scholarly articles, people, and
119
+ journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
120
+ The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
121
+ There are various ways to interact with the endpoint, personally I like the
122
+ Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
123
+ taken...","published_at":1640006160,"updated_at":1640006405,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
124
+ Page"}],"image":null,"tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","reference":[]},{"id":"62e7b438-67a3-44ac-a66d-3f5c278c949e","doi":"https://doi.org/10.59350/3s376-6bm21","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
1100
125
  bibliographic data","summary":"There are several instances where I have a
1101
126
  collection of references that I want to deduplicate and merge. For example,
1102
127
  in Zootaxa has no impact factor I describe a dataset of the literature cited
1103
128
  by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4),
1104
129
  as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1).
1105
130
  Given that the same articles may be cited many times, these datasets have
1106
- lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1107
- (Roderic Page)"}],"image":null,"content_html":"<p>There are several instances
1108
- where I have a collection of references that I want to deduplicate and merge.
1109
- For example, in <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
1110
- has no impact factor</a> I describe a dataset of the literature cited by articles
1111
- in the journal <i>Zootaxa</i>. This data is available on Figshare (<a href=\"https://doi.org/10.6084/m9.figshare.c.5054372.v4\">https://doi.org/10.6084/m9.figshare.c.5054372.v4</a>),
1112
- as is the equivalent dataset for <i>Phytotaxa</i> (<a href=\"https://doi.org/10.6084/m9.figshare.c.5525901.v1\">https://doi.org/10.6084/m9.figshare.c.5525901.v1</a>).
1113
- Given that the same articles may be cited many times, these datasets have
1114
- lots of duplicates. Similarly, articles in <a href=\"https://species.wikimedia.org\">Wikispecies</a>
1115
- often have extensive lists of references cited, and the same reference may
1116
- appear on multiple pages (for an initial attempt to extract these references
1117
- see <a href=\"https://doi.org/10.5281/zenodo.5801661\">https://doi.org/10.5281/zenodo.5801661</a>
1118
- and <a href=\"https://github.com/rdmpage/wikispecies-parser\">https://github.com/rdmpage/wikispecies-parser</a>).</p>\n\n<p>There
1119
- are several reasons I want to merge these references. If I want to build a
1120
- citation graph for <i>Zootaxa</i> or <i>Phytotaxa</i> I need to merge references
1121
- that are the same so that I can accurate count citations. I am also interested
1122
- in harvesting the metadata to help find those articles in the <a href=\"https://www.biodiversitylibrary.org\">Biodiversity
1123
- Heritage Library</a> (BHL), and the literature cited section of scientific
1124
- articles is a potential goldmine of bibliographic metadata, as is Wikispecies.</p>\n\n<p>After
1125
- various experiments and false starts I''ve created a repository <a href=\"https://github.com/rdmpage/bib-dedup\">https://github.com/rdmpage/bib-dedup</a>
1126
- to host a series of PHP scripts to deduplicate bibliographics data. I''ve
1127
- settled on using CSL-JSON as the format for bibliographic data. Because deduplication
1128
- relies on comparing pairs of references, the standard format for most of the
1129
- scripts is a JSON array containing a pair of CSL-JSON objects to compare.
1130
- Below are the steps the code takes.</p>\n\n<h2>Generating pairs to compare</h2>\n\n<p>The
1131
- first step is to take a list of references and generate the pairs that will
1132
- be compared. I started with this approach as I wanted to explore machine learning
1133
- and wanted a simple format for training data, such as an array of two CSL-JSON
1134
- objects and an integer flag representing whether the two references were the
1135
- same of different.</p>\n\n<p>There are various ways to generate CSL-JSON for
1136
- a reference. I use a tool I wrote (see <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
1137
- parsing tool released</a>) that has a simple API where you parse one or more
1138
- references and it returns that reference as structured data in CSL-JSON.</p>\n\n<p>Attempting
1139
- to do all possible pairwise comparisons rapidly gets impractical as the number
1140
- of references increases, so we need some way to restrict the number of comparisons
1141
- we make. One approach I''ve explored is the “sorted neighbourhood method”
1142
- where we sort the references 9for example by their title) then move a sliding
1143
- window down the list of references, comparing all references within that window.
1144
- This greatly reduces the number of pairwise comparisons. So the first step
1145
- is to sort the references, then run a sliding window over them, output all
1146
- the pairs in each window (ignoring in pairwise comparisons already made in
1147
- a previous window). Other methods of \"blocking\" could also be used, such
1148
- as only including references in a particular year, or a particular journal.</p>\n\n<p>So,
1149
- the output of this step is a set of JSON arrays, each with a pair of references
1150
- in CSL-JSON format. Each array is stored on a single line in the same file
1151
- in <a href=\"https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON_2\">line-delimited
1152
- JSON</a> (JSONL).</p>\n\n<h2>Comparing pairs</h2>\n\n<p>The next step is to
1153
- compare each pair of references and decide whether they are a match or not.
1154
- Initially I explored a machine learning approach used in the following paper:</p>\n\n<blockquote>\nWilson
1155
- DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex
1156
- features to improve genealogical record linkage. In: The 2011 International
1157
- Joint Conference on Neural Networks. 9–14. DOI: <a href=\"https://doi.org/10.1109/IJCNN.2011.6033192\">10.1109/IJCNN.2011.6033192</a>\n</blockquote>\n\n<p>Initial
1158
- experiments using <a href=\"https://github.com/jtet/Perceptron\">https://github.com/jtet/Perceptron</a>
1159
- were promising and I want to play with this further, but I deciding to skip
1160
- this for now and just use simple string comparison. So for each CSL-JSON object
1161
- I generate a citation string in the same format using CiteProc, then compute
1162
- the <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein
1163
- distance</a> between the two strings. By normalising this distance by the
1164
- length of the two strings being compared I can use an arbitrary threshold
1165
- to decide if the references are the same or not.</p>\n\n<h2>Clustering</h2>\n\n<p>For
1166
- this step we read the JSONL file produced above and record whether the two
1167
- references are a match or not. Assuming each reference has a unique identifier
1168
- (needs only be unique within the file) then we can use those identifier to
1169
- record the clusters each reference belongs to. I do this using a <a href=\"https://en.wikipedia.org/wiki/Disjoint-set_data_structure\">Disjoint-set
1170
- data structure</a>. For each reference start with a graph where each node
1171
- represents a reference, and each node has a pointer to a parent node. Initially
1172
- the reference is its own parent. A simple implementation is to have an array
1173
- index by reference identifiers and where the value of each cell in the array
1174
- is the node''s parent.</p>\n\n<p>As we discover pairs we update the parents
1175
- of the nodes to reflect this, such that once all the comparisons are done
1176
- we have a one or more sets of clusters corresponding to the references that
1177
- we think are the same. Another way to think of this is that we are getting
1178
- the components of a graph where each node is a reference and pair of references
1179
- that match are connected by an edge.</p>\n\n<p>In the code I''m using I write
1180
- this graph in <a href=\"https://en.wikipedia.org/wiki/Trivial_Graph_Format\">Trivial
1181
- Graph Format</a> (TGF) which can be visualised using a tools such as <a href=\"https://www.yworks.com/products/yed\">yEd</a>.</p>\n\n<h2>Merging</h2>\n\n<p>Now
1182
- that we have a graph representing the sets of references that we think are
1183
- the same we need to merge them. This is where things get interesting as the
1184
- references are similar (by definition) but may differ in some details. The
1185
- paper below describes a simple Bayesian approach for merging records:</p>\n\n<blockquote>\nCouncill
1186
- IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles
1187
- CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching
1188
- Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
1189
- Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: <a href=\"https://doi.org/10.1145/1141753.1141817\">10.1145/1141753.1141817</a>.\n</blockquote>\n\n<p>So
1190
- the next step is to read the graph with the clusters, generate the sets of
1191
- bibliographic references that correspond to each cluster, then use the method
1192
- described in Councill et al. to produce a single bibliographic record for
1193
- that cluster. These records could then be used to, say locate the corresponding
1194
- article in BHL, or populate Wikidata with missing references.</p>\n\n<p>Obviously
1195
- there is always the potential for errors, such as trying to merge references
1196
- that are not the same. As a quick and dirty check I flag as dubious any cluster
1197
- where the page numbers vary among members of the cluster. More sophisticated
1198
- checks are possible, especially if I go down the ML route (i.e., I would have
1199
- evidence for the probability that the same reference can disagree on some
1200
- aspects of metadata).</p>\n\n<h2>Summary</h2>\n\n<p>At this stage the code
1201
- is working well enough for me to play with and explore some example datasets.
1202
- The focus is on structured bibliographic metadata, but I may simplify things
1203
- and have a version that handles simple string matching, for example to cluster
1204
- together different abbreviations of the same journal name.</p>","tags":["data
1205
- cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ndtkv-6ve80","uuid":"e8e95aaf-bacb-4b5a-bf91-54e903526ab2","url":"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html","title":"Revisiting
131
+ lots of...","published_at":1643900940,"updated_at":1643901089,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
132
+ Page"}],"image":null,"tags":["data cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","reference":[]},{"id":"d33d4f49-b281-4997-9eb9-dbad1e52d9bd","doi":"https://doi.org/10.59350/92rdb-5fe58","url":"https://iphylo.blogspot.com/2022/09/local-global-identifiers-for.html","title":"Local
133
+ global identifiers for decentralised wikis","summary":"I''ve been thinking
134
+ a bit about how one could use a Markdown wiki-like tool such as Obsidian to
135
+ work with taxonomic data (see earlier posts Obsidian, markdown, and taxonomic
136
+ trees and Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu).
137
+ One \"gotcha\" would be how to name pages. If we treat the database as entirely
138
+ local, then the page names don''t matter, but what if we envisage sharing
139
+ the database, or merging it with others (for example, if we divided a taxon
140
+ up into chunks, and...","published_at":1662653340,"updated_at":1662657862,"indexed_at":1688982864,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
141
+ Page"}],"image":null,"tags":["citekey","identfiiers","markdown","obsidian","Roger
142
+ Hyam"],"language":"en","reference":[]},{"id":"6a4d5c44-f4a9-4d40-a32c-a4d5e512c55a","doi":"https://doi.org/10.59350/rfxj3-x6739","url":"https://iphylo.blogspot.com/2022/05/thoughts-on-treebase-dying.html","title":"Thoughts
143
+ on TreeBASE dying(?)","summary":"@rvosa is Naturalis no longer hosting Treebase?
144
+ https://t.co/MBRgcxaBmR— Hilmar Lapp (@hlapp) May 10, 2022 So it looks like
145
+ TreeBASE is in trouble, it''s legacy Java code a victim of security issues.
146
+ Perhaps this is a chance to rethink TreeBASE, assuming that a repository of
147
+ published phylogenies is still considered a worthwhile thing to have (and
148
+ I think that question is open). Here''s what I think could be done. The data
149
+ (individual studies with trees and data) are packaged into...","published_at":1652287980,"updated_at":1652350205,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
150
+ Page"}],"image":null,"tags":["phylogeny","TreeBASE"],"language":"en","reference":[]},{"id":"3e1278f6-e7c0-43e1-bb54-6829e1344c0d","doi":"https://doi.org/10.59350/btdk4-42879","url":"https://iphylo.blogspot.com/2022/09/the-ideal-taxonomic-journal.html","title":"The
151
+ ideal taxonomic journal","summary":"This is just some random notes on an “ideal”
152
+ taxonomic journal, inspired in part by some recent discussions on “turbo-taxonomy”
153
+ (e.g., https://doi.org/10.3897/zookeys.1087.76720 and https://doi.org/10.1186/1742-9994-10-15),
154
+ and also examples such as the Australian Journal of Taxonomy https://doi.org/10.54102/ajt.qxi3r
155
+ which seems well-intentioned but limited. XML One approach is to have highly
156
+ structured text that embeds detailed markup, and ideally a tool that generates
157
+ markup in XML. This is...","published_at":1664460000,"updated_at":1664460001,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
158
+ Page"}],"image":null,"tags":[],"language":"en","reference":[]},{"id":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","doi":"https://doi.org/10.59350/cbzgz-p8428","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
159
+ interfaces, knowledge graphs, and Miller columns","summary":"Some quick notes
160
+ on interface ideas for digital libraries and/or knowledge graphs. Recently
161
+ there’s been something of an explosion in bibliographic tools to explore the
162
+ literature. Examples include: Elicit which uses AI to search for and summarise
163
+ papers _scite which uses AI to do sentiment analysis on citations (does paper
164
+ A cite paper B favourably or not?) ResearchRabbit which uses lists, networks,
165
+ and timelines to discover related research Scispace which navigates connections
166
+ between...","published_at":1682427660,"updated_at":1682607068,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
167
+ Page"}],"image":null,"tags":["cards","flow","Knowledge Graph","Miller column","RabbitResearch"],"language":"en","reference":[]},{"id":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","doi":"https://doi.org/10.59350/463yw-pbj26","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David
168
+ Remsen","summary":"I heard yesterday from Martin Kalfatovic (BHL) that David
169
+ Remsen has died. Very sad news. It''s starting to feel like iPhylo might end
170
+ up being a list of obituaries of people working on biodiversity informatics
171
+ (e.g., Scott Federhen). I spent several happy visits at MBL at Woods Hole
172
+ talking to Dave at the height of the uBio project, which really kickstarted
173
+ large scale indexing of taxonomic names, and the use of taxonomic name finding
174
+ tools to index the literature. His work on uBio with David...","published_at":1671213240,"updated_at":1671264743,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
175
+ Page"}],"image":null,"tags":["David Remsen","obituary","uBio"],"language":"en","reference":[]},{"id":"30c78d9d-2e50-49db-9f4f-b3baa060387b","doi":"https://doi.org/10.59350/zc4qc-77616","url":"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html","title":"Does
176
+ anyone cite taxonomic treatments?","summary":"Taxonomic treatments have come
177
+ up in various discussions I''m involved in, and I''m curious as to whether
178
+ they are actually being used, in particular, whether they are actually being
179
+ cited. Consider the following quote: The taxa are described in taxonomic treatments,
180
+ well defined sections of scientific publications (Catapano 2019). They include
181
+ a nomenclatural section and one or more sections including descriptions, material
182
+ citations referring to studied specimens, or notes ecology and...","published_at":1662050940,"updated_at":1662050991,"indexed_at":1688982503,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
183
+ Page"}],"image":null,"tags":[],"language":"en","reference":[]},{"id":"c6b101f4-bfbc-4d01-921d-805c43c85757","doi":"https://doi.org/10.59350/j77nc-e8x98","url":"https://iphylo.blogspot.com/2022/08/linking-taxonomic-names-to-literature.html","title":"Linking
184
+ taxonomic names to the literature","summary":"Just some thoughts as I work
185
+ through some datasets linking taxonomic names to the literature. In the diagram
186
+ above I''ve tried to capture the different situatios I encounter. Much of
187
+ the work I''ve done on this has focussed on case 1 in the diagram: I want
188
+ to link a taxonomic name to an identifier for the work in which that name
189
+ was published. In practise this means linking names to DOIs. This has the
190
+ advantage of linking to a citable indentifier, raising questions such as whether
191
+ citations...","published_at":1661188740,"updated_at":1661188748,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
192
+ Page"}],"image":null,"tags":[],"language":"en","reference":[]},{"id":"e8e95aaf-bacb-4b5a-bf91-54e903526ab2","doi":"https://doi.org/10.59350/ndtkv-6ve80","url":"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html","title":"Revisiting
1206
193
  RSS to monitor the latest taxonomic research","summary":"Over a decade ago
1207
194
  RSS (RDF Site Summary or Really Simple Syndication) was attracting a lot of
1208
195
  interest as a way to integrate data across various websites. Many science
@@ -1210,206 +197,45 @@ http_interactions:
1210
197
  three flavours of RSS (RDF, RSS, Atom). This led to tools such as uBioRSS
1211
198
  [1] and my own e-Biosphere Challenge: visualising biodiversity digitisation
1212
199
  in real time. It was a time of enthusiasm for aggregating lots of data, such
1213
- as the ill-fated PLoS...","date_published":"2021-11-23T20:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1214
- (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1215
- both;\"><a href=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s257/feedicon.png\"
1216
- style=\"display: block; padding: 1em 0; text-align: center; clear: right;
1217
- float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"257\"
1218
- data-original-width=\"257\" src=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s200/feedicon.png\"/></a></div>\n<p>Over
1219
- a decade ago <a href=\"https://en.wikipedia.org/wiki/RSS\">RSS</a> (RDF Site
1220
- Summary or Really Simple Syndication) was attracting a lot of interest as
1221
- a way to integrate data across various websites. Many science publishers would
1222
- provide a list of their latest articles in XML in one of three flavours of
1223
- RSS (RDF, RSS, Atom). This led to tools such as <a href=\"http://ubio.org/rss/\">uBioRSS</a>
1224
- [<a href=\"#Leary2007\">1</a>] and my own <a href=\"https://iphylo.blogspot.com/2009/05/e-biosphere-challenge-visualising.html\">e-Biosphere
1225
- Challenge: visualising biodiversity digitisation in real time</a>. It was
1226
- a time of enthusiasm for aggregating lots of data, such as the <a href=\"https://iphylo.blogspot.com/2013/07/the-demise-of-plos-biodiversity-hub.html\">ill-fated</a>
1227
- PLoS Biodiversity Hub [<a href=\"#Mindell2011\">2</a>].</p>\n\n<p>Since I
1228
- seem to be condemned to revisit old ideas rather than come up with anything
1229
- new, I''ve been looking at providing a tool like the now defunct uBioRSS.
1230
- The idea is to harvest RSS feeds from journals (with an emphasis on taxonomic
1231
- and systematic journals), aggregate the results, and make them browsable by
1232
- taxon and geography. Here''s a sneak peak:</p>\n\n<div class=\"separator\"
1233
- style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s1032/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"
1234
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1235
- border=\"0\" width=\"400\" data-original-height=\"952\" data-original-width=\"1032\"
1236
- src=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s400/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"/></a></div>\n\n<p>What
1237
- seems like a straightforward task quickly became a bit of a challenge. Not
1238
- all journals have RSS feeds (they seem to have become less widely supported
1239
- over time) so I need to think of alternative ways to get lists of recent articles.
1240
- These lists also need to be processed in various ways. There are three versions
1241
- of RSS, each with their own idiosyncracies, so I need to standardise things
1242
- like dates. I also want to augment them with things like DOIs (often missing
1243
- from RSS feeds) and thumbnails for the articles (often available on publisher
1244
- websites but not the feeds). Then I need to index the content by taxon and
1245
- geography. For taxa I use a version of Patrick Leary''s \"taxonfinder\" (see
1246
- <a href=\"https://right-frill.glitch.me\">https://right-frill.glitch.me</a>)
1247
- to find names, then the <a href=\"https://index.globalnames.org\">Global Names
1248
- Index</a> to assign names found to the GBIF taxonomic hierarchy.</p>\n\n<p>Indexing
1249
- by geography proved harder. Typically <a href=\"https://en.wikipedia.org/wiki/Toponym_resolution#Geoparsing\">geoparsing</a>
1250
- involves taking a body of text and doing the following:\n<ul><li>Using named-entity
1251
- recognition <a href=\"https://en.wikipedia.org/wiki/Named-entity_recognition\">NER</a>
1252
- to identity named entities in the text (e.g., place names, people names, etc.).</li>\n<li>Using
1253
- a gazetteer of geographic names <a href=\"http://www.geonames.org\">GeoNames</a>
1254
- to try and match the place names found by NER.</li>\n</ul></p>\n\n<p>An example
1255
- of such a parser is the <a href=\"https://www.ltg.ed.ac.uk/software/geoparser/\">Edinburgh
1256
- Geoparser</a>. Typically geoparsing software can be large and tricky to install,
1257
- especially if you are looking to make your installation publicly accessible.
1258
- Geoparsing services seem to have a short half-life (e.g., <a href=\"https://geoparser.io\">Geoparser.io</a>),
1259
- perhaps because they are so useful they quickly get swamped by users.</p>\n\n<p>Bearing
1260
- this in mind, the approach I’ve taken here is to create a very simple geoparser
1261
- that is focussed on fairly large areas, especially those relevant to biodiversity,
1262
- and is aimed at geoparsing text such as abstracts of scientific papers. I''ve
1263
- created a small database of places by harvesting data from Wikidata, then
1264
- I use the \"flash text\" algorithm [<a href=\"#Singh2017\">3</a>] to find
1265
- geographic places. This approach uses a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
1266
- to store the place names. All I do is walk through the text seeing whether
1267
- the current word matches a place name (or the start of one) in the trie, then
1268
- moving on. This is very quick and seems to work quite well.</p>\n\n<p>Given
1269
- that I need to aggregate data from a lot of sources, apply various transformations
1270
- to that data, then merge it, there are a lot of moving parts. I started playing
1271
- with a \"NoCode\" platform for creating workflows, in this case <a href=\"https://n8n.io\">n8n</a>
1272
- (in many ways reminiscent of the now defunct <a href=\"https://en.wikipedia.org/wiki/Yahoo!_Pipes\">Yahoo
1273
- Pipes</a>). This was quite fun for a while, but after lots of experimentation
1274
- I moved back to writing code to aggregate the data into a CouchDB database.
1275
- CouchDB is one of the NoSQL databases that I really like as it has a great
1276
- interface, and makes queries very easy to do once you get your head around
1277
- how it works.</p>\n\n<p>So the end result of this is \"BioRSS\" <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
1278
- The interface comprises a stream of articles listed from newest to oldest,
1279
- with a treemap and a geographic map on the left. You can use these to filter
1280
- the articles by taxonomic group and/or country. For example the screen shot
1281
- is showing arthropods from China (in this case from a month or two ago in
1282
- the journal <i>ZooKeys</i>). As much fun as the interface has been to construct,
1283
- in many ways I don''t really want to spend time making an interface. For each
1284
- combination of taxon and country I provide a RSS feed so if you have a favour
1285
- feed reader you can grab the feed and view it there. As BioRSS updates the
1286
- data your feed reader should automatically update the feed. This means that
1287
- you can have a feed that monitors, say, new papers on spiders in China.</p>\n\n<p>In
1288
- the spirit of \"release early and release often\" this is an early version
1289
- of this app. I need to add a lot more feeds, back date them to bring in older
1290
- content, and I also want to make use of aggregators such as PubMed, CrossRef,
1291
- and Google Scholar. The existence of these tools is, I suspect, one reason
1292
- why RSS feeds are less common than they used to be.</p>\n\n<p>So, if this
1293
- sounds useful please take it for a spin at <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
1294
- Feedback is welcome, especially suggestions for journals to harvest and add
1295
- to the news feed. Ultimately I''d like to have sufficient coverage of the
1296
- taxonomic literature so that BioRSS becomes a place where we can go to find
1297
- the latest papers on any taxon of interest.</p>\n\n<h2>References</h2>\n\n<blockquote>\n<a
1298
- name=\"Leary2007\">1.</a> Patrick R. Leary, David P. Remsen, Catherine N.
1299
- Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic
1300
- literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages
1301
- 1434–1436, <a href=\"https://doi.org/10.1093/bioinformatics/btm109\">https://doi.org/10.1093/bioinformatics/btm109</a>\n</blockquote>\n\n<blockquote><a
1302
- name=\"Mindell2011\">2.</a> Mindell, D. P., Fisher, B. L., Roopnarine, P.,
1303
- Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating,
1304
- Tagging and Integrating Biodiversity Research. PLoS ONE, 6(8), e19491. <a
1305
- href=\"https://doi.org/10.1371/journal.pone.0019491\">doi:10.1371/journal.pone.0019491</a>\n</blockquote>\n\n<blockquote><a
1306
- name=\"Singh2017\">3.</a> Singh, V. (2017). Replace or Retrieve Keywords In
1307
- Documents at Scale. CoRR, abs/1711.00046. <a href=\"http://arxiv.org/abs/1711.00046\">http://arxiv.org/abs/1711.00046</a>\n\n</blockquote>","tags":["geocoding","NoCode","RSS"],"language":"en","references":[{"doi":"https://doi.org/10.1093/bioinformatics/btm109","key":"ref1"},{"doi":"https://doi.org/10.1371/journal.pone.0019491","key":"ref2"},{"key":"ref3","url":"http://arxiv.org/abs/1711.00046"}]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
200
+ as the ill-fated PLoS...","published_at":1637700780,"updated_at":1637701172,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
201
+ Page"}],"image":null,"tags":["geocoding","NoCode","RSS"],"language":"en","reference":[{"doi":"https://doi.org/10.1093/bioinformatics/btm109","key":"ref1"},{"doi":"https://doi.org/10.1371/journal.pone.0019491","key":"ref2"},{"key":"ref3","url":"http://arxiv.org/abs/1711.00046"}]},{"id":"20b9d31e-513f-496b-b399-4215306e1588","doi":"https://doi.org/10.59350/en7e9-5s882","url":"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html","title":"Obsidian,
202
+ markdown, and taxonomic trees","summary":"Returning to the subject of personal
203
+ knowledge graphs Kyle Scheer has an interesting repository of Markdown files
204
+ that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines
205
+ (see his blog post for more background). If you add these files to Obsidian
206
+ you get a nice visualisation of a taxonomy of academic disciplines. The applications
207
+ of this to biological taxonomy seem obvious, especially as a tool like Obsidian
208
+ enables all sorts of interesting links to be added...","published_at":1649365620,"updated_at":1649366134,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
209
+ Page"}],"image":null,"tags":["markdown","obsidian"],"language":"en","reference":[]},{"id":"d811172e-7798-403c-a83d-3d5317a9657e","doi":"https://doi.org/10.59350/w18j9-v7j10","url":"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html","title":"Papers
210
+ citing data that cite papers: CrossRef, DataCite, and the Catalogue of Life","summary":"Quick
211
+ notes to self following on from a conversation about linking taxonomic names
212
+ to the literature. Is there a way to turn those links into countable citations
213
+ (even if just one per database) for Google Scholar?— Wayne Maddison (@WayneMaddison)
214
+ August 3, 2022 There are different sorts of citation: Paper cites another
215
+ paper Paper cites a dataset Dataset cites a paper Citation type (1) is largely
216
+ a solved problem (although there are issues of the ownership and use of this...","published_at":1659526380,"updated_at":1659526393,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
217
+ Page"}],"image":null,"tags":["Catalogue of Life","citation","CrossRef","DataCite","DOI"],"language":"en","reference":[]},{"id":"a41163e0-9c9a-41e0-a141-f772663f2f32","doi":"https://doi.org/10.59350/gf1dw-n1v47","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
1308
218
  Stuart Page 1936-2022","summary":"My dad died last weekend. Below is a notice
1309
219
  in today''s New Zealand Herald. I''m in New Zealand for his funeral. Don''t
1310
- really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1311
- (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1312
- both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s3454/_DSC5106.jpg\"
1313
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1314
- border=\"0\" width=\"400\" data-original-height=\"2582\" data-original-width=\"3454\"
1315
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s400/_DSC5106.jpg\"/></a></div>\n\nMy
1316
- dad died last weekend. Below is a notice in today''s New Zealand Herald. I''m
1317
- in New Zealand for his funeral. Don''t really have the words for this right
1318
- now.\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s3640/IMG_2870.jpeg\"
1319
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1320
- border=\"0\" height=\"320\" data-original-height=\"3640\" data-original-width=\"1391\"
1321
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg\"/></a></div>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/c79vq-7rr11","uuid":"3cb94422-5506-4e24-a41c-a250bb521ee0","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL
1322
- for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
1323
- for WikiData. More precisely, the endpoint is for a subset of the entities
1324
- that are of interest to WikiCite, such as scholarly articles, people, and
1325
- journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
1326
- The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
1327
- There are various ways to interact with the endpoint, personally I like the
1328
- Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
1329
- taken...","date_published":"2021-12-20T13:16:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1330
- (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1331
- both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s1000\"
1332
- style=\"display: block; padding: 1em 0; text-align: center; clear: right;
1333
- float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"1000\"
1334
- data-original-width=\"1000\" src=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s200\"/></a></div><p>I''ve
1335
- released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint
1336
- is for a subset of the entities that are of interest to WikiCite, such as
1337
- scholarly articles, people, and journals. There is a crude demo at <a href=\"https://wikicite-graphql.herokuapp.com\">https://wikicite-graphql.herokuapp.com</a>.
1338
- The endpoint itself is at <a href=\"https://wikicite-graphql.herokuapp.com/gql.php\">https://wikicite-graphql.herokuapp.com/gql.php</a>.
1339
- There are various ways to interact with the endpoint, personally I like the
1340
- <a href=\"https://altair.sirmuel.design\">Altair GraphQL Client</a> by <a
1341
- href=\"https://github.com/imolorhe\">Samuel Imolorhe</a>.</p>\n\n<p>As I''ve
1342
- <a href=\"https://iphylo.blogspot.com/2021/04/it-been-while.html\">mentioned
1343
- earlier</a> it''s taken me a while to see the point of GraphQL. But it is
1344
- clear it is gaining traction in the biodiversity world (see for example the
1345
- <a href=\"https://dev.gbif.org/hosted-portals.html\">GBIF Hosted Portals</a>)
1346
- so it''s worth exploring. My take on GraphQL is that it is a way to create
1347
- a self-describing API that someone developing a web site can use without them
1348
- having to bury themselves in the gory details of how data is internally modelled.
1349
- For example, WikiData''s query interface uses SPARQL, a powerful language
1350
- that has a steep learning curve (in part because of the administrative overhead
1351
- brought by RDF namespaces, etc.). In my previous SPARQL-based projects such
1352
- as <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a> and <a
1353
- href=\"http://alec-demo.herokuapp.com\">ALEC</a> I have either returned SPARQL
1354
- results directly (Ozymandias) or formatted SPARQL results as <a href=\"https://schema.org/DataFeed\">schema.org
1355
- DataFeeds</a> (equivalent to RSS feeds) (ALEC). Both approaches work, but
1356
- they are project-specific and if anyone else tried to build based on these
1357
- projects they might struggle for figure out what was going on. I certainly
1358
- struggle, and I wrote them!</p>\n\n<p>So it seems worthwhile to explore this
1359
- approach a little further and see if I can develop a GraphQL interface that
1360
- can be used to build the sort of rich apps that I want to see. The demo I''ve
1361
- created uses SPARQL under the hood to provide responses to the GraphQL queries.
1362
- So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully)
1363
- simpler overlay on top of SPARQL so that we can retrieve the data we want
1364
- without having to learn the intricacies of SPARQL, nor how Wikidata models
1365
- publications and people.</p>","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
1366
- barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
1367
- tweeted about this but want to bookmark it for later as well. The paper “A
1368
- molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
1369
- contains the following: …the annotated barcode records assembled by FinBOL
1370
- participants represent a tremendous intergenerational transfer of taxonomic
1371
- knowledge … the time contributed by current taxonomists in identifying and
1372
- contributing voucher specimens represents a great gift to future generations
1373
- who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1374
- (Roderic Page)"}],"image":null,"content_html":"<p>I <a href=\"https://twitter.com/rdmpage/status/1569738844416638981?s=21&amp;t=9OVXuoUEwZtQt-Ldzlutfw\">tweeted
1375
- about this</a> but want to bookmark it for later as well. The paper “A molecular-based
1376
- identification resource for the arthropods of Finland” <a href=\"https://doi.org/10.1111/1755-0998.13510\">doi:10.1111/1755-0998.13510</a>
1377
- contains the following:</p>\n<blockquote>\n<p>…the annotated barcode records
1378
- assembled by FinBOL participants represent a tremendous <mark>intergenerational
1379
- transfer of taxonomic knowledge</mark> … the time contributed by current taxonomists
1380
- in identifying and contributing voucher specimens represents a great gift
1381
- to future generations who will benefit from their expertise when they are
1382
- no longer able to process new material.</p>\n</blockquote>\n<p>I think this
1383
- is a very clever way to characterise the project. In an age of machine learning
1384
- this may be commonest way to share knowledge , namely as expert-labelled training
1385
- data used to build tools for others. Of course, this means the expertise itself
1386
- may be lost, which has implications for updating the models if the data isn’t
1387
- complete. But it speaks to Charles Godfrey’s theme of <a href=\"https://biostor.org/reference/250587\">“Taxonomy
1388
- as information science”</a>.</p>\n<p>Note that the knowledge is also transformed
1389
- in the sense that the underlying expertise of interpreting morphology, ecology,
1390
- behaviour, genomics, and the past literature is not what is being passed on.
1391
- Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>\n<p>This
1392
- feels is different to, say iNaturalist, where there is a machine learning
1393
- model to identify images. In that case, the model is built on something the
1394
- community itself has created, and continues to create. Yes, the underlying
1395
- idea is that same: “experts” have labelled the data, a model is trained, the
1396
- model is used. But the benefits of the <a href=\"https://www.inaturalist.org\">iNaturalist</a>
1397
- model are immediately applicable to the people whose data built the model.
1398
- In the case of barcoding, because the technology itself is still not in the
1399
- hands of many (relative to, say, digital imaging), the benefits are perhaps
1400
- less tangible. Obviously researchers working with environmental DNA will find
1401
- it very useful, but broader impact may await the arrival of citizen science
1402
- DNA barcoding.</p>\n<p>The other consideration is whether the barcoding helps
1403
- taxonomists. Is it to be used to help prioritise future work (“we are getting
1404
- lots of unknown sequences in these taxa, lets do some taxonomy there”), or
1405
- is it simply capturing the knowledge of a generation that won’t be replaced:</p>\n<blockquote>\n<p>The
1406
- need to capture such knowledge is essential because there are, for example,
1407
- no young Finnish taxonomists who can critically identify species in many key
1408
- groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges,
1409
- most mite lineages).</p>\n</blockquote>\n<p>The cycle of collect data, test
1410
- and refine model, collect more data, rinse and repeat that happens with iNaturalist
1411
- creates a feedback loop. It’s not clear that a similar cycle exists for DNA
1412
- barcoding.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/enxas-arj18","uuid":"ab5a6e04-d55e-4901-8269-9eea65ce7178","url":"https://iphylo.blogspot.com/2022/08/can-we-use-citation-graph-to-measure.html","title":"Can
220
+ really have the words for this right now.","published_at":1678762800,"updated_at":1679469956,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
221
+ Page"}],"image":null,"tags":[],"language":"en","reference":[]},{"id":"8bc3fea6-cb86-4344-8dad-f312fbf58041","doi":"https://doi.org/10.59350/t6fb9-4fn44","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The
222
+ Business of Extracting Knowledge from Academic Publications","summary":"Markus
223
+ Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting
224
+ Knowledge from Academic Publications\". I spent months working on domain-specific
225
+ search engines and knowledge discovery apps for biomedicine and eventually
226
+ figured that synthesizing \"insights\" or building knowledge graphs by machine-reading
227
+ the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc—
228
+ Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I...","published_at":1639180860,"updated_at":1639180881,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
229
+ Page"}],"image":null,"tags":["ai","business model","text mining"],"language":"en","reference":[]},{"id":"96fa91d5-459c-482f-aa38-dda6e0a30e20","doi":"https://doi.org/10.59350/7esgr-61v1","url":"https://iphylo.blogspot.com/2022/01/large-graph-viewer-experiments.html","title":"Large
230
+ graph viewer experiments","summary":"I keep returning to the problem of viewing
231
+ large graphs and trees, which means my hard drive has accumulated lots of
232
+ failed prototypes. Inspired by some recent discussions on comparing taxonomic
233
+ classifications I decided to package one of these (wildly incomplete) prototypes
234
+ up so that I can document the idea and put the code somewhere safe. Very cool,
235
+ thanks for sharing this-- the tree diff is similar to what J Rees has been
236
+ cooking up lately with his ''cl diff'' tool. I''ll tag...","published_at":1641122700,"updated_at":1641122959,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
237
+ Page"}],"image":null,"tags":["Google Maps","graph","Mammal Species of the
238
+ World","mammals","taxonomy"],"language":"en","reference":[]},{"id":"ab5a6e04-d55e-4901-8269-9eea65ce7178","doi":"https://doi.org/10.59350/enxas-arj18","url":"https://iphylo.blogspot.com/2022/08/can-we-use-citation-graph-to-measure.html","title":"Can
1413
239
  we use the citation graph to measure the quality of a taxonomic database?","summary":"More
1414
240
  arm-waving notes on taxonomic databases. I''ve started to add data to ChecklistBank
1415
241
  and this has got me thinking about the issue of data quality. When you add
@@ -1417,375 +243,34 @@ http_interactions:
1417
243
  on the Catalogue of Life Checklist Confidence system of one - five stars:
1418
244
  ★ - ★★★★★. I''m scepetical about the notion of confidence or \"trust\" when
1419
245
  it is reduced to a star system (see also Can you trust EOL?). I could literally
1420
- pick any number of stars, there''s...","date_published":"2022-08-24T14:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1421
- (Roderic Page)"}],"image":null,"content_html":"<p>More arm-waving notes on
1422
- taxonomic databases. I''ve started to add data to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
1423
- and this has got me thinking about the issue of data quality. When you add
1424
- data to ChecklistBank you are asked to give a measure of confidence based
1425
- on the <a href=\"https://www.catalogueoflife.org/about/glossary.html#checklist-confidence\">Catalogue
1426
- of Life Checklist Confidence</a> system of one - five stars: ★ - ★★★★★. I''m
1427
- scepetical about the notion of confidence or \"trust\" when it is reduced
1428
- to a star system (see also <a href=\"https://iphylo.blogspot.com/2012/06/can-you-trust-eol.html\">Can
1429
- you trust EOL?</a>). I could literally pick any number of stars, there''s
1430
- no way to measure what number of stars is appropriate. This feeds into my
1431
- biggest reservation about the <a href=\"https://www.catalogueoflife.org\">Catalogue
1432
- of Life</a>, it''s almost entirely authority based, not evidence based. That
1433
- is, rather than give us evidence for why a particular taxon is valid, we are
1434
- (mostly) just given a list of taxa are asked to accept those as gospel, based
1435
- on assertions by one or more authorities. I''m not necessarly doubting the
1436
- knowledge of those making these lists, it''s just that I think we need to
1437
- do better than \"these are the accepted taxa because I say so\" implict in
1438
- the Catalogue of Life.\n</p>\n\n<p>So, is there any way we could objectively
1439
- measure the quality of a particular taxonomic checklist? Since I have a long
1440
- standing interest in link the primary taxonomic litertaure to names in databases
1441
- (since that''s where the evidence is), I keep wondering whether measures based
1442
- on that literture could be developed. \n</p>\n<p>\nI recently revisited the
1443
- fascinating (and quite old) literature on rates of synonymy:\n</p>\n<blockquote>\nGaston
1444
- Kevin J. and Mound Laurence A. 1993 Taxonomy, hypothesis testing and the
1445
- biodiversity crisisProc. R. Soc. Lond. B.251139–142\n<a href=\"http://doi.org/10.1098/rspb.1993.0020\">http://doi.org/10.1098/rspb.1993.0020</a>\n</blockquote>\n \n<blockquote>\n Andrew
1446
- R. Solow, Laurence A. Mound, Kevin J. Gaston, Estimating the Rate of Synonymy,
1447
- Systematic Biology, Volume 44, Issue 1, March 1995, Pages 93–96, <a href=\"https://doi.org/10.1093/sysbio/44.1.93\">https://doi.org/10.1093/sysbio/44.1.93</a>\n</blockquote>\n\n</p>\n\n<p>\nA
1448
- key point these papers make is that the observed rate of synonymy is quite
1449
- high (that is, many \"new species\" end up being merged with already known
1450
- species), and that because it can take time to discover that a species is
1451
- a synonym the actual rate may be even higher. In other words, in diagrams
1452
- like the one reproduced below, the reason the proportion of synonyms declines
1453
- the nearer we get to the present day (this paper came out in 1995) is not
1454
- because are are creating fewer synonyms but because we''ve not yet had time
1455
- to do the work to uncover the remaining synonyms.\n</p>\n \n<div class=\"separator\"
1456
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s621/Screenshot%202022-08-24%20at%2014.59.47.png\"
1457
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1458
- border=\"0\" width=\"400\" data-original-height=\"404\" data-original-width=\"621\"
1459
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s400/Screenshot%202022-08-24%20at%2014.59.47.png\"/></a></div>\n\n<p>Put
1460
- another way, these papers are arguing that real work of taxonomy is revision,
1461
- not species discovery, especially since it''s not uncommon for > 50% of species
1462
- in a taxon to end up being synonymised. Indeed, if a taxoomic group has few
1463
- synonyms then these authors would argue that''s a sign of neglect. More revisionary
1464
- work would likely uncover additional synonyms. So, what we need is a way to
1465
- measure the amount of research on a taxonomic group. It occurs to me that
1466
- we could use the citation graph as a way to tackle this. Lets imagine we have
1467
- a set of taxa (say a family) and we have all the papers that described new
1468
- species or undertook revisions (or both). The extensiveness of that work could
1469
- be measured by the citation graph. For example, build the citation graph for
1470
- those papers. How many original species decsriptions are not cited? Those
1471
- species have been potentially neglected. How many large-scale revisions have
1472
- there been (as measured by the numbers of taxonomic papers those revisions
1473
- cite)? There are some interesting approaches to quantifying this, such as
1474
- using <a href=\"https://en.wikipedia.org/wiki/HITS_algorithm\">hubs and authorities</a>.</p>\n \n \n<p>I''m
1475
- aware that taxonomists have not had the happiest relationship with citations:\n \n<blockquote>\nPinto
1476
- ÂP, Mejdalani G, Mounce R, Silveira LF, Marinoni L, Rafael JA. Are publications
1477
- on zoological taxonomy under attack? R Soc Open Sci. 2021 Feb 10;8(2):201617.
1478
- <a href=\"https://doi.org/10.1098/rsos.201617\">doi: 10.1098/rsos.201617</a>.
1479
- PMID: 33972859; PMCID: PMC8074659.\n</blockquote>\n\nStill, I think there
1480
- is an intriguing possibility here. For this approach to work, we need to have
1481
- linked taxonomic names to publications, and have citation data for those publications.
1482
- This is happening on various platforms. Wikidata, for example, is becoming
1483
- a repository of the taxonomic literature, some of it with citation links.\n\n<blockquote>\nPage
1484
- RDM. 2022. Wikidata and the bibliography of life. PeerJ 10:e13712 <a href=\"https://doi.org/10.7717/peerj.13712\">https://doi.org/10.7717/peerj.13712</a>\n</blockquote>\n\nTime
1485
- for some experiments.\n</p>","tags":["Bibliography of Life","citation","synonymy","taxonomic
1486
- databases"],"language":"en","references":null},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
1487
- interfaces, knowledge graphs, and Miller columns","summary":"Some quick notes
1488
- on interface ideas for digital libraries and/or knowledge graphs. Recently
1489
- there’s been something of an explosion in bibliographic tools to explore the
1490
- literature. Examples include: Elicit which uses AI to search for and summarise
1491
- papers _scite which uses AI to do sentiment analysis on citations (does paper
1492
- A cite paper B favourably or not?) ResearchRabbit which uses lists, networks,
1493
- and timelines to discover related research Scispace which navigates connections
1494
- between...","date_published":"2023-04-25T13:01:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1495
- (Roderic Page)"}],"image":null,"content_html":"<p>Some quick notes on interface
1496
- ideas for digital libraries and/or knowledge graphs.</p>\n<p>Recently there’s
1497
- been something of an explosion in bibliographic tools to explore the literature.
1498
- Examples include:</p>\n<ul>\n<li><a href=\"https://elicit.org\">Elicit</a>
1499
- which uses AI to search for and summarise papers</li>\n<li><a href=\"https://scite.ai\">_scite</a>
1500
- which uses AI to do sentiment analysis on citations (does paper A cite paper
1501
- B favourably or not?)</li>\n<li><a href=\"https://www.researchrabbit.ai\">ResearchRabbit</a>
1502
- which uses lists, networks, and timelines to discover related research</li>\n<li><a
1503
- href=\"https://typeset.io\">Scispace</a> which navigates connections between
1504
- papers, authors, topics, etc., and provides AI summaries.</li>\n</ul>\n<p>As
1505
- an aside, I think these (and similar tools) are a great example of how bibliographic
1506
- data such as abstracts, the citation graph and - to a lesser extent - full
1507
- text - have become commodities. That is, what was once proprietary information
1508
- is now free to anyone, which in turns means a whole ecosystem of new tools
1509
- can emerge. If I was clever I’d be building a <a href=\"https://en.wikipedia.org/wiki/Wardley_map\">Wardley
1510
- map</a> to explore this. Note that a decade or so ago reference managers like
1511
- <a href=\"https://www.zotero.org\">Zotero</a> were made possible by publishers
1512
- exposing basic bibliographic data on their articles. As we move to <a href=\"https://i4oc.org\">open
1513
- citations</a> we are seeing the next generation of tools.</p>\n<p>Back to
1514
- my main topic. As usual, rather than focus on what these tools do I’m more
1515
- interested in how they <strong>look</strong>. I have history here, when the
1516
- iPad came out I was intrigued by the possibilities it offered for displaying
1517
- academic articles, as discussed <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad.html\">here</a>,
1518
- <a href=\"https://iphylo.blogspot.com/2010/09/viewing-scientific-articles-on-ipad.html\">here</a>,
1519
- <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_24.html\">here</a>,
1520
- <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_3052.html\">here</a>,
1521
- and <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_31.html\">here</a>.
1522
- ResearchRabbit looks like this:</p>\n<div style=\"padding:86.91% 0 0 0;position:relative;\"><iframe
1523
- src=\"https://player.vimeo.com/video/820871442?h=23b05b0dae&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479\"
1524
- frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
1525
- style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"ResearchRabbit\"></iframe></div><script
1526
- src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>Scispace’s <a
1527
- href=\"https://typeset.io/explore/journals/parassitologia-1ieodjwe\">“trace”
1528
- view</a> looks like this:</p>\n<div style=\"padding:84.55% 0 0 0;position:relative;\"><iframe
1529
- src=\"https://player.vimeo.com/video/820871348?h=2db7b661ef&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479\"
1530
- frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
1531
- style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"Scispace
1532
- screencast\"></iframe></div><script src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>What
1533
- is interesting about both is that they display content from left to right
1534
- in vertical columns, rather than the more common horizontal rows. This sort
1535
- of display is sometimes called <a href=\"https://en.wikipedia.org/wiki/Miller_columns\">Miller
1536
- columns</a> or a <a href=\"https://web.archive.org/web/20210726134921/http://designinginterfaces.com/firstedition/index.php?page=Cascading_Lists\">cascading
1537
- list</a>.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s1024/GNUstep-liveCD.png\"
1538
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1539
- border=\"0\" width=\"400\" data-original-height=\"768\" data-original-width=\"1024\"
1540
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s400/GNUstep-liveCD.png\"/></a></div>\n\n<p>By
1541
- Gürkan Sengün (talk) - Own work, Public Domain, <a href=\"https://commons.wikimedia.org/w/index.php?curid=594715\">https://commons.wikimedia.org/w/index.php?curid=594715</a></p>\n<p>I’ve
1542
- always found displaying a knowledge graph to be a challenge, as discussed
1543
- <a href=\"https://iphylo.blogspot.com/2019/07/notes-on-collections-knowledge-graphs.html\">elsewhere
1544
- on this blog</a> and in my paper on <a href=\"https://peerj.com/articles/6739/#p-29\">Ozymandias</a>.
1545
- Miller columns enable one to drill down in increasing depth, but it doesn’t
1546
- need to be a tree, it can be a path within a network. What I like about ResearchRabbit
1547
- and the original Scispace interface is that they present the current item
1548
- together with a list of possible connections (e.g., authors, citations) that
1549
- you can drill down on. Clicking on these will result in a new column being
1550
- appended to the right, with a view (typically a list) of the next candidates
1551
- to visit. In graph terms, these are adjacent nodes to the original item. The
1552
- clickable badges on each item can be thought of as sets of edges that have
1553
- the same label (e.g., “authored by”, “cites”, “funded”, “is about”, etc.).
1554
- Each of these nodes itself becomes a starting point for further exploration.
1555
- Note that the original starting point isn’t privileged, other than being the
1556
- starting point. That is, each time we drill down we are seeing the same type
1557
- of information displayed in the same way. Note also that the navigation can
1558
- be though of as a <strong>card</strong> for a node, with <strong>buttons</strong>
1559
- grouping the adjacent nodes. When we click on an individual button, it expands
1560
- into a <strong>list</strong> in the next column. This can be thought of as
1561
- a preview for each adjacent node. Clicking on an element in the list generates
1562
- a new card (we are viewing a single node) and we get another set of buttons
1563
- corresponding to the adjacent nodes.</p>\n<p>One important behaviour in a
1564
- Miller column interface is that the current path can be pruned at any point.
1565
- If we go back (i.e., scroll to the left) and click on another tab on an item,
1566
- everything downstream of that item (i.e., to the right) gets deleted and replaced
1567
- by a new set of nodes. This could make retrieving a particular history of
1568
- browsing a bit tricky, but encourages exploration. Both Scispace and ResearchRabbit have
1569
- the ability to add items to a collection, so you can keep track of things
1570
- you discover.</p>\n<p>Lots of food for thought, I’m assuming that there is
1571
- some user interface/experience research on Miller columns. One thing to remember
1572
- is that Miller columns are most often associated with trees, but in this case
1573
- we are exploring a network. That means that potentially there is no limit
1574
- to the number of columns being generated as we wander through the graph. It
1575
- will be interesting to think about what the average depth is likely to be,
1576
- in other words, how deep down the rabbit hole will be go?</p>\n\n<h3>Update</h3>\n<p>Should
1577
- add link to David Regev''s explorations of <a href=\"https://medium.com/david-regev-on-ux/flow-browser-b730daf0f717\">Flow
1578
- Browser</a>.\n\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":["cards","flow","Knowledge
1579
- Graph","Miller column","RabbitResearch"],"language":"en","references":null},{"id":"https://doi.org/10.59350/t6fb9-4fn44","uuid":"8bc3fea6-cb86-4344-8dad-f312fbf58041","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The
1580
- Business of Extracting Knowledge from Academic Publications","summary":"Markus
1581
- Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting
1582
- Knowledge from Academic Publications\". I spent months working on domain-specific
1583
- search engines and knowledge discovery apps for biomedicine and eventually
1584
- figured that synthesizing &quot;insights&quot; or building knowledge graphs
1585
- by machine-reading the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc&mdash;
1586
- Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I...","date_published":"2021-12-11T00:01:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1587
- (Roderic Page)"}],"image":null,"content_html":"<p>Markus Strasser (<a href=\"https://twitter.com/mkstra\">@mkstra</a>
1588
- write a fascinating article entitled <a href=\"https://markusstrasser.org/extracting-knowledge-from-literature/\">\"The
1589
- Business of Extracting Knowledge from Academic Publications\"</a>.</p>\n\n<blockquote
1590
- class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">I spent months working
1591
- on domain-specific search engines and knowledge discovery apps for biomedicine
1592
- and eventually figured that synthesizing &quot;insights&quot; or building
1593
- knowledge graphs by machine-reading the academic literature (papers) is *barely
1594
- useful* :<a href=\"https://t.co/eciOg30Odc\">https://t.co/eciOg30Odc</a></p>&mdash;
1595
- Markus Strasser (@mkstra) <a href=\"https://twitter.com/mkstra/status/1468334482113523716?ref_src=twsrc%5Etfw\">December
1596
- 7, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
1597
- charset=\"utf-8\"></script>\n\n<p>His TL;DR:</p>\n\n<p><blockquote>\nTL;DR:
1598
- I worked on biomedical literature search, discovery and recommender web applications
1599
- for many months and concluded that extracting, structuring or synthesizing
1600
- \"insights\" from academic publications (papers) or building knowledge bases
1601
- from a domain corpus of literature has negligible value in industry.</p>\n\n<p>Close
1602
- to nothing of what makes science actually work is published as text on the
1603
- web.\n</blockquote></p>\n\n<p>After recounting the many problems of knowledge
1604
- extraction - including a swipe at nanopubs which \"are ... dead in my view
1605
- (without admitting it)\" - he concludes:</p>\n\n<p><blockquote>\nI’ve been
1606
- flirting with this entire cluster of ideas including open source web annotation,
1607
- semantic search and semantic web, public knowledge graphs, nano-publications,
1608
- knowledge maps, interoperable protocols and structured data, serendipitous
1609
- discovery apps, knowledge organization, communal sense making and academic
1610
- literature/publishing toolchains for a few years on and off ... nothing of
1611
- it will go anywhere.</p>\n\n<p>Don’t take that as a challenge. Take it as
1612
- a red flag and run. Run towards better problems.\n</blockquote></p>\n\n<p>Well
1613
- worth a read, and much food for thought.</p>","tags":["ai","business model","text
1614
- mining"],"language":"en","references":null},{"id":"https://doi.org/10.59350/463yw-pbj26","uuid":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David
1615
- Remsen","summary":"I heard yesterday from Martin Kalfatovic (BHL) that David
1616
- Remsen has died. Very sad news. It''s starting to feel like iPhylo might end
1617
- up being a list of obituaries of people working on biodiversity informatics
1618
- (e.g., Scott Federhen). I spent several happy visits at MBL at Woods Hole
1619
- talking to Dave at the height of the uBio project, which really kickstarted
1620
- large scale indexing of taxonomic names, and the use of taxonomic name finding
1621
- tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1622
- (Roderic Page)"}],"image":null,"content_html":"<p>I heard yesterday from Martin
1623
- Kalfatovic (BHL) that David Remsen has died. Very sad news. It''s starting
1624
- to feel like iPhylo might end up being a list of obituaries of people working
1625
- on biodiversity informatics (e.g., <a href=\"https://iphylo.blogspot.com/2016/05/scott-federhen-rip.html\">Scott
1626
- Federhen</a>).</p>\n\n<p>I spent several happy visits at MBL at Woods Hole
1627
- talking to Dave at the height of the uBio project, which really kickstarted
1628
- large scale indexing of taxonomic names, and the use of taxonomic name finding
1629
- tools to index the literature. His work on uBio with David (\"Paddy\") Patterson
1630
- led to the <a href=\"https://eol.org\">Encyclopedia of Life</a> (EOL).</p>\n\n<p>A
1631
- number of the things I''m currently working on are things Dave started. For
1632
- example, I recently uploaded a version of his dataset for Nomenclator Zoologicus[1]
1633
- to <a href=\"https://www.checklistbank.org/dataset/126539/about\">ChecklistBank</a>
1634
- where I''m working on augmenting that original dataset by adding links to
1635
- the taxonomic literature. My <a href=\"https://biorss.herokuapp.com/?feed=Y291bnRyeT1XT1JMRCZwYXRoPSU1QiUyMkJJT1RBJTIyJTVE\">BioRSS
1636
- project</a> is essentially an attempt to revive uBioRSS[2] (see <a href=\"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html\">Revisiting
1637
- RSS to monitor the latest taxonomic research</a>).</p>\n\n<p>I have fond memories
1638
- of those visits to Woods Hole. A very sad day indeed.</p>\n\n<p><b>Update:</b>
1639
- The David Remsen Memorial Fund has been set up on <a href=\"https://www.gofundme.com/f/david-remsen-memorial-fund\">GoFundMe</a>.</p>\n\n<p>1.
1640
- Remsen, D. P., Norton, C., & Patterson, D. J. (2006). Taxonomic Informatics
1641
- Tools for the Electronic Nomenclator Zoologicus. The Biological Bulletin,
1642
- 210(1), 18–24. https://doi.org/10.2307/4134533</p>\n\n<p>2. Patrick R. Leary,
1643
- David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar,
1644
- uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23,
1645
- Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109</p>","tags":["David
1646
- Remsen","obituary","uBio"],"language":"en","references":null},{"id":"https://doi.org/10.59350/pmhat-5ky65","uuid":"5891c709-d139-440f-bacb-06244424587a","url":"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html","title":"Problems
1647
- with Plazi parsing: how reliable are automated methods for extracting specimens
1648
- from the literature?","summary":"The Plazi project has become one of the major
1649
- contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
1650
- (see Plazi''s GBIF page for details). These occurrences are extracted from
1651
- taxonomic publication using automated methods. New data is published almost
1652
- daily (see latest treatments). The map below shows the geographic distribution
1653
- of material citations provided to GBIF by Plazi, which gives you a sense of
1654
- the size of the dataset. By any metric Plazi represents a...","date_published":"2021-10-25T11:10:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1655
- (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1656
- both;\"><a href=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s240/Rf7UoXTw_400x400.jpg\"
1657
- style=\"display: block; padding: 1em 0; text-align: center; clear: right;
1658
- float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"240\"
1659
- data-original-width=\"240\" src=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s200/Rf7UoXTw_400x400.jpg\"/></a></div><p>The
1660
- <a href=\"http://plazi.org\">Plazi</a> project has become one of the major
1661
- contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
1662
- (see <a href=\"https://www.gbif.org/publisher/7ce8aef0-9e92-11dc-8738-b8a03c50a862\">Plazi''s
1663
- GBIF page</a> for details). These occurrences are extracted from taxonomic
1664
- publication using automated methods. New data is published almost daily (see
1665
- <a href=\"https://tb.plazi.org/GgServer/static/newToday.html\">latest treatments</a>).
1666
- The map below shows the geographic distribution of material citations provided
1667
- to GBIF by Plazi, which gives you a sense of the size of the dataset.</p>\n\n<div
1668
- class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s1030/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"
1669
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1670
- border=\"0\" width=\"400\" data-original-height=\"514\" data-original-width=\"1030\"
1671
- src=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"/></a></div>\n\n<p>By
1672
- any metric Plazi represents a considerable achievement. But often when I browse
1673
- individual records on Plazi I find records that seem clearly incorrect. Text
1674
- mining the literature is a challenging problem, but at the moment Plazi seems
1675
- something of a \"black box\". PDFs go in, the content is mined, and data comes
1676
- up to be displayed on the Plazi web site and uploaded to GBIF. Nowhere does
1677
- there seem to be an evaluation of how accurate this text mining actually is.
1678
- Anecdotally it seems to work well in some cases, but in others it produces
1679
- what can only be described as bogus records.</p>\n\n<h2>Finding errors</h2>\n\n<p>A
1680
- treatment in Plazi is a block of text (and sometimes illustrations) that refers
1681
- to a single taxon. Often that text will include a description of the taxon,
1682
- and list one or more specimens that have been examined. These lists of specimens
1683
- (\"material citations\") are one of the key bits of information that Plaza
1684
- extracts from a treatment as these citations get fed into GBIF as occurrences.</p>\n\n<p>To
1685
- help explore treatments I''ve constructed a simple web site that takes the
1686
- Plazi identifier for a treatment and displays that treatment with the material
1687
- citations highlighted. For example, for the Plazi treatment <a href=\"https://tb.plazi.org/GgServer/html/03B5A943FFBB6F02FE27EC94FABEEAE7\">03B5A943FFBB6F02FE27EC94FABEEAE7</a>
1688
- you can view the marked up version at <a href=\"https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228\">https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228</a>.
1689
- Below is an example of a material citation with its component parts tagged:</p>\n\n<div
1690
- class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s693/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"
1691
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1692
- border=\"0\" width=\"400\" data-original-height=\"94\" data-original-width=\"693\"
1693
- src=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"/></a></div>\n\n<p>This
1694
- is an example where Plazi has successfully parsed the specimen. But I keep
1695
- coming across cases where specimens have not been parsed correctly, resulting
1696
- in issues such as single specimens being split into multiple records (e.g., <a
1697
- href=\"https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496\">https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496</a>),
1698
- geographical coordinates being misinterpreted (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9\">https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9</a>),
1699
- or collector''s initials being confused with codes for natural history collections
1700
- (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E\">https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E</a>).</p>\n\n<p>Parsing
1701
- specimens is a hard problem so it''s not unexpected to find errors. But they
1702
- do seem common enough to be easily found, which raises the question of just
1703
- what percentage of these material citations are correct? How much of the
1704
- data Plazi feeds to GBIF is correct? How would we know?</p>\n\n<h2>Systemic
1705
- problems</h2>\n\n<p>Some of the errors I''ve found concern the interpretation
1706
- of the parsed data. For example, it is striking that despite including marine
1707
- taxa <b>no</b> Plazi record has a value for depth below sea level (see <a
1708
- href=\"https://www.gbif.org/occurrence/map?depth=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">GBIF
1709
- search on depth range 0-9999 for Plazi</a>). But <a href=\"https://www.gbif.org/occurrence/map?elevation=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">many
1710
- records do have an elevation</a>, including records from marine environments.
1711
- Any record that has a depth value is interpreted by Plazi as being elevation,
1712
- so we have aerial crustacea and fish.</p>\n\n<h3>Map of Plazi records with
1713
- depth 0-9999m</h3>\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s673/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"
1714
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1715
- border=\"0\" width=\"400\" data-original-height=\"258\" data-original-width=\"673\"
1716
- src=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"/></a></div>\n\n<h3>Map
1717
- of Plazi records with elevation 0-9999m </h3>\n<div class=\"separator\" style=\"clear:
1718
- both;\"><a href=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s675/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"
1719
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1720
- border=\"0\" width=\"400\" data-original-height=\"256\" data-original-width=\"675\"
1721
- src=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"/></a></div>\n\n<p>Anecdotally
1722
- I''ve also noticed that Plazi seems to do well on zoological data, especially
1723
- journals like <i>Zootaxa</i>, but it often struggles with botanical specimens.
1724
- Botanists tend to cite specimens rather differently to zoologists (botanists
1725
- emphasise collector numbers rather than specimen codes). Hence data quality
1726
- in Plazi is likely to taxonomic biased.</p>\n\n<p>Plazi is <a href=\"https://github.com/plazi/community/issues\">using
1727
- GitHub to track issues with treatments</a> so feedback on erroneous records
1728
- is possible, but this seems inadequate to the task. There are tens of thousands
1729
- of data sets, with more being released daily, and hundreds of thousands of
1730
- occurrences, and relying on GitHub issues devolves the responsibility for
1731
- error checking onto the data users. I don''t have a measure of how many records
1732
- in Plazi have problems, but because I suspect it is a significant fraction
1733
- because for any given day''s output I can typically find errors.</p>\n\n<h2>What
1734
- to do?</h2>\n\n<p>Faced with a process that generates noisy data there are
1735
- several of things we could do:</p>\n\n<ol>\n<li>Have tools to detect and flag
1736
- errors made in generating the data.</li>\n<li>Have the data generator give
1737
- estimates the confidence of its results.</li>\n<li>Improve the data generator.</li>\n</ol>\n\n<p>I
1738
- think a comparison with the problem of parsing bibliographic references might
1739
- be instructive here. There is a long history of people developing tools to
1740
- parse references (<a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">I''ve
1741
- even had a go</a>). State-of-the art tools such as <a href=\"https://anystyle.io\">AnyStyle</a>
1742
- feature machine learning, and are tested against <a href=\"https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml\">human
1743
- curated datasets</a> of tagged bibliographic records. This means we can evaluate
1744
- the performance of a method (how well does it retrieve the same results as
1745
- human experts?) and also improve the method by expanding the corpus of training
1746
- data. Some of these tools can provide a measures of how confident they are
1747
- when classifying a string as, say, a person''s name, which means we could
1748
- flag potential issues for anyone wanting to use that record.</p>\n\n<p>We
1749
- don''t have equivalent tools for parsing specimens in the literature, and
1750
- hence have no easy way to quantify how good existing methods are, nor do we
1751
- have a public corpus of material citations that we can use as training data.
1752
- I <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">blogged
1753
- about this</a> a few months ago and was considering using Plazi as a source
1754
- of marked up specimen data to use for training. However based on what I''ve
1755
- looked at so far Plazi''s data would need to be carefully scrutinised before
1756
- it could be used as training data.</p>\n\n<p>Going forward, I think it would
1757
- be desirable to have a set of records that can be used to benchmark specimen
1758
- parsers, and ideally have the parsers themselves available as web services
1759
- so that anyone can evaluate them. Even better would be a way to contribute
1760
- to the training data so that these tools improve over time.</p>\n\n<p>Plazi''s
1761
- data extraction tools are mostly desktop-based, that is, you need to download
1762
- software to use their methods. However, there are experimental web services
1763
- available as well. I''ve created a simple wrapper around the material citation
1764
- parser, you can try it at <a href=\"https://plazi-tester.herokuapp.com/parser.php\">https://plazi-tester.herokuapp.com/parser.php</a>.
1765
- It takes a single material citation and returns a version with elements such
1766
- as specimen code and collector name tagged in different colours.</p>\n\n<h2>Summary</h2>\n\n<p>Text
1767
- mining the taxonomic literature is clearly a gold mine of data, but at the
1768
- same time it is potentially fraught as we try and extract structured data
1769
- from semi-structured text. Plazi has demonstrated that it is possible to extract
1770
- a lot of data from the literature, but at the same time the quality of that
1771
- data seems highly variable. Even minor issues in parsing text can have big
1772
- implications for data quality (e.g., marine organisms apparently living above
1773
- sea level). Historically in biodiversity informatics we have favoured data
1774
- quantity over data quality. Quantity has an obvious metric, and has milestones
1775
- we can celebrate (e.g., <a href=\"GBIF at 1 billion - what''s next?\">one
1776
- billion specimens</a>). There aren''t really any equivalent metrics for data
1777
- quality.</p>\n\n<p>Adding new types of data can sometimes initially result
1778
- in a new set of quality issues (e.g., <a href=\"https://iphylo.blogspot.com/2019/12/gbif-metagenomics-and-metacrap.html\">GBIF
1779
- metagenomics and metacrap</a>) that take time to resolve. In the case of Plazi,
1780
- I think it would be worthwhile to quantify just how many records have errors,
1781
- and develop benchmarks that we can use to test methods for extracting specimen
1782
- data from text. If we don''t do this then there will remain uncertainty as
1783
- to how much trust we can place in data mined from the taxonomic literature.</p>\n\n<h2>Update</h2>\n\nPlazi
1784
- has responded, see <a href=\"http://plazi.org/posts/2021/10/liberation-first-step-toward-quality/\">Liberating
1785
- material citations as a first step to more better data</a>. My reading of
1786
- their repsonse is that it essentially just reiterates Plazi''s approach and
1787
- doesn''t tackle the underlying issue: their method for extracting material
1788
- citations is error prone, and many of those errors end up in GBIF.","tags":["data
1789
- quality","parsing","Plazi","specimen","text mining"],"language":"en","references":null}]}'
1790
- recorded_at: Thu, 15 Jun 2023 20:47:10 GMT
1791
- recorded_with: VCR 6.1.0
246
+ pick any number of stars, there''s...","published_at":1661351580,"updated_at":1661351908,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
247
+ Page"}],"image":null,"tags":["Bibliography of Life","citation","synonymy","taxonomic
248
+ databases"],"language":"en","reference":[]},{"id":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","doi":"https://doi.org/10.59350/ymc6x-rx659","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
249
+ barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
250
+ tweeted about this but want to bookmark it for later as well. The paper “A
251
+ molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
252
+ contains the following: …the annotated barcode records assembled by FinBOL
253
+ participants represent a tremendous intergenerational transfer of taxonomic
254
+ knowledge the time contributed by current taxonomists in identifying and
255
+ contributing voucher specimens represents a great gift to future generations
256
+ who will benefit...","published_at":1663150320,"updated_at":1664459850,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
257
+ Page"}],"image":null,"tags":[],"language":"en","reference":[]},{"id":"8aea47e4-f227-45f4-b37b-0454a8a7a3ff","doi":"https://doi.org/10.59350/m48f7-c2128","url":"https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html","title":"ChatGPT,
258
+ semantic search, and knowledge graphs","summary":"One thing about ChatGPT
259
+ is it has opened my eyes to some concepts I was dimly aware of but am only
260
+ now beginning to fully appreciate. ChatGPT enables you ask it questions, but
261
+ the answers depend on what ChatGPT “knows”. As several people have noted,
262
+ what would be even better is to be able to run ChatGPT on your own content.
263
+ Indeed, ChatGPT itself now supports this using plugins. Paul Graham GPT However,
264
+ it’s still useful to see how to add ChatGPT functionality to your own content
265
+ from...","published_at":1680535800,"updated_at":1680535924,"indexed_at":1689006804,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
266
+ Page"}],"image":null,"tags":[],"language":"en","reference":[]},{"id":"9d19993f-a228-4883-8933-be5b5459530d","doi":"https://doi.org/10.59350/r3g44-d5s15","url":"https://iphylo.blogspot.com/2023/06/a-taxonomic-search-engine.html","title":"A
267
+ taxonomic search engine","summary":"Tony Rees commented on my recent post
268
+ Ten years and a million links. I’ve responded to some of his comments, but
269
+ I think the bigger question deserves more space, hence this blog post. Tony’s
270
+ comment Hi Rod, I like what you’re doing. Still struggling (a little) to find
271
+ the exact point where it answers the questions that are my “entry points”
272
+ so to speak, which (paraphrasing a post of yours from some years back) start
273
+ with: Is this a name that “we” (the human race I suppose) recognise as...","published_at":1687016280,"updated_at":1687016300,"indexed_at":1688982864,"authors":[{"url":"https://orcid.org/0000-0002-7101-9767","name":"Roderic
274
+ Page"}],"image":null,"tags":[],"language":"en","reference":[]}]}'
275
+ recorded_at: Mon, 10 Jul 2023 20:37:16 GMT
276
+ recorded_with: VCR 6.2.0