commonmeta-ruby 3.3.3 → 3.3.5

Sign up to get free protection for your applications and to get access to all the features.
Files changed (632) hide show
  1. checksums.yaml +4 -4
  2. data/Gemfile.lock +2 -2
  3. data/bin/commonmeta +2 -2
  4. data/commonmeta.gemspec +1 -1
  5. data/lib/commonmeta/cli.rb +7 -3
  6. data/lib/commonmeta/readers/json_feed_reader.rb +1 -1
  7. data/lib/commonmeta/utils.rb +34 -0
  8. data/lib/commonmeta/version.rb +1 -1
  9. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/change_metadata_as_datacite_xml/with_data_citation.yml +14 -14
  10. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/crossref.yml +5 -5
  11. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/datacite.yml +5 -5
  12. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/jalc.yml +5 -5
  13. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/kisti.yml +5 -5
  14. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/medra.yml +5 -5
  15. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/not_found.yml +5 -5
  16. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/op.yml +5 -5
  17. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/crossref.yml +5 -5
  18. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/crossref_doi_not_url.yml +5 -5
  19. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/datacite.yml +5 -5
  20. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/datacite_doi_http.yml +5 -5
  21. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/unknown_DOI_registration_agency.yml +5 -5
  22. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_cff_metadata/cff-converter-python.yml +9 -7
  23. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_cff_metadata/ruby-cff.yml +16 -14
  24. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_cff_metadata/ruby-cff_repository_url.yml +14 -12
  25. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_codemeta_metadata/maremma.yml +10 -8
  26. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_codemeta_metadata/metadata_reports.yml +9 -7
  27. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/DOI_with_ORCID_ID.yml +74 -74
  28. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/DOI_with_SICI_DOI.yml +73 -73
  29. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/DOI_with_data_citation.yml +70 -70
  30. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/JaLC.yml +159 -159
  31. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/KISTI.yml +128 -128
  32. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/OP.yml +72 -72
  33. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/affiliation_is_space.yml +73 -73
  34. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/another_book.yml +109 -109
  35. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/another_book_chapter.yml +71 -71
  36. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/article_id_as_page_number.yml +74 -74
  37. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/author_literal.yml +82 -82
  38. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book.yml +71 -71
  39. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book_chapter.yml +72 -72
  40. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book_chapter_with_RDF_for_container.yml +70 -70
  41. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book_oup.yml +69 -69
  42. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/component.yml +91 -91
  43. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/dataset.yml +101 -102
  44. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/dataset_usda.yml +133 -133
  45. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/date_in_future.yml +78 -78
  46. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/dissertation.yml +100 -100
  47. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/empty_given_name.yml +72 -72
  48. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/invalid_date.yml +73 -73
  49. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article.yml +72 -72
  50. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_original_language_title.yml +70 -70
  51. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_with.yml +76 -514
  52. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_with_RDF_for_container.yml +70 -70
  53. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_with_funding.yml +73 -73
  54. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_issue.yml +69 -69
  55. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/mEDRA.yml +69 -69
  56. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/markup.yml +78 -78
  57. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/missing_creator.yml +73 -73
  58. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/multiple_issn.yml +72 -72
  59. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/multiple_titles.yml +71 -70
  60. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/multiple_titles_with_missing.yml +716 -716
  61. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/not_found_error.yml +63 -63
  62. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/peer_review.yml +74 -74
  63. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/posted_content.yml +71 -71
  64. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/posted_content_copernicus.yml +73 -73
  65. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/report_osti.yml +117 -117
  66. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/vor_with_url.yml +75 -75
  67. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/yet_another_book.yml +69 -69
  68. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/yet_another_book_chapter.yml +70 -70
  69. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_raw/journal_article.yml +10 -10
  70. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datacite_metadata/dissertation.yml +9 -9
  71. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datacite_metadata/funding_references.yml +11 -11
  72. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datacite_metadata/subject_scheme.yml +20 -20
  73. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_id.yml +6 -419
  74. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_uuid.yml +7 -260
  75. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_uuid_specific_prefix.yml +3 -136
  76. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/by_blog_id.yml +225 -1432
  77. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/not_indexed_posts.yml +1380 -2112
  78. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/unregistered_posts.yml +6 -172
  79. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/blog_post_with_non-url_id.yml +7 -7
  80. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/blogger_post.yml +12 -12
  81. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_author_name_suffix.yml +8 -8
  82. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_doi.yml +7 -7
  83. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_organizational_author.yml +3 -3
  84. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_without_doi.yml +8 -8
  85. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/jekyll_post.yml +8 -8
  86. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/substack_post_with_broken_reference.yml +90 -176
  87. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/syldavia_gazette_post_with_references.yml +25 -25
  88. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/upstream_post_with_references.yml +61 -61
  89. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/wordpress_post.yml +8 -8
  90. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/wordpress_post_with_references.yml +20 -20
  91. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/has_familyName.yml +9 -9
  92. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/has_name_in_display-order_with_ORCID.yml +9 -9
  93. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/name_with_affiliation_crossref.yml +14 -14
  94. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/only_familyName_and_givenName.yml +43 -36
  95. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/BlogPosting.yml +158 -158
  96. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/BlogPosting_with_new_DOI.yml +162 -162
  97. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/get_schema_org_metadata_front_matter/BlogPosting.yml +178 -180
  98. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/harvard_dataverse.yml +226 -230
  99. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/pangaea.yml +43 -36
  100. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/upstream_blog.yml +94 -94
  101. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/zenodo.yml +14 -14
  102. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/handle_input/DOI_RA_not_Crossref_or_DataCite.yml +5 -5
  103. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/handle_input/unknown_DOI_prefix.yml +5 -5
  104. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/json_schema_errors/is_valid.yml +13 -13
  105. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/BlogPosting.yml +4 -4
  106. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/Dataset.yml +6 -6
  107. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/authors_with_affiliations.yml +14 -14
  108. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/climate_data.yml +6 -6
  109. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/from_schema_org.yml +159 -159
  110. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/keywords_subject_scheme.yml +6 -6
  111. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/maremma.yml +12 -10
  112. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/text.yml +4 -4
  113. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/with_data_citation.yml +14 -14
  114. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/with_pages.yml +12 -12
  115. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/Collection_of_Jupyter_notebooks.yml +9 -9
  116. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/SoftwareSourceCode_Zenodo.yml +17 -17
  117. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/SoftwareSourceCode_also_Zenodo.yml +12 -12
  118. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/ruby-cff.yml +16 -14
  119. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Dataset.yml +6 -6
  120. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Journal_article.yml +14 -14
  121. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Journal_article_vancouver_style.yml +19 -19
  122. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Missing_author.yml +12 -12
  123. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/interactive_resource_without_dates.yml +4 -4
  124. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/software_w/version.yml +6 -6
  125. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_codemeta/SoftwareSourceCode_DataCite.yml +4 -4
  126. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_codemeta/SoftwareSourceCode_DataCite_check_codemeta_v2.yml +4 -4
  127. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/another_schema_org_from_front-matter.yml +27 -27
  128. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/journal_article.yml +4 -4
  129. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/journal_article_from_datacite.yml +4 -4
  130. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_rogue_scholar_with_doi.yml +8 -54
  131. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_rogue_scholar_with_organizational_author.yml +3 -3
  132. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_upstream_blog.yml +10 -53
  133. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_with_references.yml +62 -62
  134. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/posted_content.yml +15 -15
  135. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/schema_org_from_another_science_blog.yml +6 -6
  136. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/schema_org_from_front_matter.yml +29 -29
  137. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/schema_org_from_upstream_blog.yml +4 -4
  138. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/Another_dataset.yml +28 -28
  139. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/BlogPosting.yml +4 -4
  140. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/BlogPosting_schema_org.yml +158 -158
  141. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/Dataset.yml +6 -6
  142. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/container_title.yml +11 -11
  143. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/interactive_resource_without_dates.yml +4 -4
  144. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/journal_article.yml +14 -14
  145. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/keywords_subject_scheme.yml +6 -6
  146. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/maremma.yml +9 -7
  147. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/missing_creator.yml +12 -12
  148. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/multiple_abstracts.yml +6 -6
  149. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/organization_author.yml +19 -19
  150. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/software.yml +4 -4
  151. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/software_w/version.yml +6 -6
  152. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/with_only_first_page.yml +13 -13
  153. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/with_pages.yml +12 -12
  154. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/climate_data.yml +6 -6
  155. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/maremma.yml +10 -8
  156. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/text.yml +4 -4
  157. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/with_data_citation.yml +14 -14
  158. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/with_pages.yml +12 -12
  159. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/dissertation.yml +17 -17
  160. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/from_schema_org.yml +158 -158
  161. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/journal_article.yml +18 -18
  162. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/maremma.yml +10 -8
  163. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/with_ORCID_ID.yml +12 -12
  164. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/with_data_citation.yml +14 -14
  165. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/Dataset_in_schema_4_0.yml +6 -6
  166. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/Text_pass-thru.yml +4 -4
  167. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/book_chapter.yml +15 -13
  168. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/from_schema_org.yml +158 -158
  169. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/interactive_resource_without_dates.yml +4 -4
  170. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/maremma.yml +12 -10
  171. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/with_ORCID_ID.yml +12 -12
  172. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/with_data_citation.yml +14 -14
  173. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/with_editor.yml +13 -13
  174. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/BlogPosting.yml +4 -4
  175. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/BlogPosting_schema_org.yml +159 -159
  176. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/Dataset.yml +6 -6
  177. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/alternate_name.yml +4 -4
  178. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/journal_article.yml +8 -8
  179. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/keywords_with_subject_scheme.yml +6 -6
  180. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/maremma.yml +9 -7
  181. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/with_pages.yml +7 -7
  182. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Another_Schema_org_JSON.yml +6 -6
  183. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Funding.yml +9 -9
  184. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Funding_OpenAIRE.yml +9 -9
  185. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Schema_org_JSON.yml +17 -17
  186. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Schema_org_JSON_Cyark.yml +33 -33
  187. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/alternate_identifiers.yml +9 -9
  188. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/data_catalog.yml +9 -9
  189. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/geo_location_box.yml +12 -12
  190. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/interactive_resource_without_dates.yml +9 -9
  191. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/journal_article.yml +14 -14
  192. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/maremma_schema_org_JSON.yml +10 -8
  193. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/series_information.yml +9 -9
  194. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/subject_scheme.yml +11 -11
  195. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/subject_scheme_multiple_keywords.yml +11 -11
  196. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/BlogPosting.yml +4 -4
  197. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/Dataset.yml +6 -6
  198. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/journal_article.yml +14 -14
  199. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/with_pages.yml +12 -12
  200. data/spec/readers/cff_reader_spec.rb +6 -6
  201. data/spec/readers/crossref_reader_spec.rb +3 -3
  202. data/spec/readers/crossref_xml_reader_spec.rb +7 -7
  203. data/spec/readers/json_feed_reader_spec.rb +13 -13
  204. data/spec/readers/schema_org_reader_spec.rb +2 -3
  205. data/spec/spec_helper.rb +1 -0
  206. data/spec/utils_spec.rb +1 -1
  207. data/spec/writers/cff_writer_spec.rb +3 -3
  208. data/spec/writers/ris_writer_spec.rb +2 -2
  209. data/spec/writers/schema_org_writer_spec.rb +1 -1
  210. metadata +5 -427
  211. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref/default.yml +0 -110
  212. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref/to_bibtex.yml +0 -110
  213. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref/to_crossref_xml.yml +0 -110
  214. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref/to_datacite.yml +0 -110
  215. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref/to_schema_org.yml +0 -110
  216. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref_xml/default.yml +0 -55
  217. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref_xml/to_bibtex.yml +0 -55
  218. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref_xml/to_crossref_xml.yml +0 -55
  219. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref_xml/to_datacite.yml +0 -55
  220. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref_xml/to_schema_org.yml +0 -55
  221. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/default.yml +0 -299
  222. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_bibtex.yml +0 -299
  223. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_citation.yml +0 -299
  224. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_crossref_xml.yml +0 -299
  225. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_datacite.yml +0 -299
  226. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_jats.yml +0 -299
  227. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_schema_org.yml +0 -299
  228. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/default.yml +0 -172
  229. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/to_bibtex.yml +0 -172
  230. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/to_citation.yml +0 -172
  231. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/to_datacite.yml +0 -172
  232. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/to_jats.yml +0 -172
  233. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/to_schema_org.yml +0 -172
  234. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/schema_org/default.yml +0 -1098
  235. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/schema_org/to_datacite.yml +0 -1098
  236. data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/schema_org/to_schema_org.yml +0 -1100
  237. data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/crossref.yml +0 -55
  238. data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/datacite.yml +0 -55
  239. data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/jalc.yml +0 -55
  240. data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/kisti.yml +0 -55
  241. data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/medra.yml +0 -55
  242. data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/op.yml +0 -55
  243. data/spec/fixtures/vcr_cassettes/Briard_Metadata/authors_as_string/author.yml +0 -164
  244. data/spec/fixtures/vcr_cassettes/Briard_Metadata/authors_as_string/no_author.yml +0 -164
  245. data/spec/fixtures/vcr_cassettes/Briard_Metadata/authors_as_string/single_author.yml +0 -164
  246. data/spec/fixtures/vcr_cassettes/Briard_Metadata/authors_as_string/with_organization.yml +0 -164
  247. data/spec/fixtures/vcr_cassettes/Briard_Metadata/change_metadata_as_datacite_xml/with_data_citation.yml +0 -247
  248. data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/crossref.yml +0 -55
  249. data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/datacite.yml +0 -55
  250. data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/jalc.yml +0 -55
  251. data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/kisti.yml +0 -55
  252. data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/medra.yml +0 -55
  253. data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/not_found.yml +0 -55
  254. data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/op.yml +0 -55
  255. data/spec/fixtures/vcr_cassettes/Briard_Metadata/find_from_format_by_ID/crossref.yml +0 -55
  256. data/spec/fixtures/vcr_cassettes/Briard_Metadata/find_from_format_by_ID/crossref_doi_not_url.yml +0 -55
  257. data/spec/fixtures/vcr_cassettes/Briard_Metadata/find_from_format_by_ID/datacite.yml +0 -55
  258. data/spec/fixtures/vcr_cassettes/Briard_Metadata/find_from_format_by_ID/datacite_doi_http.yml +0 -55
  259. data/spec/fixtures/vcr_cassettes/Briard_Metadata/find_from_format_by_ID/unknown_DOI_registration_agency.yml +0 -55
  260. data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/hsh_to_fos_for_match.yml +0 -221
  261. data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/hsh_to_fos_for_with_schemeUri_in_hash.yml +0 -221
  262. data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/hsh_to_fos_match.yml +0 -221
  263. data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/hsh_to_fos_no_match.yml +0 -221
  264. data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/name_to_fos_for_match.yml +0 -221
  265. data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/name_to_fos_match.yml +0 -221
  266. data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/name_to_fos_no_match.yml +0 -221
  267. data/spec/fixtures/vcr_cassettes/Briard_Metadata/from_schema_org/with_id.yml +0 -221
  268. data/spec/fixtures/vcr_cassettes/Briard_Metadata/from_schema_org_creators/with_affiliation.yml +0 -221
  269. data/spec/fixtures/vcr_cassettes/Briard_Metadata/from_schema_org_creators/without_affiliation.yml +0 -221
  270. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_cff_metadata/cff-converter-python.yml +0 -200
  271. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_cff_metadata/ruby-cff.yml +0 -154
  272. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_cff_metadata/ruby-cff_repository_url.yml +0 -154
  273. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_codemeta_metadata/maremma.yml +0 -86
  274. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_codemeta_metadata/metadata_reports.yml +0 -93
  275. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/DOI_with_ORCID_ID.yml +0 -337
  276. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/DOI_with_SICI_DOI.yml +0 -347
  277. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/DOI_with_data_citation.yml +0 -359
  278. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/JaLC.yml +0 -384
  279. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/KISTI.yml +0 -330
  280. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/OP.yml +0 -969
  281. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/affiliation_is_space.yml +0 -358
  282. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/another_book.yml +0 -312
  283. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/another_book_chapter.yml +0 -465
  284. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/article_id_as_page_number.yml +0 -276
  285. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/author_literal.yml +0 -492
  286. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/book.yml +0 -523
  287. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/book_chapter.yml +0 -377
  288. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/book_chapter_with_RDF_for_container.yml +0 -336
  289. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/book_oup.yml +0 -289
  290. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/component.yml +0 -289
  291. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/dataset.yml +0 -299
  292. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/dataset_usda.yml +0 -341
  293. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/date_in_future.yml +0 -570
  294. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/dissertation.yml +0 -301
  295. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/empty_given_name.yml +0 -303
  296. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/invalid_date.yml +0 -307
  297. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_article.yml +0 -461
  298. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_article_original_language_title.yml +0 -276
  299. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_article_with.yml +0 -470
  300. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_article_with_RDF_for_container.yml +0 -519
  301. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_article_with_funding.yml +0 -456
  302. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_issue.yml +0 -270
  303. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/mEDRA.yml +0 -310
  304. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/markup.yml +0 -329
  305. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/missing_creator.yml +0 -307
  306. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/multiple_issn.yml +0 -393
  307. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/multiple_titles.yml +0 -265
  308. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/multiple_titles_with_missing.yml +0 -860
  309. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/not_found_error.yml +0 -209
  310. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/peer_review.yml +0 -287
  311. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/posted_content.yml +0 -326
  312. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/posted_content_copernicus.yml +0 -297
  313. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/report_osti.yml +0 -315
  314. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/vor_with_url.yml +0 -451
  315. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/yet_another_book.yml +0 -816
  316. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/yet_another_book_chapter.yml +0 -324
  317. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_raw/journal_article.yml +0 -110
  318. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datacite_metadata/dissertation.yml +0 -152
  319. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datacite_metadata/funding_references.yml +0 -175
  320. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datacite_metadata/subject_scheme.yml +0 -328
  321. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date/publication_date.yml +0 -221
  322. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_date_parts/date.yml +0 -221
  323. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_date_parts/year-month.yml +0 -221
  324. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_date_parts/year.yml +0 -221
  325. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_parts/date.yml +0 -221
  326. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_parts/year-month.yml +0 -221
  327. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_parts/year.yml +0 -221
  328. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_parts/date.yml +0 -221
  329. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_parts/year-month.yml +0 -221
  330. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_parts/year.yml +0 -221
  331. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datetime_from_time/future.yml +0 -221
  332. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datetime_from_time/invalid.yml +0 -221
  333. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datetime_from_time/nil.yml +0 -221
  334. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datetime_from_time/past.yml +0 -221
  335. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datetime_from_time/present.yml +0 -221
  336. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_one_author/has_familyName.yml +0 -133
  337. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_one_author/has_name_in_display-order_with_ORCID.yml +0 -153
  338. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_one_author/is_organization.yml +0 -164
  339. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_one_author/name_with_affiliation_crossref.yml +0 -247
  340. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_one_author/only_familyName_and_givenName.yml +0 -468
  341. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/BlogPosting.yml +0 -530
  342. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/BlogPosting_with_new_DOI.yml +0 -530
  343. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/get_schema_org_metadata_front_matter/BlogPosting.yml +0 -534
  344. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/harvard_dataverse.yml +0 -1838
  345. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/pangaea.yml +0 -468
  346. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/upstream_blog.yml +0 -885
  347. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/zenodo.yml +0 -583
  348. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_series_information/only_title.yml +0 -221
  349. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_series_information/title_and_pages.yml +0 -221
  350. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_series_information/title_volume_and_pages.yml +0 -221
  351. data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_series_information/title_volume_issue_and_pages.yml +0 -221
  352. data/spec/fixtures/vcr_cassettes/Briard_Metadata/github/github_as_cff_url.yml +0 -221
  353. data/spec/fixtures/vcr_cassettes/Briard_Metadata/github/github_as_codemeta_url.yml +0 -221
  354. data/spec/fixtures/vcr_cassettes/Briard_Metadata/github/github_from_url.yml +0 -221
  355. data/spec/fixtures/vcr_cassettes/Briard_Metadata/github/github_from_url_cff_file.yml +0 -221
  356. data/spec/fixtures/vcr_cassettes/Briard_Metadata/github/github_from_url_file.yml +0 -221
  357. data/spec/fixtures/vcr_cassettes/Briard_Metadata/handle_input/DOI_RA_not_Crossref_or_DataCite.yml +0 -55
  358. data/spec/fixtures/vcr_cassettes/Briard_Metadata/handle_input/unknown_DOI_prefix.yml +0 -55
  359. data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_comma.yml +0 -164
  360. data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_family_name.yml +0 -164
  361. data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_id.yml +0 -164
  362. data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_known_given_name.yml +0 -164
  363. data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_no_info.yml +0 -164
  364. data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_orcid_id.yml +0 -164
  365. data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_type_organization.yml +0 -164
  366. data/spec/fixtures/vcr_cassettes/Briard_Metadata/json_schema_errors/is_valid.yml +0 -221
  367. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_cc_url/not_found.yml +0 -221
  368. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_cc_url/with_trailing_slash.yml +0 -221
  369. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_cc_url/with_trailing_slash_and_to_https.yml +0 -221
  370. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/doi.yml +0 -221
  371. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/doi_as_url.yml +0 -221
  372. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/filename.yml +0 -221
  373. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/ftp.yml +0 -221
  374. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/invalid_url.yml +0 -221
  375. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/sandbox_via_options.yml +0 -221
  376. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/sandbox_via_url.yml +0 -221
  377. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/string.yml +0 -221
  378. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/url.yml +0 -221
  379. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/url_with_utf-8.yml +0 -221
  380. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_ids/doi.yml +0 -221
  381. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_ids/url.yml +0 -221
  382. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_issn/from_array.yml +0 -221
  383. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_issn/from_empty_array.yml +0 -221
  384. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_issn/from_hash.yml +0 -221
  385. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_issn/from_string.yml +0 -221
  386. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_url/uri.yml +0 -221
  387. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_url/with_trailing_slash.yml +0 -221
  388. data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_url/with_trailing_slash_and_to_https.yml +0 -221
  389. data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/array.yml +0 -221
  390. data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/array_of_strings.yml +0 -221
  391. data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/first.yml +0 -221
  392. data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/hash.yml +0 -221
  393. data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/hash_with_array_value.yml +0 -221
  394. data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/nil.yml +0 -221
  395. data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/string.yml +0 -221
  396. data/spec/fixtures/vcr_cassettes/Briard_Metadata/random_doi/decode_anothe_doi.yml +0 -221
  397. data/spec/fixtures/vcr_cassettes/Briard_Metadata/random_doi/decode_doi.yml +0 -221
  398. data/spec/fixtures/vcr_cassettes/Briard_Metadata/random_doi/encode_doi.yml +0 -221
  399. data/spec/fixtures/vcr_cassettes/Briard_Metadata/sanitize/onlies_keep_specific_tags.yml +0 -221
  400. data/spec/fixtures/vcr_cassettes/Briard_Metadata/sanitize/removes_a_tags.yml +0 -221
  401. data/spec/fixtures/vcr_cassettes/Briard_Metadata/spdx/hsh_to_spdx_id.yml +0 -221
  402. data/spec/fixtures/vcr_cassettes/Briard_Metadata/spdx/hsh_to_spdx_not_found.yml +0 -221
  403. data/spec/fixtures/vcr_cassettes/Briard_Metadata/spdx/hsh_to_spdx_url.yml +0 -221
  404. data/spec/fixtures/vcr_cassettes/Briard_Metadata/spdx/name_to_spdx_exists.yml +0 -221
  405. data/spec/fixtures/vcr_cassettes/Briard_Metadata/spdx/name_to_spdx_id.yml +0 -221
  406. data/spec/fixtures/vcr_cassettes/Briard_Metadata/to_schema_org/with_id.yml +0 -221
  407. data/spec/fixtures/vcr_cassettes/Briard_Metadata/to_schema_org_identifiers/with_identifiers.yml +0 -221
  408. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid.yml +0 -221
  409. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_https.yml +0 -221
  410. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_id.yml +0 -221
  411. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_sandbox.yml +0 -221
  412. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_sandbox_https.yml +0 -221
  413. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_with_spaces.yml +0 -221
  414. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_wrong_id.yml +0 -221
  415. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_www.yml +0 -221
  416. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid_scheme/validate_orcid_scheme.yml +0 -221
  417. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid_scheme/validate_orcid_scheme_https.yml +0 -221
  418. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid_scheme/validate_orcid_scheme_trailing_slash.yml +0 -221
  419. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid_scheme/validate_orcid_scheme_www.yml +0 -221
  420. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_url/DOI.yml +0 -221
  421. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_url/ISSN.yml +0 -221
  422. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_url/URL.yml +0 -221
  423. data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_url/string.yml +0 -221
  424. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/BlogPosting.yml +0 -81
  425. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/Dataset.yml +0 -120
  426. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/authors_with_affiliations.yml +0 -186
  427. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/climate_data.yml +0 -74
  428. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/from_schema_org.yml +0 -530
  429. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/keywords_subject_scheme.yml +0 -149
  430. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/maremma.yml +0 -86
  431. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/text.yml +0 -100
  432. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/with_data_citation.yml +0 -247
  433. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/with_pages.yml +0 -228
  434. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_cff/Collection_of_Jupyter_notebooks.yml +0 -143
  435. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_cff/SoftwareSourceCode_Zenodo.yml +0 -150
  436. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_cff/SoftwareSourceCode_also_Zenodo.yml +0 -93
  437. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_cff/ruby-cff.yml +0 -154
  438. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/Dataset.yml +0 -120
  439. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/Journal_article.yml +0 -247
  440. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/Journal_article_vancouver_style.yml +0 -299
  441. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/Missing_author.yml +0 -199
  442. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/interactive_resource_without_dates.yml +0 -75
  443. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/software_w/version.yml +0 -86
  444. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_codemeta/SoftwareSourceCode_DataCite.yml +0 -76
  445. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_codemeta/SoftwareSourceCode_DataCite_check_codemeta_v2.yml +0 -76
  446. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/another_schema_org_from_front-matter.yml +0 -541
  447. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/journal_article.yml +0 -55
  448. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/journal_article_from_datacite.yml +0 -85
  449. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/posted_content.yml +0 -283
  450. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/schema_org_from_another_science_blog.yml +0 -123
  451. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/schema_org_from_front_matter.yml +0 -477
  452. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/schema_org_from_upstream_blog.yml +0 -1025
  453. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/Another_dataset.yml +0 -110
  454. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/BlogPosting.yml +0 -81
  455. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/BlogPosting_schema_org.yml +0 -530
  456. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/Dataset.yml +0 -120
  457. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/container_title.yml +0 -153
  458. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/interactive_resource_without_dates.yml +0 -75
  459. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/journal_article.yml +0 -247
  460. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/keywords_subject_scheme.yml +0 -149
  461. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/maremma.yml +0 -86
  462. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/missing_creator.yml +0 -199
  463. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/multiple_abstracts.yml +0 -101
  464. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/organization_author.yml +0 -314
  465. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/software.yml +0 -90
  466. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/software_w/version.yml +0 -86
  467. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/with_only_first_page.yml +0 -333
  468. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/with_pages.yml +0 -228
  469. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csv/climate_data.yml +0 -74
  470. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csv/maremma.yml +0 -86
  471. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csv/text.yml +0 -100
  472. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csv/with_data_citation.yml +0 -247
  473. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csv/with_pages.yml +0 -228
  474. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite/from_schema_org.yml +0 -530
  475. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite/maremma.yml +0 -86
  476. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite/with_ORCID_ID.yml +0 -228
  477. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite/with_data_citation.yml +0 -247
  478. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite_json/from_schema_org.yml +0 -530
  479. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite_json/maremma.yml +0 -86
  480. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite_json/with_ORCID_ID.yml +0 -228
  481. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite_json/with_data_citation.yml +0 -247
  482. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/Dataset_in_schema_4_0.yml +0 -120
  483. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/Text_pass-thru.yml +0 -106
  484. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/book_chapter.yml +0 -163
  485. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/from_schema_org.yml +0 -530
  486. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/interactive_resource_without_dates.yml +0 -75
  487. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/maremma.yml +0 -86
  488. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/with_ORCID_ID.yml +0 -228
  489. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/with_data_citation.yml +0 -247
  490. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/with_editor.yml +0 -355
  491. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_rdf_xml/BlogPosting.yml +0 -81
  492. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_rdf_xml/BlogPosting_schema_org.yml +0 -530
  493. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_rdf_xml/journal_article.yml +0 -247
  494. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_rdf_xml/maremma.yml +0 -86
  495. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_rdf_xml/with_pages.yml +0 -228
  496. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/BlogPosting.yml +0 -81
  497. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/BlogPosting_schema_org.yml +0 -530
  498. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/Dataset.yml +0 -120
  499. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/alternate_name.yml +0 -138
  500. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/journal_article.yml +0 -115
  501. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/keywords_with_subject_scheme.yml +0 -149
  502. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/maremma.yml +0 -86
  503. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/with_pages.yml +0 -112
  504. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Another_Schema_org_JSON.yml +0 -120
  505. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Funding.yml +0 -192
  506. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Funding_OpenAIRE.yml +0 -150
  507. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Schema_org_JSON.yml +0 -98
  508. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Schema_org_JSON_Cyark.yml +0 -160
  509. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Schema_org_JSON_IsSupplementTo.yml +0 -153
  510. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/alternate_identifiers.yml +0 -131
  511. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/data_catalog.yml +0 -136
  512. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/geo_location_box.yml +0 -181
  513. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/interactive_resource_without_dates.yml +0 -127
  514. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/journal_article.yml +0 -247
  515. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/maremma_schema_org_JSON.yml +0 -86
  516. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/series_information.yml +0 -174
  517. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/subject_scheme.yml +0 -199
  518. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/subject_scheme_multiple_keywords.yml +0 -201
  519. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_turtle/BlogPosting.yml +0 -81
  520. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_turtle/BlogPosting_schema_org.yml +0 -530
  521. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_turtle/Dataset.yml +0 -120
  522. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_turtle/journal_article.yml +0 -247
  523. data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_turtle/with_pages.yml +0 -228
  524. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/hsh_to_fos_for_match.yml +0 -221
  525. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/hsh_to_fos_for_with_schemeUri_in_hash.yml +0 -221
  526. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/hsh_to_fos_match.yml +0 -221
  527. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/hsh_to_fos_no_match.yml +0 -221
  528. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/name_to_fos_for_match.yml +0 -221
  529. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/name_to_fos_match.yml +0 -221
  530. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/name_to_fos_no_match.yml +0 -221
  531. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/from_schema_org/with_id.yml +0 -221
  532. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date/publication_date.yml +0 -221
  533. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_date_parts/date.yml +0 -221
  534. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_date_parts/year-month.yml +0 -221
  535. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_date_parts/year.yml +0 -221
  536. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_parts/date.yml +0 -221
  537. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_parts/year-month.yml +0 -221
  538. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_parts/year.yml +0 -221
  539. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_parts/date.yml +0 -221
  540. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_parts/year-month.yml +0 -221
  541. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_parts/year.yml +0 -221
  542. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datetime_from_time/future.yml +0 -221
  543. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datetime_from_time/invalid.yml +0 -221
  544. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datetime_from_time/nil.yml +0 -221
  545. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datetime_from_time/past.yml +0 -221
  546. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datetime_from_time/present.yml +0 -221
  547. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/all_posts.yml +0 -3602
  548. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/behind_the_science.yml +0 -1176
  549. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/citation_style_language.yml +0 -360
  550. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/citation_style_language_blog.yml +0 -360
  551. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/front-matter_blog.yml +0 -1034
  552. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/upstream.yml +0 -2438
  553. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/upstream_blog.yml +0 -2438
  554. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item/by_uuid.yml +0 -136
  555. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_link/license.yml +0 -221
  556. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_link/url.yml +0 -221
  557. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_series_information/only_title.yml +0 -221
  558. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_series_information/title_and_pages.yml +0 -221
  559. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_series_information/title_volume_and_pages.yml +0 -221
  560. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_series_information/title_volume_issue_and_pages.yml +0 -221
  561. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/github/github_as_cff_url.yml +0 -221
  562. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/github/github_as_codemeta_url.yml +0 -221
  563. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/github/github_from_url.yml +0 -221
  564. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/github/github_from_url_cff_file.yml +0 -221
  565. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/github/github_from_url_file.yml +0 -221
  566. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/json_feed_unregistered_url/all_posts.yml +0 -221
  567. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_cc_url/not_found.yml +0 -221
  568. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_cc_url/with_trailing_slash.yml +0 -221
  569. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_cc_url/with_trailing_slash_and_to_https.yml +0 -221
  570. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/doi.yml +0 -221
  571. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/doi_as_url.yml +0 -221
  572. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/filename.yml +0 -221
  573. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/ftp.yml +0 -221
  574. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/invalid_url.yml +0 -221
  575. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/sandbox_via_options.yml +0 -221
  576. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/sandbox_via_url.yml +0 -221
  577. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/string.yml +0 -221
  578. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/url.yml +0 -221
  579. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/url_with_utf-8.yml +0 -221
  580. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_issn/from_array.yml +0 -221
  581. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_issn/from_empty_array.yml +0 -221
  582. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_issn/from_hash.yml +0 -221
  583. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_issn/from_string.yml +0 -221
  584. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_url/uri.yml +0 -221
  585. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_url/with_trailing_slash.yml +0 -221
  586. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_url/with_trailing_slash_and_to_https.yml +0 -221
  587. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/array.yml +0 -221
  588. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/array_of_strings.yml +0 -221
  589. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/first.yml +0 -221
  590. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/hash.yml +0 -221
  591. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/hash_with_array_value.yml +0 -221
  592. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/nil.yml +0 -221
  593. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/string.yml +0 -221
  594. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_doi/decode_anothe_doi.yml +0 -221
  595. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_doi/decode_another_doi.yml +0 -221
  596. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_doi/decode_doi.yml +0 -221
  597. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_doi/encode_doi.yml +0 -221
  598. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_id/decode_another_id.yml +0 -221
  599. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_id/decode_id.yml +0 -221
  600. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_id/encode_id.yml +0 -221
  601. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/sanitize/onlies_keep_specific_tags.yml +0 -221
  602. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/sanitize/removes_a_tags.yml +0 -221
  603. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/spdx/hsh_to_spdx_id.yml +0 -221
  604. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/spdx/hsh_to_spdx_not_found.yml +0 -221
  605. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/spdx/hsh_to_spdx_url.yml +0 -221
  606. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/spdx/name_to_spdx_exists.yml +0 -221
  607. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/spdx/name_to_spdx_id.yml +0 -221
  608. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/to_schema_org/with_id.yml +0 -221
  609. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/to_schema_org_identifiers/with_identifiers.yml +0 -221
  610. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid.yml +0 -221
  611. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_https.yml +0 -221
  612. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_id.yml +0 -221
  613. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_sandbox.yml +0 -221
  614. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_sandbox_https.yml +0 -221
  615. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_with_spaces.yml +0 -221
  616. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_wrong_id.yml +0 -221
  617. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_www.yml +0 -221
  618. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid_scheme/validate_orcid_scheme.yml +0 -221
  619. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid_scheme/validate_orcid_scheme_https.yml +0 -221
  620. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid_scheme/validate_orcid_scheme_trailing_slash.yml +0 -221
  621. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid_scheme/validate_orcid_scheme_www.yml +0 -221
  622. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_url/DOI.yml +0 -221
  623. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_url/ISSN.yml +0 -221
  624. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_url/URL.yml +0 -221
  625. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_url/string.yml +0 -221
  626. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_rdf_xml/BlogPosting.yml +0 -81
  627. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_rdf_xml/BlogPosting_schema_org.yml +0 -530
  628. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_rdf_xml/journal_article.yml +0 -247
  629. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_rdf_xml/maremma.yml +0 -86
  630. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_rdf_xml/with_pages.yml +0 -228
  631. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Schema_org_JSON_IsSupplementTo.yml +0 -153
  632. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/BlogPosting_schema_org.yml +0 -530
@@ -23,13 +23,13 @@ http_interactions:
23
23
  Cache-Control:
24
24
  - public, max-age=0, must-revalidate
25
25
  Content-Length:
26
- - '162607'
26
+ - '49530'
27
27
  Content-Type:
28
28
  - application/json; charset=utf-8
29
29
  Date:
30
- - Thu, 15 Jun 2023 20:39:01 GMT
30
+ - Sun, 18 Jun 2023 15:23:48 GMT
31
31
  Etag:
32
- - '"6w7me0q1i23h72"'
32
+ - '"xv42bhvvc21253"'
33
33
  Server:
34
34
  - Vercel
35
35
  Strict-Transport-Security:
@@ -39,7 +39,7 @@ http_interactions:
39
39
  X-Vercel-Cache:
40
40
  - MISS
41
41
  X-Vercel-Id:
42
- - fra1::iad1::r8zhw-1686861540247-e74748738960
42
+ - fra1::iad1::6sjfw-1687101826987-ed7849dc5f1b
43
43
  Connection:
44
44
  - close
45
45
  body:
@@ -49,854 +49,55 @@ http_interactions:
49
49
  more ranty and less considered opinions, see my <a href=\"https://twitter.com/rdmpage\">Twitter
50
50
  feed</a>.<br>ISSN 2051-8188. Written content on this site is licensed under
51
51
  a <a href=\"https://creativecommons.org/licenses/by/4.0/\">Creative Commons
52
- Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default?alt=rss","feed_format":"application/rss+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-05-31T17:26:00+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger","category":"Natural
53
- Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/btdk4-42879","uuid":"3e1278f6-e7c0-43e1-bb54-6829e1344c0d","url":"https://iphylo.blogspot.com/2022/09/the-ideal-taxonomic-journal.html","title":"The
54
- ideal taxonomic journal","summary":"This is just some random notes on an “ideal”
55
- taxonomic journal, inspired in part by some recent discussions onturbo-taxonomy”
56
- (e.g., https://doi.org/10.3897/zookeys.1087.76720 and https://doi.org/10.1186/1742-9994-10-15),
57
- and also examples such as the Australian Journal of Taxonomy https://doi.org/10.54102/ajt.qxi3r
58
- which seems well-intentioned but limited. XML One approach is to have highly
59
- structured text that embeds detailed markup, and ideally a tool that generates
60
- markup in XML. This is...","date_published":"2022-09-29T14:00:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
61
- (Roderic Page)"}],"image":null,"content_html":"<p>This is just some random
62
- notes on an “ideal” taxonomic journal, inspired in part by some recent discussions
63
- on “turbo-taxonomy” (e.g., <a href=\"https://doi.org/10.3897/zookeys.1087.76720\">https://doi.org/10.3897/zookeys.1087.76720</a>
64
- and <a href=\"https://doi.org/10.1186/1742-9994-10-15\">https://doi.org/10.1186/1742-9994-10-15</a>),
65
- and also examples such as the Australian Journal of Taxonomy <a href=\"https://doi.org/10.54102/ajt.qxi3r\">https://doi.org/10.54102/ajt.qxi3r</a>
66
- which seems well-intentioned but limited.</p>\n<h2 id=\"xml\">XML</h2>\n<p>One
67
- approach is to have highly structured text that embeds detailed markup, and
68
- ideally a tool that generates markup in XML. This is the approach taken by
69
- Pensoft. There is an inevitable trade-off between the burden on authors of
70
- marking up text versus making the paper machine readable. In some ways this
71
- seems misplaced effort given that there is little evidence that publications
72
- by themselves have much value (see <a href=\"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html\">The
73
- Business of Extracting Knowledge from Academic Publications</a>). “Value”
74
- in this case means as a source of data or factual statements that we can compute
75
- over. Human-readable text is not a good way to convey this sort of information.</p>\n<p>It’s
76
- also interesting that many editing tools are going in the opposite direction,
77
- for example there are minimalist tools using <a href=\"https://en.wikipedia.org/wiki/Markdown\">Markdown</a>
78
- where the goal is to <em>get out of the author’s way</em>, rather than impose
79
- a way of writing. Text is written by humans for humans, so the tools should
80
- be human-friendly.</p>\n<p>The idea of publishing using XML is attractive
81
- in that it gives you XML that can be archived by, say, PubMed Central, but
82
- other than that the value seems limited. A cursory glance at download stats
83
- for journals that provide PDF and XML downloads, such as <em>PLoS One</em>
84
- and <em>ZooKeys</em>, PDF is by far the more popular format. So arguably there
85
- is little value in providing XML. Those who have tried to use JATS-XML as
86
- an authoring tool have not had a happy time: <a href=\"https://doi.org/10.7557/15.5517\">How
87
- we tried to JATS XML</a>. However, there are various tools to help with the
88
- process, such as <a href=\"https://github.com/Vitaliy-1/docxToJats\">docxToJats</a>,<br>\ntexture,
89
- and <a href=\"https://github.com/elifesciences/jats-xml-to-pdf\">jats-xml-to-pdf</a>
90
- if this is the route one wants to take.</p>\n<h2 id=\"automating-writing-manuscripts\">Automating
91
- writing manuscripts</h2>\n<p>The dream, of course, is to have a tool where
92
- you store all your taxonomic data (literature, specimens, characters, images,
93
- sequences, media files, etc.) and at the click of a button generate a paper.
94
- Certainly some of this can be automated, much nomenclatural and specimen information
95
- could be converted to human-readable text. Ideally this computer-generated
96
- text would not be edited (otherwise it could get out of sync with the underlying
97
- data). The text should be <a href=\"https://en.wikipedia.org/wiki/Transclusion\">transcluded</a>.
98
- As an aside, one way to do this would be to include things such as lists of
99
- material examined as images rather than text while the manuscript is being
100
- edited. In the same way that you (probably) wouldn’t edit a photograph within
101
- your text editor, you shouldn’t be editing data. When the manuscript is published
102
- the data-generated portions can then be output as text.</p>\n<p>Of course
103
- all of this assumes that we have taxonomic data in a database (or some other
104
- storage format, including plain text and Mark-down, e.g. <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
105
- markdown, and taxonomic trees</a>) that can generate outputs in the various
106
- formats that we need.</p>\n<h2 id=\"archiving-data-and-images\">Archiving
107
- data and images</h2>\n<p>One of the really nice things that <a href=\"http://plazi.org\">Plazi</a>
108
- do is have a pipeline that sends taxonomic descriptions and images to Zenodo,
109
- and similar data to GBIF. Any taxonomic journal should be able to do this.
110
- Indeed, arguably each taxonomic treatment within the paper should be linked
111
- to the Zenodo DOI at the time of publication. Indeed, we could imagine ultimately
112
- having treatments as transclusions within the larger manuscript. Alternatively
113
- we could store the treatments as parts of the larger article (rather like
114
- chapters in a book), each with a CrossRef DOI. I’m still sceptical about whether
115
- these treatments are as important as we make out, see <a href=\"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html\">Does
116
- anyone cite taxonomic treatments?</a>. But having machine-readable taxonomic
117
- data archived and accessible is a good thing. Uploading the same data to GBIF
118
- makes much of that data immediately accessible. Now that GBIF offers <a href=\"https://www.gbif.org/composition/3kQFinjwHbCGZeLb5OhwN2/gbif-hosted-portals\">hosted
119
- portals</a> there is the possibility of having custom interfaces to data from
120
- a particular journal.</p>\n<h2 id=\"name-and-identifier-registration\">Name
121
- and identifier registration</h2>\n<p>We would also want automatic registration
122
- of new taxonomic names, for which there are pipelines (see “A common registration-to-publication
123
- automated pipeline for nomenclatural acts for higher plants (International
124
- Plant Names Index, IPNI), fungi (Index Fungorum, MycoBank) and animals (ZooBank)”
125
- <a href=\"https://doi.org/10.3897/zookeys.550.9551\">https://doi.org/10.3897/zookeys.550.9551</a>).
126
- These pipelines do not seem to be documented in much detail, and the data
127
- formats differ across registration agencies (e.g., IPNI and ZooBank). For
128
- example, ZooBank seems to require TaxPub XML.</p>\n<p>Registration of names
129
- and identifiers, especially across multiple registration agencies (ZooBank,
130
- CrossRef, DataCite, etc.) requires some coordination, especially when one
131
- registration agency requires identifiers from another.</p>\n<h2 id=\"summary\">Summary</h2>\n<p>If
132
- data is key, then the taxonomic paper itself becomes something of a wrapper
133
- around that data. It still serves the function of being human-readable, providing
134
- broader context for the work, and as an archive that conforms to currently
135
- accepted ways to publish taxonomic names. But in some ways it is the last
136
- interesting part of the process.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/37y2z-gre70","uuid":"f3629c86-06e0-42c0-844a-266b03a91ef1","url":"https://iphylo.blogspot.com/2023/05/ten-years-and-million-links.html","title":"Ten
137
- years and a million links","summary":"As trailed on a Twitter thread last
138
- week I’ve been working on a manuscript describing the efforts to map taxonomic
139
- names to their original descriptions in the taxonomic literature. Putting
140
- together a manuscript on linking taxonomic names to the primary literature,
141
- basically “um, what, exactly, have you been doing all these years?”. TL;DR
142
- Across fungi, plants, and animals approx 1.3 million names have been linked
143
- to a persistent identifier for a publication.— Roderic Page (@rdmpage) May
144
- 25,...","date_published":"2023-05-31T17:26:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
145
- (Roderic Page)"}],"image":null,"content_html":"<p>As trailed on a Twitter
146
- thread last week I’ve been working on a manuscript describing the efforts
147
- to map taxonomic names to their original descriptions in the taxonomic literature.</p>\n<blockquote
148
- class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Putting together a manuscript
149
- on linking taxonomic names to the primary literature, basically “um, what,
150
- exactly, have you been doing all these years?”. TL;DR Across fungi, plants,
151
- and animals approx 1.3 million names have been linked to a persistent identifier
152
- for a publication.</p>— Roderic Page (@rdmpage) <a href=\"https://twitter.com/rdmpage/status/1661714128413573120?ref_src=twsrc%5Etfw\">May
153
- 25, 2023</a></blockquote> \n<p>The preprint is on bioRxiv <a href=\"https://doi.org/10.1101/2023.05.29.542697\">doi:10.1101/2023.05.29.542697</a></p>\n<blockquote>\n<p>A
154
- major gap in the biodiversity knowledge graph is a connection between taxonomic
155
- names and the taxonomic literature. While both names and publications often
156
- have persistent identifiers (PIDs), such as Life Science Identifiers (LSIDs)
157
- or Digital Object Identifiers (DOIs), LSIDs for names are rarely linked to
158
- DOIs for publications. This article describes efforts to make those connections
159
- across three large taxonomic databases: Index Fungorum, International Plant
160
- Names Index (IPNI), and the Index of Organism Names (ION). Over a million
161
- names have been matched to DOIs or other persistent identifiers for taxonomic
162
- publications. This represents approximately 36% of names for which publication
163
- data is available. The mappings between LSIDs and publication PIDs are made
164
- available through ChecklistBank. Applications of this mapping are discussed,
165
- including a web app to locate the citation of a taxonomic name, and a knowledge
166
- graph that uses data on researcher’s ORCID ids to connect taxonomic names
167
- and publications to authors of those names.</p>\n</blockquote>\n<p>Much of
168
- the work has been linking taxa to names, which still has huge gaps. There
169
- are also interesting differences in coverage between plants, animals, and
170
- fungi (see preprint for details).</p>\n\n<div class=\"separator\" style=\"clear:
171
- both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s2276/animals-coverage.png\"
172
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
173
- border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
174
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s320/animals-coverage.png\"/></a></div><div
175
- class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s2276/fungi-coverage.png\"
176
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
177
- border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
178
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s320/fungi-coverage.png\"/></a></div><div
179
- class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s2276/plants-coverage.png\"
180
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
181
- border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
182
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s320/plants-coverage.png\"/></a></div>\n\n\nThere
183
- is also a simple app to demonstrate these links, see <a href=\"https://species-cite.herokuapp.com\">https://species-cite.herokuapp.com</a>.\n\n\n\n<blockquote>\n<p>Written
184
- with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/92rdb-5fe58","uuid":"d33d4f49-b281-4997-9eb9-dbad1e52d9bd","url":"https://iphylo.blogspot.com/2022/09/local-global-identifiers-for.html","title":"Local
185
- global identifiers for decentralised wikis","summary":"I''ve been thinking
186
- a bit about how one could use a Markdown wiki-like tool such as Obsidian to
187
- work with taxonomic data (see earlier posts Obsidian, markdown, and taxonomic
188
- trees and Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu).
189
- One \"gotcha\" would be how to name pages. If we treat the database as entirely
190
- local, then the page names don''t matter, but what if we envisage sharing
191
- the database, or merging it with others (for example, if we divided a taxon
192
- up into chunks, and...","date_published":"2022-09-08T16:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
193
- (Roderic Page)"}],"image":null,"content_html":"<p>I''ve been thinking a bit
194
- about how one could use a Markdown wiki-like tool such as Obsidian to work
195
- with taxonomic data (see earlier posts <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
196
- markdown, and taxonomic trees</a> and <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">Personal
197
- knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu</a>).</p>\n\n<p>One
198
- \"gotcha\" would be how to name pages. If we treat the database as entirely
199
- local, then the page names don''t matter, but what if we envisage sharing
200
- the database, or merging it with others (for example, if we divided a taxon
201
- up into chunks, and different people worked on those different chunks)? </p>\n\n<p>This
202
- is the attraction of globally unique identifiers. You and I can independently
203
- work on the same thing, such as data linked to scientific paper, safe in the
204
- knowledge that if we both use the DOI for that paper we can easily combine
205
- what we''ve done. But global identifiers can also be a pain, especially if
206
- we need to use a service to look them up (\"is there a DOI for this paper?\",
207
- \"what is the LSID for this taxonomic name?\").</p>\n\n<p>Life would be easier
208
- if we could generate identifiers \"locally\", but had some assurance that
209
- they would be globally unique, and that anyone else generating an identifier
210
- for the same thing would arrive at the same identifier (this eliminates things
211
- such as <a href=\"https://en.wikipedia.org/wiki/Universally_unique_identifier\">UUIDs</a>
212
- which are intentionally designed to prvent people genrrating the same identifier).
213
- One approach is \"content addressing\" (see, e.g. <a href=\"https://web.archive.org/web/20210514054054/https://bentrask.com/notes/content-addressing.html\">Principles
214
- of Content Addressing</a> - dead link but in the Wayabck Machine, see also
215
- <a href=\"https://github.com/btrask/stronglink\">btrask/stronglink</a>). For
216
- example, we can generate a cryptographic hash of a file (such as a PDF) and
217
- use that as the identifier.</p>\n\n<p>Now the problem is that we have globally
218
- unique, but ugly and unfriendly identifiers (such as \"6c98136eba9084ea9a5fc0b7693fed8648014505\").
219
- What we need are nice, easy to use identifiers we can use as page names. <a
220
- href=\"https://species.wikimedia.org/wiki/Main_Page\">Wikispecies</a> serves
221
- as a possible role model, where taxon names serve as page names, as do simplified
222
- citations (e.g., authors and years). This model runs into the problem that
223
- taxon names aren''t unique, nor are author + year combinations. In Wikispecies
224
- this is resolved by having a centralised database where it''s first come,
225
- first served. If there is a name clash you have to create a new name for your
226
- page. This works, but what if you have multiple databases un by different
227
- people? How do we ensure the identifiers are the same?</p>\n\n<p>Then I remembered
228
- Roger Hyam''s flight of fantasy over a decade ago: <a href=\"http://www.hyam.net/blog/archives/1007\">SpeciesIndex.org
229
- – an impractical, practical solution</a>. He proposed the following rules
230
- to generate a unique URI for a taxonomic name:\n\n<ul>\n <li>The URI must
231
- start with \"http://speciesindex.org\" followed by one or more of the following
232
- separated by slashes.</li>\n\n <li>First word of name. Must only contain
233
- letters. Must not be the same as one of the names of the nomenclatural codes
234
- (icbn or iczn). Optional but highly recommended.</li> \n\n <li>Second word
235
- of name. Must only contain letters and not be a nomenclatural code name. Optional.</li>
236
- \n\n <li>Third word of name. Must only contain letters and not be a nomenclatural
237
- code name. Optional.</li> \n\n <li>Year of publication. Must be an integer
238
- greater than 1650 and equal to or less than the current year. If this is an
239
- ICZN name then this should be the year the species (epithet) was published
240
- as is commonly cited after the name. If this is an ICBN name at species or
241
- below then it is the date of the combination. Optional. Recommended for zoological
242
- names if known. Not recommended for botanical names unless there is a known
243
- problem with homonyms in use by non-taxonomists.</li>\n \n<li>Nomenclatural
244
- code governing the name of the taxon. Currently this must be either ''icbn''
245
- or ''iczn''. This may be omitted if the code is unknown or not relevant. Other
246
- codes may be added to this list.</li> \n <li>Qualifier This must be a Version
247
- 4 RFC-4122 UUID. Optional. Used to generate a new independent identifier for
248
- a taxon for which the conventional name is unknown or does not exist or to
249
- indicate a particular taxon concept that bears the embedded name.</li>\n\n <li>The
250
- whole speciesindex.org URI string should be considered case\nsensitive. Everything
251
- should be lower case apart from the first letter of words that are specified
252
- as having upper case in their relevant codes e.g. names at and above the rank
253
- of genus.</li>\n</ul>\n</p>\n\n<p>Roger is basically arging that while names
254
- aren''t unique (i.e., we have homonyms such as <i>Abronia</i>) they are pretty
255
- close to being so, and with a few tweaks we can come up with a unique representation.
256
- Another way to think about this if we had a database of all taxonomics, we
257
- could construct a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
258
- and for each name find the shortest set of name parts (genus, species, etc),
259
- year, and code that gave us a unique string for that name. In many cases the
260
- species name may be all we need, in other cases we may need to add year and/or
261
- nomenclatural code to arrive at a unique string. \n\n</p>\n\n<p>What about
262
- bibliographic references? Well many of us will have databases (e.g., Endnote,
263
- Mendeley, Zotero, etc.) which generate \"cite keys\". These are typically
264
- short, memorable identifiers for a reference that are unique within that database.
265
- There is an interesting discussion on the <a href=\"https://discourse.jabref.org/t/universal-citekey-generator/2441/2\">JabRef
266
- forum</a> regarding a \"Universal Citekey Generator\", and source code is
267
- available <a href=\"https://github.com/cparnot/universal-citekey-js\">cparnot/universal-citekey-js</a>.
268
- I''ve yet to explore this in detail, but it looks a promising way to generate
269
- unique identifiers from basic metadata (echos of more elaborate schemes such
270
- as <a href=\"https://en.wikipedia.org/wiki/Serial_Item_and_Contribution_Identifier\">SICIs</a>).
271
- For example,\n\n<blockquote>Senna AR, Guedes UN, Andrade LF, Pereira-Filho
272
- GH. 2021. A new species of amphipod Pariphinotus Kunkel, 1910 (Amphipoda:
273
- Phliantidae) from Southwestern Atlantic. Zool Stud 60:57. doi:10.6620/ZS.2021.60-57.</blockquote>\n\nbecomes
274
- \"Senna:2021ck\". So if two people have the same, core, metadata for a paper
275
- they can generate the same key.</p>\n\n<p>Hence it seems with a few conventions
276
- (and maybe some simple tools to support them) we could have decentralised
277
- wiki-like tools that used the same identifiers for the same things, and yet
278
- those identfiiers were short and human-friendly.</p>","tags":["citekey","identfiiers","markdown","obsidian","Roger
279
- Hyam"],"language":"en","references":null},{"id":"https://doi.org/10.59350/j77nc-e8x98","uuid":"c6b101f4-bfbc-4d01-921d-805c43c85757","url":"https://iphylo.blogspot.com/2022/08/linking-taxonomic-names-to-literature.html","title":"Linking
280
- taxonomic names to the literature","summary":"Just some thoughts as I work
281
- through some datasets linking taxonomic names to the literature. In the diagram
282
- above I''ve tried to capture the different situatios I encounter. Much of
283
- the work I''ve done on this has focussed on case 1 in the diagram: I want
284
- to link a taxonomic name to an identifier for the work in which that name
285
- was published. In practise this means linking names to DOIs. This has the
286
- advantage of linking to a citable indentifier, raising questions such as whether
287
- citations...","date_published":"2022-08-22T17:19:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
288
- (Roderic Page)"}],"image":null,"content_html":"Just some thoughts as I work
289
- through some datasets linking taxonomic names to the literature.\n\n<div class=\"separator\"
290
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s2140/linking%20to%20names144.jpg\"
291
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
292
- border=\"0\" height=\"600\" data-original-height=\"2140\" data-original-width=\"1604\"
293
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s600/linking%20to%20names144.jpg\"/></a></div>\n\n<p>In
294
- the diagram above I''ve tried to capture the different situatios I encounter.
295
- Much of the work I''ve done on this has focussed on case 1 in the diagram:
296
- I want to link a taxonomic name to an identifier for the work in which that
297
- name was published. In practise this means linking names to DOIs. This has
298
- the advantage of linking to a citable indentifier, raising questions such
299
- as whether citations of taxonmic papers by taxonomic databases could become
300
- part of a <a href=\"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html\">taxonomist''s
301
- Google Scholar profile</a>.</p>\n\n<p>In many taxonomic databases full work-level
302
- citations are not the norm, instead taxonomists cite one or more pages within
303
- a work that are relevant to a taxonomic name. These \"microcitations\" (what
304
- the U.S. legal profession refer to as \"point citations\" or \"pincites\", see
305
- <a href=\"https://rasmussen.libanswers.com/faq/283203\">What are pincites,
306
- pinpoints, or jump legal references?</a>) require some work to map to the
307
- work itself (which is typically the thing that has a citatble identifier such
308
- as a DOI).</p>\n\n<p>Microcitations (case 2 in the diagram above) can be quite
309
- complex. Some might simply mention a single page, but others might list a
310
- series of (not necessarily contiguous) pages, as well as figures, plates etc.
311
- Converting these to citable identifiers can be tricky, especially as in most
312
- cases we don''t have page-level identifiers. The Biodiversity Heritage Library
313
- (BHL) does have URLs for each scanned page, and we have a standard for referring
314
- to pages in a PDF (<code>page=&lt;pageNum&gt;</code>, see <a href=\"https://datatracker.ietf.org/doc/html/rfc8118\">RFC
315
- 8118</a>). But how do we refer to a set of pages? Do we pick the first page?
316
- Do we try and represent a set of pages, and if so, how?</p>\n\n<p>Another
317
- issue with page-level identifiers is that not everything on a given page may
318
- be relevant to the taxonomic name. In case 2 above I''ve shaded in the parts
319
- of the pages and figure that refer to the taxonomic name. An example where
320
- this can be problematic is the recent test case I created for BHL where a
321
- page image was included for the taxonomic name <a href=\"https://www.gbif.org/species/195763322\"><i>Aphrophora
322
- impressa</i></a>. The image includes the species description and a illustration,
323
- as well as text that relates to other species.</p>\n\n<div class=\"separator\"
324
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s3467/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"
325
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
326
- border=\"0\" height=\"400\" data-original-height=\"3467\" data-original-width=\"2106\"
327
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s400/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"/></a></div>\n\n<p>Given
328
- that not everything on a page need be relevant, we could extract just the
329
- relevant blocks of text and illustrations (e.g., paragraphs of text, panels
330
- within a figure, etc.) and treat that set of elements as the thing to cite.
331
- This is, of course, what <a href=\"http://plazi.org\">Plazi</a> are doing.
332
- The set of extracted blocks is glued together as a \"treatment\", assigned
333
- an identifier (often a DOI), and treated as a citable unit. It would be interesting
334
- to see to what extent these treatments are actually cited, for example, do
335
- subsequent revisions that cite work that include treatments cite those treatments,
336
- or just the work itself? Put another way, are we creating <a href=\"https://iphylo.blogspot.com/2012/09/decoding-nature-encode-ipad-app-omg-it.html\">\"threads\"</a>
337
- between taxonomic revisions?</p>\n\n<p>One reason for these notes is that
338
- I''m exploring uploading taxonomic name - literature links to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
339
- and case 1 above is easy, as is case 3 (if we have treatment-level identifiers).
340
- But case 2 is problematic because we are linking to a set of things that may
341
- not have an identifier, which means a decision has to be made about which
342
- page to link to, and how to refer to that page.</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/w18j9-v7j10","uuid":"d811172e-7798-403c-a83d-3d5317a9657e","url":"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html","title":"Papers
343
- citing data that cite papers: CrossRef, DataCite, and the Catalogue of Life","summary":"Quick
344
- notes to self following on from a conversation about linking taxonomic names
345
- to the literature. Is there a way to turn those links into countable citations
346
- (even if just one per database) for Google Scholar?&mdash; Wayne Maddison
347
- (@WayneMaddison) August 3, 2022 There are different sorts of citation: Paper
348
- cites another paper Paper cites a dataset Dataset cites a paper Citation
349
- type (1) is largely a solved problem (although there are issues of the ownership
350
- and use of this...","date_published":"2022-08-03T11:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
351
- (Roderic Page)"}],"image":null,"content_html":"Quick notes to self following
352
- on from a conversation about linking taxonomic names to the literature.\n\n<blockquote
353
- class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Is there a way to turn
354
- those links into countable citations (even if just one per database) for Google
355
- Scholar?</p>&mdash; Wayne Maddison (@WayneMaddison) <a href=\"https://twitter.com/WayneMaddison/status/1554644747406348288?ref_src=twsrc%5Etfw\">August
356
- 3, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
357
- charset=\"utf-8\"></script>\n\nThere are different sorts of citation:\n\n<ol>\n <li>Paper
358
- cites another paper</li>\n <li>Paper cites a dataset</li>\n <li>Dataset
359
- cites a paper</li>\n</ol>\n\nCitation type (1) is largely a solved problem
360
- (although there are issues of the ownership and use of this data, see e.g.
361
- <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
362
- has no impact factor</a>.\n\nCitation type (2) is becoming more widespread
363
- (but not perfect as GBIF''s <a href=\"https://twitter.com/search?q=%23citethedoi&src=typed_query\">#citethedoi</a>
364
- campaign demonstrates. But the idea is well accepted and there are guides
365
- to how to do it, e.g.:\n\n<blockquote>\nCousijn, H., Kenall, A., Ganley, E.
366
- et al. A data citation roadmap for scientific publishers. Sci Data 5, 180259
367
- (2018). <a href=\"https://doi.org/10.1038/sdata.2018.259\">https://doi.org/10.1038/sdata.2018.259</a>\n</blockquote>\n\nHowever,
368
- things do get problematic because most (but not all) DOIs for publications
369
- are managed by CrossRef, which has an extensive citation database linking
370
- papers to other paopers. Most datasets have DataCite DOIs, and DataCite manages
371
- its own citations links, but as far as I''m aware these two systems don''t
372
- really taklk to each other.\n\nCitation type (3) is the case where a database
373
- is largely based on the literature, which applies to taxonomy. Taxonomic databases
374
- are essentially collections of literature that have opinions on taxa, and
375
- the database may simply compile those (e.g., a nomenclator), or come to some
376
- view on the applicability of each name. In an ideal would, each reference
377
- included in a taxonomic database would gain a citation, which would help better
378
- reflect the value of that work (a long standing bone of contention for taxonomists).\n\nIt
379
- would be interesting to explore these issues further. CrossRef and DataCite
380
- do share <a href=\"https://www.crossref.org/services/event-data/\">Event Data</a>
381
- (see also <a href=\"https://support.datacite.org/docs/eventdata-guide\">DataCite
382
- Event Data</a>). Can this track citations of papers by a dataset?\n \n \nMy
383
- take on Wayne''s question:\n\n<blockquote>\n Is there a way to turn those
384
- links into countable citations (even if just one per database) for Google
385
- Scholar?\n</blockquote>\n\nis that he''s is after type 3 citations, which
386
- I don''t think we have a way to handle just yet (but I''d need to look at
387
- Event Data a bit more). Google Scholar is a black box, and the academic coimmunity''s
388
- reliance on it for metrics is troubling. But it would be interetsing to try
389
- and figure out if there is a way to get Google Scholar to index the citations
390
- of taxonomic papers by databases. For instance, the <a href=\"https://www.catalogueoflife.org/\">Catalogue
391
- of Life</a> has an ISSN <a href=\"https://portal.issn.org/resource/ISSN/2405-884X\">2405-884X</a>
392
- so it can be treated as a publication. At the moment its web pages have lots
393
- of identifiers for people managing data and their organisations (lots of <a
394
- href=\"https://orcid.org\">ORCIDs</a> and <a href=\"https://ror.org\">RORs</a>,
395
- and DOIs for individual datasets (e.g., <a href=\"https://www.checklistbank.org/dataset/9828/about\">checklistbank.org</a>)
396
- but precious little in the way of DOIs for publications (or, indeed, ORCIDs
397
- for taxonomists). What would it take for taxonomic publications in the Catalogue
398
- of Life to be treated as first class citations?","tags":["Catalogue of Life","citation","CrossRef","DataCite","DOI"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ws094-1w310","uuid":"6bed78ec-0029-4096-b1c3-48a55a9fdb3b","url":"https://iphylo.blogspot.com/2023/04/chatgpt-of-course.html","title":"ChatGPT,
399
- of course","summary":"I haven’t blogged for a while, work and other reasons
400
- have meant I’ve not had much time to think, and mostly I blog to help me think.
401
- ChatGPT is obviously a big thing at the moment, and once we get past the moral
402
- panic (“students can pass exams using AI!”) there are a lot of interesting
403
- possibilities to explore. Inspired by essays such as How Q&amp;A systems based
404
- on large language models (eg GPT4) will change things if they become the dominant
405
- search paradigm — 9 implications for libraries...","date_published":"2023-04-03T12:52:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
406
- (Roderic Page)"}],"image":null,"content_html":"<p>I haven’t blogged for a
407
- while, work and <a href=\"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html\">other
408
- reasons</a> have meant I’ve not had much time to think, and mostly I blog
409
- to help me think.</p>\n<p>ChatGPT is obviously a big thing at the moment,
410
- and once we get past the moral panic (“students can pass exams using AI!”)
411
- there are a lot of interesting possibilities to explore. Inspired by essays
412
- such as <a href=\"https://medium.com/@aarontay/how-q-a-systems-based-on-large-language-models-eg-gpt4-will-change-things-if-they-become-the-norm-c7cf62736ba\">How
413
- Q&amp;A systems based on large language models (eg GPT4) will change things
414
- if they become the dominant search paradigm — 9 implications for libraries</a>
415
- and <a href=\"https://about.sourcegraph.com/blog/cheating-is-all-you-need\">Cheating
416
- is All You Need</a>, as well as [<a href=\"https://paul-graham-gpt.vercel.app/\">Paul
417
- Graham GPT</a>](<a href=\"https://paul-graham-gpt.vercel.app\">https://paul-graham-gpt.vercel.app</a>)
418
- I thought I’d try a few things and see where this goes.</p>\n<p>ChatGPT can
419
- do some surprising things.</p>\n<h4 id=\"parse-bibliographic-data\">Parse
420
- bibliographic data</h4>\n<p>I spend a LOT of time working with bibliographic
421
- data, trying to parse it into structured data. ChatGPT can do this:</p>\n\n<div
422
- class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s924/Screenshot%202023-04-03%20at%2012.59.30.png\"
423
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
424
- border=\"0\" height=\"400\" data-original-height=\"924\" data-original-width=\"738\"
425
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s400/Screenshot%202023-04-03%20at%2012.59.30.png\"/></a></div>\n\n<p>Note
426
- that it does more than simply parse the strings, it expands journal abbreviations
427
- such as “J. Malay Brch. R. Asiat. Soc.” to the full name “Journal of the Malayan
428
- Branch of the Royal Asiatic Society”. So we can get clean, parsed data in
429
- a range of formats.</p>\n<h4 id=\"parse-specimens\">Parse specimens</h4>\n<p>Based
430
- on the success with parsing bibliographic strings I wondered how well it could
431
- handle citation software specimens (“material examined”). Elsewhere I’ve been
432
- critical of Plazi’s ability to do this, see <a href=\"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html\">Problems
433
- with Plazi parsing: how reliable are automated methods for extracting specimens
434
- from the literature?</a>.</p>\n<p>For example, given this specimen record
435
- on p. 130 of <a href=\"https://doi.org/10.5852/ejt.2021.775.1553\">doi:10.5852/ejt.2021.775.1553</a></p>\n<blockquote>\n<p>LAOS
436
- • Kammoune Province, Bunghona Market, 7 km Nof Xe Bangfai River;<br>\n17.13674°
437
- N, 104.98591° E; E. Jeratthitikul, K. Wisittikoson, A. Fanka, N. Wutthituntisil
438
- and P. Prasankok leg.; sold by local people;<br>\nMUMNH-UNI2831.</p>\n</blockquote>\n<p>ChatGPT
439
- extracted a plausible Darwin Core record:</p>\n\n<div class=\"separator\"
440
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s901/Screenshot%202023-04-03%20at%2013.30.54.png\"
441
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
442
- border=\"0\" height=\"400\" data-original-height=\"901\" data-original-width=\"764\"
443
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s400/Screenshot%202023-04-03%20at%2013.30.54.png\"/></a></div>\n\n<p>I’ve
444
- been experimenting with parsing specimen records using the same machine learning
445
- approach for bibliographic data (e.g., <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
446
- parsing tool released</a>), perhaps it’s time to rethink that idea.</p>\n<h4
447
- id=\"correct-ocr-errors\">Correct OCR errors</h4>\n<p>OCR errors are a well
448
- known limitation of the Biodiversity Heritage Library (BHL), see for example
449
- <a href=\"https://blog.biodiversitylibrary.org/2022/07/ocr-improvements-early-analysis.html\">OCR
450
- Improvements: An Early Analysis</a>. Given that ChatGPT is built on top of
451
- a large language model it seems obvious to see how it handles OCR errors.
452
- I tried the following snippet of text from <a href=\"https://www.biodiversitylibrary.org/page/4312227\">BHL
453
- page 4312227</a>:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Jiiaii
454
- Jr., 16, who always join him and Hedriek in th(&gt; search for<br>\nfrogs.
455
- Iloraeio Mayorga, research assistant to tlie author, has also<br>\nbeen (wtreniely
456
- useful, not only in field work, but also in the<br>\ntedious curatorial and
457
- laboratory work that usually fol- lows. Dr.<br>\nJohn Randall kindly took
458
- tlie photouraplis of E. hedricki, and the<br>\nGaliiiaues family of San Juan
459
- made their El Verde house available as<br>\na (MMiter of activities for the
460
- colleetiuff party. To all these<br>\njx’opic, the autlioi- feels (iee])ly
461
- indebted.</p>\n<p>Elet’tiierodactylus iiKnRrcKi s]). n.</p>\n<p>Type. Museum
462
- of Coniparative Zoology Xo. ;}()!)()8, c^ from El Verde,<br>\nwest flank
463
- of El Yuiuinc Puerto Hico. I.IOO ft. Coll. J. A. Kivero,<br>\n11 Aug-. I!)(i2.</p>\n</blockquote>\n<p>ChatGPT
464
- returns this:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Juan Jr., 16,
465
- who always joins him and Hedrick in the search for<br>\nfrogs. Horacio Mayorga,
466
- research assistant to the author, has also<br>\nbeen extremely useful, not
467
- only in field work, but also in the tedious<br>\ncuratorial and laboratory
468
- work that usually follows. Dr. John Randall<br>\nkindly took the photographs
469
- of E. hedricki, and the Galinanes family<br>\nof San Juan made their El Verde
470
- house available as a center of<br>\nactivities for the collecting party. To
471
- all these people, the author<br>\nfeels deeply indebted.</p>\n<p>Eleutherodactylus
472
- hedricki sp. n.</p>\n<p>Type. Museum of Comparative Zoology No. 30908, c?
473
- from El Verde, west<br>\nflank of El Yunque, Puerto Rico. 1100 ft. Coll. J.
474
- A. Rivero, 11 Aug.<br>\n1962.</p>\n</blockquote>\n\n<div class=\"separator\"
475
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s859/Screenshot%202023-04-03%20at%2013.07.53.png\"
476
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
477
- border=\"0\" width=\"400\" data-original-height=\"510\" data-original-width=\"859\"
478
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s400/Screenshot%202023-04-03%20at%2013.07.53.png\"/></a></div>\n\n<p>Comparing
479
- this to the scanned image ChatGPT it does pretty well, for example the gobbledegook
480
- “Elet’tiierodactylus iiKnRrcKi” is correctly translated as “Eleutherodactylus
481
- hedricki”. Running all of BHL through ChatGPT probably isn’t feasible, but
482
- one could imagine targeted cleaning of key papers.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>These
483
- small experiments are fairly trivial, but they are the sort of tedious tasks
484
- that would otherwise require significant programming (or other resources)
485
- to solve. But ChatGPT can do rather more, as I hope to discuss in the next
486
- post.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/7esgr-61v1","uuid":"96fa91d5-459c-482f-aa38-dda6e0a30e20","url":"https://iphylo.blogspot.com/2022/01/large-graph-viewer-experiments.html","title":"Large
487
- graph viewer experiments","summary":"I keep returning to the problem of viewing
488
- large graphs and trees, which means my hard drive has accumulated lots of
489
- failed prototypes. Inspired by some recent discussions on comparing taxonomic
490
- classifications I decided to package one of these (wildly incomplete) prototypes
491
- up so that I can document the idea and put the code somewhere safe. Very cool,
492
- thanks for sharing this-- the tree diff is similar to what J Rees has been
493
- cooking up lately with his &#39;cl diff&#39; tool. I&#39;ll tag...","date_published":"2022-01-02T11:25:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
494
- (Roderic Page)"}],"image":null,"content_html":"<p>I keep returning to the
495
- problem of viewing large graphs and trees, which means my hard drive has accumulated
496
- lots of failed prototypes. Inspired by some recent discussions on comparing
497
- taxonomic classifications I decided to package one of these (wildly incomplete)
498
- prototypes up so that I can document the idea and put the code somewhere safe.</p>\n\n<blockquote
499
- class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Very cool, thanks for sharing
500
- this-- the tree diff is similar to what J Rees has been cooking up lately
501
- with his &#39;cl diff&#39; tool. I&#39;ll tag <a href=\"https://twitter.com/beckettws?ref_src=twsrc%5Etfw\">@beckettws</a>
502
- in here too so he can see potential crossover. The goal is autogenerate diffs
503
- like this as 1st step to mapping taxo name-to concept</p>&mdash; Nate Upham
504
- (@n8_upham) <a href=\"https://twitter.com/n8_upham/status/1475834371131289608?ref_src=twsrc%5Etfw\">December
505
- 28, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
506
- charset=\"utf-8\"></script>\n\n<h2>Google Maps-like viewer</h2>\n\n<div class=\"separator\"
507
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s500\"
508
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
509
- border=\"0\" width=\"400\" data-original-height=\"448\" data-original-width=\"500\"
510
- src=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s400\"/></a></div>\n\n<p>I''ve
511
- created a simple viewer that uses a tiled map viewer (like Google Maps) to
512
- display a large graph. The idea is to draw the entire graph scaled to a 256
513
- x 256 pixel tile. The graph is stored in a database that supports geospatial
514
- queries, which means the queries to retrieve the individual tiles need to
515
- display the graph at different levels of resolution are simply bounding box
516
- queries to a database. I realise that this description is cryptic at best.
517
- The GitHub repository <a href=\"https://github.com/rdmpage/gml-viewer\">https://github.com/rdmpage/gml-viewer</a>
518
- has more details and the code itself. There''s a lot to do, especially adding
519
- support for labels(!) which presents some interesting challenges (<a href=\"https://en.wikipedia.org/wiki/Level_of_detail_(computer_graphics)\">levels
520
- of detail</a> and <a href=\"https://en.wikipedia.org/wiki/Cartographic_generalization\">generalization</a>).
521
- The code doesn''t do any layout of the graph itself, instead I''ve used the
522
- <a href=\"https://www.yworks.com/products/yed\">yEd</a> tool to compute the
523
- x,y coordinates of the graph.</p>\n\n<p>Since this exercise was inspired by
524
- a discussion of the <a href=\"https://www.mammaldiversity.org\">ASM Mammal
525
- Diversity Database</a>, the graph I''ve used for the demonstration above is
526
- the ASM classification of extant mammals. I guess I need to solve the labelling
527
- issue fairly quickly!</p>","tags":["Google Maps","graph","Mammal Species of
528
- the World","mammals","taxonomy"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m48f7-c2128","uuid":"8aea47e4-f227-45f4-b37b-0454a8a7a3ff","url":"https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html","title":"ChatGPT,
529
- semantic search, and knowledge graphs","summary":"One thing about ChatGPT
530
- is it has opened my eyes to some concepts I was dimly aware of but am only
531
- now beginning to fully appreciate. ChatGPT enables you ask it questions, but
532
- the answers depend on what ChatGPT “knows”. As several people have noted,
533
- what would be even better is to be able to run ChatGPT on your own content.
534
- Indeed, ChatGPT itself now supports this using plugins. Paul Graham GPT However,
535
- it’s still useful to see how to add ChatGPT functionality to your own content
536
- from...","date_published":"2023-04-03T15:30:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
537
- (Roderic Page)"}],"image":null,"content_html":"<p>One thing about ChatGPT
538
- is it has opened my eyes to some concepts I was dimly aware of but am only
539
- now beginning to fully appreciate. ChatGPT enables you ask it questions, but
540
- the answers depend on what ChatGPT “knows”. As several people have noted,
541
- what would be even better is to be able to run ChatGPT on your own content.
542
- Indeed, ChatGPT itself now supports this using <a href=\"https://openai.com/blog/chatgpt-plugins\">plugins</a>.</p>\n<h4
543
- id=\"paul-graham-gpt\">Paul Graham GPT</h4>\n<p>However, it’s still useful
544
- to see how to add ChatGPT functionality to your own content from scratch.
545
- A nice example of this is <a href=\"https://paul-graham-gpt.vercel.app/\">Paul
546
- Graham GPT</a> by <a href=\"https://twitter.com/mckaywrigley\">Mckay Wrigley</a>.
547
- Mckay Wrigley took essays by Paul Graham (a well known venture capitalist)
548
- and built a question and answer tool very like ChatGPT.</p>\n<iframe width=\"560\"
549
- height=\"315\" src=\"https://www.youtube.com/embed/ii1jcLg-eIQ\" title=\"YouTube
550
- video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write;
551
- encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n<p>Because
552
- you can send a block of text to ChatGPT (as part of the prompt) you can get
553
- ChatGPT to summarise or transform that information, or answer questions based
554
- on that information. But there is a limit to how much information you can
555
- pack into a prompt. You can’t put all of Paul Graham’s essays into a prompt
556
- for example. So a solution is to do some preprocessing. For example, given
557
- a question such as “How do I start a startup?” we could first find the essays
558
- that are most relevant to this question, then use them to create a prompt
559
- for ChatGPT. A quick and dirty way to do this is simply do a text search over
560
- the essays and take the top hits. But we aren’t searching for words, we are
561
- searching for answers to a question. The essay with the best answer might
562
- not include the phrase “How do I start a startup?”.</p>\n<h4 id=\"semantic-search\">Semantic
563
- search</h4>\n<p>Enter <a href=\"https://en.wikipedia.org/wiki/Semantic_search\">Semantic
564
- search</a>. The key concept behind semantic search is that we are looking
565
- for documents with similar meaning, not just similarity of text. One approach
566
- to this is to represent documents by “embeddings”, that is, a vector of numbers
567
- that encapsulate features of the document. Documents with similar vectors
568
- are potentially related. In semantic search we take the query (e.g., “How
569
- do I start a startup?”), compute its embedding, then search among the documents
570
- for those with similar embeddings.</p>\n<p>To create Paul Graham GPT Mckay
571
- Wrigley did the following. First he sent each essay to the OpenAI API underlying
572
- ChatGPT, and in return he got the embedding for that essay (a vector of 1536
573
- numbers). Each embedding was stored in a database (Mckay uses Postgres with
574
- <a href=\"https://github.com/pgvector/pgvector\">pgvector</a>). When a user
575
- enters a query such as “How do I start a startup?” that query is also sent
576
- to the OpenAI API to retrieve its embedding vector. Then we query the database
577
- of embeddings for Paul Graham’s essays and take the top five hits. These hits
578
- are, one hopes, the most likely to contain relevant answers. The original
579
- question and the most similar essays are then bundled up and sent to ChatGPT
580
- which then synthesises an answer. See his <a href=\"https://github.com/mckaywrigley/paul-graham-gpt\">GitHub
581
- repo</a> for more details. Note that we are still using ChatGPT, but on a
582
- set of documents it doesn’t already have.</p>\n<h4 id=\"knowledge-graphs\">Knowledge
583
- graphs</h4>\n<p>I’m a fan of knowledge graphs, but they are not terribly easy
584
- to use. For example, I built a knowledge graph of Australian animals <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a>
585
- that contains a wealth of information on taxa, publications, and people, wrapped
586
- up in a web site. If you want to learn more you need to figure out how to
587
- write queries in SPARQL, which is not fun. Maybe we could use ChatGPT to write
588
- the SPARQL queries for us, but it would be much more fun to be simply ask
589
- natural language queries (e.g., “who are the experts on Australian ants?”).
590
- I made some naïve notes on these ideas <a href=\"https://iphylo.blogspot.com/2015/09/possible-project-natural-language.html\">Possible
591
- project: natural language queries, or answering “how many species are there?”</a>
592
- and <a href=\"https://iphylo.blogspot.com/2019/05/ozymandias-meets-wikipedia-with-notes.html\">Ozymandias
593
- meets Wikipedia, with notes on natural language generation</a>.</p>\n<p>Of
594
- course, this is a well known problem. Tools such as <a href=\"http://rdf2vec.org\">RDF2vec</a>
595
- can take RDF from a knowledge graph and create embeddings which could in tern
596
- be used to support semantic search. But it seems to me that we could simply
597
- this process a bit by making use of ChatGPT.</p>\n<p>Firstly we would generate
598
- natural language statements from the knowledge graph (e.g., “species x belongs
599
- to genus y and was described in z”, “this paper on ants was authored by x”,
600
- etc.) that cover the basic questions we expect people to ask. We then get
601
- embeddings for these (e.g., using OpenAI). We then have an interface where
602
- people can ask a question (“is species x a valid species?”, “who has published
603
- on ants”, etc.), we get the embedding for that question, retrieve natural
604
- language statements that the closest in embedding “space”, package everything
605
- up and ask ChatGPT to summarise the answer.</p>\n<p>The trick, of course,
606
- is to figure out how t generate natural language statements from the knowledge
607
- graph (which amounts to deciding what paths to traverse in the knowledge graph,
608
- and how to write those paths is something approximating English). We also
609
- want to know something about the sorts of questions people are likely to ask
610
- so that we have a reasonable chance of having the answers (for example, are
611
- people going to ask about individual species, or questions about summary statistics
612
- such as numbers of species in a genus, etc.).</p>\n<p>What makes this attractive
613
- is that it seems a straightforward way to go from a largely academic exercise
614
- (build a knowledge graph) to something potentially useful (a question and
615
- answer machine). Imagine if something like the defunct BBC wildlife site (see
616
- <a href=\"https://iphylo.blogspot.com/2017/12/blue-planet-ii-bbc-and-semantic-web.html\">Blue
617
- Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and
618
- opportunities lost</a>) revived <a href=\"https://aspiring-look.glitch.me\">here</a>
619
- had a question and answer interface where we could ask questions rather than
620
- passively browse.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>I have so much
621
- more to learn, and need to think about ways to incorporate semantic search
622
- and ChatGPT-like tools into knowledge graphs.</p>\n<blockquote>\n<p>Written
623
- with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/rfxj3-x6739","uuid":"6a4d5c44-f4a9-4d40-a32c-a4d5e512c55a","url":"https://iphylo.blogspot.com/2022/05/thoughts-on-treebase-dying.html","title":"Thoughts
624
- on TreeBASE dying(?)","summary":"@rvosa is Naturalis no longer hosting Treebase?
625
- https://t.co/MBRgcxaBmR&mdash; Hilmar Lapp (@hlapp) May 10, 2022 So it looks
626
- like TreeBASE is in trouble, it''s legacy Java code a victim of security issues.
627
- Perhaps this is a chance to rethink TreeBASE, assuming that a repository of
628
- published phylogenies is still considered a worthwhile thing to have (and
629
- I think that question is open). Here''s what I think could be done. The data
630
- (individual studies with trees and data) are packaged into...","date_published":"2022-05-11T16:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
631
- (Roderic Page)"}],"image":null,"content_html":"<blockquote class=\"twitter-tweet\"><p
632
- lang=\"en\" dir=\"ltr\"><a href=\"https://twitter.com/rvosa?ref_src=twsrc%5Etfw\">@rvosa</a>
633
- is Naturalis no longer hosting Treebase? <a href=\"https://t.co/MBRgcxaBmR\">https://t.co/MBRgcxaBmR</a></p>&mdash;
634
- Hilmar Lapp (@hlapp) <a href=\"https://twitter.com/hlapp/status/1524166490798309381?ref_src=twsrc%5Etfw\">May
635
- 10, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
636
- charset=\"utf-8\"></script>\n\n<p>So it looks like <a href=\"http://treebase.org\">TreeBASE</a>
637
- is in trouble, it''s legacy Java code a victim of security issues. Perhaps
638
- this is a chance to rethink TreeBASE, assuming that a repository of published
639
- phylogenies is still considered a worthwhile thing to have (and I think that
640
- question is open).</p>\n\n<p>Here''s what I think could be done.</p>\n\n<ol>\n<li>\nThe
641
- data (individual studies with trees and data) are packaged into whatever format
642
- is easiest (NEXUS, XML, JSON) and uploaded to a repository such as <a href=\"https://zenodo.org\">Zenodo</a>
643
- for long term storage. They get DOIs for citability. This becomes the default
644
- storage for TreeBASE.\n</li>\n<li>\nThe data is transformed into JSON and
645
- indexed using Elasticsearch. A simple web interface is placed on top so that
646
- people can easily find trees (never a strong point of the original TreeBASE).
647
- Trees are displayed natively on the web using SVG. The number one goal is
648
- for people to be able to find trees, view them, and download them.\n</li>\n<li>\nTo
649
- add data to TreeBASE the easiest way would be for people to upload them direct
650
- to Zenodo and tag them \"treebase\". A bot then grabs a feed of these datasets
651
- and adds them to the search engine in (1) above. As time allows, add an interface
652
- where people upload data directly, it gets curated, then deposited in Zenodo.
653
- This presupposes that there are people available to do curation. Maybe have
654
- \"stars\" for the level of curation so that users know whether anyone has
655
- checked the data.\n</li>\n</ol>\n\n<p>There''s lots of details to tweak, for
656
- example how many of the existing URLs for studies are preserved (some URL
657
- mapping), and what about the API? And I''m unclear about the relationship
658
- with <a href=\"https://datadryad.org\">Dryad</a>.</p>\n\n<p>My sense is that
659
- the TreeBASE code is very much of its time (10-15 years ago), a monolithic
660
- block of code with SQL, Java, etc. If one was starting from scratch today
661
- I don''t think this would be the obvious solution. Things have trended towards
662
- being simpler, with lots of building blocks now available in the cloud. Need
663
- a search engine? Just spin up a container in the cloud and you have one. More
664
- and more functionality can be devolved elsewhere.</p>\n\n<p>Another other
665
- issue is how to support TreeBASE. It has essentially been a volunteer effort
666
- to date, with little or no funding. One reason I think having Zenodo as a
667
- storage engine is that it takes care of long term sustainability of the data.</p>\n\n<p>I
668
- realise that this is all wild arm waving, but maybe now is the time to reinvent
669
- TreeBASE?</p>\n\n<h2>Updates</h2>\n\n<p>It''s been a while since I''ve paid
670
- a lot of attention to phylogenetic databases, and it shows. There is a file-based
671
- storage system for phylogenies <a href=\"https://github.com/OpenTreeOfLife/phylesystem-1\">phylesystem</a>
672
- (see \"Phylesystem: a git-based data store for community-curated phylogenetic
673
- estimates\" <a href=\"https://doi.org/10.1093/bioinformatics/btv276\">https://doi.org/10.1093/bioinformatics/btv276</a>)
674
- that is sort of what I had in mind, although long term persistence is based
675
- on GitHub rather than a repository such as Zenodo. Phylesystem uses a truly
676
- horrible-looking JSON transformation of <a href=\"http://nexml.github.io\">NeXML</a>
677
- (NeXML itself is ugly), and TreeBASE also supports NeXML, so some form of
678
- NeXML or a JSON transformation seems the obvious storage format. It will probably
679
- need some cleaning and simplification if it is to be indexed easily. Looking
680
- back over the long history of TreeBASE and phylogenetic databases I''m struck
681
- by how much complexity has been introduced over time. I think the tech has
682
- gotten in the way sometimes (which might just be another way of saying that
683
- I''m not smart enough to make sense of it all.</p>\n\n<p>So we could imagine
684
- a search engine that covers both TreeBASE and <a href=\"https://tree.opentreeoflife.org/curator\">Open
685
- Tree of Life studies</a>.</p>\n\n<p>Basic metadata-based searches would be
686
- straightforward, and we could have a user interface that highlights the trees
687
- (I think TreeBASE''s biggest search rival is a Google image search). The harder
688
- problem is searching by tree structure, for which there is an interesting
689
- literature without any decent implementations that I''m aware of (as I said,
690
- I''ve been out of this field a while).</p>\n\n<p>So my instinct is we could
691
- go a long way with simply indexing JSON (CouchDB or Elasticsearch), then need
692
- to think a bit more cleverly about higher taxon and tree based searching.
693
- I''ve always thought that one killer query would be not so much \"show me
694
- all the trees for my taxon\" but \"show me a synthesis of the trees for my
695
- taxon\". Imagine a supertree of recent studies that we could use as a summary
696
- of our current knowledge, or a visualisation that summarises where there are
697
- conflicts among the trees.</p>\n\n<h3>Relevant code and sites</h3>\n\n<ul>\n<li><a
698
- href=\"https://github.com/rdmpage/cdaotools\">CDAO Tools</a>, see \"CDAO-Store:
699
- Ontology-driven Data Integration for Phylogenetic Analysis\" <a href=\"https://doi.org/10.1186/1471-2105-12-98\">https://doi.org/10.1186/1471-2105-12-98</a></li>\n<li><a
700
- href=\"https://github.com/NESCent/phylocommons\">PhyloCommons</a></li>\n</ul>","tags":["phylogeny","TreeBASE"],"language":"en","references":null},{"id":"https://doi.org/10.59350/jzvs4-r9559","uuid":"23fa1dd8-5c6b-4aa9-9cad-c6f6b14ae9e0","url":"https://iphylo.blogspot.com/2021/08/json-ld-in-wild-examples-of-how.html","title":"JSON-LD
701
- in the wild: examples of how structured data is represented on the web","summary":"I''ve
702
- created a GitHub repository so that I can keep track of the examples of JSON-LD
703
- that I''ve seen being actively used, for example embedded in web sites, or
704
- accessed using an API. The repository is https://github.com/rdmpage/wild-json-ld.
705
- The list is by no means exhaustive, I hope to add more examples as I come
706
- across them. One reason for doing this is to learn what others are doing.
707
- For example, after looking at SciGraph''s JSON-LD I now see how an ordered
708
- list can be modelled in RDF in...","date_published":"2021-08-27T13:20:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
709
- (Roderic Page)"}],"image":null,"content_html":"<p>I''ve created a GitHub repository
710
- so that I can keep track of the examples of JSON-LD that I''ve seen being
711
- actively used, for example embedded in web sites, or accessed using an API.
712
- The repository is <a href=\"https://github.com/rdmpage/wild-json-ld\">https://github.com/rdmpage/wild-json-ld</a>.
713
- The list is by no means exhaustive, I hope to add more examples as I come
714
- across them.</p>\n\n<p>One reason for doing this is to learn what others are
715
- doing. For example, after looking at SciGraph''s JSON-LD I now see how an
716
- ordered list can be modelled in RDF in such a way that the list of authors
717
- in a JSON-LD document for, say a scientific paper, is correct. By default
718
- RDF has no notion of ordered lists, so if you do a SPARQL query to get the
719
- authors of a paper, the order of the authors returned in the query will be
720
- arbitrary. There are various ways to try and tackle this. In my Ozymandias
721
- knowledge graph I used \"roles\" to represent order (see <a href=\"https://doi.org/10.7717/peerj.6739/fig-2\">Figure
722
- 2</a> in the Ozymandias paper). I then used properties of the role to order
723
- the list of authors.</p>\n\n<p>Another approach is to use rdf:lists (see <a
724
- href=\"http://www.snee.com/bobdc.blog/2014/04/rdf-lists-and-sparql.html\">RDF
725
- lists and SPARQL</a> and <a href=\"https://stackoverflow.com/questions/17523804/is-it-possible-to-get-the-position-of-an-element-in-an-rdf-collection-in-sparql/17530689#17530689\">Is
726
- it possible to get the position of an element in an RDF Collection in SPARQL?</a>
727
- for an introduction to lists). SciGraph uses this approach. The value for
728
- schema:author is not an author, but a blank node (bnode), and this bnode has
729
- two predicates, rdf:first and rdf:rest. One points to an author, the other
730
- points to another bnode. This pattern repeats until we encounter a value of
731
- rdf:nil for rdf:rest.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a
732
- href=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s629/list.png\"
733
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
734
- border=\"0\" height=\"320\" data-original-height=\"629\" data-original-width=\"401\"
735
- src=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s320/list.png\"/></a></div>\n\n<p>This
736
- introduces some complexity, but the benefit is that the JSON-LD version of
737
- the RDF will have the authors in the correct order, and hence any client that
738
- is using JSON will be able to treat the array of authors as ordered. Without
739
- some means of ordering the client could not make this assumption, hence the
740
- first author in the list might not actually be the first author of the paper.</p>","tags":["JSON-LD","RDF"],"language":"en","references":null},{"id":"https://doi.org/10.59350/zc4qc-77616","uuid":"30c78d9d-2e50-49db-9f4f-b3baa060387b","url":"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html","title":"Does
741
- anyone cite taxonomic treatments?","summary":"Taxonomic treatments have come
742
- up in various discussions I''m involved in, and I''m curious as to whether
743
- they are actually being used, in particular, whether they are actually being
744
- cited. Consider the following quote: The taxa are described in taxonomic treatments,
745
- well defined sections of scientific publications (Catapano 2019). They include
746
- a nomenclatural section and one or more sections including descriptions, material
747
- citations referring to studied specimens, or notes ecology and...","date_published":"2022-09-01T16:49:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
748
- (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
749
- both;\"><a href=\"https://zenodo.org/record/5731100/thumb100\" style=\"display:
750
- block; padding: 1em 0; text-align: center; clear: right; float: right;\"><img
751
- alt=\"\" border=\"0\" height=\"128\" data-original-height=\"106\" data-original-width=\"100\"
752
- src=\"https://zenodo.org/record/5731100/thumb250\"/></a></div>\nTaxonomic
753
- treatments have come up in various discussions I''m involved in, and I''m
754
- curious as to whether they are actually being used, in particular, whether
755
- they are actually being cited. Consider the following quote:\n\n<blockquote>\nThe
756
- taxa are described in taxonomic treatments, well defined sections of scientific
757
- publications (Catapano 2019). They include a nomenclatural section and one
758
- or more sections including descriptions, material citations referring to studied
759
- specimens, or notes ecology and behavior. In case the treatment does not describe
760
- a new discovered taxon, previous treatments are cited in the form of treatment
761
- citations. This citation can refer to a previous treatment and add additional
762
- data, or it can be a statement synonymizing the taxon with another taxon.
763
- This allows building a citation network, and ultimately is a constituent part
764
- of the catalogue of life. - Taxonomic Treatments as Open FAIR Digital Objects
765
- <a href=\"https://doi.org/10.3897/rio.8.e93709\">https://doi.org/10.3897/rio.8.e93709</a>\n</blockquote>\n\n<p>\n
766
- \"Traditional\" academic citation is from article to article. For example,
767
- consider these two papers:\n\n<blockquote>\nLi Y, Li S, Lin Y (2021) Taxonomic
768
- study on fourteen symphytognathid species from Asia (Araneae, Symphytognathidae).
769
- ZooKeys 1072: 1-47. https://doi.org/10.3897/zookeys.1072.67935\n</blockquote>\n\n<blockquote>\nMiller
770
- J, Griswold C, Yin C (2009) The symphytognathoid spiders of the Gaoligongshan,
771
- Yunnan, China (Araneae: Araneoidea): Systematics and diversity of micro-orbweavers.
772
- ZooKeys 11: 9-195. https://doi.org/10.3897/zookeys.11.160\n</blockquote>\n</p>\n\n<p>Li
773
- et al. 2021 cites Miller et al. 2009 (although Pensoft seems to have broken
774
- the citation such that it does appear correctly either on their web page or
775
- in CrossRef).</p>\n\n<p>So, we have this link: [article]10.3897/zookeys.1072.67935
776
- --cites--> [article]10.3897/zookeys.11.160. One article cites another.</p>\n\n<p>In
777
- their 2021 paper Li et al. discuss <i>Patu jidanweishi</i> Miller, Griswold
778
- & Yin, 2009:\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s1040/Screenshot%202022-09-01%20at%2017.12.27.png\"
779
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
780
- border=\"0\" width=\"400\" data-original-height=\"314\" data-original-width=\"1040\"
781
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s400/Screenshot%202022-09-01%20at%2017.12.27.png\"/></a></div>\n\n<p>There
782
- is a treatment for the original description of <i>Patu jidanweishi</i> at
783
- <a href=\"https://doi.org/10.5281/zenodo.3792232\">https://doi.org/10.5281/zenodo.3792232</a>,
784
- which was created by Plazi with a time stamp \"2020-05-06T04:59:53.278684+00:00\".
785
- The original publication date was 2009, the treatments are being added retrospectively.</p>\n\n<p>In
786
- an ideal world my expectation would be that Li et al. 2021 would have cited
787
- the treatment, instead of just providing the text string \"Patu jidanweishi
788
- Miller, Griswold & Yin, 2009: 64, figs 65A–E, 66A, B, 67A–D, 68A–F, 69A–F,
789
- 70A–F and 71A–F (♂♀).\" Isn''t the expectation under the treatment model that
790
- we would have seen this relationship:</p>\n\n<p>[article]10.3897/zookeys.1072.67935
791
- --cites--> [treatment]https://doi.org/10.5281/zenodo.3792232</p>\n\n<p>Furthermore,
792
- if it is the case that \"[i]n case the treatment does not describe a new discovered
793
- taxon, previous treatments are cited in the form of treatment citations\"
794
- then we should also see a citation between treatments, in other words Li et
795
- al.''s 2021 treatment of <i>Patu jidanweishi</i> (which doesn''t seem to have
796
- a DOI but is available on Plazi'' web site as <a href=\"https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74\">https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74</a>)
797
- should also cite the original treatment? It doesn''t - but it does cite the
798
- Miller et al. paper.</p>\n\n<p>So in this example we don''t see articles citing
799
- treatments, nor do we see treatments citing treatments. Playing Devil''s advocate,
800
- why then do we have treatments? Does''t the lack of citations suggest that
801
- - despite some taxonomists saying this is the unit that matters - they actually
802
- don''t. If we pay attention to what people do rather than what they say they
803
- do, they cite articles.</p>\n\n<p>Now, there are all sorts of reasons why
804
- we don''t see [article] -> [treatment] citations, or [treatment] -> [treatment]
805
- citations. Treatments are being added after the fact by Plazi, not by the
806
- authors of the original work. And in many cases the treatments that could
807
- be cited haven''t appeared until after that potentially citing work was published.
808
- In the example above the Miller et al. paper dates from 2009, but the treatment
809
- extracted only went online in 2020. And while there is a long standing culture
810
- of citing publications (ideally using DOIs) there isn''t an equivalent culture
811
- of citing treatments (beyond the simple text strings).</p>\n\n<p>Obviously
812
- this is but one example. I''d need to do some exploration of the citation
813
- graph to get a better sense of citations patterns, perhaps using <a href=\"https://www.crossref.org/documentation/event-data/\">CrossRef''s
814
- event data</a>. But my sense is that taxonomists don''t cite treatments.</p>\n\n<p>I''m
815
- guessing Plazi would respond by saying treatments are cited, for example (indirectly)
816
- in GBIF downloads. This is true, although arguably people aren''t citing the
817
- treatment, they''re citing specimen data in those treatments, and that specimen
818
- data could be extracted at the level of articles rather than treatments. In
819
- other words, it''s not the treatments themselves that people are citing.</p>\n\n<p>To
820
- be clear, I think there is value in being able to identify those \"well defined
821
- sections\" of a publication that deal with a given taxon (i.e., treatments),
822
- but it''s not clear to me that these are actually the citable units people
823
- might hope them to be. Likewise, journals such as <i>ZooKeys</i> have DOIs
824
- for individual figures. Does anyone actually cite those?</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/en7e9-5s882","uuid":"20b9d31e-513f-496b-b399-4215306e1588","url":"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html","title":"Obsidian,
825
- markdown, and taxonomic trees","summary":"Returning to the subject of personal
826
- knowledge graphs Kyle Scheer has an interesting repository of Markdown files
827
- that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines
828
- (see his blog post for more background). If you add these files to Obsidian
829
- you get a nice visualisation of a taxonomy of academic disciplines. The applications
830
- of this to biological taxonomy seem obvious, especially as a tool like Obsidian
831
- enables all sorts of interesting links to be added...","date_published":"2022-04-07T21:07:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
832
- (Roderic Page)"}],"image":null,"content_html":"<p>Returning to the subject
833
- of <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">personal
834
- knowledge graphs</a> Kyle Scheer has an interesting repository of Markdown
835
- files that describe academic disciplines at <a href=\"https://github.com/kyletscheer/academic-disciplines\">https://github.com/kyletscheer/academic-disciplines</a>
836
- (see <a href=\"https://kyletscheer.medium.com/on-creating-a-tree-of-knowledge-f099c1028bf6\">his
837
- blog post</a> for more background).</p>\n\n<p>If you add these files to <a
838
- href=\"https://obsidian.md/\">Obsidian</a> you get a nice visualisation of
839
- a taxonomy of academic disciplines. The applications of this to biological
840
- taxonomy seem obvious, especially as a tool like Obsidian enables all sorts
841
- of interesting links to be added (e.g., we could add links to the taxonomic
842
- research behind each node in the taxonomic tree, the people doing that research,
843
- etc. - although that would mean we''d no longer have a simple tree).</p>\n\n<p>The
844
- more I look at these sort of simple Markdown-based tools the more I wonder
845
- whether we could make more use of them to create simple but persistent databases.
846
- Text files seem the most stable, long-lived digital format around, maybe this
847
- would be a way to minimise the inevitable obsolescence of database and server
848
- software. Time for some experiments I feel... can we take a taxonomic group,
849
- such as mammals, and create a richly connected database purely in Markdown?</p>\n\n<div
850
- class=\"separator\" style=\"clear: both; text-align: center;\"><iframe allowfullscreen=''allowfullscreen''
851
- webkitallowfullscreen=''webkitallowfullscreen'' mozallowfullscreen=''mozallowfullscreen''
852
- width=''400'' height=''322'' src=''https://www.blogger.com/video.g?token=AD6v5dy3Sa_SY_MJCZYYCT-bAGe9QD1z_V0tkE0qM5FaQJfAEgGOoHtYPATsNNbBvTEh_tHOZ83nMGzpYRg''
853
- class=''b-hbp-video b-uploaded'' frameborder=''0''></iframe></div>","tags":["markdown","obsidian"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m7gb7-d7c49","uuid":"7d814863-43b5-4faf-a475-da8de5efd3ef","url":"https://iphylo.blogspot.com/2022/02/duplicate-dois-again.html","title":"Duplicate
854
- DOIs (again)","summary":"This blog post provides some background to a recent
855
- tweet where I expressed my frustration about the duplication of DOIs for the
856
- same article. I''m going to document the details here. The DOI that alerted
857
- me to this problem is https://doi.org/10.2307/2436688 which is for the article
858
- Snyder, W. C., & Hansen, H. N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American
859
- Journal of Botany, 27(2), 64–67. This article is hosted by JSTOR at https://www.jstor.org/stable/2436688
860
- which displays the DOI...","date_published":"2022-02-08T15:06:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
861
- (Roderic Page)"}],"image":null,"content_html":"<p>This blog post provides
862
- some background to a <a href=\"https://twitter.com/rdmpage/status/1491023036199600132\">recent
863
- tweet</a> where I expressed my frustration about the duplication of DOIs for
864
- the same article. I''m going to document the details here.</p>\n\n<p>The DOI
865
- that alerted me to this problem is <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
866
- which is for the article</p>\n\n<blockquote>\nSnyder, W. C., & Hansen, H.
867
- N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American Journal of Botany, 27(2),
868
- 64–67.\n</blockquote>\n\n<p>This article is hosted by JSTOR at <a href=\"https://www.jstor.org/stable/2436688\">https://www.jstor.org/stable/2436688</a>
869
- which displays the DOI <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
870
- .</p>\n\n<p>This same article is also hosted by Wiley at <a href=\"https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x\">https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x</a>
871
- with the DOI <a href=\"https://doi.org/10.1002/j.1537-2197.1940.tb14217.x\">https://doi.org/10.1002/j.1537-2197.1940.tb14217.x</a>.</p>\n\n<h2>Expected
872
- behaviour</h2>\n\n<p>What should happen is if Wiley is going to be the publisher
873
- of this content (taking over from JSTOR), the DOI <b>10.2307/2436688</b> should
874
- be redirected to the Wiley page, and the Wiley page displays this DOI (i.e.,
875
- <b>10.2307/2436688</b>). If I want to get metadata for this DOI, I should
876
- be able to use CrossRef''s API to retrieve that metadata, e.g. <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
877
- should return metadata for the article.</p>\n\n<h2>What actually happens</h2>\n\n<p>Wiley
878
- display the same article on their web site with the DOI <b>10.1002/j.1537-2197.1940.tb14217.x</b>.
879
- They have minted a new DOI for the same article! The original JSTOR DOI now
880
- resolves to the Wiley page (you can see this using the <a href=\"https://hdl.handle.net\">Handle
881
- Resolver</a>), which is what is supposed to happen. However, Wiley should
882
- have reused the original DOI rather than mint their own.</p>\n\n<p>Furthermore,
883
- while the original DOI still resolves in a web browser, I can''t retrieve
884
- metadata about that DOI from CrossRef, so any attempt to build upon that DOI
885
- fails. However, I can retrieve metadata for the Wiley DOI, i.e. <a href=\"https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x\">https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x</a>
886
- works, but <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
887
- doesn''t.</p>\n\n<h2>Why does this matter?</h2>\n\n<p>For anyone using DOIs
888
- as stable links to the literature the persistence of DOIs is something you
889
- should be able to rely upon, both for people clicking on links in web browsers
890
- and developers getting metadata from those DOIs. The whole rationale of the
891
- DOI system is a single, globally unique identifier for each article, and that
892
- these DOIs persist even when the publisher of the content changes. If this
893
- property doesn''t hold, then why would a developer such as myself invest effort
894
- in linking using DOIs?</p>\n\n<p>Just for the record, I think CrossRef is
895
- great and is a hugely important part of the scholarly landscape. There are
896
- lots of things that I do that would be nearly impossible without CrossRef
897
- and its tools. But cases like this where we get massive duplication of DOIs
898
- when a publishers takes over an existing journal fundamentally breaks the
899
- underlying model of stable, persistent identifiers.</p>","tags":["CrossRef","DOI","duplicates"],"language":"en","references":null},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
52
+ Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default","feed_format":"application/atom+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-05-31T17:26:00+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger
53
+ 7.00","category":"Natural Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
54
+ barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
55
+ tweeted about this but want to bookmark it for later as well. The paper A
56
+ molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
57
+ contains the following: …the annotated barcode records assembled by FinBOL
58
+ participants represent a tremendous intergenerational transfer of taxonomic
59
+ knowledge the time contributed by current taxonomists in identifying and
60
+ contributing voucher specimens represents a great gift to future generations
61
+ who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":"2022-09-29T13:57:30Z","date_indexed":"1909-06-16T11:02:21+00:00","authors":[{"url":null,"name":"Roderic
62
+ Page"}],"image":null,"content_html":"<p>I <a href=\"https://twitter.com/rdmpage/status/1569738844416638981?s=21&amp;t=9OVXuoUEwZtQt-Ldzlutfw\">tweeted
63
+ about this</a> but want to bookmark it for later as well. The paper “A molecular-based
64
+ identification resource for the arthropods of Finland” <a href=\"https://doi.org/10.1111/1755-0998.13510\">doi:10.1111/1755-0998.13510</a>
65
+ contains the following:</p>\n<blockquote>\n<p>…the annotated barcode records
66
+ assembled by FinBOL participants represent a tremendous <mark>intergenerational
67
+ transfer of taxonomic knowledge</mark> the time contributed by current taxonomists
68
+ in identifying and contributing voucher specimens represents a great gift
69
+ to future generations who will benefit from their expertise when they are
70
+ no longer able to process new material.</p>\n</blockquote>\n<p>I think this
71
+ is a very clever way to characterise the project. In an age of machine learning
72
+ this may be commonest way to share knowledge , namely as expert-labelled training
73
+ data used to build tools for others. Of course, this means the expertise itself
74
+ may be lost, which has implications for updating the models if the data isn’t
75
+ complete. But it speaks to Charles Godfrey’s theme of <a href=\"https://biostor.org/reference/250587\">“Taxonomy
76
+ as information science”</a>.</p>\n<p>Note that the knowledge is also transformed
77
+ in the sense that the underlying expertise of interpreting morphology, ecology,
78
+ behaviour, genomics, and the past literature is not what is being passed on.
79
+ Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>\n<p>This
80
+ feels is different to, say iNaturalist, where there is a machine learning
81
+ model to identify images. In that case, the model is built on something the
82
+ community itself has created, and continues to create. Yes, the underlying
83
+ idea is that same: “experts” have labelled the data, a model is trained, the
84
+ model is used. But the benefits of the <a href=\"https://www.inaturalist.org\">iNaturalist</a>
85
+ model are immediately applicable to the people whose data built the model.
86
+ In the case of barcoding, because the technology itself is still not in the
87
+ hands of many (relative to, say, digital imaging), the benefits are perhaps
88
+ less tangible. Obviously researchers working with environmental DNA will find
89
+ it very useful, but broader impact may await the arrival of citizen science
90
+ DNA barcoding.</p>\n<p>The other consideration is whether the barcoding helps
91
+ taxonomists. Is it to be used to help prioritise future work (“we are getting
92
+ lots of unknown sequences in these taxa, lets do some taxonomy there”), or
93
+ is it simply capturing the knowledge of a generation that won’t be replaced:</p>\n<blockquote>\n<p>The
94
+ need to capture such knowledge is essential because there are, for example,
95
+ no young Finnish taxonomists who can critically identify species in many key
96
+ groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges,
97
+ most mite lineages).</p>\n</blockquote>\n<p>The cycle of collect data, test
98
+ and refine model, collect more data, rinse and repeat that happens with iNaturalist
99
+ creates a feedback loop. It’s not clear that a similar cycle exists for DNA
100
+ barcoding.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
900
101
  on \"The Macroscope\" - a tool for the 21st Century?","summary":"This is a
901
102
  guest post by Tony Rees. It would be difficult to encounter a scientist, or
902
103
  anyone interested in science, who is not familiar with the microscope, a tool
@@ -904,8 +105,8 @@ http_interactions:
904
105
  by the unaided eye, or to reveal otherwise invisible fine detail in larger
905
106
  objects. A select few with a particular interest in microscopy may also have
906
107
  encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop
907
- microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
908
- (Roderic Page)"}],"image":null,"content_html":"<p><img src=\"https://lh3.googleusercontent.com/-A99btr6ERMs/Vl1Wvjp2OtI/AAAAAAAAEFI/7bKdRjNG5w0/ytNkVT2U.jpg?imgmax=800\"
108
+ microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":"2021-10-08T10:26:22Z","date_indexed":"1909-06-16T10:02:25+00:00","authors":[{"url":null,"name":"Roderic
109
+ Page"}],"image":null,"content_html":"<p><img src=\"https://lh3.googleusercontent.com/-A99btr6ERMs/Vl1Wvjp2OtI/AAAAAAAAEFI/7bKdRjNG5w0/ytNkVT2U.jpg?imgmax=800\"
909
110
  alt=\"YtNkVT2U\" title=\"ytNkVT2U.jpg\" border=\"0\" width=\"128\" height=\"128\"
910
111
  style=\"float:right;\" /> This is a guest post by <a href=\"https://about.me/TonyRees\">Tony
911
112
  Rees</a>.</p>\n\n<p>It would be difficult to encounter a scientist, or anyone
@@ -1054,261 +255,11 @@ http_interactions:
1054
255
  a few personal views which would be out of scope in a \"neutral point of view\"
1055
256
  article such as is required on Wikipedia; also to see if readers of this blog
1056
257
  would like to contribute further to discussion of any of the concepts traversed
1057
- herein.</p>","tags":["guest post","macroscope"],"language":"en","references":null},{"id":"https://doi.org/10.59350/2b1j9-qmw12","uuid":"37538c38-66e6-4ac4-ab5c-679684622ade","url":"https://iphylo.blogspot.com/2022/05/round-trip-from-identifiers-to.html","title":"Round
1058
- trip from identifiers to citations and back again","summary":"Note to self
1059
- (basically rewriting last year''s Finding citations of specimens). Bibliographic
1060
- data supports going from identifier to citation string and back again, so
1061
- we can do a \"round trip.\" 1. Given a DOI we can get structured data with
1062
- a simple HTTP fetch, then use a tool such as citation.js to convert that data
1063
- into a human-readable string in a variety of formats. Identifier ⟶ Structured
1064
- data ⟶ Human readable string 10.7717/peerj-cs.214 HTTP with...","date_published":"2022-05-27T16:34:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1065
- (Roderic Page)"}],"image":null,"content_html":"<p>Note to self (basically
1066
- rewriting last year''s <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">Finding
1067
- citations of specimens</a>).</p>\n\n<p>Bibliographic data supports going from
1068
- identifier to citation string and back again, so we can do a \"round trip.\"</p>\n\n<h2>1.</h2>\n\n<p>Given
1069
- a DOI we can get structured data with a simple HTTP fetch, then use a tool
1070
- such as <a href=\"https://citation.js.org\">citation.js</a> to convert that
1071
- data into a human-readable string in a variety of formats.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟶\n</th>\n<th>\nStructured
1072
- data\n</th>\n<th>\n⟶\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nHTTP
1073
- with content-negotiation\n</td>\n<td>\nCSL-JSON\n</td>\n<td>\nCSL templates\n</td>\n<td
1074
- width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
1075
- modular bibliography tool for the browser and command line. PeerJ Computer
1076
- Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>2.</h2>\n\n<p>Going
1077
- in the reverse direction (string to identifier) is a little more challenging.
1078
- In the \"old days\" a typical strategy was to attempt to parse the citation
1079
- string into structured data (see <a href=\"hhtps://anystyle.io\">AnyStyle</a>
1080
- for a nice example of this), then we could extract a truple of (journal, volume,
1081
- starting page) and use that to query CrossRef to find if there was an article
1082
- with that tuple, which gave us the DOI.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nStructured
1083
- data\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nOpenURL
1084
- query\n</td>\n<td>\njournal, volume, start page\n</td>\n<td>\nCitation parser
1085
- \n</td>\n<td width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
1086
- modular bibliography tool for the browser and command line. PeerJ Computer
1087
- Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>3.</h2>\n\n<p>Another
1088
- strategy is to take all the citations strings for each DOI, index those in
1089
- a search engine, then just use a simple search to find the best match to your
1090
- citation string, and hence the DOI. This is what <a href=\"https://search.crossref.org\">https://search.crossref.org</a>
1091
- does.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman
1092
- readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nsearch\n</td>\n<td
1093
- width=\"50%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
1094
- modular bibliography tool for the browser and command line. PeerJ Computer
1095
- Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<p>At
1096
- the moment my work on material citations (i.e., lists of specimens in taxonomic
1097
- papers) is focussing on 1 (generating citations from specimen data in GBIF)
1098
- and 2 (parsing citations into structured data).</p>","tags":["citation","GBIF","material
1099
- examined","specimen codes"],"language":"en","references":null},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
1100
- bibliographic data","summary":"There are several instances where I have a
1101
- collection of references that I want to deduplicate and merge. For example,
1102
- in Zootaxa has no impact factor I describe a dataset of the literature cited
1103
- by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4),
1104
- as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1).
1105
- Given that the same articles may be cited many times, these datasets have
1106
- lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1107
- (Roderic Page)"}],"image":null,"content_html":"<p>There are several instances
1108
- where I have a collection of references that I want to deduplicate and merge.
1109
- For example, in <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
1110
- has no impact factor</a> I describe a dataset of the literature cited by articles
1111
- in the journal <i>Zootaxa</i>. This data is available on Figshare (<a href=\"https://doi.org/10.6084/m9.figshare.c.5054372.v4\">https://doi.org/10.6084/m9.figshare.c.5054372.v4</a>),
1112
- as is the equivalent dataset for <i>Phytotaxa</i> (<a href=\"https://doi.org/10.6084/m9.figshare.c.5525901.v1\">https://doi.org/10.6084/m9.figshare.c.5525901.v1</a>).
1113
- Given that the same articles may be cited many times, these datasets have
1114
- lots of duplicates. Similarly, articles in <a href=\"https://species.wikimedia.org\">Wikispecies</a>
1115
- often have extensive lists of references cited, and the same reference may
1116
- appear on multiple pages (for an initial attempt to extract these references
1117
- see <a href=\"https://doi.org/10.5281/zenodo.5801661\">https://doi.org/10.5281/zenodo.5801661</a>
1118
- and <a href=\"https://github.com/rdmpage/wikispecies-parser\">https://github.com/rdmpage/wikispecies-parser</a>).</p>\n\n<p>There
1119
- are several reasons I want to merge these references. If I want to build a
1120
- citation graph for <i>Zootaxa</i> or <i>Phytotaxa</i> I need to merge references
1121
- that are the same so that I can accurate count citations. I am also interested
1122
- in harvesting the metadata to help find those articles in the <a href=\"https://www.biodiversitylibrary.org\">Biodiversity
1123
- Heritage Library</a> (BHL), and the literature cited section of scientific
1124
- articles is a potential goldmine of bibliographic metadata, as is Wikispecies.</p>\n\n<p>After
1125
- various experiments and false starts I''ve created a repository <a href=\"https://github.com/rdmpage/bib-dedup\">https://github.com/rdmpage/bib-dedup</a>
1126
- to host a series of PHP scripts to deduplicate bibliographics data. I''ve
1127
- settled on using CSL-JSON as the format for bibliographic data. Because deduplication
1128
- relies on comparing pairs of references, the standard format for most of the
1129
- scripts is a JSON array containing a pair of CSL-JSON objects to compare.
1130
- Below are the steps the code takes.</p>\n\n<h2>Generating pairs to compare</h2>\n\n<p>The
1131
- first step is to take a list of references and generate the pairs that will
1132
- be compared. I started with this approach as I wanted to explore machine learning
1133
- and wanted a simple format for training data, such as an array of two CSL-JSON
1134
- objects and an integer flag representing whether the two references were the
1135
- same of different.</p>\n\n<p>There are various ways to generate CSL-JSON for
1136
- a reference. I use a tool I wrote (see <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
1137
- parsing tool released</a>) that has a simple API where you parse one or more
1138
- references and it returns that reference as structured data in CSL-JSON.</p>\n\n<p>Attempting
1139
- to do all possible pairwise comparisons rapidly gets impractical as the number
1140
- of references increases, so we need some way to restrict the number of comparisons
1141
- we make. One approach I''ve explored is the “sorted neighbourhood method”
1142
- where we sort the references 9for example by their title) then move a sliding
1143
- window down the list of references, comparing all references within that window.
1144
- This greatly reduces the number of pairwise comparisons. So the first step
1145
- is to sort the references, then run a sliding window over them, output all
1146
- the pairs in each window (ignoring in pairwise comparisons already made in
1147
- a previous window). Other methods of \"blocking\" could also be used, such
1148
- as only including references in a particular year, or a particular journal.</p>\n\n<p>So,
1149
- the output of this step is a set of JSON arrays, each with a pair of references
1150
- in CSL-JSON format. Each array is stored on a single line in the same file
1151
- in <a href=\"https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON_2\">line-delimited
1152
- JSON</a> (JSONL).</p>\n\n<h2>Comparing pairs</h2>\n\n<p>The next step is to
1153
- compare each pair of references and decide whether they are a match or not.
1154
- Initially I explored a machine learning approach used in the following paper:</p>\n\n<blockquote>\nWilson
1155
- DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex
1156
- features to improve genealogical record linkage. In: The 2011 International
1157
- Joint Conference on Neural Networks. 9–14. DOI: <a href=\"https://doi.org/10.1109/IJCNN.2011.6033192\">10.1109/IJCNN.2011.6033192</a>\n</blockquote>\n\n<p>Initial
1158
- experiments using <a href=\"https://github.com/jtet/Perceptron\">https://github.com/jtet/Perceptron</a>
1159
- were promising and I want to play with this further, but I deciding to skip
1160
- this for now and just use simple string comparison. So for each CSL-JSON object
1161
- I generate a citation string in the same format using CiteProc, then compute
1162
- the <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein
1163
- distance</a> between the two strings. By normalising this distance by the
1164
- length of the two strings being compared I can use an arbitrary threshold
1165
- to decide if the references are the same or not.</p>\n\n<h2>Clustering</h2>\n\n<p>For
1166
- this step we read the JSONL file produced above and record whether the two
1167
- references are a match or not. Assuming each reference has a unique identifier
1168
- (needs only be unique within the file) then we can use those identifier to
1169
- record the clusters each reference belongs to. I do this using a <a href=\"https://en.wikipedia.org/wiki/Disjoint-set_data_structure\">Disjoint-set
1170
- data structure</a>. For each reference start with a graph where each node
1171
- represents a reference, and each node has a pointer to a parent node. Initially
1172
- the reference is its own parent. A simple implementation is to have an array
1173
- index by reference identifiers and where the value of each cell in the array
1174
- is the node''s parent.</p>\n\n<p>As we discover pairs we update the parents
1175
- of the nodes to reflect this, such that once all the comparisons are done
1176
- we have a one or more sets of clusters corresponding to the references that
1177
- we think are the same. Another way to think of this is that we are getting
1178
- the components of a graph where each node is a reference and pair of references
1179
- that match are connected by an edge.</p>\n\n<p>In the code I''m using I write
1180
- this graph in <a href=\"https://en.wikipedia.org/wiki/Trivial_Graph_Format\">Trivial
1181
- Graph Format</a> (TGF) which can be visualised using a tools such as <a href=\"https://www.yworks.com/products/yed\">yEd</a>.</p>\n\n<h2>Merging</h2>\n\n<p>Now
1182
- that we have a graph representing the sets of references that we think are
1183
- the same we need to merge them. This is where things get interesting as the
1184
- references are similar (by definition) but may differ in some details. The
1185
- paper below describes a simple Bayesian approach for merging records:</p>\n\n<blockquote>\nCouncill
1186
- IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles
1187
- CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching
1188
- Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
1189
- Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: <a href=\"https://doi.org/10.1145/1141753.1141817\">10.1145/1141753.1141817</a>.\n</blockquote>\n\n<p>So
1190
- the next step is to read the graph with the clusters, generate the sets of
1191
- bibliographic references that correspond to each cluster, then use the method
1192
- described in Councill et al. to produce a single bibliographic record for
1193
- that cluster. These records could then be used to, say locate the corresponding
1194
- article in BHL, or populate Wikidata with missing references.</p>\n\n<p>Obviously
1195
- there is always the potential for errors, such as trying to merge references
1196
- that are not the same. As a quick and dirty check I flag as dubious any cluster
1197
- where the page numbers vary among members of the cluster. More sophisticated
1198
- checks are possible, especially if I go down the ML route (i.e., I would have
1199
- evidence for the probability that the same reference can disagree on some
1200
- aspects of metadata).</p>\n\n<h2>Summary</h2>\n\n<p>At this stage the code
1201
- is working well enough for me to play with and explore some example datasets.
1202
- The focus is on structured bibliographic metadata, but I may simplify things
1203
- and have a version that handles simple string matching, for example to cluster
1204
- together different abbreviations of the same journal name.</p>","tags":["data
1205
- cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ndtkv-6ve80","uuid":"e8e95aaf-bacb-4b5a-bf91-54e903526ab2","url":"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html","title":"Revisiting
1206
- RSS to monitor the latest taxonomic research","summary":"Over a decade ago
1207
- RSS (RDF Site Summary or Really Simple Syndication) was attracting a lot of
1208
- interest as a way to integrate data across various websites. Many science
1209
- publishers would provide a list of their latest articles in XML in one of
1210
- three flavours of RSS (RDF, RSS, Atom). This led to tools such as uBioRSS
1211
- [1] and my own e-Biosphere Challenge: visualising biodiversity digitisation
1212
- in real time. It was a time of enthusiasm for aggregating lots of data, such
1213
- as the ill-fated PLoS...","date_published":"2021-11-23T20:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1214
- (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1215
- both;\"><a href=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s257/feedicon.png\"
1216
- style=\"display: block; padding: 1em 0; text-align: center; clear: right;
1217
- float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"257\"
1218
- data-original-width=\"257\" src=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s200/feedicon.png\"/></a></div>\n<p>Over
1219
- a decade ago <a href=\"https://en.wikipedia.org/wiki/RSS\">RSS</a> (RDF Site
1220
- Summary or Really Simple Syndication) was attracting a lot of interest as
1221
- a way to integrate data across various websites. Many science publishers would
1222
- provide a list of their latest articles in XML in one of three flavours of
1223
- RSS (RDF, RSS, Atom). This led to tools such as <a href=\"http://ubio.org/rss/\">uBioRSS</a>
1224
- [<a href=\"#Leary2007\">1</a>] and my own <a href=\"https://iphylo.blogspot.com/2009/05/e-biosphere-challenge-visualising.html\">e-Biosphere
1225
- Challenge: visualising biodiversity digitisation in real time</a>. It was
1226
- a time of enthusiasm for aggregating lots of data, such as the <a href=\"https://iphylo.blogspot.com/2013/07/the-demise-of-plos-biodiversity-hub.html\">ill-fated</a>
1227
- PLoS Biodiversity Hub [<a href=\"#Mindell2011\">2</a>].</p>\n\n<p>Since I
1228
- seem to be condemned to revisit old ideas rather than come up with anything
1229
- new, I''ve been looking at providing a tool like the now defunct uBioRSS.
1230
- The idea is to harvest RSS feeds from journals (with an emphasis on taxonomic
1231
- and systematic journals), aggregate the results, and make them browsable by
1232
- taxon and geography. Here''s a sneak peak:</p>\n\n<div class=\"separator\"
1233
- style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s1032/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"
1234
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1235
- border=\"0\" width=\"400\" data-original-height=\"952\" data-original-width=\"1032\"
1236
- src=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s400/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"/></a></div>\n\n<p>What
1237
- seems like a straightforward task quickly became a bit of a challenge. Not
1238
- all journals have RSS feeds (they seem to have become less widely supported
1239
- over time) so I need to think of alternative ways to get lists of recent articles.
1240
- These lists also need to be processed in various ways. There are three versions
1241
- of RSS, each with their own idiosyncracies, so I need to standardise things
1242
- like dates. I also want to augment them with things like DOIs (often missing
1243
- from RSS feeds) and thumbnails for the articles (often available on publisher
1244
- websites but not the feeds). Then I need to index the content by taxon and
1245
- geography. For taxa I use a version of Patrick Leary''s \"taxonfinder\" (see
1246
- <a href=\"https://right-frill.glitch.me\">https://right-frill.glitch.me</a>)
1247
- to find names, then the <a href=\"https://index.globalnames.org\">Global Names
1248
- Index</a> to assign names found to the GBIF taxonomic hierarchy.</p>\n\n<p>Indexing
1249
- by geography proved harder. Typically <a href=\"https://en.wikipedia.org/wiki/Toponym_resolution#Geoparsing\">geoparsing</a>
1250
- involves taking a body of text and doing the following:\n<ul><li>Using named-entity
1251
- recognition <a href=\"https://en.wikipedia.org/wiki/Named-entity_recognition\">NER</a>
1252
- to identity named entities in the text (e.g., place names, people names, etc.).</li>\n<li>Using
1253
- a gazetteer of geographic names <a href=\"http://www.geonames.org\">GeoNames</a>
1254
- to try and match the place names found by NER.</li>\n</ul></p>\n\n<p>An example
1255
- of such a parser is the <a href=\"https://www.ltg.ed.ac.uk/software/geoparser/\">Edinburgh
1256
- Geoparser</a>. Typically geoparsing software can be large and tricky to install,
1257
- especially if you are looking to make your installation publicly accessible.
1258
- Geoparsing services seem to have a short half-life (e.g., <a href=\"https://geoparser.io\">Geoparser.io</a>),
1259
- perhaps because they are so useful they quickly get swamped by users.</p>\n\n<p>Bearing
1260
- this in mind, the approach I’ve taken here is to create a very simple geoparser
1261
- that is focussed on fairly large areas, especially those relevant to biodiversity,
1262
- and is aimed at geoparsing text such as abstracts of scientific papers. I''ve
1263
- created a small database of places by harvesting data from Wikidata, then
1264
- I use the \"flash text\" algorithm [<a href=\"#Singh2017\">3</a>] to find
1265
- geographic places. This approach uses a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
1266
- to store the place names. All I do is walk through the text seeing whether
1267
- the current word matches a place name (or the start of one) in the trie, then
1268
- moving on. This is very quick and seems to work quite well.</p>\n\n<p>Given
1269
- that I need to aggregate data from a lot of sources, apply various transformations
1270
- to that data, then merge it, there are a lot of moving parts. I started playing
1271
- with a \"NoCode\" platform for creating workflows, in this case <a href=\"https://n8n.io\">n8n</a>
1272
- (in many ways reminiscent of the now defunct <a href=\"https://en.wikipedia.org/wiki/Yahoo!_Pipes\">Yahoo
1273
- Pipes</a>). This was quite fun for a while, but after lots of experimentation
1274
- I moved back to writing code to aggregate the data into a CouchDB database.
1275
- CouchDB is one of the NoSQL databases that I really like as it has a great
1276
- interface, and makes queries very easy to do once you get your head around
1277
- how it works.</p>\n\n<p>So the end result of this is \"BioRSS\" <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
1278
- The interface comprises a stream of articles listed from newest to oldest,
1279
- with a treemap and a geographic map on the left. You can use these to filter
1280
- the articles by taxonomic group and/or country. For example the screen shot
1281
- is showing arthropods from China (in this case from a month or two ago in
1282
- the journal <i>ZooKeys</i>). As much fun as the interface has been to construct,
1283
- in many ways I don''t really want to spend time making an interface. For each
1284
- combination of taxon and country I provide a RSS feed so if you have a favour
1285
- feed reader you can grab the feed and view it there. As BioRSS updates the
1286
- data your feed reader should automatically update the feed. This means that
1287
- you can have a feed that monitors, say, new papers on spiders in China.</p>\n\n<p>In
1288
- the spirit of \"release early and release often\" this is an early version
1289
- of this app. I need to add a lot more feeds, back date them to bring in older
1290
- content, and I also want to make use of aggregators such as PubMed, CrossRef,
1291
- and Google Scholar. The existence of these tools is, I suspect, one reason
1292
- why RSS feeds are less common than they used to be.</p>\n\n<p>So, if this
1293
- sounds useful please take it for a spin at <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
1294
- Feedback is welcome, especially suggestions for journals to harvest and add
1295
- to the news feed. Ultimately I''d like to have sufficient coverage of the
1296
- taxonomic literature so that BioRSS becomes a place where we can go to find
1297
- the latest papers on any taxon of interest.</p>\n\n<h2>References</h2>\n\n<blockquote>\n<a
1298
- name=\"Leary2007\">1.</a> Patrick R. Leary, David P. Remsen, Catherine N.
1299
- Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic
1300
- literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages
1301
- 1434–1436, <a href=\"https://doi.org/10.1093/bioinformatics/btm109\">https://doi.org/10.1093/bioinformatics/btm109</a>\n</blockquote>\n\n<blockquote><a
1302
- name=\"Mindell2011\">2.</a> Mindell, D. P., Fisher, B. L., Roopnarine, P.,
1303
- Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating,
1304
- Tagging and Integrating Biodiversity Research. PLoS ONE, 6(8), e19491. <a
1305
- href=\"https://doi.org/10.1371/journal.pone.0019491\">doi:10.1371/journal.pone.0019491</a>\n</blockquote>\n\n<blockquote><a
1306
- name=\"Singh2017\">3.</a> Singh, V. (2017). Replace or Retrieve Keywords In
1307
- Documents at Scale. CoRR, abs/1711.00046. <a href=\"http://arxiv.org/abs/1711.00046\">http://arxiv.org/abs/1711.00046</a>\n\n</blockquote>","tags":["geocoding","NoCode","RSS"],"language":"en","references":[{"doi":"https://doi.org/10.1093/bioinformatics/btm109","key":"ref1"},{"doi":"https://doi.org/10.1371/journal.pone.0019491","key":"ref2"},{"key":"ref3","url":"http://arxiv.org/abs/1711.00046"}]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
258
+ herein.</p>","tags":["guest post","macroscope"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
1308
259
  Stuart Page 1936-2022","summary":"My dad died last weekend. Below is a notice
1309
260
  in today''s New Zealand Herald. I''m in New Zealand for his funeral. Don''t
1310
- really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1311
- (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
261
+ really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":"2023-03-22T07:25:56Z","date_indexed":"1909-06-16T10:41:55+00:00","authors":[{"url":null,"name":"Roderic
262
+ Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1312
263
  both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s3454/_DSC5106.jpg\"
1313
264
  style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1314
265
  border=\"0\" width=\"400\" data-original-height=\"2582\" data-original-width=\"3454\"
@@ -1318,172 +269,7 @@ http_interactions:
1318
269
  now.\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s3640/IMG_2870.jpeg\"
1319
270
  style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1320
271
  border=\"0\" height=\"320\" data-original-height=\"3640\" data-original-width=\"1391\"
1321
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg\"/></a></div>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/c79vq-7rr11","uuid":"3cb94422-5506-4e24-a41c-a250bb521ee0","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL
1322
- for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
1323
- for WikiData. More precisely, the endpoint is for a subset of the entities
1324
- that are of interest to WikiCite, such as scholarly articles, people, and
1325
- journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
1326
- The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
1327
- There are various ways to interact with the endpoint, personally I like the
1328
- Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
1329
- taken...","date_published":"2021-12-20T13:16:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1330
- (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1331
- both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s1000\"
1332
- style=\"display: block; padding: 1em 0; text-align: center; clear: right;
1333
- float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"1000\"
1334
- data-original-width=\"1000\" src=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s200\"/></a></div><p>I''ve
1335
- released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint
1336
- is for a subset of the entities that are of interest to WikiCite, such as
1337
- scholarly articles, people, and journals. There is a crude demo at <a href=\"https://wikicite-graphql.herokuapp.com\">https://wikicite-graphql.herokuapp.com</a>.
1338
- The endpoint itself is at <a href=\"https://wikicite-graphql.herokuapp.com/gql.php\">https://wikicite-graphql.herokuapp.com/gql.php</a>.
1339
- There are various ways to interact with the endpoint, personally I like the
1340
- <a href=\"https://altair.sirmuel.design\">Altair GraphQL Client</a> by <a
1341
- href=\"https://github.com/imolorhe\">Samuel Imolorhe</a>.</p>\n\n<p>As I''ve
1342
- <a href=\"https://iphylo.blogspot.com/2021/04/it-been-while.html\">mentioned
1343
- earlier</a> it''s taken me a while to see the point of GraphQL. But it is
1344
- clear it is gaining traction in the biodiversity world (see for example the
1345
- <a href=\"https://dev.gbif.org/hosted-portals.html\">GBIF Hosted Portals</a>)
1346
- so it''s worth exploring. My take on GraphQL is that it is a way to create
1347
- a self-describing API that someone developing a web site can use without them
1348
- having to bury themselves in the gory details of how data is internally modelled.
1349
- For example, WikiData''s query interface uses SPARQL, a powerful language
1350
- that has a steep learning curve (in part because of the administrative overhead
1351
- brought by RDF namespaces, etc.). In my previous SPARQL-based projects such
1352
- as <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a> and <a
1353
- href=\"http://alec-demo.herokuapp.com\">ALEC</a> I have either returned SPARQL
1354
- results directly (Ozymandias) or formatted SPARQL results as <a href=\"https://schema.org/DataFeed\">schema.org
1355
- DataFeeds</a> (equivalent to RSS feeds) (ALEC). Both approaches work, but
1356
- they are project-specific and if anyone else tried to build based on these
1357
- projects they might struggle for figure out what was going on. I certainly
1358
- struggle, and I wrote them!</p>\n\n<p>So it seems worthwhile to explore this
1359
- approach a little further and see if I can develop a GraphQL interface that
1360
- can be used to build the sort of rich apps that I want to see. The demo I''ve
1361
- created uses SPARQL under the hood to provide responses to the GraphQL queries.
1362
- So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully)
1363
- simpler overlay on top of SPARQL so that we can retrieve the data we want
1364
- without having to learn the intricacies of SPARQL, nor how Wikidata models
1365
- publications and people.</p>","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
1366
- barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
1367
- tweeted about this but want to bookmark it for later as well. The paper “A
1368
- molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
1369
- contains the following: …the annotated barcode records assembled by FinBOL
1370
- participants represent a tremendous intergenerational transfer of taxonomic
1371
- knowledge … the time contributed by current taxonomists in identifying and
1372
- contributing voucher specimens represents a great gift to future generations
1373
- who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1374
- (Roderic Page)"}],"image":null,"content_html":"<p>I <a href=\"https://twitter.com/rdmpage/status/1569738844416638981?s=21&amp;t=9OVXuoUEwZtQt-Ldzlutfw\">tweeted
1375
- about this</a> but want to bookmark it for later as well. The paper “A molecular-based
1376
- identification resource for the arthropods of Finland” <a href=\"https://doi.org/10.1111/1755-0998.13510\">doi:10.1111/1755-0998.13510</a>
1377
- contains the following:</p>\n<blockquote>\n<p>…the annotated barcode records
1378
- assembled by FinBOL participants represent a tremendous <mark>intergenerational
1379
- transfer of taxonomic knowledge</mark> … the time contributed by current taxonomists
1380
- in identifying and contributing voucher specimens represents a great gift
1381
- to future generations who will benefit from their expertise when they are
1382
- no longer able to process new material.</p>\n</blockquote>\n<p>I think this
1383
- is a very clever way to characterise the project. In an age of machine learning
1384
- this may be commonest way to share knowledge , namely as expert-labelled training
1385
- data used to build tools for others. Of course, this means the expertise itself
1386
- may be lost, which has implications for updating the models if the data isn’t
1387
- complete. But it speaks to Charles Godfrey’s theme of <a href=\"https://biostor.org/reference/250587\">“Taxonomy
1388
- as information science”</a>.</p>\n<p>Note that the knowledge is also transformed
1389
- in the sense that the underlying expertise of interpreting morphology, ecology,
1390
- behaviour, genomics, and the past literature is not what is being passed on.
1391
- Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>\n<p>This
1392
- feels is different to, say iNaturalist, where there is a machine learning
1393
- model to identify images. In that case, the model is built on something the
1394
- community itself has created, and continues to create. Yes, the underlying
1395
- idea is that same: “experts” have labelled the data, a model is trained, the
1396
- model is used. But the benefits of the <a href=\"https://www.inaturalist.org\">iNaturalist</a>
1397
- model are immediately applicable to the people whose data built the model.
1398
- In the case of barcoding, because the technology itself is still not in the
1399
- hands of many (relative to, say, digital imaging), the benefits are perhaps
1400
- less tangible. Obviously researchers working with environmental DNA will find
1401
- it very useful, but broader impact may await the arrival of citizen science
1402
- DNA barcoding.</p>\n<p>The other consideration is whether the barcoding helps
1403
- taxonomists. Is it to be used to help prioritise future work (“we are getting
1404
- lots of unknown sequences in these taxa, lets do some taxonomy there”), or
1405
- is it simply capturing the knowledge of a generation that won’t be replaced:</p>\n<blockquote>\n<p>The
1406
- need to capture such knowledge is essential because there are, for example,
1407
- no young Finnish taxonomists who can critically identify species in many key
1408
- groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges,
1409
- most mite lineages).</p>\n</blockquote>\n<p>The cycle of collect data, test
1410
- and refine model, collect more data, rinse and repeat that happens with iNaturalist
1411
- creates a feedback loop. It’s not clear that a similar cycle exists for DNA
1412
- barcoding.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/enxas-arj18","uuid":"ab5a6e04-d55e-4901-8269-9eea65ce7178","url":"https://iphylo.blogspot.com/2022/08/can-we-use-citation-graph-to-measure.html","title":"Can
1413
- we use the citation graph to measure the quality of a taxonomic database?","summary":"More
1414
- arm-waving notes on taxonomic databases. I''ve started to add data to ChecklistBank
1415
- and this has got me thinking about the issue of data quality. When you add
1416
- data to ChecklistBank you are asked to give a measure of confidence based
1417
- on the Catalogue of Life Checklist Confidence system of one - five stars:
1418
- ★ - ★★★★★. I''m scepetical about the notion of confidence or \"trust\" when
1419
- it is reduced to a star system (see also Can you trust EOL?). I could literally
1420
- pick any number of stars, there''s...","date_published":"2022-08-24T14:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1421
- (Roderic Page)"}],"image":null,"content_html":"<p>More arm-waving notes on
1422
- taxonomic databases. I''ve started to add data to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
1423
- and this has got me thinking about the issue of data quality. When you add
1424
- data to ChecklistBank you are asked to give a measure of confidence based
1425
- on the <a href=\"https://www.catalogueoflife.org/about/glossary.html#checklist-confidence\">Catalogue
1426
- of Life Checklist Confidence</a> system of one - five stars: ★ - ★★★★★. I''m
1427
- scepetical about the notion of confidence or \"trust\" when it is reduced
1428
- to a star system (see also <a href=\"https://iphylo.blogspot.com/2012/06/can-you-trust-eol.html\">Can
1429
- you trust EOL?</a>). I could literally pick any number of stars, there''s
1430
- no way to measure what number of stars is appropriate. This feeds into my
1431
- biggest reservation about the <a href=\"https://www.catalogueoflife.org\">Catalogue
1432
- of Life</a>, it''s almost entirely authority based, not evidence based. That
1433
- is, rather than give us evidence for why a particular taxon is valid, we are
1434
- (mostly) just given a list of taxa are asked to accept those as gospel, based
1435
- on assertions by one or more authorities. I''m not necessarly doubting the
1436
- knowledge of those making these lists, it''s just that I think we need to
1437
- do better than \"these are the accepted taxa because I say so\" implict in
1438
- the Catalogue of Life.\n</p>\n\n<p>So, is there any way we could objectively
1439
- measure the quality of a particular taxonomic checklist? Since I have a long
1440
- standing interest in link the primary taxonomic litertaure to names in databases
1441
- (since that''s where the evidence is), I keep wondering whether measures based
1442
- on that literture could be developed. \n</p>\n<p>\nI recently revisited the
1443
- fascinating (and quite old) literature on rates of synonymy:\n</p>\n<blockquote>\nGaston
1444
- Kevin J. and Mound Laurence A. 1993 Taxonomy, hypothesis testing and the
1445
- biodiversity crisisProc. R. Soc. Lond. B.251139–142\n<a href=\"http://doi.org/10.1098/rspb.1993.0020\">http://doi.org/10.1098/rspb.1993.0020</a>\n</blockquote>\n \n<blockquote>\n Andrew
1446
- R. Solow, Laurence A. Mound, Kevin J. Gaston, Estimating the Rate of Synonymy,
1447
- Systematic Biology, Volume 44, Issue 1, March 1995, Pages 93–96, <a href=\"https://doi.org/10.1093/sysbio/44.1.93\">https://doi.org/10.1093/sysbio/44.1.93</a>\n</blockquote>\n\n</p>\n\n<p>\nA
1448
- key point these papers make is that the observed rate of synonymy is quite
1449
- high (that is, many \"new species\" end up being merged with already known
1450
- species), and that because it can take time to discover that a species is
1451
- a synonym the actual rate may be even higher. In other words, in diagrams
1452
- like the one reproduced below, the reason the proportion of synonyms declines
1453
- the nearer we get to the present day (this paper came out in 1995) is not
1454
- because are are creating fewer synonyms but because we''ve not yet had time
1455
- to do the work to uncover the remaining synonyms.\n</p>\n \n<div class=\"separator\"
1456
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s621/Screenshot%202022-08-24%20at%2014.59.47.png\"
1457
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1458
- border=\"0\" width=\"400\" data-original-height=\"404\" data-original-width=\"621\"
1459
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s400/Screenshot%202022-08-24%20at%2014.59.47.png\"/></a></div>\n\n<p>Put
1460
- another way, these papers are arguing that real work of taxonomy is revision,
1461
- not species discovery, especially since it''s not uncommon for > 50% of species
1462
- in a taxon to end up being synonymised. Indeed, if a taxoomic group has few
1463
- synonyms then these authors would argue that''s a sign of neglect. More revisionary
1464
- work would likely uncover additional synonyms. So, what we need is a way to
1465
- measure the amount of research on a taxonomic group. It occurs to me that
1466
- we could use the citation graph as a way to tackle this. Lets imagine we have
1467
- a set of taxa (say a family) and we have all the papers that described new
1468
- species or undertook revisions (or both). The extensiveness of that work could
1469
- be measured by the citation graph. For example, build the citation graph for
1470
- those papers. How many original species decsriptions are not cited? Those
1471
- species have been potentially neglected. How many large-scale revisions have
1472
- there been (as measured by the numbers of taxonomic papers those revisions
1473
- cite)? There are some interesting approaches to quantifying this, such as
1474
- using <a href=\"https://en.wikipedia.org/wiki/HITS_algorithm\">hubs and authorities</a>.</p>\n \n \n<p>I''m
1475
- aware that taxonomists have not had the happiest relationship with citations:\n \n<blockquote>\nPinto
1476
- ÂP, Mejdalani G, Mounce R, Silveira LF, Marinoni L, Rafael JA. Are publications
1477
- on zoological taxonomy under attack? R Soc Open Sci. 2021 Feb 10;8(2):201617.
1478
- <a href=\"https://doi.org/10.1098/rsos.201617\">doi: 10.1098/rsos.201617</a>.
1479
- PMID: 33972859; PMCID: PMC8074659.\n</blockquote>\n\nStill, I think there
1480
- is an intriguing possibility here. For this approach to work, we need to have
1481
- linked taxonomic names to publications, and have citation data for those publications.
1482
- This is happening on various platforms. Wikidata, for example, is becoming
1483
- a repository of the taxonomic literature, some of it with citation links.\n\n<blockquote>\nPage
1484
- RDM. 2022. Wikidata and the bibliography of life. PeerJ 10:e13712 <a href=\"https://doi.org/10.7717/peerj.13712\">https://doi.org/10.7717/peerj.13712</a>\n</blockquote>\n\nTime
1485
- for some experiments.\n</p>","tags":["Bibliography of Life","citation","synonymy","taxonomic
1486
- databases"],"language":"en","references":null},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
272
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg\"/></a></div>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
1487
273
  interfaces, knowledge graphs, and Miller columns","summary":"Some quick notes
1488
274
  on interface ideas for digital libraries and/or knowledge graphs. Recently
1489
275
  there’s been something of an explosion in bibliographic tools to explore the
@@ -1491,10 +277,10 @@ http_interactions:
1491
277
  papers _scite which uses AI to do sentiment analysis on citations (does paper
1492
278
  A cite paper B favourably or not?) ResearchRabbit which uses lists, networks,
1493
279
  and timelines to discover related research Scispace which navigates connections
1494
- between...","date_published":"2023-04-25T13:01:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1495
- (Roderic Page)"}],"image":null,"content_html":"<p>Some quick notes on interface
1496
- ideas for digital libraries and/or knowledge graphs.</p>\n<p>Recently there’s
1497
- been something of an explosion in bibliographic tools to explore the literature.
280
+ between...","date_published":"2023-04-25T13:01:00Z","date_modified":"2023-04-27T14:51:08Z","date_indexed":"1909-06-16T11:25:14+00:00","authors":[{"url":null,"name":"Roderic
281
+ Page"}],"image":null,"content_html":"<p>Some quick notes on interface ideas
282
+ for digital libraries and/or knowledge graphs.</p>\n<p>Recently there’s been
283
+ something of an explosion in bibliographic tools to explore the literature.
1498
284
  Examples include:</p>\n<ul>\n<li><a href=\"https://elicit.org\">Elicit</a>
1499
285
  which uses AI to search for and summarise papers</li>\n<li><a href=\"https://scite.ai\">_scite</a>
1500
286
  which uses AI to do sentiment analysis on citations (does paper A cite paper
@@ -1576,15 +362,15 @@ http_interactions:
1576
362
  in other words, how deep down the rabbit hole will be go?</p>\n\n<h3>Update</h3>\n<p>Should
1577
363
  add link to David Regev''s explorations of <a href=\"https://medium.com/david-regev-on-ux/flow-browser-b730daf0f717\">Flow
1578
364
  Browser</a>.\n\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":["cards","flow","Knowledge
1579
- Graph","Miller column","RabbitResearch"],"language":"en","references":null},{"id":"https://doi.org/10.59350/t6fb9-4fn44","uuid":"8bc3fea6-cb86-4344-8dad-f312fbf58041","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The
365
+ Graph","Miller column","RabbitResearch"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/t6fb9-4fn44","uuid":"8bc3fea6-cb86-4344-8dad-f312fbf58041","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The
1580
366
  Business of Extracting Knowledge from Academic Publications","summary":"Markus
1581
367
  Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting
1582
368
  Knowledge from Academic Publications\". I spent months working on domain-specific
1583
369
  search engines and knowledge discovery apps for biomedicine and eventually
1584
- figured that synthesizing &quot;insights&quot; or building knowledge graphs
1585
- by machine-reading the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc&mdash;
1586
- Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I...","date_published":"2021-12-11T00:01:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1587
- (Roderic Page)"}],"image":null,"content_html":"<p>Markus Strasser (<a href=\"https://twitter.com/mkstra\">@mkstra</a>
370
+ figured that synthesizing \"insights\" or building knowledge graphs by machine-reading
371
+ the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc
372
+ Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I worked on biomedical...","date_published":"2021-12-11T00:01:00Z","date_modified":"2021-12-11T00:01:21Z","date_indexed":"1909-06-16T11:32:09+00:00","authors":[{"url":null,"name":"Roderic
373
+ Page"}],"image":null,"content_html":"<p>Markus Strasser (<a href=\"https://twitter.com/mkstra\">@mkstra</a>
1588
374
  write a fascinating article entitled <a href=\"https://markusstrasser.org/extracting-knowledge-from-literature/\">\"The
1589
375
  Business of Extracting Knowledge from Academic Publications\"</a>.</p>\n\n<blockquote
1590
376
  class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">I spent months working
@@ -1611,18 +397,18 @@ http_interactions:
1611
397
  it will go anywhere.</p>\n\n<p>Don’t take that as a challenge. Take it as
1612
398
  a red flag and run. Run towards better problems.\n</blockquote></p>\n\n<p>Well
1613
399
  worth a read, and much food for thought.</p>","tags":["ai","business model","text
1614
- mining"],"language":"en","references":null},{"id":"https://doi.org/10.59350/463yw-pbj26","uuid":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David
400
+ mining"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/463yw-pbj26","uuid":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David
1615
401
  Remsen","summary":"I heard yesterday from Martin Kalfatovic (BHL) that David
1616
402
  Remsen has died. Very sad news. It''s starting to feel like iPhylo might end
1617
403
  up being a list of obituaries of people working on biodiversity informatics
1618
404
  (e.g., Scott Federhen). I spent several happy visits at MBL at Woods Hole
1619
405
  talking to Dave at the height of the uBio project, which really kickstarted
1620
406
  large scale indexing of taxonomic names, and the use of taxonomic name finding
1621
- tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1622
- (Roderic Page)"}],"image":null,"content_html":"<p>I heard yesterday from Martin
1623
- Kalfatovic (BHL) that David Remsen has died. Very sad news. It''s starting
1624
- to feel like iPhylo might end up being a list of obituaries of people working
1625
- on biodiversity informatics (e.g., <a href=\"https://iphylo.blogspot.com/2016/05/scott-federhen-rip.html\">Scott
407
+ tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":"2022-12-17T08:12:23Z","date_indexed":"1909-06-16T11:41:39+00:00","authors":[{"url":null,"name":"Roderic
408
+ Page"}],"image":null,"content_html":"<p>I heard yesterday from Martin Kalfatovic
409
+ (BHL) that David Remsen has died. Very sad news. It''s starting to feel like
410
+ iPhylo might end up being a list of obituaries of people working on biodiversity
411
+ informatics (e.g., <a href=\"https://iphylo.blogspot.com/2016/05/scott-federhen-rip.html\">Scott
1626
412
  Federhen</a>).</p>\n\n<p>I spent several happy visits at MBL at Woods Hole
1627
413
  talking to Dave at the height of the uBio project, which really kickstarted
1628
414
  large scale indexing of taxonomic names, and the use of taxonomic name finding
@@ -1643,149 +429,156 @@ http_interactions:
1643
429
  David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar,
1644
430
  uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23,
1645
431
  Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109</p>","tags":["David
1646
- Remsen","obituary","uBio"],"language":"en","references":null},{"id":"https://doi.org/10.59350/pmhat-5ky65","uuid":"5891c709-d139-440f-bacb-06244424587a","url":"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html","title":"Problems
1647
- with Plazi parsing: how reliable are automated methods for extracting specimens
1648
- from the literature?","summary":"The Plazi project has become one of the major
1649
- contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
1650
- (see Plazi''s GBIF page for details). These occurrences are extracted from
1651
- taxonomic publication using automated methods. New data is published almost
1652
- daily (see latest treatments). The map below shows the geographic distribution
1653
- of material citations provided to GBIF by Plazi, which gives you a sense of
1654
- the size of the dataset. By any metric Plazi represents a...","date_published":"2021-10-25T11:10:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1655
- (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1656
- both;\"><a href=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s240/Rf7UoXTw_400x400.jpg\"
432
+ Remsen","obituary","uBio"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
433
+ bibliographic data","summary":"There are several instances where I have a
434
+ collection of references that I want to deduplicate and merge. For example,
435
+ in Zootaxa has no impact factor I describe a dataset of the literature cited
436
+ by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4),
437
+ as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1).
438
+ Given that the same articles may be cited many times, these datasets have
439
+ lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":"2022-02-03T15:11:29Z","date_indexed":"1909-06-16T10:22:30+00:00","authors":[{"url":null,"name":"Roderic
440
+ Page"}],"image":null,"content_html":"<p>There are several instances where
441
+ I have a collection of references that I want to deduplicate and merge. For
442
+ example, in <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
443
+ has no impact factor</a> I describe a dataset of the literature cited by articles
444
+ in the journal <i>Zootaxa</i>. This data is available on Figshare (<a href=\"https://doi.org/10.6084/m9.figshare.c.5054372.v4\">https://doi.org/10.6084/m9.figshare.c.5054372.v4</a>),
445
+ as is the equivalent dataset for <i>Phytotaxa</i> (<a href=\"https://doi.org/10.6084/m9.figshare.c.5525901.v1\">https://doi.org/10.6084/m9.figshare.c.5525901.v1</a>).
446
+ Given that the same articles may be cited many times, these datasets have
447
+ lots of duplicates. Similarly, articles in <a href=\"https://species.wikimedia.org\">Wikispecies</a>
448
+ often have extensive lists of references cited, and the same reference may
449
+ appear on multiple pages (for an initial attempt to extract these references
450
+ see <a href=\"https://doi.org/10.5281/zenodo.5801661\">https://doi.org/10.5281/zenodo.5801661</a>
451
+ and <a href=\"https://github.com/rdmpage/wikispecies-parser\">https://github.com/rdmpage/wikispecies-parser</a>).</p>\n\n<p>There
452
+ are several reasons I want to merge these references. If I want to build a
453
+ citation graph for <i>Zootaxa</i> or <i>Phytotaxa</i> I need to merge references
454
+ that are the same so that I can accurate count citations. I am also interested
455
+ in harvesting the metadata to help find those articles in the <a href=\"https://www.biodiversitylibrary.org\">Biodiversity
456
+ Heritage Library</a> (BHL), and the literature cited section of scientific
457
+ articles is a potential goldmine of bibliographic metadata, as is Wikispecies.</p>\n\n<p>After
458
+ various experiments and false starts I''ve created a repository <a href=\"https://github.com/rdmpage/bib-dedup\">https://github.com/rdmpage/bib-dedup</a>
459
+ to host a series of PHP scripts to deduplicate bibliographics data. I''ve
460
+ settled on using CSL-JSON as the format for bibliographic data. Because deduplication
461
+ relies on comparing pairs of references, the standard format for most of the
462
+ scripts is a JSON array containing a pair of CSL-JSON objects to compare.
463
+ Below are the steps the code takes.</p>\n\n<h2>Generating pairs to compare</h2>\n\n<p>The
464
+ first step is to take a list of references and generate the pairs that will
465
+ be compared. I started with this approach as I wanted to explore machine learning
466
+ and wanted a simple format for training data, such as an array of two CSL-JSON
467
+ objects and an integer flag representing whether the two references were the
468
+ same of different.</p>\n\n<p>There are various ways to generate CSL-JSON for
469
+ a reference. I use a tool I wrote (see <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
470
+ parsing tool released</a>) that has a simple API where you parse one or more
471
+ references and it returns that reference as structured data in CSL-JSON.</p>\n\n<p>Attempting
472
+ to do all possible pairwise comparisons rapidly gets impractical as the number
473
+ of references increases, so we need some way to restrict the number of comparisons
474
+ we make. One approach I''ve explored is the “sorted neighbourhood method”
475
+ where we sort the references 9for example by their title) then move a sliding
476
+ window down the list of references, comparing all references within that window.
477
+ This greatly reduces the number of pairwise comparisons. So the first step
478
+ is to sort the references, then run a sliding window over them, output all
479
+ the pairs in each window (ignoring in pairwise comparisons already made in
480
+ a previous window). Other methods of \"blocking\" could also be used, such
481
+ as only including references in a particular year, or a particular journal.</p>\n\n<p>So,
482
+ the output of this step is a set of JSON arrays, each with a pair of references
483
+ in CSL-JSON format. Each array is stored on a single line in the same file
484
+ in <a href=\"https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON_2\">line-delimited
485
+ JSON</a> (JSONL).</p>\n\n<h2>Comparing pairs</h2>\n\n<p>The next step is to
486
+ compare each pair of references and decide whether they are a match or not.
487
+ Initially I explored a machine learning approach used in the following paper:</p>\n\n<blockquote>\nWilson
488
+ DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex
489
+ features to improve genealogical record linkage. In: The 2011 International
490
+ Joint Conference on Neural Networks. 9–14. DOI: <a href=\"https://doi.org/10.1109/IJCNN.2011.6033192\">10.1109/IJCNN.2011.6033192</a>\n</blockquote>\n\n<p>Initial
491
+ experiments using <a href=\"https://github.com/jtet/Perceptron\">https://github.com/jtet/Perceptron</a>
492
+ were promising and I want to play with this further, but I deciding to skip
493
+ this for now and just use simple string comparison. So for each CSL-JSON object
494
+ I generate a citation string in the same format using CiteProc, then compute
495
+ the <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein
496
+ distance</a> between the two strings. By normalising this distance by the
497
+ length of the two strings being compared I can use an arbitrary threshold
498
+ to decide if the references are the same or not.</p>\n\n<h2>Clustering</h2>\n\n<p>For
499
+ this step we read the JSONL file produced above and record whether the two
500
+ references are a match or not. Assuming each reference has a unique identifier
501
+ (needs only be unique within the file) then we can use those identifier to
502
+ record the clusters each reference belongs to. I do this using a <a href=\"https://en.wikipedia.org/wiki/Disjoint-set_data_structure\">Disjoint-set
503
+ data structure</a>. For each reference start with a graph where each node
504
+ represents a reference, and each node has a pointer to a parent node. Initially
505
+ the reference is its own parent. A simple implementation is to have an array
506
+ index by reference identifiers and where the value of each cell in the array
507
+ is the node''s parent.</p>\n\n<p>As we discover pairs we update the parents
508
+ of the nodes to reflect this, such that once all the comparisons are done
509
+ we have a one or more sets of clusters corresponding to the references that
510
+ we think are the same. Another way to think of this is that we are getting
511
+ the components of a graph where each node is a reference and pair of references
512
+ that match are connected by an edge.</p>\n\n<p>In the code I''m using I write
513
+ this graph in <a href=\"https://en.wikipedia.org/wiki/Trivial_Graph_Format\">Trivial
514
+ Graph Format</a> (TGF) which can be visualised using a tools such as <a href=\"https://www.yworks.com/products/yed\">yEd</a>.</p>\n\n<h2>Merging</h2>\n\n<p>Now
515
+ that we have a graph representing the sets of references that we think are
516
+ the same we need to merge them. This is where things get interesting as the
517
+ references are similar (by definition) but may differ in some details. The
518
+ paper below describes a simple Bayesian approach for merging records:</p>\n\n<blockquote>\nCouncill
519
+ IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles
520
+ CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching
521
+ Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
522
+ Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: <a href=\"https://doi.org/10.1145/1141753.1141817\">10.1145/1141753.1141817</a>.\n</blockquote>\n\n<p>So
523
+ the next step is to read the graph with the clusters, generate the sets of
524
+ bibliographic references that correspond to each cluster, then use the method
525
+ described in Councill et al. to produce a single bibliographic record for
526
+ that cluster. These records could then be used to, say locate the corresponding
527
+ article in BHL, or populate Wikidata with missing references.</p>\n\n<p>Obviously
528
+ there is always the potential for errors, such as trying to merge references
529
+ that are not the same. As a quick and dirty check I flag as dubious any cluster
530
+ where the page numbers vary among members of the cluster. More sophisticated
531
+ checks are possible, especially if I go down the ML route (i.e., I would have
532
+ evidence for the probability that the same reference can disagree on some
533
+ aspects of metadata).</p>\n\n<h2>Summary</h2>\n\n<p>At this stage the code
534
+ is working well enough for me to play with and explore some example datasets.
535
+ The focus is on structured bibliographic metadata, but I may simplify things
536
+ and have a version that handles simple string matching, for example to cluster
537
+ together different abbreviations of the same journal name.</p>","tags":["data
538
+ cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/c79vq-7rr11","uuid":"3cb94422-5506-4e24-a41c-a250bb521ee0","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL
539
+ for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
540
+ for WikiData. More precisely, the endpoint is for a subset of the entities
541
+ that are of interest to WikiCite, such as scholarly articles, people, and
542
+ journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
543
+ The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
544
+ There are various ways to interact with the endpoint, personally I like the
545
+ Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
546
+ taken...","date_published":"2021-12-20T13:16:00Z","date_modified":"2021-12-20T13:20:05Z","date_indexed":"1909-06-16T10:52:00+00:00","authors":[{"url":null,"name":"Roderic
547
+ Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
548
+ both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s1000\"
1657
549
  style=\"display: block; padding: 1em 0; text-align: center; clear: right;
1658
- float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"240\"
1659
- data-original-width=\"240\" src=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s200/Rf7UoXTw_400x400.jpg\"/></a></div><p>The
1660
- <a href=\"http://plazi.org\">Plazi</a> project has become one of the major
1661
- contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
1662
- (see <a href=\"https://www.gbif.org/publisher/7ce8aef0-9e92-11dc-8738-b8a03c50a862\">Plazi''s
1663
- GBIF page</a> for details). These occurrences are extracted from taxonomic
1664
- publication using automated methods. New data is published almost daily (see
1665
- <a href=\"https://tb.plazi.org/GgServer/static/newToday.html\">latest treatments</a>).
1666
- The map below shows the geographic distribution of material citations provided
1667
- to GBIF by Plazi, which gives you a sense of the size of the dataset.</p>\n\n<div
1668
- class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s1030/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"
1669
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1670
- border=\"0\" width=\"400\" data-original-height=\"514\" data-original-width=\"1030\"
1671
- src=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"/></a></div>\n\n<p>By
1672
- any metric Plazi represents a considerable achievement. But often when I browse
1673
- individual records on Plazi I find records that seem clearly incorrect. Text
1674
- mining the literature is a challenging problem, but at the moment Plazi seems
1675
- something of a \"black box\". PDFs go in, the content is mined, and data comes
1676
- up to be displayed on the Plazi web site and uploaded to GBIF. Nowhere does
1677
- there seem to be an evaluation of how accurate this text mining actually is.
1678
- Anecdotally it seems to work well in some cases, but in others it produces
1679
- what can only be described as bogus records.</p>\n\n<h2>Finding errors</h2>\n\n<p>A
1680
- treatment in Plazi is a block of text (and sometimes illustrations) that refers
1681
- to a single taxon. Often that text will include a description of the taxon,
1682
- and list one or more specimens that have been examined. These lists of specimens
1683
- (\"material citations\") are one of the key bits of information that Plaza
1684
- extracts from a treatment as these citations get fed into GBIF as occurrences.</p>\n\n<p>To
1685
- help explore treatments I''ve constructed a simple web site that takes the
1686
- Plazi identifier for a treatment and displays that treatment with the material
1687
- citations highlighted. For example, for the Plazi treatment <a href=\"https://tb.plazi.org/GgServer/html/03B5A943FFBB6F02FE27EC94FABEEAE7\">03B5A943FFBB6F02FE27EC94FABEEAE7</a>
1688
- you can view the marked up version at <a href=\"https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228\">https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228</a>.
1689
- Below is an example of a material citation with its component parts tagged:</p>\n\n<div
1690
- class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s693/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"
1691
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1692
- border=\"0\" width=\"400\" data-original-height=\"94\" data-original-width=\"693\"
1693
- src=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"/></a></div>\n\n<p>This
1694
- is an example where Plazi has successfully parsed the specimen. But I keep
1695
- coming across cases where specimens have not been parsed correctly, resulting
1696
- in issues such as single specimens being split into multiple records (e.g., <a
1697
- href=\"https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496\">https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496</a>),
1698
- geographical coordinates being misinterpreted (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9\">https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9</a>),
1699
- or collector''s initials being confused with codes for natural history collections
1700
- (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E\">https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E</a>).</p>\n\n<p>Parsing
1701
- specimens is a hard problem so it''s not unexpected to find errors. But they
1702
- do seem common enough to be easily found, which raises the question of just
1703
- what percentage of these material citations are correct? How much of the
1704
- data Plazi feeds to GBIF is correct? How would we know?</p>\n\n<h2>Systemic
1705
- problems</h2>\n\n<p>Some of the errors I''ve found concern the interpretation
1706
- of the parsed data. For example, it is striking that despite including marine
1707
- taxa <b>no</b> Plazi record has a value for depth below sea level (see <a
1708
- href=\"https://www.gbif.org/occurrence/map?depth=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">GBIF
1709
- search on depth range 0-9999 for Plazi</a>). But <a href=\"https://www.gbif.org/occurrence/map?elevation=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">many
1710
- records do have an elevation</a>, including records from marine environments.
1711
- Any record that has a depth value is interpreted by Plazi as being elevation,
1712
- so we have aerial crustacea and fish.</p>\n\n<h3>Map of Plazi records with
1713
- depth 0-9999m</h3>\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s673/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"
1714
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1715
- border=\"0\" width=\"400\" data-original-height=\"258\" data-original-width=\"673\"
1716
- src=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"/></a></div>\n\n<h3>Map
1717
- of Plazi records with elevation 0-9999m </h3>\n<div class=\"separator\" style=\"clear:
1718
- both;\"><a href=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s675/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"
1719
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1720
- border=\"0\" width=\"400\" data-original-height=\"256\" data-original-width=\"675\"
1721
- src=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"/></a></div>\n\n<p>Anecdotally
1722
- I''ve also noticed that Plazi seems to do well on zoological data, especially
1723
- journals like <i>Zootaxa</i>, but it often struggles with botanical specimens.
1724
- Botanists tend to cite specimens rather differently to zoologists (botanists
1725
- emphasise collector numbers rather than specimen codes). Hence data quality
1726
- in Plazi is likely to taxonomic biased.</p>\n\n<p>Plazi is <a href=\"https://github.com/plazi/community/issues\">using
1727
- GitHub to track issues with treatments</a> so feedback on erroneous records
1728
- is possible, but this seems inadequate to the task. There are tens of thousands
1729
- of data sets, with more being released daily, and hundreds of thousands of
1730
- occurrences, and relying on GitHub issues devolves the responsibility for
1731
- error checking onto the data users. I don''t have a measure of how many records
1732
- in Plazi have problems, but because I suspect it is a significant fraction
1733
- because for any given day''s output I can typically find errors.</p>\n\n<h2>What
1734
- to do?</h2>\n\n<p>Faced with a process that generates noisy data there are
1735
- several of things we could do:</p>\n\n<ol>\n<li>Have tools to detect and flag
1736
- errors made in generating the data.</li>\n<li>Have the data generator give
1737
- estimates the confidence of its results.</li>\n<li>Improve the data generator.</li>\n</ol>\n\n<p>I
1738
- think a comparison with the problem of parsing bibliographic references might
1739
- be instructive here. There is a long history of people developing tools to
1740
- parse references (<a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">I''ve
1741
- even had a go</a>). State-of-the art tools such as <a href=\"https://anystyle.io\">AnyStyle</a>
1742
- feature machine learning, and are tested against <a href=\"https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml\">human
1743
- curated datasets</a> of tagged bibliographic records. This means we can evaluate
1744
- the performance of a method (how well does it retrieve the same results as
1745
- human experts?) and also improve the method by expanding the corpus of training
1746
- data. Some of these tools can provide a measures of how confident they are
1747
- when classifying a string as, say, a person''s name, which means we could
1748
- flag potential issues for anyone wanting to use that record.</p>\n\n<p>We
1749
- don''t have equivalent tools for parsing specimens in the literature, and
1750
- hence have no easy way to quantify how good existing methods are, nor do we
1751
- have a public corpus of material citations that we can use as training data.
1752
- I <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">blogged
1753
- about this</a> a few months ago and was considering using Plazi as a source
1754
- of marked up specimen data to use for training. However based on what I''ve
1755
- looked at so far Plazi''s data would need to be carefully scrutinised before
1756
- it could be used as training data.</p>\n\n<p>Going forward, I think it would
1757
- be desirable to have a set of records that can be used to benchmark specimen
1758
- parsers, and ideally have the parsers themselves available as web services
1759
- so that anyone can evaluate them. Even better would be a way to contribute
1760
- to the training data so that these tools improve over time.</p>\n\n<p>Plazi''s
1761
- data extraction tools are mostly desktop-based, that is, you need to download
1762
- software to use their methods. However, there are experimental web services
1763
- available as well. I''ve created a simple wrapper around the material citation
1764
- parser, you can try it at <a href=\"https://plazi-tester.herokuapp.com/parser.php\">https://plazi-tester.herokuapp.com/parser.php</a>.
1765
- It takes a single material citation and returns a version with elements such
1766
- as specimen code and collector name tagged in different colours.</p>\n\n<h2>Summary</h2>\n\n<p>Text
1767
- mining the taxonomic literature is clearly a gold mine of data, but at the
1768
- same time it is potentially fraught as we try and extract structured data
1769
- from semi-structured text. Plazi has demonstrated that it is possible to extract
1770
- a lot of data from the literature, but at the same time the quality of that
1771
- data seems highly variable. Even minor issues in parsing text can have big
1772
- implications for data quality (e.g., marine organisms apparently living above
1773
- sea level). Historically in biodiversity informatics we have favoured data
1774
- quantity over data quality. Quantity has an obvious metric, and has milestones
1775
- we can celebrate (e.g., <a href=\"GBIF at 1 billion - what''s next?\">one
1776
- billion specimens</a>). There aren''t really any equivalent metrics for data
1777
- quality.</p>\n\n<p>Adding new types of data can sometimes initially result
1778
- in a new set of quality issues (e.g., <a href=\"https://iphylo.blogspot.com/2019/12/gbif-metagenomics-and-metacrap.html\">GBIF
1779
- metagenomics and metacrap</a>) that take time to resolve. In the case of Plazi,
1780
- I think it would be worthwhile to quantify just how many records have errors,
1781
- and develop benchmarks that we can use to test methods for extracting specimen
1782
- data from text. If we don''t do this then there will remain uncertainty as
1783
- to how much trust we can place in data mined from the taxonomic literature.</p>\n\n<h2>Update</h2>\n\nPlazi
1784
- has responded, see <a href=\"http://plazi.org/posts/2021/10/liberation-first-step-toward-quality/\">Liberating
1785
- material citations as a first step to more better data</a>. My reading of
1786
- their repsonse is that it essentially just reiterates Plazi''s approach and
1787
- doesn''t tackle the underlying issue: their method for extracting material
1788
- citations is error prone, and many of those errors end up in GBIF.","tags":["data
1789
- quality","parsing","Plazi","specimen","text mining"],"language":"en","references":null}]}'
1790
- recorded_at: Thu, 15 Jun 2023 20:39:01 GMT
550
+ float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"1000\"
551
+ data-original-width=\"1000\" src=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s200\"/></a></div><p>I''ve
552
+ released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint
553
+ is for a subset of the entities that are of interest to WikiCite, such as
554
+ scholarly articles, people, and journals. There is a crude demo at <a href=\"https://wikicite-graphql.herokuapp.com\">https://wikicite-graphql.herokuapp.com</a>.
555
+ The endpoint itself is at <a href=\"https://wikicite-graphql.herokuapp.com/gql.php\">https://wikicite-graphql.herokuapp.com/gql.php</a>.
556
+ There are various ways to interact with the endpoint, personally I like the
557
+ <a href=\"https://altair.sirmuel.design\">Altair GraphQL Client</a> by <a
558
+ href=\"https://github.com/imolorhe\">Samuel Imolorhe</a>.</p>\n\n<p>As I''ve
559
+ <a href=\"https://iphylo.blogspot.com/2021/04/it-been-while.html\">mentioned
560
+ earlier</a> it''s taken me a while to see the point of GraphQL. But it is
561
+ clear it is gaining traction in the biodiversity world (see for example the
562
+ <a href=\"https://dev.gbif.org/hosted-portals.html\">GBIF Hosted Portals</a>)
563
+ so it''s worth exploring. My take on GraphQL is that it is a way to create
564
+ a self-describing API that someone developing a web site can use without them
565
+ having to bury themselves in the gory details of how data is internally modelled.
566
+ For example, WikiData''s query interface uses SPARQL, a powerful language
567
+ that has a steep learning curve (in part because of the administrative overhead
568
+ brought by RDF namespaces, etc.). In my previous SPARQL-based projects such
569
+ as <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a> and <a
570
+ href=\"http://alec-demo.herokuapp.com\">ALEC</a> I have either returned SPARQL
571
+ results directly (Ozymandias) or formatted SPARQL results as <a href=\"https://schema.org/DataFeed\">schema.org
572
+ DataFeeds</a> (equivalent to RSS feeds) (ALEC). Both approaches work, but
573
+ they are project-specific and if anyone else tried to build based on these
574
+ projects they might struggle for figure out what was going on. I certainly
575
+ struggle, and I wrote them!</p>\n\n<p>So it seems worthwhile to explore this
576
+ approach a little further and see if I can develop a GraphQL interface that
577
+ can be used to build the sort of rich apps that I want to see. The demo I''ve
578
+ created uses SPARQL under the hood to provide responses to the GraphQL queries.
579
+ So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully)
580
+ simpler overlay on top of SPARQL so that we can retrieve the data we want
581
+ without having to learn the intricacies of SPARQL, nor how Wikidata models
582
+ publications and people.</p>","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":[]}]}'
583
+ recorded_at: Sun, 18 Jun 2023 15:23:48 GMT
1791
584
  recorded_with: VCR 6.1.0