commonmeta-ruby 3.3.2 → 3.3.4
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Gemfile.lock +3 -1
- data/bin/commonmeta +1 -1
- data/commonmeta.gemspec +1 -0
- data/lib/commonmeta/cli.rb +10 -0
- data/lib/commonmeta/readers/json_feed_reader.rb +1 -1
- data/lib/commonmeta/utils.rb +55 -1
- data/lib/commonmeta/version.rb +1 -1
- data/lib/commonmeta.rb +1 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/change_metadata_as_datacite_xml/with_data_citation.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/crossref.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/datacite.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/jalc.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/kisti.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/medra.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/not_found.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/op.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/crossref.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/crossref_doi_not_url.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/datacite.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/datacite_doi_http.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/unknown_DOI_registration_agency.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_cff_metadata/cff-converter-python.yml +9 -7
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_cff_metadata/ruby-cff.yml +16 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_cff_metadata/ruby-cff_repository_url.yml +14 -12
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_codemeta_metadata/maremma.yml +10 -8
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_codemeta_metadata/metadata_reports.yml +9 -7
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/DOI_with_ORCID_ID.yml +74 -74
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/DOI_with_SICI_DOI.yml +73 -73
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/DOI_with_data_citation.yml +70 -70
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/JaLC.yml +159 -159
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/KISTI.yml +128 -128
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/OP.yml +72 -72
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/affiliation_is_space.yml +73 -73
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/another_book.yml +109 -109
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/another_book_chapter.yml +71 -71
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/article_id_as_page_number.yml +74 -74
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/author_literal.yml +82 -82
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book.yml +71 -71
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book_chapter.yml +72 -72
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book_chapter_with_RDF_for_container.yml +70 -70
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book_oup.yml +69 -69
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/component.yml +91 -91
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/dataset.yml +101 -102
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/dataset_usda.yml +133 -133
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/date_in_future.yml +78 -78
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/dissertation.yml +100 -100
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/empty_given_name.yml +72 -72
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/invalid_date.yml +73 -73
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article.yml +72 -72
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_original_language_title.yml +70 -70
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_with.yml +76 -514
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_with_RDF_for_container.yml +70 -70
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_with_funding.yml +73 -73
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_issue.yml +69 -69
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/mEDRA.yml +69 -69
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/markup.yml +78 -78
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/missing_creator.yml +73 -73
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/multiple_issn.yml +72 -72
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/multiple_titles.yml +71 -70
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/multiple_titles_with_missing.yml +716 -716
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/not_found_error.yml +63 -63
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/peer_review.yml +74 -74
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/posted_content.yml +71 -71
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/posted_content_copernicus.yml +73 -73
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/report_osti.yml +117 -117
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/vor_with_url.yml +75 -75
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/yet_another_book.yml +69 -69
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/yet_another_book_chapter.yml +70 -70
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_raw/journal_article.yml +10 -10
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datacite_metadata/dissertation.yml +9 -9
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datacite_metadata/funding_references.yml +11 -11
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datacite_metadata/subject_scheme.yml +20 -20
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_id.yml +6 -419
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_uuid.yml +7 -260
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_uuid_specific_prefix.yml +3 -136
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/by_blog_id.yml +225 -1432
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/not_indexed_posts.yml +1380 -2112
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/unregistered_posts.yml +6 -172
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/blog_post_with_non-url_id.yml +7 -7
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/blogger_post.yml +12 -12
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_author_name_suffix.yml +8 -8
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_doi.yml +7 -7
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_organizational_author.yml +3 -3
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_without_doi.yml +8 -8
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/jekyll_post.yml +8 -8
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/substack_post_with_broken_reference.yml +90 -176
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/syldavia_gazette_post_with_references.yml +25 -25
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/upstream_post_with_references.yml +61 -61
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/wordpress_post.yml +8 -8
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/wordpress_post_with_references.yml +20 -20
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/has_familyName.yml +9 -9
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/has_name_in_display-order_with_ORCID.yml +9 -9
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/name_with_affiliation_crossref.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/only_familyName_and_givenName.yml +43 -36
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/BlogPosting.yml +158 -158
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/BlogPosting_with_new_DOI.yml +162 -162
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/get_schema_org_metadata_front_matter/BlogPosting.yml +178 -180
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/harvard_dataverse.yml +226 -230
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/pangaea.yml +43 -36
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/upstream_blog.yml +94 -94
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/zenodo.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/handle_input/DOI_RA_not_Crossref_or_DataCite.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/handle_input/unknown_DOI_prefix.yml +5 -5
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/json_schema_errors/is_valid.yml +13 -13
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/BlogPosting.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/Dataset.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/authors_with_affiliations.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/climate_data.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/from_schema_org.yml +159 -159
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/keywords_subject_scheme.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/maremma.yml +12 -10
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/text.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/with_data_citation.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/with_pages.yml +12 -12
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/Collection_of_Jupyter_notebooks.yml +9 -9
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/SoftwareSourceCode_Zenodo.yml +17 -17
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/SoftwareSourceCode_also_Zenodo.yml +12 -12
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/ruby-cff.yml +16 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Dataset.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Journal_article.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Journal_article_vancouver_style.yml +19 -19
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Missing_author.yml +12 -12
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/interactive_resource_without_dates.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/software_w/version.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_codemeta/SoftwareSourceCode_DataCite.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_codemeta/SoftwareSourceCode_DataCite_check_codemeta_v2.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/another_schema_org_from_front-matter.yml +27 -27
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/journal_article.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/journal_article_from_datacite.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_rogue_scholar_with_doi.yml +8 -54
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_rogue_scholar_with_organizational_author.yml +3 -3
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_upstream_blog.yml +10 -53
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_with_references.yml +62 -62
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/posted_content.yml +15 -15
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/schema_org_from_another_science_blog.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/schema_org_from_front_matter.yml +29 -29
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/schema_org_from_upstream_blog.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/Another_dataset.yml +28 -28
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/BlogPosting.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/BlogPosting_schema_org.yml +158 -158
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/Dataset.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/container_title.yml +11 -11
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/interactive_resource_without_dates.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/journal_article.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/keywords_subject_scheme.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/maremma.yml +9 -7
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/missing_creator.yml +12 -12
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/multiple_abstracts.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/organization_author.yml +19 -19
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/software.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/software_w/version.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/with_only_first_page.yml +13 -13
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/with_pages.yml +12 -12
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/climate_data.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/maremma.yml +10 -8
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/text.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/with_data_citation.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/with_pages.yml +12 -12
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/dissertation.yml +17 -17
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/from_schema_org.yml +158 -158
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/journal_article.yml +18 -18
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/maremma.yml +10 -8
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/with_ORCID_ID.yml +12 -12
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/with_data_citation.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/Dataset_in_schema_4_0.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/Text_pass-thru.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/book_chapter.yml +15 -13
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/from_schema_org.yml +158 -158
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/interactive_resource_without_dates.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/maremma.yml +12 -10
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/with_ORCID_ID.yml +12 -12
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/with_data_citation.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/with_editor.yml +13 -13
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/BlogPosting.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/BlogPosting_schema_org.yml +159 -159
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/Dataset.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/alternate_name.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/journal_article.yml +8 -8
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/keywords_with_subject_scheme.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/maremma.yml +9 -7
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/with_pages.yml +7 -7
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Another_Schema_org_JSON.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Funding.yml +9 -9
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Funding_OpenAIRE.yml +9 -9
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Schema_org_JSON.yml +17 -17
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Schema_org_JSON_Cyark.yml +33 -33
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/alternate_identifiers.yml +9 -9
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/data_catalog.yml +9 -9
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/geo_location_box.yml +12 -12
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/interactive_resource_without_dates.yml +9 -9
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/journal_article.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/maremma_schema_org_JSON.yml +10 -8
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/series_information.yml +9 -9
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/subject_scheme.yml +11 -11
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/subject_scheme_multiple_keywords.yml +11 -11
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/BlogPosting.yml +4 -4
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/Dataset.yml +6 -6
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/journal_article.yml +14 -14
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/with_pages.yml +12 -12
- data/spec/readers/cff_reader_spec.rb +6 -6
- data/spec/readers/crossref_reader_spec.rb +3 -3
- data/spec/readers/crossref_xml_reader_spec.rb +7 -7
- data/spec/readers/json_feed_reader_spec.rb +13 -13
- data/spec/readers/schema_org_reader_spec.rb +2 -3
- data/spec/spec_helper.rb +1 -0
- data/spec/utils_spec.rb +1 -1
- data/spec/writers/cff_writer_spec.rb +3 -3
- data/spec/writers/ris_writer_spec.rb +2 -2
- data/spec/writers/schema_org_writer_spec.rb +1 -1
- metadata +21 -423
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref/default.yml +0 -110
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref/to_bibtex.yml +0 -110
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref/to_crossref_xml.yml +0 -110
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref/to_datacite.yml +0 -110
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref/to_schema_org.yml +0 -110
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref_xml/default.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref_xml/to_bibtex.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref_xml/to_crossref_xml.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref_xml/to_datacite.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_file/crossref_xml/to_schema_org.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/default.yml +0 -299
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_bibtex.yml +0 -299
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_citation.yml +0 -299
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_crossref_xml.yml +0 -299
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_datacite.yml +0 -299
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_jats.yml +0 -299
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/crossref/to_schema_org.yml +0 -299
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/default.yml +0 -172
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/to_bibtex.yml +0 -172
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/to_citation.yml +0 -172
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/to_datacite.yml +0 -172
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/to_jats.yml +0 -172
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/datacite/to_schema_org.yml +0 -172
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/schema_org/default.yml +0 -1098
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/schema_org/to_datacite.yml +0 -1098
- data/spec/fixtures/vcr_cassettes/Briard_CLI/convert_from_id/schema_org/to_schema_org.yml +0 -1100
- data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/crossref.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/datacite.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/jalc.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/kisti.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/medra.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_CLI/find_from_format_by_id/op.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/authors_as_string/author.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/authors_as_string/no_author.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/authors_as_string/single_author.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/authors_as_string/with_organization.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/change_metadata_as_datacite_xml/with_data_citation.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/crossref.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/datacite.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/jalc.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/kisti.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/medra.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/not_found.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/doi_registration_agency/op.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/find_from_format_by_ID/crossref.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/find_from_format_by_ID/crossref_doi_not_url.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/find_from_format_by_ID/datacite.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/find_from_format_by_ID/datacite_doi_http.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/find_from_format_by_ID/unknown_DOI_registration_agency.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/hsh_to_fos_for_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/hsh_to_fos_for_with_schemeUri_in_hash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/hsh_to_fos_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/hsh_to_fos_no_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/name_to_fos_for_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/name_to_fos_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/fos/name_to_fos_no_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/from_schema_org/with_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/from_schema_org_creators/with_affiliation.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/from_schema_org_creators/without_affiliation.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_cff_metadata/cff-converter-python.yml +0 -200
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_cff_metadata/ruby-cff.yml +0 -154
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_cff_metadata/ruby-cff_repository_url.yml +0 -154
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_codemeta_metadata/maremma.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_codemeta_metadata/metadata_reports.yml +0 -93
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/DOI_with_ORCID_ID.yml +0 -337
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/DOI_with_SICI_DOI.yml +0 -347
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/DOI_with_data_citation.yml +0 -359
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/JaLC.yml +0 -384
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/KISTI.yml +0 -330
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/OP.yml +0 -969
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/affiliation_is_space.yml +0 -358
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/another_book.yml +0 -312
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/another_book_chapter.yml +0 -465
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/article_id_as_page_number.yml +0 -276
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/author_literal.yml +0 -492
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/book.yml +0 -523
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/book_chapter.yml +0 -377
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/book_chapter_with_RDF_for_container.yml +0 -336
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/book_oup.yml +0 -289
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/component.yml +0 -289
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/dataset.yml +0 -299
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/dataset_usda.yml +0 -341
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/date_in_future.yml +0 -570
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/dissertation.yml +0 -301
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/empty_given_name.yml +0 -303
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/invalid_date.yml +0 -307
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_article.yml +0 -461
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_article_original_language_title.yml +0 -276
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_article_with.yml +0 -470
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_article_with_RDF_for_container.yml +0 -519
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_article_with_funding.yml +0 -456
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/journal_issue.yml +0 -270
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/mEDRA.yml +0 -310
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/markup.yml +0 -329
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/missing_creator.yml +0 -307
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/multiple_issn.yml +0 -393
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/multiple_titles.yml +0 -265
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/multiple_titles_with_missing.yml +0 -860
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/not_found_error.yml +0 -209
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/peer_review.yml +0 -287
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/posted_content.yml +0 -326
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/posted_content_copernicus.yml +0 -297
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/report_osti.yml +0 -315
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/vor_with_url.yml +0 -451
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/yet_another_book.yml +0 -816
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_metadata/yet_another_book_chapter.yml +0 -324
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_crossref_raw/journal_article.yml +0 -110
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datacite_metadata/dissertation.yml +0 -152
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datacite_metadata/funding_references.yml +0 -175
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datacite_metadata/subject_scheme.yml +0 -328
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date/publication_date.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_date_parts/date.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_date_parts/year-month.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_date_parts/year.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_parts/date.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_parts/year-month.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_from_parts/year.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_parts/date.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_parts/year-month.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_date_parts/year.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datetime_from_time/future.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datetime_from_time/invalid.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datetime_from_time/nil.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datetime_from_time/past.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_datetime_from_time/present.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_one_author/has_familyName.yml +0 -133
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_one_author/has_name_in_display-order_with_ORCID.yml +0 -153
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_one_author/is_organization.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_one_author/name_with_affiliation_crossref.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_one_author/only_familyName_and_givenName.yml +0 -468
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/BlogPosting.yml +0 -530
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/BlogPosting_with_new_DOI.yml +0 -530
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/get_schema_org_metadata_front_matter/BlogPosting.yml +0 -534
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/harvard_dataverse.yml +0 -1838
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/pangaea.yml +0 -468
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/upstream_blog.yml +0 -885
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_schema_org_metadata/zenodo.yml +0 -583
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_series_information/only_title.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_series_information/title_and_pages.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_series_information/title_volume_and_pages.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/get_series_information/title_volume_issue_and_pages.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/github/github_as_cff_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/github/github_as_codemeta_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/github/github_from_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/github/github_from_url_cff_file.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/github/github_from_url_file.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/handle_input/DOI_RA_not_Crossref_or_DataCite.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/handle_input/unknown_DOI_prefix.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_comma.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_family_name.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_id.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_known_given_name.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_no_info.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_orcid_id.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/is_personal_name_/has_type_organization.yml +0 -164
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/json_schema_errors/is_valid.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_cc_url/not_found.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_cc_url/with_trailing_slash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_cc_url/with_trailing_slash_and_to_https.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/doi.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/doi_as_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/filename.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/ftp.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/invalid_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/sandbox_via_options.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/sandbox_via_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/string.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_id/url_with_utf-8.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_ids/doi.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_ids/url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_issn/from_array.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_issn/from_empty_array.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_issn/from_hash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_issn/from_string.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_url/uri.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_url/with_trailing_slash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/normalize_url/with_trailing_slash_and_to_https.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/array.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/array_of_strings.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/first.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/hash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/hash_with_array_value.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/nil.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/parse_attributes/string.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/random_doi/decode_anothe_doi.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/random_doi/decode_doi.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/random_doi/encode_doi.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/sanitize/onlies_keep_specific_tags.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/sanitize/removes_a_tags.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/spdx/hsh_to_spdx_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/spdx/hsh_to_spdx_not_found.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/spdx/hsh_to_spdx_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/spdx/name_to_spdx_exists.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/spdx/name_to_spdx_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/to_schema_org/with_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/to_schema_org_identifiers/with_identifiers.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_https.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_sandbox.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_sandbox_https.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_with_spaces.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_wrong_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid/validate_orcid_www.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid_scheme/validate_orcid_scheme.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid_scheme/validate_orcid_scheme_https.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid_scheme/validate_orcid_scheme_trailing_slash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_orcid_scheme/validate_orcid_scheme_www.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_url/DOI.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_url/ISSN.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_url/URL.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/validate_url/string.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/BlogPosting.yml +0 -81
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/Dataset.yml +0 -120
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/authors_with_affiliations.yml +0 -186
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/climate_data.yml +0 -74
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/from_schema_org.yml +0 -530
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/keywords_subject_scheme.yml +0 -149
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/maremma.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/text.yml +0 -100
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/with_data_citation.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_bibtex/with_pages.yml +0 -228
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_cff/Collection_of_Jupyter_notebooks.yml +0 -143
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_cff/SoftwareSourceCode_Zenodo.yml +0 -150
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_cff/SoftwareSourceCode_also_Zenodo.yml +0 -93
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_cff/ruby-cff.yml +0 -154
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/Dataset.yml +0 -120
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/Journal_article.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/Journal_article_vancouver_style.yml +0 -299
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/Missing_author.yml +0 -199
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/interactive_resource_without_dates.yml +0 -75
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_citation/software_w/version.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_codemeta/SoftwareSourceCode_DataCite.yml +0 -76
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_codemeta/SoftwareSourceCode_DataCite_check_codemeta_v2.yml +0 -76
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/another_schema_org_from_front-matter.yml +0 -541
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/journal_article.yml +0 -55
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/journal_article_from_datacite.yml +0 -85
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/posted_content.yml +0 -283
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/schema_org_from_another_science_blog.yml +0 -123
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/schema_org_from_front_matter.yml +0 -477
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_crossref/schema_org_from_upstream_blog.yml +0 -1025
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/Another_dataset.yml +0 -110
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/BlogPosting.yml +0 -81
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/BlogPosting_schema_org.yml +0 -530
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/Dataset.yml +0 -120
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/container_title.yml +0 -153
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/interactive_resource_without_dates.yml +0 -75
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/journal_article.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/keywords_subject_scheme.yml +0 -149
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/maremma.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/missing_creator.yml +0 -199
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/multiple_abstracts.yml +0 -101
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/organization_author.yml +0 -314
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/software.yml +0 -90
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/software_w/version.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/with_only_first_page.yml +0 -333
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csl/with_pages.yml +0 -228
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csv/climate_data.yml +0 -74
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csv/maremma.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csv/text.yml +0 -100
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csv/with_data_citation.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_csv/with_pages.yml +0 -228
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite/from_schema_org.yml +0 -530
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite/maremma.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite/with_ORCID_ID.yml +0 -228
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite/with_data_citation.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite_json/from_schema_org.yml +0 -530
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite_json/maremma.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite_json/with_ORCID_ID.yml +0 -228
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_datacite_json/with_data_citation.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/Dataset_in_schema_4_0.yml +0 -120
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/Text_pass-thru.yml +0 -106
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/book_chapter.yml +0 -163
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/from_schema_org.yml +0 -530
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/interactive_resource_without_dates.yml +0 -75
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/maremma.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/with_ORCID_ID.yml +0 -228
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/with_data_citation.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_jats_xml/with_editor.yml +0 -355
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_rdf_xml/BlogPosting.yml +0 -81
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_rdf_xml/BlogPosting_schema_org.yml +0 -530
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_rdf_xml/journal_article.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_rdf_xml/maremma.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_rdf_xml/with_pages.yml +0 -228
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/BlogPosting.yml +0 -81
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/BlogPosting_schema_org.yml +0 -530
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/Dataset.yml +0 -120
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/alternate_name.yml +0 -138
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/journal_article.yml +0 -115
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/keywords_with_subject_scheme.yml +0 -149
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/maremma.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_ris/with_pages.yml +0 -112
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Another_Schema_org_JSON.yml +0 -120
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Funding.yml +0 -192
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Funding_OpenAIRE.yml +0 -150
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Schema_org_JSON.yml +0 -98
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Schema_org_JSON_Cyark.yml +0 -160
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/Schema_org_JSON_IsSupplementTo.yml +0 -153
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/alternate_identifiers.yml +0 -131
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/data_catalog.yml +0 -136
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/geo_location_box.yml +0 -181
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/interactive_resource_without_dates.yml +0 -127
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/journal_article.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/maremma_schema_org_JSON.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/series_information.yml +0 -174
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/subject_scheme.yml +0 -199
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_schema_org/subject_scheme_multiple_keywords.yml +0 -201
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_turtle/BlogPosting.yml +0 -81
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_turtle/BlogPosting_schema_org.yml +0 -530
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_turtle/Dataset.yml +0 -120
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_turtle/journal_article.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Briard_Metadata/write_metadata_as_turtle/with_pages.yml +0 -228
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/hsh_to_fos_for_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/hsh_to_fos_for_with_schemeUri_in_hash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/hsh_to_fos_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/hsh_to_fos_no_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/name_to_fos_for_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/name_to_fos_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/fos/name_to_fos_no_match.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/from_schema_org/with_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date/publication_date.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_date_parts/date.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_date_parts/year-month.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_date_parts/year.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_parts/date.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_parts/year-month.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_from_parts/year.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_parts/date.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_parts/year-month.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_date_parts/year.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datetime_from_time/future.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datetime_from_time/invalid.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datetime_from_time/nil.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datetime_from_time/past.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datetime_from_time/present.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/all_posts.yml +0 -3602
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/behind_the_science.yml +0 -1176
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/citation_style_language.yml +0 -360
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/citation_style_language_blog.yml +0 -360
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/front-matter_blog.yml +0 -1034
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/upstream.yml +0 -2438
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/upstream_blog.yml +0 -2438
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item/by_uuid.yml +0 -136
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_link/license.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_link/url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_series_information/only_title.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_series_information/title_and_pages.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_series_information/title_volume_and_pages.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_series_information/title_volume_issue_and_pages.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/github/github_as_cff_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/github/github_as_codemeta_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/github/github_from_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/github/github_from_url_cff_file.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/github/github_from_url_file.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/json_feed_unregistered_url/all_posts.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_cc_url/not_found.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_cc_url/with_trailing_slash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_cc_url/with_trailing_slash_and_to_https.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/doi.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/doi_as_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/filename.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/ftp.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/invalid_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/sandbox_via_options.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/sandbox_via_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/string.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_id/url_with_utf-8.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_issn/from_array.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_issn/from_empty_array.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_issn/from_hash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_issn/from_string.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_url/uri.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_url/with_trailing_slash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/normalize_url/with_trailing_slash_and_to_https.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/array.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/array_of_strings.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/first.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/hash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/hash_with_array_value.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/nil.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/parse_attributes/string.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_doi/decode_anothe_doi.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_doi/decode_another_doi.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_doi/decode_doi.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_doi/encode_doi.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_id/decode_another_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_id/decode_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/random_id/encode_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/sanitize/onlies_keep_specific_tags.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/sanitize/removes_a_tags.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/spdx/hsh_to_spdx_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/spdx/hsh_to_spdx_not_found.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/spdx/hsh_to_spdx_url.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/spdx/name_to_spdx_exists.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/spdx/name_to_spdx_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/to_schema_org/with_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/to_schema_org_identifiers/with_identifiers.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_https.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_sandbox.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_sandbox_https.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_with_spaces.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_wrong_id.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid/validate_orcid_www.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid_scheme/validate_orcid_scheme.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid_scheme/validate_orcid_scheme_https.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid_scheme/validate_orcid_scheme_trailing_slash.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_orcid_scheme/validate_orcid_scheme_www.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_url/DOI.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_url/ISSN.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_url/URL.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/validate_url/string.yml +0 -221
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_rdf_xml/BlogPosting.yml +0 -81
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_rdf_xml/BlogPosting_schema_org.yml +0 -530
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_rdf_xml/journal_article.yml +0 -247
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_rdf_xml/maremma.yml +0 -86
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_rdf_xml/with_pages.yml +0 -228
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Schema_org_JSON_IsSupplementTo.yml +0 -153
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/BlogPosting_schema_org.yml +0 -530
@@ -23,13 +23,13 @@ http_interactions:
|
|
23
23
|
Cache-Control:
|
24
24
|
- public, max-age=0, must-revalidate
|
25
25
|
Content-Length:
|
26
|
-
- '
|
26
|
+
- '49530'
|
27
27
|
Content-Type:
|
28
28
|
- application/json; charset=utf-8
|
29
29
|
Date:
|
30
|
-
-
|
30
|
+
- Sun, 18 Jun 2023 15:23:48 GMT
|
31
31
|
Etag:
|
32
|
-
- '"
|
32
|
+
- '"xv42bhvvc21253"'
|
33
33
|
Server:
|
34
34
|
- Vercel
|
35
35
|
Strict-Transport-Security:
|
@@ -39,7 +39,7 @@ http_interactions:
|
|
39
39
|
X-Vercel-Cache:
|
40
40
|
- MISS
|
41
41
|
X-Vercel-Id:
|
42
|
-
- fra1::iad1::
|
42
|
+
- fra1::iad1::6sjfw-1687101826987-ed7849dc5f1b
|
43
43
|
Connection:
|
44
44
|
- close
|
45
45
|
body:
|
@@ -49,854 +49,55 @@ http_interactions:
|
|
49
49
|
more ranty and less considered opinions, see my <a href=\"https://twitter.com/rdmpage\">Twitter
|
50
50
|
feed</a>.<br>ISSN 2051-8188. Written content on this site is licensed under
|
51
51
|
a <a href=\"https://creativecommons.org/licenses/by/4.0/\">Creative Commons
|
52
|
-
Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default
|
53
|
-
Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
your text editor, you shouldn’t be editing data. When the manuscript is published
|
102
|
-
the data-generated portions can then be output as text.</p>\n<p>Of course
|
103
|
-
all of this assumes that we have taxonomic data in a database (or some other
|
104
|
-
storage format, including plain text and Mark-down, e.g. <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
|
105
|
-
markdown, and taxonomic trees</a>) that can generate outputs in the various
|
106
|
-
formats that we need.</p>\n<h2 id=\"archiving-data-and-images\">Archiving
|
107
|
-
data and images</h2>\n<p>One of the really nice things that <a href=\"http://plazi.org\">Plazi</a>
|
108
|
-
do is have a pipeline that sends taxonomic descriptions and images to Zenodo,
|
109
|
-
and similar data to GBIF. Any taxonomic journal should be able to do this.
|
110
|
-
Indeed, arguably each taxonomic treatment within the paper should be linked
|
111
|
-
to the Zenodo DOI at the time of publication. Indeed, we could imagine ultimately
|
112
|
-
having treatments as transclusions within the larger manuscript. Alternatively
|
113
|
-
we could store the treatments as parts of the larger article (rather like
|
114
|
-
chapters in a book), each with a CrossRef DOI. I’m still sceptical about whether
|
115
|
-
these treatments are as important as we make out, see <a href=\"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html\">Does
|
116
|
-
anyone cite taxonomic treatments?</a>. But having machine-readable taxonomic
|
117
|
-
data archived and accessible is a good thing. Uploading the same data to GBIF
|
118
|
-
makes much of that data immediately accessible. Now that GBIF offers <a href=\"https://www.gbif.org/composition/3kQFinjwHbCGZeLb5OhwN2/gbif-hosted-portals\">hosted
|
119
|
-
portals</a> there is the possibility of having custom interfaces to data from
|
120
|
-
a particular journal.</p>\n<h2 id=\"name-and-identifier-registration\">Name
|
121
|
-
and identifier registration</h2>\n<p>We would also want automatic registration
|
122
|
-
of new taxonomic names, for which there are pipelines (see “A common registration-to-publication
|
123
|
-
automated pipeline for nomenclatural acts for higher plants (International
|
124
|
-
Plant Names Index, IPNI), fungi (Index Fungorum, MycoBank) and animals (ZooBank)”
|
125
|
-
<a href=\"https://doi.org/10.3897/zookeys.550.9551\">https://doi.org/10.3897/zookeys.550.9551</a>).
|
126
|
-
These pipelines do not seem to be documented in much detail, and the data
|
127
|
-
formats differ across registration agencies (e.g., IPNI and ZooBank). For
|
128
|
-
example, ZooBank seems to require TaxPub XML.</p>\n<p>Registration of names
|
129
|
-
and identifiers, especially across multiple registration agencies (ZooBank,
|
130
|
-
CrossRef, DataCite, etc.) requires some coordination, especially when one
|
131
|
-
registration agency requires identifiers from another.</p>\n<h2 id=\"summary\">Summary</h2>\n<p>If
|
132
|
-
data is key, then the taxonomic paper itself becomes something of a wrapper
|
133
|
-
around that data. It still serves the function of being human-readable, providing
|
134
|
-
broader context for the work, and as an archive that conforms to currently
|
135
|
-
accepted ways to publish taxonomic names. But in some ways it is the last
|
136
|
-
interesting part of the process.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/37y2z-gre70","uuid":"f3629c86-06e0-42c0-844a-266b03a91ef1","url":"https://iphylo.blogspot.com/2023/05/ten-years-and-million-links.html","title":"Ten
|
137
|
-
years and a million links","summary":"As trailed on a Twitter thread last
|
138
|
-
week I’ve been working on a manuscript describing the efforts to map taxonomic
|
139
|
-
names to their original descriptions in the taxonomic literature. Putting
|
140
|
-
together a manuscript on linking taxonomic names to the primary literature,
|
141
|
-
basically “um, what, exactly, have you been doing all these years?”. TL;DR
|
142
|
-
Across fungi, plants, and animals approx 1.3 million names have been linked
|
143
|
-
to a persistent identifier for a publication.— Roderic Page (@rdmpage) May
|
144
|
-
25,...","date_published":"2023-05-31T17:26:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
145
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>As trailed on a Twitter
|
146
|
-
thread last week I’ve been working on a manuscript describing the efforts
|
147
|
-
to map taxonomic names to their original descriptions in the taxonomic literature.</p>\n<blockquote
|
148
|
-
class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Putting together a manuscript
|
149
|
-
on linking taxonomic names to the primary literature, basically “um, what,
|
150
|
-
exactly, have you been doing all these years?”. TL;DR Across fungi, plants,
|
151
|
-
and animals approx 1.3 million names have been linked to a persistent identifier
|
152
|
-
for a publication.</p>— Roderic Page (@rdmpage) <a href=\"https://twitter.com/rdmpage/status/1661714128413573120?ref_src=twsrc%5Etfw\">May
|
153
|
-
25, 2023</a></blockquote> \n<p>The preprint is on bioRxiv <a href=\"https://doi.org/10.1101/2023.05.29.542697\">doi:10.1101/2023.05.29.542697</a></p>\n<blockquote>\n<p>A
|
154
|
-
major gap in the biodiversity knowledge graph is a connection between taxonomic
|
155
|
-
names and the taxonomic literature. While both names and publications often
|
156
|
-
have persistent identifiers (PIDs), such as Life Science Identifiers (LSIDs)
|
157
|
-
or Digital Object Identifiers (DOIs), LSIDs for names are rarely linked to
|
158
|
-
DOIs for publications. This article describes efforts to make those connections
|
159
|
-
across three large taxonomic databases: Index Fungorum, International Plant
|
160
|
-
Names Index (IPNI), and the Index of Organism Names (ION). Over a million
|
161
|
-
names have been matched to DOIs or other persistent identifiers for taxonomic
|
162
|
-
publications. This represents approximately 36% of names for which publication
|
163
|
-
data is available. The mappings between LSIDs and publication PIDs are made
|
164
|
-
available through ChecklistBank. Applications of this mapping are discussed,
|
165
|
-
including a web app to locate the citation of a taxonomic name, and a knowledge
|
166
|
-
graph that uses data on researcher’s ORCID ids to connect taxonomic names
|
167
|
-
and publications to authors of those names.</p>\n</blockquote>\n<p>Much of
|
168
|
-
the work has been linking taxa to names, which still has huge gaps. There
|
169
|
-
are also interesting differences in coverage between plants, animals, and
|
170
|
-
fungi (see preprint for details).</p>\n\n<div class=\"separator\" style=\"clear:
|
171
|
-
both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s2276/animals-coverage.png\"
|
172
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
173
|
-
border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
|
174
|
-
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s320/animals-coverage.png\"/></a></div><div
|
175
|
-
class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s2276/fungi-coverage.png\"
|
176
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
177
|
-
border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
|
178
|
-
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s320/fungi-coverage.png\"/></a></div><div
|
179
|
-
class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s2276/plants-coverage.png\"
|
180
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
181
|
-
border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
|
182
|
-
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s320/plants-coverage.png\"/></a></div>\n\n\nThere
|
183
|
-
is also a simple app to demonstrate these links, see <a href=\"https://species-cite.herokuapp.com\">https://species-cite.herokuapp.com</a>.\n\n\n\n<blockquote>\n<p>Written
|
184
|
-
with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/92rdb-5fe58","uuid":"d33d4f49-b281-4997-9eb9-dbad1e52d9bd","url":"https://iphylo.blogspot.com/2022/09/local-global-identifiers-for.html","title":"Local
|
185
|
-
global identifiers for decentralised wikis","summary":"I''ve been thinking
|
186
|
-
a bit about how one could use a Markdown wiki-like tool such as Obsidian to
|
187
|
-
work with taxonomic data (see earlier posts Obsidian, markdown, and taxonomic
|
188
|
-
trees and Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu).
|
189
|
-
One \"gotcha\" would be how to name pages. If we treat the database as entirely
|
190
|
-
local, then the page names don''t matter, but what if we envisage sharing
|
191
|
-
the database, or merging it with others (for example, if we divided a taxon
|
192
|
-
up into chunks, and...","date_published":"2022-09-08T16:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
193
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>I''ve been thinking a bit
|
194
|
-
about how one could use a Markdown wiki-like tool such as Obsidian to work
|
195
|
-
with taxonomic data (see earlier posts <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
|
196
|
-
markdown, and taxonomic trees</a> and <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">Personal
|
197
|
-
knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu</a>).</p>\n\n<p>One
|
198
|
-
\"gotcha\" would be how to name pages. If we treat the database as entirely
|
199
|
-
local, then the page names don''t matter, but what if we envisage sharing
|
200
|
-
the database, or merging it with others (for example, if we divided a taxon
|
201
|
-
up into chunks, and different people worked on those different chunks)? </p>\n\n<p>This
|
202
|
-
is the attraction of globally unique identifiers. You and I can independently
|
203
|
-
work on the same thing, such as data linked to scientific paper, safe in the
|
204
|
-
knowledge that if we both use the DOI for that paper we can easily combine
|
205
|
-
what we''ve done. But global identifiers can also be a pain, especially if
|
206
|
-
we need to use a service to look them up (\"is there a DOI for this paper?\",
|
207
|
-
\"what is the LSID for this taxonomic name?\").</p>\n\n<p>Life would be easier
|
208
|
-
if we could generate identifiers \"locally\", but had some assurance that
|
209
|
-
they would be globally unique, and that anyone else generating an identifier
|
210
|
-
for the same thing would arrive at the same identifier (this eliminates things
|
211
|
-
such as <a href=\"https://en.wikipedia.org/wiki/Universally_unique_identifier\">UUIDs</a>
|
212
|
-
which are intentionally designed to prvent people genrrating the same identifier).
|
213
|
-
One approach is \"content addressing\" (see, e.g. <a href=\"https://web.archive.org/web/20210514054054/https://bentrask.com/notes/content-addressing.html\">Principles
|
214
|
-
of Content Addressing</a> - dead link but in the Wayabck Machine, see also
|
215
|
-
<a href=\"https://github.com/btrask/stronglink\">btrask/stronglink</a>). For
|
216
|
-
example, we can generate a cryptographic hash of a file (such as a PDF) and
|
217
|
-
use that as the identifier.</p>\n\n<p>Now the problem is that we have globally
|
218
|
-
unique, but ugly and unfriendly identifiers (such as \"6c98136eba9084ea9a5fc0b7693fed8648014505\").
|
219
|
-
What we need are nice, easy to use identifiers we can use as page names. <a
|
220
|
-
href=\"https://species.wikimedia.org/wiki/Main_Page\">Wikispecies</a> serves
|
221
|
-
as a possible role model, where taxon names serve as page names, as do simplified
|
222
|
-
citations (e.g., authors and years). This model runs into the problem that
|
223
|
-
taxon names aren''t unique, nor are author + year combinations. In Wikispecies
|
224
|
-
this is resolved by having a centralised database where it''s first come,
|
225
|
-
first served. If there is a name clash you have to create a new name for your
|
226
|
-
page. This works, but what if you have multiple databases un by different
|
227
|
-
people? How do we ensure the identifiers are the same?</p>\n\n<p>Then I remembered
|
228
|
-
Roger Hyam''s flight of fantasy over a decade ago: <a href=\"http://www.hyam.net/blog/archives/1007\">SpeciesIndex.org
|
229
|
-
– an impractical, practical solution</a>. He proposed the following rules
|
230
|
-
to generate a unique URI for a taxonomic name:\n\n<ul>\n <li>The URI must
|
231
|
-
start with \"http://speciesindex.org\" followed by one or more of the following
|
232
|
-
separated by slashes.</li>\n\n <li>First word of name. Must only contain
|
233
|
-
letters. Must not be the same as one of the names of the nomenclatural codes
|
234
|
-
(icbn or iczn). Optional but highly recommended.</li> \n\n <li>Second word
|
235
|
-
of name. Must only contain letters and not be a nomenclatural code name. Optional.</li>
|
236
|
-
\n\n <li>Third word of name. Must only contain letters and not be a nomenclatural
|
237
|
-
code name. Optional.</li> \n\n <li>Year of publication. Must be an integer
|
238
|
-
greater than 1650 and equal to or less than the current year. If this is an
|
239
|
-
ICZN name then this should be the year the species (epithet) was published
|
240
|
-
as is commonly cited after the name. If this is an ICBN name at species or
|
241
|
-
below then it is the date of the combination. Optional. Recommended for zoological
|
242
|
-
names if known. Not recommended for botanical names unless there is a known
|
243
|
-
problem with homonyms in use by non-taxonomists.</li>\n \n<li>Nomenclatural
|
244
|
-
code governing the name of the taxon. Currently this must be either ''icbn''
|
245
|
-
or ''iczn''. This may be omitted if the code is unknown or not relevant. Other
|
246
|
-
codes may be added to this list.</li> \n <li>Qualifier This must be a Version
|
247
|
-
4 RFC-4122 UUID. Optional. Used to generate a new independent identifier for
|
248
|
-
a taxon for which the conventional name is unknown or does not exist or to
|
249
|
-
indicate a particular taxon concept that bears the embedded name.</li>\n\n <li>The
|
250
|
-
whole speciesindex.org URI string should be considered case\nsensitive. Everything
|
251
|
-
should be lower case apart from the first letter of words that are specified
|
252
|
-
as having upper case in their relevant codes e.g. names at and above the rank
|
253
|
-
of genus.</li>\n</ul>\n</p>\n\n<p>Roger is basically arging that while names
|
254
|
-
aren''t unique (i.e., we have homonyms such as <i>Abronia</i>) they are pretty
|
255
|
-
close to being so, and with a few tweaks we can come up with a unique representation.
|
256
|
-
Another way to think about this if we had a database of all taxonomics, we
|
257
|
-
could construct a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
|
258
|
-
and for each name find the shortest set of name parts (genus, species, etc),
|
259
|
-
year, and code that gave us a unique string for that name. In many cases the
|
260
|
-
species name may be all we need, in other cases we may need to add year and/or
|
261
|
-
nomenclatural code to arrive at a unique string. \n\n</p>\n\n<p>What about
|
262
|
-
bibliographic references? Well many of us will have databases (e.g., Endnote,
|
263
|
-
Mendeley, Zotero, etc.) which generate \"cite keys\". These are typically
|
264
|
-
short, memorable identifiers for a reference that are unique within that database.
|
265
|
-
There is an interesting discussion on the <a href=\"https://discourse.jabref.org/t/universal-citekey-generator/2441/2\">JabRef
|
266
|
-
forum</a> regarding a \"Universal Citekey Generator\", and source code is
|
267
|
-
available <a href=\"https://github.com/cparnot/universal-citekey-js\">cparnot/universal-citekey-js</a>.
|
268
|
-
I''ve yet to explore this in detail, but it looks a promising way to generate
|
269
|
-
unique identifiers from basic metadata (echos of more elaborate schemes such
|
270
|
-
as <a href=\"https://en.wikipedia.org/wiki/Serial_Item_and_Contribution_Identifier\">SICIs</a>).
|
271
|
-
For example,\n\n<blockquote>Senna AR, Guedes UN, Andrade LF, Pereira-Filho
|
272
|
-
GH. 2021. A new species of amphipod Pariphinotus Kunkel, 1910 (Amphipoda:
|
273
|
-
Phliantidae) from Southwestern Atlantic. Zool Stud 60:57. doi:10.6620/ZS.2021.60-57.</blockquote>\n\nbecomes
|
274
|
-
\"Senna:2021ck\". So if two people have the same, core, metadata for a paper
|
275
|
-
they can generate the same key.</p>\n\n<p>Hence it seems with a few conventions
|
276
|
-
(and maybe some simple tools to support them) we could have decentralised
|
277
|
-
wiki-like tools that used the same identifiers for the same things, and yet
|
278
|
-
those identfiiers were short and human-friendly.</p>","tags":["citekey","identfiiers","markdown","obsidian","Roger
|
279
|
-
Hyam"],"language":"en","references":null},{"id":"https://doi.org/10.59350/j77nc-e8x98","uuid":"c6b101f4-bfbc-4d01-921d-805c43c85757","url":"https://iphylo.blogspot.com/2022/08/linking-taxonomic-names-to-literature.html","title":"Linking
|
280
|
-
taxonomic names to the literature","summary":"Just some thoughts as I work
|
281
|
-
through some datasets linking taxonomic names to the literature. In the diagram
|
282
|
-
above I''ve tried to capture the different situatios I encounter. Much of
|
283
|
-
the work I''ve done on this has focussed on case 1 in the diagram: I want
|
284
|
-
to link a taxonomic name to an identifier for the work in which that name
|
285
|
-
was published. In practise this means linking names to DOIs. This has the
|
286
|
-
advantage of linking to a citable indentifier, raising questions such as whether
|
287
|
-
citations...","date_published":"2022-08-22T17:19:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
288
|
-
(Roderic Page)"}],"image":null,"content_html":"Just some thoughts as I work
|
289
|
-
through some datasets linking taxonomic names to the literature.\n\n<div class=\"separator\"
|
290
|
-
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s2140/linking%20to%20names144.jpg\"
|
291
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
292
|
-
border=\"0\" height=\"600\" data-original-height=\"2140\" data-original-width=\"1604\"
|
293
|
-
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s600/linking%20to%20names144.jpg\"/></a></div>\n\n<p>In
|
294
|
-
the diagram above I''ve tried to capture the different situatios I encounter.
|
295
|
-
Much of the work I''ve done on this has focussed on case 1 in the diagram:
|
296
|
-
I want to link a taxonomic name to an identifier for the work in which that
|
297
|
-
name was published. In practise this means linking names to DOIs. This has
|
298
|
-
the advantage of linking to a citable indentifier, raising questions such
|
299
|
-
as whether citations of taxonmic papers by taxonomic databases could become
|
300
|
-
part of a <a href=\"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html\">taxonomist''s
|
301
|
-
Google Scholar profile</a>.</p>\n\n<p>In many taxonomic databases full work-level
|
302
|
-
citations are not the norm, instead taxonomists cite one or more pages within
|
303
|
-
a work that are relevant to a taxonomic name. These \"microcitations\" (what
|
304
|
-
the U.S. legal profession refer to as \"point citations\" or \"pincites\", see
|
305
|
-
<a href=\"https://rasmussen.libanswers.com/faq/283203\">What are pincites,
|
306
|
-
pinpoints, or jump legal references?</a>) require some work to map to the
|
307
|
-
work itself (which is typically the thing that has a citatble identifier such
|
308
|
-
as a DOI).</p>\n\n<p>Microcitations (case 2 in the diagram above) can be quite
|
309
|
-
complex. Some might simply mention a single page, but others might list a
|
310
|
-
series of (not necessarily contiguous) pages, as well as figures, plates etc.
|
311
|
-
Converting these to citable identifiers can be tricky, especially as in most
|
312
|
-
cases we don''t have page-level identifiers. The Biodiversity Heritage Library
|
313
|
-
(BHL) does have URLs for each scanned page, and we have a standard for referring
|
314
|
-
to pages in a PDF (<code>page=<pageNum></code>, see <a href=\"https://datatracker.ietf.org/doc/html/rfc8118\">RFC
|
315
|
-
8118</a>). But how do we refer to a set of pages? Do we pick the first page?
|
316
|
-
Do we try and represent a set of pages, and if so, how?</p>\n\n<p>Another
|
317
|
-
issue with page-level identifiers is that not everything on a given page may
|
318
|
-
be relevant to the taxonomic name. In case 2 above I''ve shaded in the parts
|
319
|
-
of the pages and figure that refer to the taxonomic name. An example where
|
320
|
-
this can be problematic is the recent test case I created for BHL where a
|
321
|
-
page image was included for the taxonomic name <a href=\"https://www.gbif.org/species/195763322\"><i>Aphrophora
|
322
|
-
impressa</i></a>. The image includes the species description and a illustration,
|
323
|
-
as well as text that relates to other species.</p>\n\n<div class=\"separator\"
|
324
|
-
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s3467/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"
|
325
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
326
|
-
border=\"0\" height=\"400\" data-original-height=\"3467\" data-original-width=\"2106\"
|
327
|
-
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s400/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"/></a></div>\n\n<p>Given
|
328
|
-
that not everything on a page need be relevant, we could extract just the
|
329
|
-
relevant blocks of text and illustrations (e.g., paragraphs of text, panels
|
330
|
-
within a figure, etc.) and treat that set of elements as the thing to cite.
|
331
|
-
This is, of course, what <a href=\"http://plazi.org\">Plazi</a> are doing.
|
332
|
-
The set of extracted blocks is glued together as a \"treatment\", assigned
|
333
|
-
an identifier (often a DOI), and treated as a citable unit. It would be interesting
|
334
|
-
to see to what extent these treatments are actually cited, for example, do
|
335
|
-
subsequent revisions that cite work that include treatments cite those treatments,
|
336
|
-
or just the work itself? Put another way, are we creating <a href=\"https://iphylo.blogspot.com/2012/09/decoding-nature-encode-ipad-app-omg-it.html\">\"threads\"</a>
|
337
|
-
between taxonomic revisions?</p>\n\n<p>One reason for these notes is that
|
338
|
-
I''m exploring uploading taxonomic name - literature links to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
|
339
|
-
and case 1 above is easy, as is case 3 (if we have treatment-level identifiers).
|
340
|
-
But case 2 is problematic because we are linking to a set of things that may
|
341
|
-
not have an identifier, which means a decision has to be made about which
|
342
|
-
page to link to, and how to refer to that page.</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/w18j9-v7j10","uuid":"d811172e-7798-403c-a83d-3d5317a9657e","url":"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html","title":"Papers
|
343
|
-
citing data that cite papers: CrossRef, DataCite, and the Catalogue of Life","summary":"Quick
|
344
|
-
notes to self following on from a conversation about linking taxonomic names
|
345
|
-
to the literature. Is there a way to turn those links into countable citations
|
346
|
-
(even if just one per database) for Google Scholar?— Wayne Maddison
|
347
|
-
(@WayneMaddison) August 3, 2022 There are different sorts of citation: Paper
|
348
|
-
cites another paper Paper cites a dataset Dataset cites a paper Citation
|
349
|
-
type (1) is largely a solved problem (although there are issues of the ownership
|
350
|
-
and use of this...","date_published":"2022-08-03T11:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
351
|
-
(Roderic Page)"}],"image":null,"content_html":"Quick notes to self following
|
352
|
-
on from a conversation about linking taxonomic names to the literature.\n\n<blockquote
|
353
|
-
class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Is there a way to turn
|
354
|
-
those links into countable citations (even if just one per database) for Google
|
355
|
-
Scholar?</p>— Wayne Maddison (@WayneMaddison) <a href=\"https://twitter.com/WayneMaddison/status/1554644747406348288?ref_src=twsrc%5Etfw\">August
|
356
|
-
3, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
|
357
|
-
charset=\"utf-8\"></script>\n\nThere are different sorts of citation:\n\n<ol>\n <li>Paper
|
358
|
-
cites another paper</li>\n <li>Paper cites a dataset</li>\n <li>Dataset
|
359
|
-
cites a paper</li>\n</ol>\n\nCitation type (1) is largely a solved problem
|
360
|
-
(although there are issues of the ownership and use of this data, see e.g.
|
361
|
-
<a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
|
362
|
-
has no impact factor</a>.\n\nCitation type (2) is becoming more widespread
|
363
|
-
(but not perfect as GBIF''s <a href=\"https://twitter.com/search?q=%23citethedoi&src=typed_query\">#citethedoi</a>
|
364
|
-
campaign demonstrates. But the idea is well accepted and there are guides
|
365
|
-
to how to do it, e.g.:\n\n<blockquote>\nCousijn, H., Kenall, A., Ganley, E.
|
366
|
-
et al. A data citation roadmap for scientific publishers. Sci Data 5, 180259
|
367
|
-
(2018). <a href=\"https://doi.org/10.1038/sdata.2018.259\">https://doi.org/10.1038/sdata.2018.259</a>\n</blockquote>\n\nHowever,
|
368
|
-
things do get problematic because most (but not all) DOIs for publications
|
369
|
-
are managed by CrossRef, which has an extensive citation database linking
|
370
|
-
papers to other paopers. Most datasets have DataCite DOIs, and DataCite manages
|
371
|
-
its own citations links, but as far as I''m aware these two systems don''t
|
372
|
-
really taklk to each other.\n\nCitation type (3) is the case where a database
|
373
|
-
is largely based on the literature, which applies to taxonomy. Taxonomic databases
|
374
|
-
are essentially collections of literature that have opinions on taxa, and
|
375
|
-
the database may simply compile those (e.g., a nomenclator), or come to some
|
376
|
-
view on the applicability of each name. In an ideal would, each reference
|
377
|
-
included in a taxonomic database would gain a citation, which would help better
|
378
|
-
reflect the value of that work (a long standing bone of contention for taxonomists).\n\nIt
|
379
|
-
would be interesting to explore these issues further. CrossRef and DataCite
|
380
|
-
do share <a href=\"https://www.crossref.org/services/event-data/\">Event Data</a>
|
381
|
-
(see also <a href=\"https://support.datacite.org/docs/eventdata-guide\">DataCite
|
382
|
-
Event Data</a>). Can this track citations of papers by a dataset?\n \n \nMy
|
383
|
-
take on Wayne''s question:\n\n<blockquote>\n Is there a way to turn those
|
384
|
-
links into countable citations (even if just one per database) for Google
|
385
|
-
Scholar?\n</blockquote>\n\nis that he''s is after type 3 citations, which
|
386
|
-
I don''t think we have a way to handle just yet (but I''d need to look at
|
387
|
-
Event Data a bit more). Google Scholar is a black box, and the academic coimmunity''s
|
388
|
-
reliance on it for metrics is troubling. But it would be interetsing to try
|
389
|
-
and figure out if there is a way to get Google Scholar to index the citations
|
390
|
-
of taxonomic papers by databases. For instance, the <a href=\"https://www.catalogueoflife.org/\">Catalogue
|
391
|
-
of Life</a> has an ISSN <a href=\"https://portal.issn.org/resource/ISSN/2405-884X\">2405-884X</a>
|
392
|
-
so it can be treated as a publication. At the moment its web pages have lots
|
393
|
-
of identifiers for people managing data and their organisations (lots of <a
|
394
|
-
href=\"https://orcid.org\">ORCIDs</a> and <a href=\"https://ror.org\">RORs</a>,
|
395
|
-
and DOIs for individual datasets (e.g., <a href=\"https://www.checklistbank.org/dataset/9828/about\">checklistbank.org</a>)
|
396
|
-
but precious little in the way of DOIs for publications (or, indeed, ORCIDs
|
397
|
-
for taxonomists). What would it take for taxonomic publications in the Catalogue
|
398
|
-
of Life to be treated as first class citations?","tags":["Catalogue of Life","citation","CrossRef","DataCite","DOI"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ws094-1w310","uuid":"6bed78ec-0029-4096-b1c3-48a55a9fdb3b","url":"https://iphylo.blogspot.com/2023/04/chatgpt-of-course.html","title":"ChatGPT,
|
399
|
-
of course","summary":"I haven’t blogged for a while, work and other reasons
|
400
|
-
have meant I’ve not had much time to think, and mostly I blog to help me think.
|
401
|
-
ChatGPT is obviously a big thing at the moment, and once we get past the moral
|
402
|
-
panic (“students can pass exams using AI!”) there are a lot of interesting
|
403
|
-
possibilities to explore. Inspired by essays such as How Q&A systems based
|
404
|
-
on large language models (eg GPT4) will change things if they become the dominant
|
405
|
-
search paradigm — 9 implications for libraries...","date_published":"2023-04-03T12:52:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
406
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>I haven’t blogged for a
|
407
|
-
while, work and <a href=\"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html\">other
|
408
|
-
reasons</a> have meant I’ve not had much time to think, and mostly I blog
|
409
|
-
to help me think.</p>\n<p>ChatGPT is obviously a big thing at the moment,
|
410
|
-
and once we get past the moral panic (“students can pass exams using AI!”)
|
411
|
-
there are a lot of interesting possibilities to explore. Inspired by essays
|
412
|
-
such as <a href=\"https://medium.com/@aarontay/how-q-a-systems-based-on-large-language-models-eg-gpt4-will-change-things-if-they-become-the-norm-c7cf62736ba\">How
|
413
|
-
Q&A systems based on large language models (eg GPT4) will change things
|
414
|
-
if they become the dominant search paradigm — 9 implications for libraries</a>
|
415
|
-
and <a href=\"https://about.sourcegraph.com/blog/cheating-is-all-you-need\">Cheating
|
416
|
-
is All You Need</a>, as well as [<a href=\"https://paul-graham-gpt.vercel.app/\">Paul
|
417
|
-
Graham GPT</a>](<a href=\"https://paul-graham-gpt.vercel.app\">https://paul-graham-gpt.vercel.app</a>)
|
418
|
-
I thought I’d try a few things and see where this goes.</p>\n<p>ChatGPT can
|
419
|
-
do some surprising things.</p>\n<h4 id=\"parse-bibliographic-data\">Parse
|
420
|
-
bibliographic data</h4>\n<p>I spend a LOT of time working with bibliographic
|
421
|
-
data, trying to parse it into structured data. ChatGPT can do this:</p>\n\n<div
|
422
|
-
class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s924/Screenshot%202023-04-03%20at%2012.59.30.png\"
|
423
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
424
|
-
border=\"0\" height=\"400\" data-original-height=\"924\" data-original-width=\"738\"
|
425
|
-
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s400/Screenshot%202023-04-03%20at%2012.59.30.png\"/></a></div>\n\n<p>Note
|
426
|
-
that it does more than simply parse the strings, it expands journal abbreviations
|
427
|
-
such as “J. Malay Brch. R. Asiat. Soc.” to the full name “Journal of the Malayan
|
428
|
-
Branch of the Royal Asiatic Society”. So we can get clean, parsed data in
|
429
|
-
a range of formats.</p>\n<h4 id=\"parse-specimens\">Parse specimens</h4>\n<p>Based
|
430
|
-
on the success with parsing bibliographic strings I wondered how well it could
|
431
|
-
handle citation software specimens (“material examined”). Elsewhere I’ve been
|
432
|
-
critical of Plazi’s ability to do this, see <a href=\"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html\">Problems
|
433
|
-
with Plazi parsing: how reliable are automated methods for extracting specimens
|
434
|
-
from the literature?</a>.</p>\n<p>For example, given this specimen record
|
435
|
-
on p. 130 of <a href=\"https://doi.org/10.5852/ejt.2021.775.1553\">doi:10.5852/ejt.2021.775.1553</a></p>\n<blockquote>\n<p>LAOS
|
436
|
-
• Kammoune Province, Bunghona Market, 7 km Nof Xe Bangfai River;<br>\n17.13674°
|
437
|
-
N, 104.98591° E; E. Jeratthitikul, K. Wisittikoson, A. Fanka, N. Wutthituntisil
|
438
|
-
and P. Prasankok leg.; sold by local people;<br>\nMUMNH-UNI2831.</p>\n</blockquote>\n<p>ChatGPT
|
439
|
-
extracted a plausible Darwin Core record:</p>\n\n<div class=\"separator\"
|
440
|
-
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s901/Screenshot%202023-04-03%20at%2013.30.54.png\"
|
441
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
442
|
-
border=\"0\" height=\"400\" data-original-height=\"901\" data-original-width=\"764\"
|
443
|
-
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s400/Screenshot%202023-04-03%20at%2013.30.54.png\"/></a></div>\n\n<p>I’ve
|
444
|
-
been experimenting with parsing specimen records using the same machine learning
|
445
|
-
approach for bibliographic data (e.g., <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
|
446
|
-
parsing tool released</a>), perhaps it’s time to rethink that idea.</p>\n<h4
|
447
|
-
id=\"correct-ocr-errors\">Correct OCR errors</h4>\n<p>OCR errors are a well
|
448
|
-
known limitation of the Biodiversity Heritage Library (BHL), see for example
|
449
|
-
<a href=\"https://blog.biodiversitylibrary.org/2022/07/ocr-improvements-early-analysis.html\">OCR
|
450
|
-
Improvements: An Early Analysis</a>. Given that ChatGPT is built on top of
|
451
|
-
a large language model it seems obvious to see how it handles OCR errors.
|
452
|
-
I tried the following snippet of text from <a href=\"https://www.biodiversitylibrary.org/page/4312227\">BHL
|
453
|
-
page 4312227</a>:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Jiiaii
|
454
|
-
Jr., 16, who always join him and Hedriek in th(> search for<br>\nfrogs.
|
455
|
-
Iloraeio Mayorga, research assistant to tlie author, has also<br>\nbeen (wtreniely
|
456
|
-
useful, not only in field work, but also in the<br>\ntedious curatorial and
|
457
|
-
laboratory work that usually fol- lows. Dr.<br>\nJohn Randall kindly took
|
458
|
-
tlie photouraplis of E. hedricki, and the<br>\nGaliiiaues family of San Juan
|
459
|
-
made their El Verde house available as<br>\na (MMiter of activities for the
|
460
|
-
colleetiuff party. To all these<br>\njx’opic, the autlioi- feels (iee])ly
|
461
|
-
indebted.</p>\n<p>Elet’tiierodactylus iiKnRrcKi s]). n.</p>\n<p>Type. Museum
|
462
|
-
of Coniparative Zoology Xo. ;}()!)()8, c^ from El Verde,<br>\nwest flank
|
463
|
-
of El Yuiuinc Puerto Hico. I.IOO ft. Coll. J. A. Kivero,<br>\n11 Aug-. I!)(i2.</p>\n</blockquote>\n<p>ChatGPT
|
464
|
-
returns this:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Juan Jr., 16,
|
465
|
-
who always joins him and Hedrick in the search for<br>\nfrogs. Horacio Mayorga,
|
466
|
-
research assistant to the author, has also<br>\nbeen extremely useful, not
|
467
|
-
only in field work, but also in the tedious<br>\ncuratorial and laboratory
|
468
|
-
work that usually follows. Dr. John Randall<br>\nkindly took the photographs
|
469
|
-
of E. hedricki, and the Galinanes family<br>\nof San Juan made their El Verde
|
470
|
-
house available as a center of<br>\nactivities for the collecting party. To
|
471
|
-
all these people, the author<br>\nfeels deeply indebted.</p>\n<p>Eleutherodactylus
|
472
|
-
hedricki sp. n.</p>\n<p>Type. Museum of Comparative Zoology No. 30908, c?
|
473
|
-
from El Verde, west<br>\nflank of El Yunque, Puerto Rico. 1100 ft. Coll. J.
|
474
|
-
A. Rivero, 11 Aug.<br>\n1962.</p>\n</blockquote>\n\n<div class=\"separator\"
|
475
|
-
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s859/Screenshot%202023-04-03%20at%2013.07.53.png\"
|
476
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
477
|
-
border=\"0\" width=\"400\" data-original-height=\"510\" data-original-width=\"859\"
|
478
|
-
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s400/Screenshot%202023-04-03%20at%2013.07.53.png\"/></a></div>\n\n<p>Comparing
|
479
|
-
this to the scanned image ChatGPT it does pretty well, for example the gobbledegook
|
480
|
-
“Elet’tiierodactylus iiKnRrcKi” is correctly translated as “Eleutherodactylus
|
481
|
-
hedricki”. Running all of BHL through ChatGPT probably isn’t feasible, but
|
482
|
-
one could imagine targeted cleaning of key papers.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>These
|
483
|
-
small experiments are fairly trivial, but they are the sort of tedious tasks
|
484
|
-
that would otherwise require significant programming (or other resources)
|
485
|
-
to solve. But ChatGPT can do rather more, as I hope to discuss in the next
|
486
|
-
post.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/7esgr-61v1","uuid":"96fa91d5-459c-482f-aa38-dda6e0a30e20","url":"https://iphylo.blogspot.com/2022/01/large-graph-viewer-experiments.html","title":"Large
|
487
|
-
graph viewer experiments","summary":"I keep returning to the problem of viewing
|
488
|
-
large graphs and trees, which means my hard drive has accumulated lots of
|
489
|
-
failed prototypes. Inspired by some recent discussions on comparing taxonomic
|
490
|
-
classifications I decided to package one of these (wildly incomplete) prototypes
|
491
|
-
up so that I can document the idea and put the code somewhere safe. Very cool,
|
492
|
-
thanks for sharing this-- the tree diff is similar to what J Rees has been
|
493
|
-
cooking up lately with his 'cl diff' tool. I'll tag...","date_published":"2022-01-02T11:25:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
494
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>I keep returning to the
|
495
|
-
problem of viewing large graphs and trees, which means my hard drive has accumulated
|
496
|
-
lots of failed prototypes. Inspired by some recent discussions on comparing
|
497
|
-
taxonomic classifications I decided to package one of these (wildly incomplete)
|
498
|
-
prototypes up so that I can document the idea and put the code somewhere safe.</p>\n\n<blockquote
|
499
|
-
class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Very cool, thanks for sharing
|
500
|
-
this-- the tree diff is similar to what J Rees has been cooking up lately
|
501
|
-
with his 'cl diff' tool. I'll tag <a href=\"https://twitter.com/beckettws?ref_src=twsrc%5Etfw\">@beckettws</a>
|
502
|
-
in here too so he can see potential crossover. The goal is autogenerate diffs
|
503
|
-
like this as 1st step to mapping taxo name-to concept</p>— Nate Upham
|
504
|
-
(@n8_upham) <a href=\"https://twitter.com/n8_upham/status/1475834371131289608?ref_src=twsrc%5Etfw\">December
|
505
|
-
28, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
|
506
|
-
charset=\"utf-8\"></script>\n\n<h2>Google Maps-like viewer</h2>\n\n<div class=\"separator\"
|
507
|
-
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s500\"
|
508
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
509
|
-
border=\"0\" width=\"400\" data-original-height=\"448\" data-original-width=\"500\"
|
510
|
-
src=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s400\"/></a></div>\n\n<p>I''ve
|
511
|
-
created a simple viewer that uses a tiled map viewer (like Google Maps) to
|
512
|
-
display a large graph. The idea is to draw the entire graph scaled to a 256
|
513
|
-
x 256 pixel tile. The graph is stored in a database that supports geospatial
|
514
|
-
queries, which means the queries to retrieve the individual tiles need to
|
515
|
-
display the graph at different levels of resolution are simply bounding box
|
516
|
-
queries to a database. I realise that this description is cryptic at best.
|
517
|
-
The GitHub repository <a href=\"https://github.com/rdmpage/gml-viewer\">https://github.com/rdmpage/gml-viewer</a>
|
518
|
-
has more details and the code itself. There''s a lot to do, especially adding
|
519
|
-
support for labels(!) which presents some interesting challenges (<a href=\"https://en.wikipedia.org/wiki/Level_of_detail_(computer_graphics)\">levels
|
520
|
-
of detail</a> and <a href=\"https://en.wikipedia.org/wiki/Cartographic_generalization\">generalization</a>).
|
521
|
-
The code doesn''t do any layout of the graph itself, instead I''ve used the
|
522
|
-
<a href=\"https://www.yworks.com/products/yed\">yEd</a> tool to compute the
|
523
|
-
x,y coordinates of the graph.</p>\n\n<p>Since this exercise was inspired by
|
524
|
-
a discussion of the <a href=\"https://www.mammaldiversity.org\">ASM Mammal
|
525
|
-
Diversity Database</a>, the graph I''ve used for the demonstration above is
|
526
|
-
the ASM classification of extant mammals. I guess I need to solve the labelling
|
527
|
-
issue fairly quickly!</p>","tags":["Google Maps","graph","Mammal Species of
|
528
|
-
the World","mammals","taxonomy"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m48f7-c2128","uuid":"8aea47e4-f227-45f4-b37b-0454a8a7a3ff","url":"https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html","title":"ChatGPT,
|
529
|
-
semantic search, and knowledge graphs","summary":"One thing about ChatGPT
|
530
|
-
is it has opened my eyes to some concepts I was dimly aware of but am only
|
531
|
-
now beginning to fully appreciate. ChatGPT enables you ask it questions, but
|
532
|
-
the answers depend on what ChatGPT “knows”. As several people have noted,
|
533
|
-
what would be even better is to be able to run ChatGPT on your own content.
|
534
|
-
Indeed, ChatGPT itself now supports this using plugins. Paul Graham GPT However,
|
535
|
-
it’s still useful to see how to add ChatGPT functionality to your own content
|
536
|
-
from...","date_published":"2023-04-03T15:30:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
537
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>One thing about ChatGPT
|
538
|
-
is it has opened my eyes to some concepts I was dimly aware of but am only
|
539
|
-
now beginning to fully appreciate. ChatGPT enables you ask it questions, but
|
540
|
-
the answers depend on what ChatGPT “knows”. As several people have noted,
|
541
|
-
what would be even better is to be able to run ChatGPT on your own content.
|
542
|
-
Indeed, ChatGPT itself now supports this using <a href=\"https://openai.com/blog/chatgpt-plugins\">plugins</a>.</p>\n<h4
|
543
|
-
id=\"paul-graham-gpt\">Paul Graham GPT</h4>\n<p>However, it’s still useful
|
544
|
-
to see how to add ChatGPT functionality to your own content from scratch.
|
545
|
-
A nice example of this is <a href=\"https://paul-graham-gpt.vercel.app/\">Paul
|
546
|
-
Graham GPT</a> by <a href=\"https://twitter.com/mckaywrigley\">Mckay Wrigley</a>.
|
547
|
-
Mckay Wrigley took essays by Paul Graham (a well known venture capitalist)
|
548
|
-
and built a question and answer tool very like ChatGPT.</p>\n<iframe width=\"560\"
|
549
|
-
height=\"315\" src=\"https://www.youtube.com/embed/ii1jcLg-eIQ\" title=\"YouTube
|
550
|
-
video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write;
|
551
|
-
encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n<p>Because
|
552
|
-
you can send a block of text to ChatGPT (as part of the prompt) you can get
|
553
|
-
ChatGPT to summarise or transform that information, or answer questions based
|
554
|
-
on that information. But there is a limit to how much information you can
|
555
|
-
pack into a prompt. You can’t put all of Paul Graham’s essays into a prompt
|
556
|
-
for example. So a solution is to do some preprocessing. For example, given
|
557
|
-
a question such as “How do I start a startup?” we could first find the essays
|
558
|
-
that are most relevant to this question, then use them to create a prompt
|
559
|
-
for ChatGPT. A quick and dirty way to do this is simply do a text search over
|
560
|
-
the essays and take the top hits. But we aren’t searching for words, we are
|
561
|
-
searching for answers to a question. The essay with the best answer might
|
562
|
-
not include the phrase “How do I start a startup?”.</p>\n<h4 id=\"semantic-search\">Semantic
|
563
|
-
search</h4>\n<p>Enter <a href=\"https://en.wikipedia.org/wiki/Semantic_search\">Semantic
|
564
|
-
search</a>. The key concept behind semantic search is that we are looking
|
565
|
-
for documents with similar meaning, not just similarity of text. One approach
|
566
|
-
to this is to represent documents by “embeddings”, that is, a vector of numbers
|
567
|
-
that encapsulate features of the document. Documents with similar vectors
|
568
|
-
are potentially related. In semantic search we take the query (e.g., “How
|
569
|
-
do I start a startup?”), compute its embedding, then search among the documents
|
570
|
-
for those with similar embeddings.</p>\n<p>To create Paul Graham GPT Mckay
|
571
|
-
Wrigley did the following. First he sent each essay to the OpenAI API underlying
|
572
|
-
ChatGPT, and in return he got the embedding for that essay (a vector of 1536
|
573
|
-
numbers). Each embedding was stored in a database (Mckay uses Postgres with
|
574
|
-
<a href=\"https://github.com/pgvector/pgvector\">pgvector</a>). When a user
|
575
|
-
enters a query such as “How do I start a startup?” that query is also sent
|
576
|
-
to the OpenAI API to retrieve its embedding vector. Then we query the database
|
577
|
-
of embeddings for Paul Graham’s essays and take the top five hits. These hits
|
578
|
-
are, one hopes, the most likely to contain relevant answers. The original
|
579
|
-
question and the most similar essays are then bundled up and sent to ChatGPT
|
580
|
-
which then synthesises an answer. See his <a href=\"https://github.com/mckaywrigley/paul-graham-gpt\">GitHub
|
581
|
-
repo</a> for more details. Note that we are still using ChatGPT, but on a
|
582
|
-
set of documents it doesn’t already have.</p>\n<h4 id=\"knowledge-graphs\">Knowledge
|
583
|
-
graphs</h4>\n<p>I’m a fan of knowledge graphs, but they are not terribly easy
|
584
|
-
to use. For example, I built a knowledge graph of Australian animals <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a>
|
585
|
-
that contains a wealth of information on taxa, publications, and people, wrapped
|
586
|
-
up in a web site. If you want to learn more you need to figure out how to
|
587
|
-
write queries in SPARQL, which is not fun. Maybe we could use ChatGPT to write
|
588
|
-
the SPARQL queries for us, but it would be much more fun to be simply ask
|
589
|
-
natural language queries (e.g., “who are the experts on Australian ants?”).
|
590
|
-
I made some naïve notes on these ideas <a href=\"https://iphylo.blogspot.com/2015/09/possible-project-natural-language.html\">Possible
|
591
|
-
project: natural language queries, or answering “how many species are there?”</a>
|
592
|
-
and <a href=\"https://iphylo.blogspot.com/2019/05/ozymandias-meets-wikipedia-with-notes.html\">Ozymandias
|
593
|
-
meets Wikipedia, with notes on natural language generation</a>.</p>\n<p>Of
|
594
|
-
course, this is a well known problem. Tools such as <a href=\"http://rdf2vec.org\">RDF2vec</a>
|
595
|
-
can take RDF from a knowledge graph and create embeddings which could in tern
|
596
|
-
be used to support semantic search. But it seems to me that we could simply
|
597
|
-
this process a bit by making use of ChatGPT.</p>\n<p>Firstly we would generate
|
598
|
-
natural language statements from the knowledge graph (e.g., “species x belongs
|
599
|
-
to genus y and was described in z”, “this paper on ants was authored by x”,
|
600
|
-
etc.) that cover the basic questions we expect people to ask. We then get
|
601
|
-
embeddings for these (e.g., using OpenAI). We then have an interface where
|
602
|
-
people can ask a question (“is species x a valid species?”, “who has published
|
603
|
-
on ants”, etc.), we get the embedding for that question, retrieve natural
|
604
|
-
language statements that the closest in embedding “space”, package everything
|
605
|
-
up and ask ChatGPT to summarise the answer.</p>\n<p>The trick, of course,
|
606
|
-
is to figure out how t generate natural language statements from the knowledge
|
607
|
-
graph (which amounts to deciding what paths to traverse in the knowledge graph,
|
608
|
-
and how to write those paths is something approximating English). We also
|
609
|
-
want to know something about the sorts of questions people are likely to ask
|
610
|
-
so that we have a reasonable chance of having the answers (for example, are
|
611
|
-
people going to ask about individual species, or questions about summary statistics
|
612
|
-
such as numbers of species in a genus, etc.).</p>\n<p>What makes this attractive
|
613
|
-
is that it seems a straightforward way to go from a largely academic exercise
|
614
|
-
(build a knowledge graph) to something potentially useful (a question and
|
615
|
-
answer machine). Imagine if something like the defunct BBC wildlife site (see
|
616
|
-
<a href=\"https://iphylo.blogspot.com/2017/12/blue-planet-ii-bbc-and-semantic-web.html\">Blue
|
617
|
-
Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and
|
618
|
-
opportunities lost</a>) revived <a href=\"https://aspiring-look.glitch.me\">here</a>
|
619
|
-
had a question and answer interface where we could ask questions rather than
|
620
|
-
passively browse.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>I have so much
|
621
|
-
more to learn, and need to think about ways to incorporate semantic search
|
622
|
-
and ChatGPT-like tools into knowledge graphs.</p>\n<blockquote>\n<p>Written
|
623
|
-
with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/rfxj3-x6739","uuid":"6a4d5c44-f4a9-4d40-a32c-a4d5e512c55a","url":"https://iphylo.blogspot.com/2022/05/thoughts-on-treebase-dying.html","title":"Thoughts
|
624
|
-
on TreeBASE dying(?)","summary":"@rvosa is Naturalis no longer hosting Treebase?
|
625
|
-
https://t.co/MBRgcxaBmR— Hilmar Lapp (@hlapp) May 10, 2022 So it looks
|
626
|
-
like TreeBASE is in trouble, it''s legacy Java code a victim of security issues.
|
627
|
-
Perhaps this is a chance to rethink TreeBASE, assuming that a repository of
|
628
|
-
published phylogenies is still considered a worthwhile thing to have (and
|
629
|
-
I think that question is open). Here''s what I think could be done. The data
|
630
|
-
(individual studies with trees and data) are packaged into...","date_published":"2022-05-11T16:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
631
|
-
(Roderic Page)"}],"image":null,"content_html":"<blockquote class=\"twitter-tweet\"><p
|
632
|
-
lang=\"en\" dir=\"ltr\"><a href=\"https://twitter.com/rvosa?ref_src=twsrc%5Etfw\">@rvosa</a>
|
633
|
-
is Naturalis no longer hosting Treebase? <a href=\"https://t.co/MBRgcxaBmR\">https://t.co/MBRgcxaBmR</a></p>—
|
634
|
-
Hilmar Lapp (@hlapp) <a href=\"https://twitter.com/hlapp/status/1524166490798309381?ref_src=twsrc%5Etfw\">May
|
635
|
-
10, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
|
636
|
-
charset=\"utf-8\"></script>\n\n<p>So it looks like <a href=\"http://treebase.org\">TreeBASE</a>
|
637
|
-
is in trouble, it''s legacy Java code a victim of security issues. Perhaps
|
638
|
-
this is a chance to rethink TreeBASE, assuming that a repository of published
|
639
|
-
phylogenies is still considered a worthwhile thing to have (and I think that
|
640
|
-
question is open).</p>\n\n<p>Here''s what I think could be done.</p>\n\n<ol>\n<li>\nThe
|
641
|
-
data (individual studies with trees and data) are packaged into whatever format
|
642
|
-
is easiest (NEXUS, XML, JSON) and uploaded to a repository such as <a href=\"https://zenodo.org\">Zenodo</a>
|
643
|
-
for long term storage. They get DOIs for citability. This becomes the default
|
644
|
-
storage for TreeBASE.\n</li>\n<li>\nThe data is transformed into JSON and
|
645
|
-
indexed using Elasticsearch. A simple web interface is placed on top so that
|
646
|
-
people can easily find trees (never a strong point of the original TreeBASE).
|
647
|
-
Trees are displayed natively on the web using SVG. The number one goal is
|
648
|
-
for people to be able to find trees, view them, and download them.\n</li>\n<li>\nTo
|
649
|
-
add data to TreeBASE the easiest way would be for people to upload them direct
|
650
|
-
to Zenodo and tag them \"treebase\". A bot then grabs a feed of these datasets
|
651
|
-
and adds them to the search engine in (1) above. As time allows, add an interface
|
652
|
-
where people upload data directly, it gets curated, then deposited in Zenodo.
|
653
|
-
This presupposes that there are people available to do curation. Maybe have
|
654
|
-
\"stars\" for the level of curation so that users know whether anyone has
|
655
|
-
checked the data.\n</li>\n</ol>\n\n<p>There''s lots of details to tweak, for
|
656
|
-
example how many of the existing URLs for studies are preserved (some URL
|
657
|
-
mapping), and what about the API? And I''m unclear about the relationship
|
658
|
-
with <a href=\"https://datadryad.org\">Dryad</a>.</p>\n\n<p>My sense is that
|
659
|
-
the TreeBASE code is very much of its time (10-15 years ago), a monolithic
|
660
|
-
block of code with SQL, Java, etc. If one was starting from scratch today
|
661
|
-
I don''t think this would be the obvious solution. Things have trended towards
|
662
|
-
being simpler, with lots of building blocks now available in the cloud. Need
|
663
|
-
a search engine? Just spin up a container in the cloud and you have one. More
|
664
|
-
and more functionality can be devolved elsewhere.</p>\n\n<p>Another other
|
665
|
-
issue is how to support TreeBASE. It has essentially been a volunteer effort
|
666
|
-
to date, with little or no funding. One reason I think having Zenodo as a
|
667
|
-
storage engine is that it takes care of long term sustainability of the data.</p>\n\n<p>I
|
668
|
-
realise that this is all wild arm waving, but maybe now is the time to reinvent
|
669
|
-
TreeBASE?</p>\n\n<h2>Updates</h2>\n\n<p>It''s been a while since I''ve paid
|
670
|
-
a lot of attention to phylogenetic databases, and it shows. There is a file-based
|
671
|
-
storage system for phylogenies <a href=\"https://github.com/OpenTreeOfLife/phylesystem-1\">phylesystem</a>
|
672
|
-
(see \"Phylesystem: a git-based data store for community-curated phylogenetic
|
673
|
-
estimates\" <a href=\"https://doi.org/10.1093/bioinformatics/btv276\">https://doi.org/10.1093/bioinformatics/btv276</a>)
|
674
|
-
that is sort of what I had in mind, although long term persistence is based
|
675
|
-
on GitHub rather than a repository such as Zenodo. Phylesystem uses a truly
|
676
|
-
horrible-looking JSON transformation of <a href=\"http://nexml.github.io\">NeXML</a>
|
677
|
-
(NeXML itself is ugly), and TreeBASE also supports NeXML, so some form of
|
678
|
-
NeXML or a JSON transformation seems the obvious storage format. It will probably
|
679
|
-
need some cleaning and simplification if it is to be indexed easily. Looking
|
680
|
-
back over the long history of TreeBASE and phylogenetic databases I''m struck
|
681
|
-
by how much complexity has been introduced over time. I think the tech has
|
682
|
-
gotten in the way sometimes (which might just be another way of saying that
|
683
|
-
I''m not smart enough to make sense of it all.</p>\n\n<p>So we could imagine
|
684
|
-
a search engine that covers both TreeBASE and <a href=\"https://tree.opentreeoflife.org/curator\">Open
|
685
|
-
Tree of Life studies</a>.</p>\n\n<p>Basic metadata-based searches would be
|
686
|
-
straightforward, and we could have a user interface that highlights the trees
|
687
|
-
(I think TreeBASE''s biggest search rival is a Google image search). The harder
|
688
|
-
problem is searching by tree structure, for which there is an interesting
|
689
|
-
literature without any decent implementations that I''m aware of (as I said,
|
690
|
-
I''ve been out of this field a while).</p>\n\n<p>So my instinct is we could
|
691
|
-
go a long way with simply indexing JSON (CouchDB or Elasticsearch), then need
|
692
|
-
to think a bit more cleverly about higher taxon and tree based searching.
|
693
|
-
I''ve always thought that one killer query would be not so much \"show me
|
694
|
-
all the trees for my taxon\" but \"show me a synthesis of the trees for my
|
695
|
-
taxon\". Imagine a supertree of recent studies that we could use as a summary
|
696
|
-
of our current knowledge, or a visualisation that summarises where there are
|
697
|
-
conflicts among the trees.</p>\n\n<h3>Relevant code and sites</h3>\n\n<ul>\n<li><a
|
698
|
-
href=\"https://github.com/rdmpage/cdaotools\">CDAO Tools</a>, see \"CDAO-Store:
|
699
|
-
Ontology-driven Data Integration for Phylogenetic Analysis\" <a href=\"https://doi.org/10.1186/1471-2105-12-98\">https://doi.org/10.1186/1471-2105-12-98</a></li>\n<li><a
|
700
|
-
href=\"https://github.com/NESCent/phylocommons\">PhyloCommons</a></li>\n</ul>","tags":["phylogeny","TreeBASE"],"language":"en","references":null},{"id":"https://doi.org/10.59350/jzvs4-r9559","uuid":"23fa1dd8-5c6b-4aa9-9cad-c6f6b14ae9e0","url":"https://iphylo.blogspot.com/2021/08/json-ld-in-wild-examples-of-how.html","title":"JSON-LD
|
701
|
-
in the wild: examples of how structured data is represented on the web","summary":"I''ve
|
702
|
-
created a GitHub repository so that I can keep track of the examples of JSON-LD
|
703
|
-
that I''ve seen being actively used, for example embedded in web sites, or
|
704
|
-
accessed using an API. The repository is https://github.com/rdmpage/wild-json-ld.
|
705
|
-
The list is by no means exhaustive, I hope to add more examples as I come
|
706
|
-
across them. One reason for doing this is to learn what others are doing.
|
707
|
-
For example, after looking at SciGraph''s JSON-LD I now see how an ordered
|
708
|
-
list can be modelled in RDF in...","date_published":"2021-08-27T13:20:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
709
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>I''ve created a GitHub repository
|
710
|
-
so that I can keep track of the examples of JSON-LD that I''ve seen being
|
711
|
-
actively used, for example embedded in web sites, or accessed using an API.
|
712
|
-
The repository is <a href=\"https://github.com/rdmpage/wild-json-ld\">https://github.com/rdmpage/wild-json-ld</a>.
|
713
|
-
The list is by no means exhaustive, I hope to add more examples as I come
|
714
|
-
across them.</p>\n\n<p>One reason for doing this is to learn what others are
|
715
|
-
doing. For example, after looking at SciGraph''s JSON-LD I now see how an
|
716
|
-
ordered list can be modelled in RDF in such a way that the list of authors
|
717
|
-
in a JSON-LD document for, say a scientific paper, is correct. By default
|
718
|
-
RDF has no notion of ordered lists, so if you do a SPARQL query to get the
|
719
|
-
authors of a paper, the order of the authors returned in the query will be
|
720
|
-
arbitrary. There are various ways to try and tackle this. In my Ozymandias
|
721
|
-
knowledge graph I used \"roles\" to represent order (see <a href=\"https://doi.org/10.7717/peerj.6739/fig-2\">Figure
|
722
|
-
2</a> in the Ozymandias paper). I then used properties of the role to order
|
723
|
-
the list of authors.</p>\n\n<p>Another approach is to use rdf:lists (see <a
|
724
|
-
href=\"http://www.snee.com/bobdc.blog/2014/04/rdf-lists-and-sparql.html\">RDF
|
725
|
-
lists and SPARQL</a> and <a href=\"https://stackoverflow.com/questions/17523804/is-it-possible-to-get-the-position-of-an-element-in-an-rdf-collection-in-sparql/17530689#17530689\">Is
|
726
|
-
it possible to get the position of an element in an RDF Collection in SPARQL?</a>
|
727
|
-
for an introduction to lists). SciGraph uses this approach. The value for
|
728
|
-
schema:author is not an author, but a blank node (bnode), and this bnode has
|
729
|
-
two predicates, rdf:first and rdf:rest. One points to an author, the other
|
730
|
-
points to another bnode. This pattern repeats until we encounter a value of
|
731
|
-
rdf:nil for rdf:rest.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a
|
732
|
-
href=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s629/list.png\"
|
733
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
734
|
-
border=\"0\" height=\"320\" data-original-height=\"629\" data-original-width=\"401\"
|
735
|
-
src=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s320/list.png\"/></a></div>\n\n<p>This
|
736
|
-
introduces some complexity, but the benefit is that the JSON-LD version of
|
737
|
-
the RDF will have the authors in the correct order, and hence any client that
|
738
|
-
is using JSON will be able to treat the array of authors as ordered. Without
|
739
|
-
some means of ordering the client could not make this assumption, hence the
|
740
|
-
first author in the list might not actually be the first author of the paper.</p>","tags":["JSON-LD","RDF"],"language":"en","references":null},{"id":"https://doi.org/10.59350/zc4qc-77616","uuid":"30c78d9d-2e50-49db-9f4f-b3baa060387b","url":"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html","title":"Does
|
741
|
-
anyone cite taxonomic treatments?","summary":"Taxonomic treatments have come
|
742
|
-
up in various discussions I''m involved in, and I''m curious as to whether
|
743
|
-
they are actually being used, in particular, whether they are actually being
|
744
|
-
cited. Consider the following quote: The taxa are described in taxonomic treatments,
|
745
|
-
well defined sections of scientific publications (Catapano 2019). They include
|
746
|
-
a nomenclatural section and one or more sections including descriptions, material
|
747
|
-
citations referring to studied specimens, or notes ecology and...","date_published":"2022-09-01T16:49:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
748
|
-
(Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
749
|
-
both;\"><a href=\"https://zenodo.org/record/5731100/thumb100\" style=\"display:
|
750
|
-
block; padding: 1em 0; text-align: center; clear: right; float: right;\"><img
|
751
|
-
alt=\"\" border=\"0\" height=\"128\" data-original-height=\"106\" data-original-width=\"100\"
|
752
|
-
src=\"https://zenodo.org/record/5731100/thumb250\"/></a></div>\nTaxonomic
|
753
|
-
treatments have come up in various discussions I''m involved in, and I''m
|
754
|
-
curious as to whether they are actually being used, in particular, whether
|
755
|
-
they are actually being cited. Consider the following quote:\n\n<blockquote>\nThe
|
756
|
-
taxa are described in taxonomic treatments, well defined sections of scientific
|
757
|
-
publications (Catapano 2019). They include a nomenclatural section and one
|
758
|
-
or more sections including descriptions, material citations referring to studied
|
759
|
-
specimens, or notes ecology and behavior. In case the treatment does not describe
|
760
|
-
a new discovered taxon, previous treatments are cited in the form of treatment
|
761
|
-
citations. This citation can refer to a previous treatment and add additional
|
762
|
-
data, or it can be a statement synonymizing the taxon with another taxon.
|
763
|
-
This allows building a citation network, and ultimately is a constituent part
|
764
|
-
of the catalogue of life. - Taxonomic Treatments as Open FAIR Digital Objects
|
765
|
-
<a href=\"https://doi.org/10.3897/rio.8.e93709\">https://doi.org/10.3897/rio.8.e93709</a>\n</blockquote>\n\n<p>\n
|
766
|
-
\"Traditional\" academic citation is from article to article. For example,
|
767
|
-
consider these two papers:\n\n<blockquote>\nLi Y, Li S, Lin Y (2021) Taxonomic
|
768
|
-
study on fourteen symphytognathid species from Asia (Araneae, Symphytognathidae).
|
769
|
-
ZooKeys 1072: 1-47. https://doi.org/10.3897/zookeys.1072.67935\n</blockquote>\n\n<blockquote>\nMiller
|
770
|
-
J, Griswold C, Yin C (2009) The symphytognathoid spiders of the Gaoligongshan,
|
771
|
-
Yunnan, China (Araneae: Araneoidea): Systematics and diversity of micro-orbweavers.
|
772
|
-
ZooKeys 11: 9-195. https://doi.org/10.3897/zookeys.11.160\n</blockquote>\n</p>\n\n<p>Li
|
773
|
-
et al. 2021 cites Miller et al. 2009 (although Pensoft seems to have broken
|
774
|
-
the citation such that it does appear correctly either on their web page or
|
775
|
-
in CrossRef).</p>\n\n<p>So, we have this link: [article]10.3897/zookeys.1072.67935
|
776
|
-
--cites--> [article]10.3897/zookeys.11.160. One article cites another.</p>\n\n<p>In
|
777
|
-
their 2021 paper Li et al. discuss <i>Patu jidanweishi</i> Miller, Griswold
|
778
|
-
& Yin, 2009:\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s1040/Screenshot%202022-09-01%20at%2017.12.27.png\"
|
779
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
780
|
-
border=\"0\" width=\"400\" data-original-height=\"314\" data-original-width=\"1040\"
|
781
|
-
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s400/Screenshot%202022-09-01%20at%2017.12.27.png\"/></a></div>\n\n<p>There
|
782
|
-
is a treatment for the original description of <i>Patu jidanweishi</i> at
|
783
|
-
<a href=\"https://doi.org/10.5281/zenodo.3792232\">https://doi.org/10.5281/zenodo.3792232</a>,
|
784
|
-
which was created by Plazi with a time stamp \"2020-05-06T04:59:53.278684+00:00\".
|
785
|
-
The original publication date was 2009, the treatments are being added retrospectively.</p>\n\n<p>In
|
786
|
-
an ideal world my expectation would be that Li et al. 2021 would have cited
|
787
|
-
the treatment, instead of just providing the text string \"Patu jidanweishi
|
788
|
-
Miller, Griswold & Yin, 2009: 64, figs 65A–E, 66A, B, 67A–D, 68A–F, 69A–F,
|
789
|
-
70A–F and 71A–F (♂♀).\" Isn''t the expectation under the treatment model that
|
790
|
-
we would have seen this relationship:</p>\n\n<p>[article]10.3897/zookeys.1072.67935
|
791
|
-
--cites--> [treatment]https://doi.org/10.5281/zenodo.3792232</p>\n\n<p>Furthermore,
|
792
|
-
if it is the case that \"[i]n case the treatment does not describe a new discovered
|
793
|
-
taxon, previous treatments are cited in the form of treatment citations\"
|
794
|
-
then we should also see a citation between treatments, in other words Li et
|
795
|
-
al.''s 2021 treatment of <i>Patu jidanweishi</i> (which doesn''t seem to have
|
796
|
-
a DOI but is available on Plazi'' web site as <a href=\"https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74\">https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74</a>)
|
797
|
-
should also cite the original treatment? It doesn''t - but it does cite the
|
798
|
-
Miller et al. paper.</p>\n\n<p>So in this example we don''t see articles citing
|
799
|
-
treatments, nor do we see treatments citing treatments. Playing Devil''s advocate,
|
800
|
-
why then do we have treatments? Does''t the lack of citations suggest that
|
801
|
-
- despite some taxonomists saying this is the unit that matters - they actually
|
802
|
-
don''t. If we pay attention to what people do rather than what they say they
|
803
|
-
do, they cite articles.</p>\n\n<p>Now, there are all sorts of reasons why
|
804
|
-
we don''t see [article] -> [treatment] citations, or [treatment] -> [treatment]
|
805
|
-
citations. Treatments are being added after the fact by Plazi, not by the
|
806
|
-
authors of the original work. And in many cases the treatments that could
|
807
|
-
be cited haven''t appeared until after that potentially citing work was published.
|
808
|
-
In the example above the Miller et al. paper dates from 2009, but the treatment
|
809
|
-
extracted only went online in 2020. And while there is a long standing culture
|
810
|
-
of citing publications (ideally using DOIs) there isn''t an equivalent culture
|
811
|
-
of citing treatments (beyond the simple text strings).</p>\n\n<p>Obviously
|
812
|
-
this is but one example. I''d need to do some exploration of the citation
|
813
|
-
graph to get a better sense of citations patterns, perhaps using <a href=\"https://www.crossref.org/documentation/event-data/\">CrossRef''s
|
814
|
-
event data</a>. But my sense is that taxonomists don''t cite treatments.</p>\n\n<p>I''m
|
815
|
-
guessing Plazi would respond by saying treatments are cited, for example (indirectly)
|
816
|
-
in GBIF downloads. This is true, although arguably people aren''t citing the
|
817
|
-
treatment, they''re citing specimen data in those treatments, and that specimen
|
818
|
-
data could be extracted at the level of articles rather than treatments. In
|
819
|
-
other words, it''s not the treatments themselves that people are citing.</p>\n\n<p>To
|
820
|
-
be clear, I think there is value in being able to identify those \"well defined
|
821
|
-
sections\" of a publication that deal with a given taxon (i.e., treatments),
|
822
|
-
but it''s not clear to me that these are actually the citable units people
|
823
|
-
might hope them to be. Likewise, journals such as <i>ZooKeys</i> have DOIs
|
824
|
-
for individual figures. Does anyone actually cite those?</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/en7e9-5s882","uuid":"20b9d31e-513f-496b-b399-4215306e1588","url":"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html","title":"Obsidian,
|
825
|
-
markdown, and taxonomic trees","summary":"Returning to the subject of personal
|
826
|
-
knowledge graphs Kyle Scheer has an interesting repository of Markdown files
|
827
|
-
that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines
|
828
|
-
(see his blog post for more background). If you add these files to Obsidian
|
829
|
-
you get a nice visualisation of a taxonomy of academic disciplines. The applications
|
830
|
-
of this to biological taxonomy seem obvious, especially as a tool like Obsidian
|
831
|
-
enables all sorts of interesting links to be added...","date_published":"2022-04-07T21:07:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
832
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>Returning to the subject
|
833
|
-
of <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">personal
|
834
|
-
knowledge graphs</a> Kyle Scheer has an interesting repository of Markdown
|
835
|
-
files that describe academic disciplines at <a href=\"https://github.com/kyletscheer/academic-disciplines\">https://github.com/kyletscheer/academic-disciplines</a>
|
836
|
-
(see <a href=\"https://kyletscheer.medium.com/on-creating-a-tree-of-knowledge-f099c1028bf6\">his
|
837
|
-
blog post</a> for more background).</p>\n\n<p>If you add these files to <a
|
838
|
-
href=\"https://obsidian.md/\">Obsidian</a> you get a nice visualisation of
|
839
|
-
a taxonomy of academic disciplines. The applications of this to biological
|
840
|
-
taxonomy seem obvious, especially as a tool like Obsidian enables all sorts
|
841
|
-
of interesting links to be added (e.g., we could add links to the taxonomic
|
842
|
-
research behind each node in the taxonomic tree, the people doing that research,
|
843
|
-
etc. - although that would mean we''d no longer have a simple tree).</p>\n\n<p>The
|
844
|
-
more I look at these sort of simple Markdown-based tools the more I wonder
|
845
|
-
whether we could make more use of them to create simple but persistent databases.
|
846
|
-
Text files seem the most stable, long-lived digital format around, maybe this
|
847
|
-
would be a way to minimise the inevitable obsolescence of database and server
|
848
|
-
software. Time for some experiments I feel... can we take a taxonomic group,
|
849
|
-
such as mammals, and create a richly connected database purely in Markdown?</p>\n\n<div
|
850
|
-
class=\"separator\" style=\"clear: both; text-align: center;\"><iframe allowfullscreen=''allowfullscreen''
|
851
|
-
webkitallowfullscreen=''webkitallowfullscreen'' mozallowfullscreen=''mozallowfullscreen''
|
852
|
-
width=''400'' height=''322'' src=''https://www.blogger.com/video.g?token=AD6v5dy3Sa_SY_MJCZYYCT-bAGe9QD1z_V0tkE0qM5FaQJfAEgGOoHtYPATsNNbBvTEh_tHOZ83nMGzpYRg''
|
853
|
-
class=''b-hbp-video b-uploaded'' frameborder=''0''></iframe></div>","tags":["markdown","obsidian"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m7gb7-d7c49","uuid":"7d814863-43b5-4faf-a475-da8de5efd3ef","url":"https://iphylo.blogspot.com/2022/02/duplicate-dois-again.html","title":"Duplicate
|
854
|
-
DOIs (again)","summary":"This blog post provides some background to a recent
|
855
|
-
tweet where I expressed my frustration about the duplication of DOIs for the
|
856
|
-
same article. I''m going to document the details here. The DOI that alerted
|
857
|
-
me to this problem is https://doi.org/10.2307/2436688 which is for the article
|
858
|
-
Snyder, W. C., & Hansen, H. N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American
|
859
|
-
Journal of Botany, 27(2), 64–67. This article is hosted by JSTOR at https://www.jstor.org/stable/2436688
|
860
|
-
which displays the DOI...","date_published":"2022-02-08T15:06:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
861
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>This blog post provides
|
862
|
-
some background to a <a href=\"https://twitter.com/rdmpage/status/1491023036199600132\">recent
|
863
|
-
tweet</a> where I expressed my frustration about the duplication of DOIs for
|
864
|
-
the same article. I''m going to document the details here.</p>\n\n<p>The DOI
|
865
|
-
that alerted me to this problem is <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
|
866
|
-
which is for the article</p>\n\n<blockquote>\nSnyder, W. C., & Hansen, H.
|
867
|
-
N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American Journal of Botany, 27(2),
|
868
|
-
64–67.\n</blockquote>\n\n<p>This article is hosted by JSTOR at <a href=\"https://www.jstor.org/stable/2436688\">https://www.jstor.org/stable/2436688</a>
|
869
|
-
which displays the DOI <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
|
870
|
-
.</p>\n\n<p>This same article is also hosted by Wiley at <a href=\"https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x\">https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x</a>
|
871
|
-
with the DOI <a href=\"https://doi.org/10.1002/j.1537-2197.1940.tb14217.x\">https://doi.org/10.1002/j.1537-2197.1940.tb14217.x</a>.</p>\n\n<h2>Expected
|
872
|
-
behaviour</h2>\n\n<p>What should happen is if Wiley is going to be the publisher
|
873
|
-
of this content (taking over from JSTOR), the DOI <b>10.2307/2436688</b> should
|
874
|
-
be redirected to the Wiley page, and the Wiley page displays this DOI (i.e.,
|
875
|
-
<b>10.2307/2436688</b>). If I want to get metadata for this DOI, I should
|
876
|
-
be able to use CrossRef''s API to retrieve that metadata, e.g. <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
|
877
|
-
should return metadata for the article.</p>\n\n<h2>What actually happens</h2>\n\n<p>Wiley
|
878
|
-
display the same article on their web site with the DOI <b>10.1002/j.1537-2197.1940.tb14217.x</b>.
|
879
|
-
They have minted a new DOI for the same article! The original JSTOR DOI now
|
880
|
-
resolves to the Wiley page (you can see this using the <a href=\"https://hdl.handle.net\">Handle
|
881
|
-
Resolver</a>), which is what is supposed to happen. However, Wiley should
|
882
|
-
have reused the original DOI rather than mint their own.</p>\n\n<p>Furthermore,
|
883
|
-
while the original DOI still resolves in a web browser, I can''t retrieve
|
884
|
-
metadata about that DOI from CrossRef, so any attempt to build upon that DOI
|
885
|
-
fails. However, I can retrieve metadata for the Wiley DOI, i.e. <a href=\"https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x\">https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x</a>
|
886
|
-
works, but <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
|
887
|
-
doesn''t.</p>\n\n<h2>Why does this matter?</h2>\n\n<p>For anyone using DOIs
|
888
|
-
as stable links to the literature the persistence of DOIs is something you
|
889
|
-
should be able to rely upon, both for people clicking on links in web browsers
|
890
|
-
and developers getting metadata from those DOIs. The whole rationale of the
|
891
|
-
DOI system is a single, globally unique identifier for each article, and that
|
892
|
-
these DOIs persist even when the publisher of the content changes. If this
|
893
|
-
property doesn''t hold, then why would a developer such as myself invest effort
|
894
|
-
in linking using DOIs?</p>\n\n<p>Just for the record, I think CrossRef is
|
895
|
-
great and is a hugely important part of the scholarly landscape. There are
|
896
|
-
lots of things that I do that would be nearly impossible without CrossRef
|
897
|
-
and its tools. But cases like this where we get massive duplication of DOIs
|
898
|
-
when a publishers takes over an existing journal fundamentally breaks the
|
899
|
-
underlying model of stable, persistent identifiers.</p>","tags":["CrossRef","DOI","duplicates"],"language":"en","references":null},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
|
52
|
+
Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default","feed_format":"application/atom+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-05-31T17:26:00+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger
|
53
|
+
7.00","category":"Natural Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
|
54
|
+
barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
|
55
|
+
tweeted about this but want to bookmark it for later as well. The paper “A
|
56
|
+
molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
|
57
|
+
contains the following: …the annotated barcode records assembled by FinBOL
|
58
|
+
participants represent a tremendous intergenerational transfer of taxonomic
|
59
|
+
knowledge … the time contributed by current taxonomists in identifying and
|
60
|
+
contributing voucher specimens represents a great gift to future generations
|
61
|
+
who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":"2022-09-29T13:57:30Z","date_indexed":"1909-06-16T11:02:21+00:00","authors":[{"url":null,"name":"Roderic
|
62
|
+
Page"}],"image":null,"content_html":"<p>I <a href=\"https://twitter.com/rdmpage/status/1569738844416638981?s=21&t=9OVXuoUEwZtQt-Ldzlutfw\">tweeted
|
63
|
+
about this</a> but want to bookmark it for later as well. The paper “A molecular-based
|
64
|
+
identification resource for the arthropods of Finland” <a href=\"https://doi.org/10.1111/1755-0998.13510\">doi:10.1111/1755-0998.13510</a>
|
65
|
+
contains the following:</p>\n<blockquote>\n<p>…the annotated barcode records
|
66
|
+
assembled by FinBOL participants represent a tremendous <mark>intergenerational
|
67
|
+
transfer of taxonomic knowledge</mark> … the time contributed by current taxonomists
|
68
|
+
in identifying and contributing voucher specimens represents a great gift
|
69
|
+
to future generations who will benefit from their expertise when they are
|
70
|
+
no longer able to process new material.</p>\n</blockquote>\n<p>I think this
|
71
|
+
is a very clever way to characterise the project. In an age of machine learning
|
72
|
+
this may be commonest way to share knowledge , namely as expert-labelled training
|
73
|
+
data used to build tools for others. Of course, this means the expertise itself
|
74
|
+
may be lost, which has implications for updating the models if the data isn’t
|
75
|
+
complete. But it speaks to Charles Godfrey’s theme of <a href=\"https://biostor.org/reference/250587\">“Taxonomy
|
76
|
+
as information science”</a>.</p>\n<p>Note that the knowledge is also transformed
|
77
|
+
in the sense that the underlying expertise of interpreting morphology, ecology,
|
78
|
+
behaviour, genomics, and the past literature is not what is being passed on.
|
79
|
+
Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>\n<p>This
|
80
|
+
feels is different to, say iNaturalist, where there is a machine learning
|
81
|
+
model to identify images. In that case, the model is built on something the
|
82
|
+
community itself has created, and continues to create. Yes, the underlying
|
83
|
+
idea is that same: “experts” have labelled the data, a model is trained, the
|
84
|
+
model is used. But the benefits of the <a href=\"https://www.inaturalist.org\">iNaturalist</a>
|
85
|
+
model are immediately applicable to the people whose data built the model.
|
86
|
+
In the case of barcoding, because the technology itself is still not in the
|
87
|
+
hands of many (relative to, say, digital imaging), the benefits are perhaps
|
88
|
+
less tangible. Obviously researchers working with environmental DNA will find
|
89
|
+
it very useful, but broader impact may await the arrival of citizen science
|
90
|
+
DNA barcoding.</p>\n<p>The other consideration is whether the barcoding helps
|
91
|
+
taxonomists. Is it to be used to help prioritise future work (“we are getting
|
92
|
+
lots of unknown sequences in these taxa, lets do some taxonomy there”), or
|
93
|
+
is it simply capturing the knowledge of a generation that won’t be replaced:</p>\n<blockquote>\n<p>The
|
94
|
+
need to capture such knowledge is essential because there are, for example,
|
95
|
+
no young Finnish taxonomists who can critically identify species in many key
|
96
|
+
groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges,
|
97
|
+
most mite lineages).</p>\n</blockquote>\n<p>The cycle of collect data, test
|
98
|
+
and refine model, collect more data, rinse and repeat that happens with iNaturalist
|
99
|
+
creates a feedback loop. It’s not clear that a similar cycle exists for DNA
|
100
|
+
barcoding.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
|
900
101
|
on \"The Macroscope\" - a tool for the 21st Century?","summary":"This is a
|
901
102
|
guest post by Tony Rees. It would be difficult to encounter a scientist, or
|
902
103
|
anyone interested in science, who is not familiar with the microscope, a tool
|
@@ -904,8 +105,8 @@ http_interactions:
|
|
904
105
|
by the unaided eye, or to reveal otherwise invisible fine detail in larger
|
905
106
|
objects. A select few with a particular interest in microscopy may also have
|
906
107
|
encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop
|
907
|
-
microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":
|
908
|
-
|
108
|
+
microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":"2021-10-08T10:26:22Z","date_indexed":"1909-06-16T10:02:25+00:00","authors":[{"url":null,"name":"Roderic
|
109
|
+
Page"}],"image":null,"content_html":"<p><img src=\"https://lh3.googleusercontent.com/-A99btr6ERMs/Vl1Wvjp2OtI/AAAAAAAAEFI/7bKdRjNG5w0/ytNkVT2U.jpg?imgmax=800\"
|
909
110
|
alt=\"YtNkVT2U\" title=\"ytNkVT2U.jpg\" border=\"0\" width=\"128\" height=\"128\"
|
910
111
|
style=\"float:right;\" /> This is a guest post by <a href=\"https://about.me/TonyRees\">Tony
|
911
112
|
Rees</a>.</p>\n\n<p>It would be difficult to encounter a scientist, or anyone
|
@@ -1054,261 +255,11 @@ http_interactions:
|
|
1054
255
|
a few personal views which would be out of scope in a \"neutral point of view\"
|
1055
256
|
article such as is required on Wikipedia; also to see if readers of this blog
|
1056
257
|
would like to contribute further to discussion of any of the concepts traversed
|
1057
|
-
herein.</p>","tags":["guest post","macroscope"],"language":"en","references":
|
1058
|
-
trip from identifiers to citations and back again","summary":"Note to self
|
1059
|
-
(basically rewriting last year''s Finding citations of specimens). Bibliographic
|
1060
|
-
data supports going from identifier to citation string and back again, so
|
1061
|
-
we can do a \"round trip.\" 1. Given a DOI we can get structured data with
|
1062
|
-
a simple HTTP fetch, then use a tool such as citation.js to convert that data
|
1063
|
-
into a human-readable string in a variety of formats. Identifier ⟶ Structured
|
1064
|
-
data ⟶ Human readable string 10.7717/peerj-cs.214 HTTP with...","date_published":"2022-05-27T16:34:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1065
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>Note to self (basically
|
1066
|
-
rewriting last year''s <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">Finding
|
1067
|
-
citations of specimens</a>).</p>\n\n<p>Bibliographic data supports going from
|
1068
|
-
identifier to citation string and back again, so we can do a \"round trip.\"</p>\n\n<h2>1.</h2>\n\n<p>Given
|
1069
|
-
a DOI we can get structured data with a simple HTTP fetch, then use a tool
|
1070
|
-
such as <a href=\"https://citation.js.org\">citation.js</a> to convert that
|
1071
|
-
data into a human-readable string in a variety of formats.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟶\n</th>\n<th>\nStructured
|
1072
|
-
data\n</th>\n<th>\n⟶\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nHTTP
|
1073
|
-
with content-negotiation\n</td>\n<td>\nCSL-JSON\n</td>\n<td>\nCSL templates\n</td>\n<td
|
1074
|
-
width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
|
1075
|
-
modular bibliography tool for the browser and command line. PeerJ Computer
|
1076
|
-
Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>2.</h2>\n\n<p>Going
|
1077
|
-
in the reverse direction (string to identifier) is a little more challenging.
|
1078
|
-
In the \"old days\" a typical strategy was to attempt to parse the citation
|
1079
|
-
string into structured data (see <a href=\"hhtps://anystyle.io\">AnyStyle</a>
|
1080
|
-
for a nice example of this), then we could extract a truple of (journal, volume,
|
1081
|
-
starting page) and use that to query CrossRef to find if there was an article
|
1082
|
-
with that tuple, which gave us the DOI.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nStructured
|
1083
|
-
data\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nOpenURL
|
1084
|
-
query\n</td>\n<td>\njournal, volume, start page\n</td>\n<td>\nCitation parser
|
1085
|
-
\n</td>\n<td width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
|
1086
|
-
modular bibliography tool for the browser and command line. PeerJ Computer
|
1087
|
-
Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>3.</h2>\n\n<p>Another
|
1088
|
-
strategy is to take all the citations strings for each DOI, index those in
|
1089
|
-
a search engine, then just use a simple search to find the best match to your
|
1090
|
-
citation string, and hence the DOI. This is what <a href=\"https://search.crossref.org\">https://search.crossref.org</a>
|
1091
|
-
does.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman
|
1092
|
-
readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nsearch\n</td>\n<td
|
1093
|
-
width=\"50%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
|
1094
|
-
modular bibliography tool for the browser and command line. PeerJ Computer
|
1095
|
-
Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<p>At
|
1096
|
-
the moment my work on material citations (i.e., lists of specimens in taxonomic
|
1097
|
-
papers) is focussing on 1 (generating citations from specimen data in GBIF)
|
1098
|
-
and 2 (parsing citations into structured data).</p>","tags":["citation","GBIF","material
|
1099
|
-
examined","specimen codes"],"language":"en","references":null},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
|
1100
|
-
bibliographic data","summary":"There are several instances where I have a
|
1101
|
-
collection of references that I want to deduplicate and merge. For example,
|
1102
|
-
in Zootaxa has no impact factor I describe a dataset of the literature cited
|
1103
|
-
by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4),
|
1104
|
-
as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1).
|
1105
|
-
Given that the same articles may be cited many times, these datasets have
|
1106
|
-
lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1107
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>There are several instances
|
1108
|
-
where I have a collection of references that I want to deduplicate and merge.
|
1109
|
-
For example, in <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
|
1110
|
-
has no impact factor</a> I describe a dataset of the literature cited by articles
|
1111
|
-
in the journal <i>Zootaxa</i>. This data is available on Figshare (<a href=\"https://doi.org/10.6084/m9.figshare.c.5054372.v4\">https://doi.org/10.6084/m9.figshare.c.5054372.v4</a>),
|
1112
|
-
as is the equivalent dataset for <i>Phytotaxa</i> (<a href=\"https://doi.org/10.6084/m9.figshare.c.5525901.v1\">https://doi.org/10.6084/m9.figshare.c.5525901.v1</a>).
|
1113
|
-
Given that the same articles may be cited many times, these datasets have
|
1114
|
-
lots of duplicates. Similarly, articles in <a href=\"https://species.wikimedia.org\">Wikispecies</a>
|
1115
|
-
often have extensive lists of references cited, and the same reference may
|
1116
|
-
appear on multiple pages (for an initial attempt to extract these references
|
1117
|
-
see <a href=\"https://doi.org/10.5281/zenodo.5801661\">https://doi.org/10.5281/zenodo.5801661</a>
|
1118
|
-
and <a href=\"https://github.com/rdmpage/wikispecies-parser\">https://github.com/rdmpage/wikispecies-parser</a>).</p>\n\n<p>There
|
1119
|
-
are several reasons I want to merge these references. If I want to build a
|
1120
|
-
citation graph for <i>Zootaxa</i> or <i>Phytotaxa</i> I need to merge references
|
1121
|
-
that are the same so that I can accurate count citations. I am also interested
|
1122
|
-
in harvesting the metadata to help find those articles in the <a href=\"https://www.biodiversitylibrary.org\">Biodiversity
|
1123
|
-
Heritage Library</a> (BHL), and the literature cited section of scientific
|
1124
|
-
articles is a potential goldmine of bibliographic metadata, as is Wikispecies.</p>\n\n<p>After
|
1125
|
-
various experiments and false starts I''ve created a repository <a href=\"https://github.com/rdmpage/bib-dedup\">https://github.com/rdmpage/bib-dedup</a>
|
1126
|
-
to host a series of PHP scripts to deduplicate bibliographics data. I''ve
|
1127
|
-
settled on using CSL-JSON as the format for bibliographic data. Because deduplication
|
1128
|
-
relies on comparing pairs of references, the standard format for most of the
|
1129
|
-
scripts is a JSON array containing a pair of CSL-JSON objects to compare.
|
1130
|
-
Below are the steps the code takes.</p>\n\n<h2>Generating pairs to compare</h2>\n\n<p>The
|
1131
|
-
first step is to take a list of references and generate the pairs that will
|
1132
|
-
be compared. I started with this approach as I wanted to explore machine learning
|
1133
|
-
and wanted a simple format for training data, such as an array of two CSL-JSON
|
1134
|
-
objects and an integer flag representing whether the two references were the
|
1135
|
-
same of different.</p>\n\n<p>There are various ways to generate CSL-JSON for
|
1136
|
-
a reference. I use a tool I wrote (see <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
|
1137
|
-
parsing tool released</a>) that has a simple API where you parse one or more
|
1138
|
-
references and it returns that reference as structured data in CSL-JSON.</p>\n\n<p>Attempting
|
1139
|
-
to do all possible pairwise comparisons rapidly gets impractical as the number
|
1140
|
-
of references increases, so we need some way to restrict the number of comparisons
|
1141
|
-
we make. One approach I''ve explored is the “sorted neighbourhood method”
|
1142
|
-
where we sort the references 9for example by their title) then move a sliding
|
1143
|
-
window down the list of references, comparing all references within that window.
|
1144
|
-
This greatly reduces the number of pairwise comparisons. So the first step
|
1145
|
-
is to sort the references, then run a sliding window over them, output all
|
1146
|
-
the pairs in each window (ignoring in pairwise comparisons already made in
|
1147
|
-
a previous window). Other methods of \"blocking\" could also be used, such
|
1148
|
-
as only including references in a particular year, or a particular journal.</p>\n\n<p>So,
|
1149
|
-
the output of this step is a set of JSON arrays, each with a pair of references
|
1150
|
-
in CSL-JSON format. Each array is stored on a single line in the same file
|
1151
|
-
in <a href=\"https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON_2\">line-delimited
|
1152
|
-
JSON</a> (JSONL).</p>\n\n<h2>Comparing pairs</h2>\n\n<p>The next step is to
|
1153
|
-
compare each pair of references and decide whether they are a match or not.
|
1154
|
-
Initially I explored a machine learning approach used in the following paper:</p>\n\n<blockquote>\nWilson
|
1155
|
-
DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex
|
1156
|
-
features to improve genealogical record linkage. In: The 2011 International
|
1157
|
-
Joint Conference on Neural Networks. 9–14. DOI: <a href=\"https://doi.org/10.1109/IJCNN.2011.6033192\">10.1109/IJCNN.2011.6033192</a>\n</blockquote>\n\n<p>Initial
|
1158
|
-
experiments using <a href=\"https://github.com/jtet/Perceptron\">https://github.com/jtet/Perceptron</a>
|
1159
|
-
were promising and I want to play with this further, but I deciding to skip
|
1160
|
-
this for now and just use simple string comparison. So for each CSL-JSON object
|
1161
|
-
I generate a citation string in the same format using CiteProc, then compute
|
1162
|
-
the <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein
|
1163
|
-
distance</a> between the two strings. By normalising this distance by the
|
1164
|
-
length of the two strings being compared I can use an arbitrary threshold
|
1165
|
-
to decide if the references are the same or not.</p>\n\n<h2>Clustering</h2>\n\n<p>For
|
1166
|
-
this step we read the JSONL file produced above and record whether the two
|
1167
|
-
references are a match or not. Assuming each reference has a unique identifier
|
1168
|
-
(needs only be unique within the file) then we can use those identifier to
|
1169
|
-
record the clusters each reference belongs to. I do this using a <a href=\"https://en.wikipedia.org/wiki/Disjoint-set_data_structure\">Disjoint-set
|
1170
|
-
data structure</a>. For each reference start with a graph where each node
|
1171
|
-
represents a reference, and each node has a pointer to a parent node. Initially
|
1172
|
-
the reference is its own parent. A simple implementation is to have an array
|
1173
|
-
index by reference identifiers and where the value of each cell in the array
|
1174
|
-
is the node''s parent.</p>\n\n<p>As we discover pairs we update the parents
|
1175
|
-
of the nodes to reflect this, such that once all the comparisons are done
|
1176
|
-
we have a one or more sets of clusters corresponding to the references that
|
1177
|
-
we think are the same. Another way to think of this is that we are getting
|
1178
|
-
the components of a graph where each node is a reference and pair of references
|
1179
|
-
that match are connected by an edge.</p>\n\n<p>In the code I''m using I write
|
1180
|
-
this graph in <a href=\"https://en.wikipedia.org/wiki/Trivial_Graph_Format\">Trivial
|
1181
|
-
Graph Format</a> (TGF) which can be visualised using a tools such as <a href=\"https://www.yworks.com/products/yed\">yEd</a>.</p>\n\n<h2>Merging</h2>\n\n<p>Now
|
1182
|
-
that we have a graph representing the sets of references that we think are
|
1183
|
-
the same we need to merge them. This is where things get interesting as the
|
1184
|
-
references are similar (by definition) but may differ in some details. The
|
1185
|
-
paper below describes a simple Bayesian approach for merging records:</p>\n\n<blockquote>\nCouncill
|
1186
|
-
IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles
|
1187
|
-
CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching
|
1188
|
-
Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
|
1189
|
-
Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: <a href=\"https://doi.org/10.1145/1141753.1141817\">10.1145/1141753.1141817</a>.\n</blockquote>\n\n<p>So
|
1190
|
-
the next step is to read the graph with the clusters, generate the sets of
|
1191
|
-
bibliographic references that correspond to each cluster, then use the method
|
1192
|
-
described in Councill et al. to produce a single bibliographic record for
|
1193
|
-
that cluster. These records could then be used to, say locate the corresponding
|
1194
|
-
article in BHL, or populate Wikidata with missing references.</p>\n\n<p>Obviously
|
1195
|
-
there is always the potential for errors, such as trying to merge references
|
1196
|
-
that are not the same. As a quick and dirty check I flag as dubious any cluster
|
1197
|
-
where the page numbers vary among members of the cluster. More sophisticated
|
1198
|
-
checks are possible, especially if I go down the ML route (i.e., I would have
|
1199
|
-
evidence for the probability that the same reference can disagree on some
|
1200
|
-
aspects of metadata).</p>\n\n<h2>Summary</h2>\n\n<p>At this stage the code
|
1201
|
-
is working well enough for me to play with and explore some example datasets.
|
1202
|
-
The focus is on structured bibliographic metadata, but I may simplify things
|
1203
|
-
and have a version that handles simple string matching, for example to cluster
|
1204
|
-
together different abbreviations of the same journal name.</p>","tags":["data
|
1205
|
-
cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ndtkv-6ve80","uuid":"e8e95aaf-bacb-4b5a-bf91-54e903526ab2","url":"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html","title":"Revisiting
|
1206
|
-
RSS to monitor the latest taxonomic research","summary":"Over a decade ago
|
1207
|
-
RSS (RDF Site Summary or Really Simple Syndication) was attracting a lot of
|
1208
|
-
interest as a way to integrate data across various websites. Many science
|
1209
|
-
publishers would provide a list of their latest articles in XML in one of
|
1210
|
-
three flavours of RSS (RDF, RSS, Atom). This led to tools such as uBioRSS
|
1211
|
-
[1] and my own e-Biosphere Challenge: visualising biodiversity digitisation
|
1212
|
-
in real time. It was a time of enthusiasm for aggregating lots of data, such
|
1213
|
-
as the ill-fated PLoS...","date_published":"2021-11-23T20:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1214
|
-
(Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
1215
|
-
both;\"><a href=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s257/feedicon.png\"
|
1216
|
-
style=\"display: block; padding: 1em 0; text-align: center; clear: right;
|
1217
|
-
float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"257\"
|
1218
|
-
data-original-width=\"257\" src=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s200/feedicon.png\"/></a></div>\n<p>Over
|
1219
|
-
a decade ago <a href=\"https://en.wikipedia.org/wiki/RSS\">RSS</a> (RDF Site
|
1220
|
-
Summary or Really Simple Syndication) was attracting a lot of interest as
|
1221
|
-
a way to integrate data across various websites. Many science publishers would
|
1222
|
-
provide a list of their latest articles in XML in one of three flavours of
|
1223
|
-
RSS (RDF, RSS, Atom). This led to tools such as <a href=\"http://ubio.org/rss/\">uBioRSS</a>
|
1224
|
-
[<a href=\"#Leary2007\">1</a>] and my own <a href=\"https://iphylo.blogspot.com/2009/05/e-biosphere-challenge-visualising.html\">e-Biosphere
|
1225
|
-
Challenge: visualising biodiversity digitisation in real time</a>. It was
|
1226
|
-
a time of enthusiasm for aggregating lots of data, such as the <a href=\"https://iphylo.blogspot.com/2013/07/the-demise-of-plos-biodiversity-hub.html\">ill-fated</a>
|
1227
|
-
PLoS Biodiversity Hub [<a href=\"#Mindell2011\">2</a>].</p>\n\n<p>Since I
|
1228
|
-
seem to be condemned to revisit old ideas rather than come up with anything
|
1229
|
-
new, I''ve been looking at providing a tool like the now defunct uBioRSS.
|
1230
|
-
The idea is to harvest RSS feeds from journals (with an emphasis on taxonomic
|
1231
|
-
and systematic journals), aggregate the results, and make them browsable by
|
1232
|
-
taxon and geography. Here''s a sneak peak:</p>\n\n<div class=\"separator\"
|
1233
|
-
style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s1032/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"
|
1234
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1235
|
-
border=\"0\" width=\"400\" data-original-height=\"952\" data-original-width=\"1032\"
|
1236
|
-
src=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s400/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"/></a></div>\n\n<p>What
|
1237
|
-
seems like a straightforward task quickly became a bit of a challenge. Not
|
1238
|
-
all journals have RSS feeds (they seem to have become less widely supported
|
1239
|
-
over time) so I need to think of alternative ways to get lists of recent articles.
|
1240
|
-
These lists also need to be processed in various ways. There are three versions
|
1241
|
-
of RSS, each with their own idiosyncracies, so I need to standardise things
|
1242
|
-
like dates. I also want to augment them with things like DOIs (often missing
|
1243
|
-
from RSS feeds) and thumbnails for the articles (often available on publisher
|
1244
|
-
websites but not the feeds). Then I need to index the content by taxon and
|
1245
|
-
geography. For taxa I use a version of Patrick Leary''s \"taxonfinder\" (see
|
1246
|
-
<a href=\"https://right-frill.glitch.me\">https://right-frill.glitch.me</a>)
|
1247
|
-
to find names, then the <a href=\"https://index.globalnames.org\">Global Names
|
1248
|
-
Index</a> to assign names found to the GBIF taxonomic hierarchy.</p>\n\n<p>Indexing
|
1249
|
-
by geography proved harder. Typically <a href=\"https://en.wikipedia.org/wiki/Toponym_resolution#Geoparsing\">geoparsing</a>
|
1250
|
-
involves taking a body of text and doing the following:\n<ul><li>Using named-entity
|
1251
|
-
recognition <a href=\"https://en.wikipedia.org/wiki/Named-entity_recognition\">NER</a>
|
1252
|
-
to identity named entities in the text (e.g., place names, people names, etc.).</li>\n<li>Using
|
1253
|
-
a gazetteer of geographic names <a href=\"http://www.geonames.org\">GeoNames</a>
|
1254
|
-
to try and match the place names found by NER.</li>\n</ul></p>\n\n<p>An example
|
1255
|
-
of such a parser is the <a href=\"https://www.ltg.ed.ac.uk/software/geoparser/\">Edinburgh
|
1256
|
-
Geoparser</a>. Typically geoparsing software can be large and tricky to install,
|
1257
|
-
especially if you are looking to make your installation publicly accessible.
|
1258
|
-
Geoparsing services seem to have a short half-life (e.g., <a href=\"https://geoparser.io\">Geoparser.io</a>),
|
1259
|
-
perhaps because they are so useful they quickly get swamped by users.</p>\n\n<p>Bearing
|
1260
|
-
this in mind, the approach I’ve taken here is to create a very simple geoparser
|
1261
|
-
that is focussed on fairly large areas, especially those relevant to biodiversity,
|
1262
|
-
and is aimed at geoparsing text such as abstracts of scientific papers. I''ve
|
1263
|
-
created a small database of places by harvesting data from Wikidata, then
|
1264
|
-
I use the \"flash text\" algorithm [<a href=\"#Singh2017\">3</a>] to find
|
1265
|
-
geographic places. This approach uses a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
|
1266
|
-
to store the place names. All I do is walk through the text seeing whether
|
1267
|
-
the current word matches a place name (or the start of one) in the trie, then
|
1268
|
-
moving on. This is very quick and seems to work quite well.</p>\n\n<p>Given
|
1269
|
-
that I need to aggregate data from a lot of sources, apply various transformations
|
1270
|
-
to that data, then merge it, there are a lot of moving parts. I started playing
|
1271
|
-
with a \"NoCode\" platform for creating workflows, in this case <a href=\"https://n8n.io\">n8n</a>
|
1272
|
-
(in many ways reminiscent of the now defunct <a href=\"https://en.wikipedia.org/wiki/Yahoo!_Pipes\">Yahoo
|
1273
|
-
Pipes</a>). This was quite fun for a while, but after lots of experimentation
|
1274
|
-
I moved back to writing code to aggregate the data into a CouchDB database.
|
1275
|
-
CouchDB is one of the NoSQL databases that I really like as it has a great
|
1276
|
-
interface, and makes queries very easy to do once you get your head around
|
1277
|
-
how it works.</p>\n\n<p>So the end result of this is \"BioRSS\" <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
|
1278
|
-
The interface comprises a stream of articles listed from newest to oldest,
|
1279
|
-
with a treemap and a geographic map on the left. You can use these to filter
|
1280
|
-
the articles by taxonomic group and/or country. For example the screen shot
|
1281
|
-
is showing arthropods from China (in this case from a month or two ago in
|
1282
|
-
the journal <i>ZooKeys</i>). As much fun as the interface has been to construct,
|
1283
|
-
in many ways I don''t really want to spend time making an interface. For each
|
1284
|
-
combination of taxon and country I provide a RSS feed so if you have a favour
|
1285
|
-
feed reader you can grab the feed and view it there. As BioRSS updates the
|
1286
|
-
data your feed reader should automatically update the feed. This means that
|
1287
|
-
you can have a feed that monitors, say, new papers on spiders in China.</p>\n\n<p>In
|
1288
|
-
the spirit of \"release early and release often\" this is an early version
|
1289
|
-
of this app. I need to add a lot more feeds, back date them to bring in older
|
1290
|
-
content, and I also want to make use of aggregators such as PubMed, CrossRef,
|
1291
|
-
and Google Scholar. The existence of these tools is, I suspect, one reason
|
1292
|
-
why RSS feeds are less common than they used to be.</p>\n\n<p>So, if this
|
1293
|
-
sounds useful please take it for a spin at <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
|
1294
|
-
Feedback is welcome, especially suggestions for journals to harvest and add
|
1295
|
-
to the news feed. Ultimately I''d like to have sufficient coverage of the
|
1296
|
-
taxonomic literature so that BioRSS becomes a place where we can go to find
|
1297
|
-
the latest papers on any taxon of interest.</p>\n\n<h2>References</h2>\n\n<blockquote>\n<a
|
1298
|
-
name=\"Leary2007\">1.</a> Patrick R. Leary, David P. Remsen, Catherine N.
|
1299
|
-
Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic
|
1300
|
-
literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages
|
1301
|
-
1434–1436, <a href=\"https://doi.org/10.1093/bioinformatics/btm109\">https://doi.org/10.1093/bioinformatics/btm109</a>\n</blockquote>\n\n<blockquote><a
|
1302
|
-
name=\"Mindell2011\">2.</a> Mindell, D. P., Fisher, B. L., Roopnarine, P.,
|
1303
|
-
Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating,
|
1304
|
-
Tagging and Integrating Biodiversity Research. PLoS ONE, 6(8), e19491. <a
|
1305
|
-
href=\"https://doi.org/10.1371/journal.pone.0019491\">doi:10.1371/journal.pone.0019491</a>\n</blockquote>\n\n<blockquote><a
|
1306
|
-
name=\"Singh2017\">3.</a> Singh, V. (2017). Replace or Retrieve Keywords In
|
1307
|
-
Documents at Scale. CoRR, abs/1711.00046. <a href=\"http://arxiv.org/abs/1711.00046\">http://arxiv.org/abs/1711.00046</a>\n\n</blockquote>","tags":["geocoding","NoCode","RSS"],"language":"en","references":[{"doi":"https://doi.org/10.1093/bioinformatics/btm109","key":"ref1"},{"doi":"https://doi.org/10.1371/journal.pone.0019491","key":"ref2"},{"key":"ref3","url":"http://arxiv.org/abs/1711.00046"}]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
|
258
|
+
herein.</p>","tags":["guest post","macroscope"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
|
1308
259
|
Stuart Page 1936-2022","summary":"My dad died last weekend. Below is a notice
|
1309
260
|
in today''s New Zealand Herald. I''m in New Zealand for his funeral. Don''t
|
1310
|
-
really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":
|
1311
|
-
|
261
|
+
really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":"2023-03-22T07:25:56Z","date_indexed":"1909-06-16T10:41:55+00:00","authors":[{"url":null,"name":"Roderic
|
262
|
+
Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
1312
263
|
both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s3454/_DSC5106.jpg\"
|
1313
264
|
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1314
265
|
border=\"0\" width=\"400\" data-original-height=\"2582\" data-original-width=\"3454\"
|
@@ -1318,172 +269,7 @@ http_interactions:
|
|
1318
269
|
now.\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s3640/IMG_2870.jpeg\"
|
1319
270
|
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1320
271
|
border=\"0\" height=\"320\" data-original-height=\"3640\" data-original-width=\"1391\"
|
1321
|
-
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg\"/></a></div>","tags":[],"language":"en","references":
|
1322
|
-
for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
|
1323
|
-
for WikiData. More precisely, the endpoint is for a subset of the entities
|
1324
|
-
that are of interest to WikiCite, such as scholarly articles, people, and
|
1325
|
-
journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
|
1326
|
-
The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
|
1327
|
-
There are various ways to interact with the endpoint, personally I like the
|
1328
|
-
Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
|
1329
|
-
taken...","date_published":"2021-12-20T13:16:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1330
|
-
(Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
1331
|
-
both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s1000\"
|
1332
|
-
style=\"display: block; padding: 1em 0; text-align: center; clear: right;
|
1333
|
-
float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"1000\"
|
1334
|
-
data-original-width=\"1000\" src=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s200\"/></a></div><p>I''ve
|
1335
|
-
released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint
|
1336
|
-
is for a subset of the entities that are of interest to WikiCite, such as
|
1337
|
-
scholarly articles, people, and journals. There is a crude demo at <a href=\"https://wikicite-graphql.herokuapp.com\">https://wikicite-graphql.herokuapp.com</a>.
|
1338
|
-
The endpoint itself is at <a href=\"https://wikicite-graphql.herokuapp.com/gql.php\">https://wikicite-graphql.herokuapp.com/gql.php</a>.
|
1339
|
-
There are various ways to interact with the endpoint, personally I like the
|
1340
|
-
<a href=\"https://altair.sirmuel.design\">Altair GraphQL Client</a> by <a
|
1341
|
-
href=\"https://github.com/imolorhe\">Samuel Imolorhe</a>.</p>\n\n<p>As I''ve
|
1342
|
-
<a href=\"https://iphylo.blogspot.com/2021/04/it-been-while.html\">mentioned
|
1343
|
-
earlier</a> it''s taken me a while to see the point of GraphQL. But it is
|
1344
|
-
clear it is gaining traction in the biodiversity world (see for example the
|
1345
|
-
<a href=\"https://dev.gbif.org/hosted-portals.html\">GBIF Hosted Portals</a>)
|
1346
|
-
so it''s worth exploring. My take on GraphQL is that it is a way to create
|
1347
|
-
a self-describing API that someone developing a web site can use without them
|
1348
|
-
having to bury themselves in the gory details of how data is internally modelled.
|
1349
|
-
For example, WikiData''s query interface uses SPARQL, a powerful language
|
1350
|
-
that has a steep learning curve (in part because of the administrative overhead
|
1351
|
-
brought by RDF namespaces, etc.). In my previous SPARQL-based projects such
|
1352
|
-
as <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a> and <a
|
1353
|
-
href=\"http://alec-demo.herokuapp.com\">ALEC</a> I have either returned SPARQL
|
1354
|
-
results directly (Ozymandias) or formatted SPARQL results as <a href=\"https://schema.org/DataFeed\">schema.org
|
1355
|
-
DataFeeds</a> (equivalent to RSS feeds) (ALEC). Both approaches work, but
|
1356
|
-
they are project-specific and if anyone else tried to build based on these
|
1357
|
-
projects they might struggle for figure out what was going on. I certainly
|
1358
|
-
struggle, and I wrote them!</p>\n\n<p>So it seems worthwhile to explore this
|
1359
|
-
approach a little further and see if I can develop a GraphQL interface that
|
1360
|
-
can be used to build the sort of rich apps that I want to see. The demo I''ve
|
1361
|
-
created uses SPARQL under the hood to provide responses to the GraphQL queries.
|
1362
|
-
So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully)
|
1363
|
-
simpler overlay on top of SPARQL so that we can retrieve the data we want
|
1364
|
-
without having to learn the intricacies of SPARQL, nor how Wikidata models
|
1365
|
-
publications and people.</p>","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
|
1366
|
-
barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
|
1367
|
-
tweeted about this but want to bookmark it for later as well. The paper “A
|
1368
|
-
molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
|
1369
|
-
contains the following: …the annotated barcode records assembled by FinBOL
|
1370
|
-
participants represent a tremendous intergenerational transfer of taxonomic
|
1371
|
-
knowledge … the time contributed by current taxonomists in identifying and
|
1372
|
-
contributing voucher specimens represents a great gift to future generations
|
1373
|
-
who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1374
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>I <a href=\"https://twitter.com/rdmpage/status/1569738844416638981?s=21&t=9OVXuoUEwZtQt-Ldzlutfw\">tweeted
|
1375
|
-
about this</a> but want to bookmark it for later as well. The paper “A molecular-based
|
1376
|
-
identification resource for the arthropods of Finland” <a href=\"https://doi.org/10.1111/1755-0998.13510\">doi:10.1111/1755-0998.13510</a>
|
1377
|
-
contains the following:</p>\n<blockquote>\n<p>…the annotated barcode records
|
1378
|
-
assembled by FinBOL participants represent a tremendous <mark>intergenerational
|
1379
|
-
transfer of taxonomic knowledge</mark> … the time contributed by current taxonomists
|
1380
|
-
in identifying and contributing voucher specimens represents a great gift
|
1381
|
-
to future generations who will benefit from their expertise when they are
|
1382
|
-
no longer able to process new material.</p>\n</blockquote>\n<p>I think this
|
1383
|
-
is a very clever way to characterise the project. In an age of machine learning
|
1384
|
-
this may be commonest way to share knowledge , namely as expert-labelled training
|
1385
|
-
data used to build tools for others. Of course, this means the expertise itself
|
1386
|
-
may be lost, which has implications for updating the models if the data isn’t
|
1387
|
-
complete. But it speaks to Charles Godfrey’s theme of <a href=\"https://biostor.org/reference/250587\">“Taxonomy
|
1388
|
-
as information science”</a>.</p>\n<p>Note that the knowledge is also transformed
|
1389
|
-
in the sense that the underlying expertise of interpreting morphology, ecology,
|
1390
|
-
behaviour, genomics, and the past literature is not what is being passed on.
|
1391
|
-
Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>\n<p>This
|
1392
|
-
feels is different to, say iNaturalist, where there is a machine learning
|
1393
|
-
model to identify images. In that case, the model is built on something the
|
1394
|
-
community itself has created, and continues to create. Yes, the underlying
|
1395
|
-
idea is that same: “experts” have labelled the data, a model is trained, the
|
1396
|
-
model is used. But the benefits of the <a href=\"https://www.inaturalist.org\">iNaturalist</a>
|
1397
|
-
model are immediately applicable to the people whose data built the model.
|
1398
|
-
In the case of barcoding, because the technology itself is still not in the
|
1399
|
-
hands of many (relative to, say, digital imaging), the benefits are perhaps
|
1400
|
-
less tangible. Obviously researchers working with environmental DNA will find
|
1401
|
-
it very useful, but broader impact may await the arrival of citizen science
|
1402
|
-
DNA barcoding.</p>\n<p>The other consideration is whether the barcoding helps
|
1403
|
-
taxonomists. Is it to be used to help prioritise future work (“we are getting
|
1404
|
-
lots of unknown sequences in these taxa, lets do some taxonomy there”), or
|
1405
|
-
is it simply capturing the knowledge of a generation that won’t be replaced:</p>\n<blockquote>\n<p>The
|
1406
|
-
need to capture such knowledge is essential because there are, for example,
|
1407
|
-
no young Finnish taxonomists who can critically identify species in many key
|
1408
|
-
groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges,
|
1409
|
-
most mite lineages).</p>\n</blockquote>\n<p>The cycle of collect data, test
|
1410
|
-
and refine model, collect more data, rinse and repeat that happens with iNaturalist
|
1411
|
-
creates a feedback loop. It’s not clear that a similar cycle exists for DNA
|
1412
|
-
barcoding.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/enxas-arj18","uuid":"ab5a6e04-d55e-4901-8269-9eea65ce7178","url":"https://iphylo.blogspot.com/2022/08/can-we-use-citation-graph-to-measure.html","title":"Can
|
1413
|
-
we use the citation graph to measure the quality of a taxonomic database?","summary":"More
|
1414
|
-
arm-waving notes on taxonomic databases. I''ve started to add data to ChecklistBank
|
1415
|
-
and this has got me thinking about the issue of data quality. When you add
|
1416
|
-
data to ChecklistBank you are asked to give a measure of confidence based
|
1417
|
-
on the Catalogue of Life Checklist Confidence system of one - five stars:
|
1418
|
-
★ - ★★★★★. I''m scepetical about the notion of confidence or \"trust\" when
|
1419
|
-
it is reduced to a star system (see also Can you trust EOL?). I could literally
|
1420
|
-
pick any number of stars, there''s...","date_published":"2022-08-24T14:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1421
|
-
(Roderic Page)"}],"image":null,"content_html":"<p>More arm-waving notes on
|
1422
|
-
taxonomic databases. I''ve started to add data to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
|
1423
|
-
and this has got me thinking about the issue of data quality. When you add
|
1424
|
-
data to ChecklistBank you are asked to give a measure of confidence based
|
1425
|
-
on the <a href=\"https://www.catalogueoflife.org/about/glossary.html#checklist-confidence\">Catalogue
|
1426
|
-
of Life Checklist Confidence</a> system of one - five stars: ★ - ★★★★★. I''m
|
1427
|
-
scepetical about the notion of confidence or \"trust\" when it is reduced
|
1428
|
-
to a star system (see also <a href=\"https://iphylo.blogspot.com/2012/06/can-you-trust-eol.html\">Can
|
1429
|
-
you trust EOL?</a>). I could literally pick any number of stars, there''s
|
1430
|
-
no way to measure what number of stars is appropriate. This feeds into my
|
1431
|
-
biggest reservation about the <a href=\"https://www.catalogueoflife.org\">Catalogue
|
1432
|
-
of Life</a>, it''s almost entirely authority based, not evidence based. That
|
1433
|
-
is, rather than give us evidence for why a particular taxon is valid, we are
|
1434
|
-
(mostly) just given a list of taxa are asked to accept those as gospel, based
|
1435
|
-
on assertions by one or more authorities. I''m not necessarly doubting the
|
1436
|
-
knowledge of those making these lists, it''s just that I think we need to
|
1437
|
-
do better than \"these are the accepted taxa because I say so\" implict in
|
1438
|
-
the Catalogue of Life.\n</p>\n\n<p>So, is there any way we could objectively
|
1439
|
-
measure the quality of a particular taxonomic checklist? Since I have a long
|
1440
|
-
standing interest in link the primary taxonomic litertaure to names in databases
|
1441
|
-
(since that''s where the evidence is), I keep wondering whether measures based
|
1442
|
-
on that literture could be developed. \n</p>\n<p>\nI recently revisited the
|
1443
|
-
fascinating (and quite old) literature on rates of synonymy:\n</p>\n<blockquote>\nGaston
|
1444
|
-
Kevin J. and Mound Laurence A. 1993 Taxonomy, hypothesis testing and the
|
1445
|
-
biodiversity crisisProc. R. Soc. Lond. B.251139–142\n<a href=\"http://doi.org/10.1098/rspb.1993.0020\">http://doi.org/10.1098/rspb.1993.0020</a>\n</blockquote>\n \n<blockquote>\n Andrew
|
1446
|
-
R. Solow, Laurence A. Mound, Kevin J. Gaston, Estimating the Rate of Synonymy,
|
1447
|
-
Systematic Biology, Volume 44, Issue 1, March 1995, Pages 93–96, <a href=\"https://doi.org/10.1093/sysbio/44.1.93\">https://doi.org/10.1093/sysbio/44.1.93</a>\n</blockquote>\n\n</p>\n\n<p>\nA
|
1448
|
-
key point these papers make is that the observed rate of synonymy is quite
|
1449
|
-
high (that is, many \"new species\" end up being merged with already known
|
1450
|
-
species), and that because it can take time to discover that a species is
|
1451
|
-
a synonym the actual rate may be even higher. In other words, in diagrams
|
1452
|
-
like the one reproduced below, the reason the proportion of synonyms declines
|
1453
|
-
the nearer we get to the present day (this paper came out in 1995) is not
|
1454
|
-
because are are creating fewer synonyms but because we''ve not yet had time
|
1455
|
-
to do the work to uncover the remaining synonyms.\n</p>\n \n<div class=\"separator\"
|
1456
|
-
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s621/Screenshot%202022-08-24%20at%2014.59.47.png\"
|
1457
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1458
|
-
border=\"0\" width=\"400\" data-original-height=\"404\" data-original-width=\"621\"
|
1459
|
-
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s400/Screenshot%202022-08-24%20at%2014.59.47.png\"/></a></div>\n\n<p>Put
|
1460
|
-
another way, these papers are arguing that real work of taxonomy is revision,
|
1461
|
-
not species discovery, especially since it''s not uncommon for > 50% of species
|
1462
|
-
in a taxon to end up being synonymised. Indeed, if a taxoomic group has few
|
1463
|
-
synonyms then these authors would argue that''s a sign of neglect. More revisionary
|
1464
|
-
work would likely uncover additional synonyms. So, what we need is a way to
|
1465
|
-
measure the amount of research on a taxonomic group. It occurs to me that
|
1466
|
-
we could use the citation graph as a way to tackle this. Lets imagine we have
|
1467
|
-
a set of taxa (say a family) and we have all the papers that described new
|
1468
|
-
species or undertook revisions (or both). The extensiveness of that work could
|
1469
|
-
be measured by the citation graph. For example, build the citation graph for
|
1470
|
-
those papers. How many original species decsriptions are not cited? Those
|
1471
|
-
species have been potentially neglected. How many large-scale revisions have
|
1472
|
-
there been (as measured by the numbers of taxonomic papers those revisions
|
1473
|
-
cite)? There are some interesting approaches to quantifying this, such as
|
1474
|
-
using <a href=\"https://en.wikipedia.org/wiki/HITS_algorithm\">hubs and authorities</a>.</p>\n \n \n<p>I''m
|
1475
|
-
aware that taxonomists have not had the happiest relationship with citations:\n \n<blockquote>\nPinto
|
1476
|
-
ÂP, Mejdalani G, Mounce R, Silveira LF, Marinoni L, Rafael JA. Are publications
|
1477
|
-
on zoological taxonomy under attack? R Soc Open Sci. 2021 Feb 10;8(2):201617.
|
1478
|
-
<a href=\"https://doi.org/10.1098/rsos.201617\">doi: 10.1098/rsos.201617</a>.
|
1479
|
-
PMID: 33972859; PMCID: PMC8074659.\n</blockquote>\n\nStill, I think there
|
1480
|
-
is an intriguing possibility here. For this approach to work, we need to have
|
1481
|
-
linked taxonomic names to publications, and have citation data for those publications.
|
1482
|
-
This is happening on various platforms. Wikidata, for example, is becoming
|
1483
|
-
a repository of the taxonomic literature, some of it with citation links.\n\n<blockquote>\nPage
|
1484
|
-
RDM. 2022. Wikidata and the bibliography of life. PeerJ 10:e13712 <a href=\"https://doi.org/10.7717/peerj.13712\">https://doi.org/10.7717/peerj.13712</a>\n</blockquote>\n\nTime
|
1485
|
-
for some experiments.\n</p>","tags":["Bibliography of Life","citation","synonymy","taxonomic
|
1486
|
-
databases"],"language":"en","references":null},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
|
272
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg\"/></a></div>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
|
1487
273
|
interfaces, knowledge graphs, and Miller columns","summary":"Some quick notes
|
1488
274
|
on interface ideas for digital libraries and/or knowledge graphs. Recently
|
1489
275
|
there’s been something of an explosion in bibliographic tools to explore the
|
@@ -1491,10 +277,10 @@ http_interactions:
|
|
1491
277
|
papers _scite which uses AI to do sentiment analysis on citations (does paper
|
1492
278
|
A cite paper B favourably or not?) ResearchRabbit which uses lists, networks,
|
1493
279
|
and timelines to discover related research Scispace which navigates connections
|
1494
|
-
between...","date_published":"2023-04-25T13:01:00Z","date_modified":
|
1495
|
-
|
1496
|
-
|
1497
|
-
|
280
|
+
between...","date_published":"2023-04-25T13:01:00Z","date_modified":"2023-04-27T14:51:08Z","date_indexed":"1909-06-16T11:25:14+00:00","authors":[{"url":null,"name":"Roderic
|
281
|
+
Page"}],"image":null,"content_html":"<p>Some quick notes on interface ideas
|
282
|
+
for digital libraries and/or knowledge graphs.</p>\n<p>Recently there’s been
|
283
|
+
something of an explosion in bibliographic tools to explore the literature.
|
1498
284
|
Examples include:</p>\n<ul>\n<li><a href=\"https://elicit.org\">Elicit</a>
|
1499
285
|
which uses AI to search for and summarise papers</li>\n<li><a href=\"https://scite.ai\">_scite</a>
|
1500
286
|
which uses AI to do sentiment analysis on citations (does paper A cite paper
|
@@ -1576,15 +362,15 @@ http_interactions:
|
|
1576
362
|
in other words, how deep down the rabbit hole will be go?</p>\n\n<h3>Update</h3>\n<p>Should
|
1577
363
|
add link to David Regev''s explorations of <a href=\"https://medium.com/david-regev-on-ux/flow-browser-b730daf0f717\">Flow
|
1578
364
|
Browser</a>.\n\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":["cards","flow","Knowledge
|
1579
|
-
Graph","Miller column","RabbitResearch"],"language":"en","references":
|
365
|
+
Graph","Miller column","RabbitResearch"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/t6fb9-4fn44","uuid":"8bc3fea6-cb86-4344-8dad-f312fbf58041","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The
|
1580
366
|
Business of Extracting Knowledge from Academic Publications","summary":"Markus
|
1581
367
|
Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting
|
1582
368
|
Knowledge from Academic Publications\". I spent months working on domain-specific
|
1583
369
|
search engines and knowledge discovery apps for biomedicine and eventually
|
1584
|
-
figured that synthesizing
|
1585
|
-
|
1586
|
-
Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I...","date_published":"2021-12-11T00:01:00Z","date_modified":
|
1587
|
-
|
370
|
+
figured that synthesizing \"insights\" or building knowledge graphs by machine-reading
|
371
|
+
the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc—
|
372
|
+
Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I worked on biomedical...","date_published":"2021-12-11T00:01:00Z","date_modified":"2021-12-11T00:01:21Z","date_indexed":"1909-06-16T11:32:09+00:00","authors":[{"url":null,"name":"Roderic
|
373
|
+
Page"}],"image":null,"content_html":"<p>Markus Strasser (<a href=\"https://twitter.com/mkstra\">@mkstra</a>
|
1588
374
|
write a fascinating article entitled <a href=\"https://markusstrasser.org/extracting-knowledge-from-literature/\">\"The
|
1589
375
|
Business of Extracting Knowledge from Academic Publications\"</a>.</p>\n\n<blockquote
|
1590
376
|
class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">I spent months working
|
@@ -1611,18 +397,18 @@ http_interactions:
|
|
1611
397
|
it will go anywhere.</p>\n\n<p>Don’t take that as a challenge. Take it as
|
1612
398
|
a red flag and run. Run towards better problems.\n</blockquote></p>\n\n<p>Well
|
1613
399
|
worth a read, and much food for thought.</p>","tags":["ai","business model","text
|
1614
|
-
mining"],"language":"en","references":
|
400
|
+
mining"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/463yw-pbj26","uuid":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David
|
1615
401
|
Remsen","summary":"I heard yesterday from Martin Kalfatovic (BHL) that David
|
1616
402
|
Remsen has died. Very sad news. It''s starting to feel like iPhylo might end
|
1617
403
|
up being a list of obituaries of people working on biodiversity informatics
|
1618
404
|
(e.g., Scott Federhen). I spent several happy visits at MBL at Woods Hole
|
1619
405
|
talking to Dave at the height of the uBio project, which really kickstarted
|
1620
406
|
large scale indexing of taxonomic names, and the use of taxonomic name finding
|
1621
|
-
tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":
|
1622
|
-
|
1623
|
-
|
1624
|
-
|
1625
|
-
|
407
|
+
tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":"2022-12-17T08:12:23Z","date_indexed":"1909-06-16T11:41:39+00:00","authors":[{"url":null,"name":"Roderic
|
408
|
+
Page"}],"image":null,"content_html":"<p>I heard yesterday from Martin Kalfatovic
|
409
|
+
(BHL) that David Remsen has died. Very sad news. It''s starting to feel like
|
410
|
+
iPhylo might end up being a list of obituaries of people working on biodiversity
|
411
|
+
informatics (e.g., <a href=\"https://iphylo.blogspot.com/2016/05/scott-federhen-rip.html\">Scott
|
1626
412
|
Federhen</a>).</p>\n\n<p>I spent several happy visits at MBL at Woods Hole
|
1627
413
|
talking to Dave at the height of the uBio project, which really kickstarted
|
1628
414
|
large scale indexing of taxonomic names, and the use of taxonomic name finding
|
@@ -1643,149 +429,156 @@ http_interactions:
|
|
1643
429
|
David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar,
|
1644
430
|
uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23,
|
1645
431
|
Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109</p>","tags":["David
|
1646
|
-
Remsen","obituary","uBio"],"language":"en","references":
|
1647
|
-
|
1648
|
-
|
1649
|
-
|
1650
|
-
|
1651
|
-
|
1652
|
-
|
1653
|
-
of
|
1654
|
-
|
1655
|
-
|
1656
|
-
|
432
|
+
Remsen","obituary","uBio"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
|
433
|
+
bibliographic data","summary":"There are several instances where I have a
|
434
|
+
collection of references that I want to deduplicate and merge. For example,
|
435
|
+
in Zootaxa has no impact factor I describe a dataset of the literature cited
|
436
|
+
by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4),
|
437
|
+
as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1).
|
438
|
+
Given that the same articles may be cited many times, these datasets have
|
439
|
+
lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":"2022-02-03T15:11:29Z","date_indexed":"1909-06-16T10:22:30+00:00","authors":[{"url":null,"name":"Roderic
|
440
|
+
Page"}],"image":null,"content_html":"<p>There are several instances where
|
441
|
+
I have a collection of references that I want to deduplicate and merge. For
|
442
|
+
example, in <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
|
443
|
+
has no impact factor</a> I describe a dataset of the literature cited by articles
|
444
|
+
in the journal <i>Zootaxa</i>. This data is available on Figshare (<a href=\"https://doi.org/10.6084/m9.figshare.c.5054372.v4\">https://doi.org/10.6084/m9.figshare.c.5054372.v4</a>),
|
445
|
+
as is the equivalent dataset for <i>Phytotaxa</i> (<a href=\"https://doi.org/10.6084/m9.figshare.c.5525901.v1\">https://doi.org/10.6084/m9.figshare.c.5525901.v1</a>).
|
446
|
+
Given that the same articles may be cited many times, these datasets have
|
447
|
+
lots of duplicates. Similarly, articles in <a href=\"https://species.wikimedia.org\">Wikispecies</a>
|
448
|
+
often have extensive lists of references cited, and the same reference may
|
449
|
+
appear on multiple pages (for an initial attempt to extract these references
|
450
|
+
see <a href=\"https://doi.org/10.5281/zenodo.5801661\">https://doi.org/10.5281/zenodo.5801661</a>
|
451
|
+
and <a href=\"https://github.com/rdmpage/wikispecies-parser\">https://github.com/rdmpage/wikispecies-parser</a>).</p>\n\n<p>There
|
452
|
+
are several reasons I want to merge these references. If I want to build a
|
453
|
+
citation graph for <i>Zootaxa</i> or <i>Phytotaxa</i> I need to merge references
|
454
|
+
that are the same so that I can accurate count citations. I am also interested
|
455
|
+
in harvesting the metadata to help find those articles in the <a href=\"https://www.biodiversitylibrary.org\">Biodiversity
|
456
|
+
Heritage Library</a> (BHL), and the literature cited section of scientific
|
457
|
+
articles is a potential goldmine of bibliographic metadata, as is Wikispecies.</p>\n\n<p>After
|
458
|
+
various experiments and false starts I''ve created a repository <a href=\"https://github.com/rdmpage/bib-dedup\">https://github.com/rdmpage/bib-dedup</a>
|
459
|
+
to host a series of PHP scripts to deduplicate bibliographics data. I''ve
|
460
|
+
settled on using CSL-JSON as the format for bibliographic data. Because deduplication
|
461
|
+
relies on comparing pairs of references, the standard format for most of the
|
462
|
+
scripts is a JSON array containing a pair of CSL-JSON objects to compare.
|
463
|
+
Below are the steps the code takes.</p>\n\n<h2>Generating pairs to compare</h2>\n\n<p>The
|
464
|
+
first step is to take a list of references and generate the pairs that will
|
465
|
+
be compared. I started with this approach as I wanted to explore machine learning
|
466
|
+
and wanted a simple format for training data, such as an array of two CSL-JSON
|
467
|
+
objects and an integer flag representing whether the two references were the
|
468
|
+
same of different.</p>\n\n<p>There are various ways to generate CSL-JSON for
|
469
|
+
a reference. I use a tool I wrote (see <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
|
470
|
+
parsing tool released</a>) that has a simple API where you parse one or more
|
471
|
+
references and it returns that reference as structured data in CSL-JSON.</p>\n\n<p>Attempting
|
472
|
+
to do all possible pairwise comparisons rapidly gets impractical as the number
|
473
|
+
of references increases, so we need some way to restrict the number of comparisons
|
474
|
+
we make. One approach I''ve explored is the “sorted neighbourhood method”
|
475
|
+
where we sort the references 9for example by their title) then move a sliding
|
476
|
+
window down the list of references, comparing all references within that window.
|
477
|
+
This greatly reduces the number of pairwise comparisons. So the first step
|
478
|
+
is to sort the references, then run a sliding window over them, output all
|
479
|
+
the pairs in each window (ignoring in pairwise comparisons already made in
|
480
|
+
a previous window). Other methods of \"blocking\" could also be used, such
|
481
|
+
as only including references in a particular year, or a particular journal.</p>\n\n<p>So,
|
482
|
+
the output of this step is a set of JSON arrays, each with a pair of references
|
483
|
+
in CSL-JSON format. Each array is stored on a single line in the same file
|
484
|
+
in <a href=\"https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON_2\">line-delimited
|
485
|
+
JSON</a> (JSONL).</p>\n\n<h2>Comparing pairs</h2>\n\n<p>The next step is to
|
486
|
+
compare each pair of references and decide whether they are a match or not.
|
487
|
+
Initially I explored a machine learning approach used in the following paper:</p>\n\n<blockquote>\nWilson
|
488
|
+
DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex
|
489
|
+
features to improve genealogical record linkage. In: The 2011 International
|
490
|
+
Joint Conference on Neural Networks. 9–14. DOI: <a href=\"https://doi.org/10.1109/IJCNN.2011.6033192\">10.1109/IJCNN.2011.6033192</a>\n</blockquote>\n\n<p>Initial
|
491
|
+
experiments using <a href=\"https://github.com/jtet/Perceptron\">https://github.com/jtet/Perceptron</a>
|
492
|
+
were promising and I want to play with this further, but I deciding to skip
|
493
|
+
this for now and just use simple string comparison. So for each CSL-JSON object
|
494
|
+
I generate a citation string in the same format using CiteProc, then compute
|
495
|
+
the <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein
|
496
|
+
distance</a> between the two strings. By normalising this distance by the
|
497
|
+
length of the two strings being compared I can use an arbitrary threshold
|
498
|
+
to decide if the references are the same or not.</p>\n\n<h2>Clustering</h2>\n\n<p>For
|
499
|
+
this step we read the JSONL file produced above and record whether the two
|
500
|
+
references are a match or not. Assuming each reference has a unique identifier
|
501
|
+
(needs only be unique within the file) then we can use those identifier to
|
502
|
+
record the clusters each reference belongs to. I do this using a <a href=\"https://en.wikipedia.org/wiki/Disjoint-set_data_structure\">Disjoint-set
|
503
|
+
data structure</a>. For each reference start with a graph where each node
|
504
|
+
represents a reference, and each node has a pointer to a parent node. Initially
|
505
|
+
the reference is its own parent. A simple implementation is to have an array
|
506
|
+
index by reference identifiers and where the value of each cell in the array
|
507
|
+
is the node''s parent.</p>\n\n<p>As we discover pairs we update the parents
|
508
|
+
of the nodes to reflect this, such that once all the comparisons are done
|
509
|
+
we have a one or more sets of clusters corresponding to the references that
|
510
|
+
we think are the same. Another way to think of this is that we are getting
|
511
|
+
the components of a graph where each node is a reference and pair of references
|
512
|
+
that match are connected by an edge.</p>\n\n<p>In the code I''m using I write
|
513
|
+
this graph in <a href=\"https://en.wikipedia.org/wiki/Trivial_Graph_Format\">Trivial
|
514
|
+
Graph Format</a> (TGF) which can be visualised using a tools such as <a href=\"https://www.yworks.com/products/yed\">yEd</a>.</p>\n\n<h2>Merging</h2>\n\n<p>Now
|
515
|
+
that we have a graph representing the sets of references that we think are
|
516
|
+
the same we need to merge them. This is where things get interesting as the
|
517
|
+
references are similar (by definition) but may differ in some details. The
|
518
|
+
paper below describes a simple Bayesian approach for merging records:</p>\n\n<blockquote>\nCouncill
|
519
|
+
IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles
|
520
|
+
CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching
|
521
|
+
Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
|
522
|
+
Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: <a href=\"https://doi.org/10.1145/1141753.1141817\">10.1145/1141753.1141817</a>.\n</blockquote>\n\n<p>So
|
523
|
+
the next step is to read the graph with the clusters, generate the sets of
|
524
|
+
bibliographic references that correspond to each cluster, then use the method
|
525
|
+
described in Councill et al. to produce a single bibliographic record for
|
526
|
+
that cluster. These records could then be used to, say locate the corresponding
|
527
|
+
article in BHL, or populate Wikidata with missing references.</p>\n\n<p>Obviously
|
528
|
+
there is always the potential for errors, such as trying to merge references
|
529
|
+
that are not the same. As a quick and dirty check I flag as dubious any cluster
|
530
|
+
where the page numbers vary among members of the cluster. More sophisticated
|
531
|
+
checks are possible, especially if I go down the ML route (i.e., I would have
|
532
|
+
evidence for the probability that the same reference can disagree on some
|
533
|
+
aspects of metadata).</p>\n\n<h2>Summary</h2>\n\n<p>At this stage the code
|
534
|
+
is working well enough for me to play with and explore some example datasets.
|
535
|
+
The focus is on structured bibliographic metadata, but I may simplify things
|
536
|
+
and have a version that handles simple string matching, for example to cluster
|
537
|
+
together different abbreviations of the same journal name.</p>","tags":["data
|
538
|
+
cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/c79vq-7rr11","uuid":"3cb94422-5506-4e24-a41c-a250bb521ee0","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL
|
539
|
+
for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
|
540
|
+
for WikiData. More precisely, the endpoint is for a subset of the entities
|
541
|
+
that are of interest to WikiCite, such as scholarly articles, people, and
|
542
|
+
journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
|
543
|
+
The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
|
544
|
+
There are various ways to interact with the endpoint, personally I like the
|
545
|
+
Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
|
546
|
+
taken...","date_published":"2021-12-20T13:16:00Z","date_modified":"2021-12-20T13:20:05Z","date_indexed":"1909-06-16T10:52:00+00:00","authors":[{"url":null,"name":"Roderic
|
547
|
+
Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
548
|
+
both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s1000\"
|
1657
549
|
style=\"display: block; padding: 1em 0; text-align: center; clear: right;
|
1658
|
-
float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"
|
1659
|
-
data-original-width=\"
|
1660
|
-
|
1661
|
-
|
1662
|
-
|
1663
|
-
|
1664
|
-
|
1665
|
-
<a href=\"https://
|
1666
|
-
|
1667
|
-
|
1668
|
-
|
1669
|
-
|
1670
|
-
|
1671
|
-
|
1672
|
-
|
1673
|
-
|
1674
|
-
|
1675
|
-
|
1676
|
-
|
1677
|
-
|
1678
|
-
|
1679
|
-
|
1680
|
-
|
1681
|
-
|
1682
|
-
|
1683
|
-
|
1684
|
-
|
1685
|
-
|
1686
|
-
|
1687
|
-
|
1688
|
-
|
1689
|
-
|
1690
|
-
|
1691
|
-
|
1692
|
-
border=\"0\" width=\"400\" data-original-height=\"94\" data-original-width=\"693\"
|
1693
|
-
src=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"/></a></div>\n\n<p>This
|
1694
|
-
is an example where Plazi has successfully parsed the specimen. But I keep
|
1695
|
-
coming across cases where specimens have not been parsed correctly, resulting
|
1696
|
-
in issues such as single specimens being split into multiple records (e.g., <a
|
1697
|
-
href=\"https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496\">https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496</a>),
|
1698
|
-
geographical coordinates being misinterpreted (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9\">https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9</a>),
|
1699
|
-
or collector''s initials being confused with codes for natural history collections
|
1700
|
-
(e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E\">https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E</a>).</p>\n\n<p>Parsing
|
1701
|
-
specimens is a hard problem so it''s not unexpected to find errors. But they
|
1702
|
-
do seem common enough to be easily found, which raises the question of just
|
1703
|
-
what percentage of these material citations are correct? How much of the
|
1704
|
-
data Plazi feeds to GBIF is correct? How would we know?</p>\n\n<h2>Systemic
|
1705
|
-
problems</h2>\n\n<p>Some of the errors I''ve found concern the interpretation
|
1706
|
-
of the parsed data. For example, it is striking that despite including marine
|
1707
|
-
taxa <b>no</b> Plazi record has a value for depth below sea level (see <a
|
1708
|
-
href=\"https://www.gbif.org/occurrence/map?depth=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">GBIF
|
1709
|
-
search on depth range 0-9999 for Plazi</a>). But <a href=\"https://www.gbif.org/occurrence/map?elevation=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">many
|
1710
|
-
records do have an elevation</a>, including records from marine environments.
|
1711
|
-
Any record that has a depth value is interpreted by Plazi as being elevation,
|
1712
|
-
so we have aerial crustacea and fish.</p>\n\n<h3>Map of Plazi records with
|
1713
|
-
depth 0-9999m</h3>\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s673/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"
|
1714
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1715
|
-
border=\"0\" width=\"400\" data-original-height=\"258\" data-original-width=\"673\"
|
1716
|
-
src=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"/></a></div>\n\n<h3>Map
|
1717
|
-
of Plazi records with elevation 0-9999m </h3>\n<div class=\"separator\" style=\"clear:
|
1718
|
-
both;\"><a href=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s675/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"
|
1719
|
-
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1720
|
-
border=\"0\" width=\"400\" data-original-height=\"256\" data-original-width=\"675\"
|
1721
|
-
src=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"/></a></div>\n\n<p>Anecdotally
|
1722
|
-
I''ve also noticed that Plazi seems to do well on zoological data, especially
|
1723
|
-
journals like <i>Zootaxa</i>, but it often struggles with botanical specimens.
|
1724
|
-
Botanists tend to cite specimens rather differently to zoologists (botanists
|
1725
|
-
emphasise collector numbers rather than specimen codes). Hence data quality
|
1726
|
-
in Plazi is likely to taxonomic biased.</p>\n\n<p>Plazi is <a href=\"https://github.com/plazi/community/issues\">using
|
1727
|
-
GitHub to track issues with treatments</a> so feedback on erroneous records
|
1728
|
-
is possible, but this seems inadequate to the task. There are tens of thousands
|
1729
|
-
of data sets, with more being released daily, and hundreds of thousands of
|
1730
|
-
occurrences, and relying on GitHub issues devolves the responsibility for
|
1731
|
-
error checking onto the data users. I don''t have a measure of how many records
|
1732
|
-
in Plazi have problems, but because I suspect it is a significant fraction
|
1733
|
-
because for any given day''s output I can typically find errors.</p>\n\n<h2>What
|
1734
|
-
to do?</h2>\n\n<p>Faced with a process that generates noisy data there are
|
1735
|
-
several of things we could do:</p>\n\n<ol>\n<li>Have tools to detect and flag
|
1736
|
-
errors made in generating the data.</li>\n<li>Have the data generator give
|
1737
|
-
estimates the confidence of its results.</li>\n<li>Improve the data generator.</li>\n</ol>\n\n<p>I
|
1738
|
-
think a comparison with the problem of parsing bibliographic references might
|
1739
|
-
be instructive here. There is a long history of people developing tools to
|
1740
|
-
parse references (<a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">I''ve
|
1741
|
-
even had a go</a>). State-of-the art tools such as <a href=\"https://anystyle.io\">AnyStyle</a>
|
1742
|
-
feature machine learning, and are tested against <a href=\"https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml\">human
|
1743
|
-
curated datasets</a> of tagged bibliographic records. This means we can evaluate
|
1744
|
-
the performance of a method (how well does it retrieve the same results as
|
1745
|
-
human experts?) and also improve the method by expanding the corpus of training
|
1746
|
-
data. Some of these tools can provide a measures of how confident they are
|
1747
|
-
when classifying a string as, say, a person''s name, which means we could
|
1748
|
-
flag potential issues for anyone wanting to use that record.</p>\n\n<p>We
|
1749
|
-
don''t have equivalent tools for parsing specimens in the literature, and
|
1750
|
-
hence have no easy way to quantify how good existing methods are, nor do we
|
1751
|
-
have a public corpus of material citations that we can use as training data.
|
1752
|
-
I <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">blogged
|
1753
|
-
about this</a> a few months ago and was considering using Plazi as a source
|
1754
|
-
of marked up specimen data to use for training. However based on what I''ve
|
1755
|
-
looked at so far Plazi''s data would need to be carefully scrutinised before
|
1756
|
-
it could be used as training data.</p>\n\n<p>Going forward, I think it would
|
1757
|
-
be desirable to have a set of records that can be used to benchmark specimen
|
1758
|
-
parsers, and ideally have the parsers themselves available as web services
|
1759
|
-
so that anyone can evaluate them. Even better would be a way to contribute
|
1760
|
-
to the training data so that these tools improve over time.</p>\n\n<p>Plazi''s
|
1761
|
-
data extraction tools are mostly desktop-based, that is, you need to download
|
1762
|
-
software to use their methods. However, there are experimental web services
|
1763
|
-
available as well. I''ve created a simple wrapper around the material citation
|
1764
|
-
parser, you can try it at <a href=\"https://plazi-tester.herokuapp.com/parser.php\">https://plazi-tester.herokuapp.com/parser.php</a>.
|
1765
|
-
It takes a single material citation and returns a version with elements such
|
1766
|
-
as specimen code and collector name tagged in different colours.</p>\n\n<h2>Summary</h2>\n\n<p>Text
|
1767
|
-
mining the taxonomic literature is clearly a gold mine of data, but at the
|
1768
|
-
same time it is potentially fraught as we try and extract structured data
|
1769
|
-
from semi-structured text. Plazi has demonstrated that it is possible to extract
|
1770
|
-
a lot of data from the literature, but at the same time the quality of that
|
1771
|
-
data seems highly variable. Even minor issues in parsing text can have big
|
1772
|
-
implications for data quality (e.g., marine organisms apparently living above
|
1773
|
-
sea level). Historically in biodiversity informatics we have favoured data
|
1774
|
-
quantity over data quality. Quantity has an obvious metric, and has milestones
|
1775
|
-
we can celebrate (e.g., <a href=\"GBIF at 1 billion - what''s next?\">one
|
1776
|
-
billion specimens</a>). There aren''t really any equivalent metrics for data
|
1777
|
-
quality.</p>\n\n<p>Adding new types of data can sometimes initially result
|
1778
|
-
in a new set of quality issues (e.g., <a href=\"https://iphylo.blogspot.com/2019/12/gbif-metagenomics-and-metacrap.html\">GBIF
|
1779
|
-
metagenomics and metacrap</a>) that take time to resolve. In the case of Plazi,
|
1780
|
-
I think it would be worthwhile to quantify just how many records have errors,
|
1781
|
-
and develop benchmarks that we can use to test methods for extracting specimen
|
1782
|
-
data from text. If we don''t do this then there will remain uncertainty as
|
1783
|
-
to how much trust we can place in data mined from the taxonomic literature.</p>\n\n<h2>Update</h2>\n\nPlazi
|
1784
|
-
has responded, see <a href=\"http://plazi.org/posts/2021/10/liberation-first-step-toward-quality/\">Liberating
|
1785
|
-
material citations as a first step to more better data</a>. My reading of
|
1786
|
-
their repsonse is that it essentially just reiterates Plazi''s approach and
|
1787
|
-
doesn''t tackle the underlying issue: their method for extracting material
|
1788
|
-
citations is error prone, and many of those errors end up in GBIF.","tags":["data
|
1789
|
-
quality","parsing","Plazi","specimen","text mining"],"language":"en","references":null}]}'
|
1790
|
-
recorded_at: Thu, 15 Jun 2023 20:39:01 GMT
|
550
|
+
float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"1000\"
|
551
|
+
data-original-width=\"1000\" src=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s200\"/></a></div><p>I''ve
|
552
|
+
released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint
|
553
|
+
is for a subset of the entities that are of interest to WikiCite, such as
|
554
|
+
scholarly articles, people, and journals. There is a crude demo at <a href=\"https://wikicite-graphql.herokuapp.com\">https://wikicite-graphql.herokuapp.com</a>.
|
555
|
+
The endpoint itself is at <a href=\"https://wikicite-graphql.herokuapp.com/gql.php\">https://wikicite-graphql.herokuapp.com/gql.php</a>.
|
556
|
+
There are various ways to interact with the endpoint, personally I like the
|
557
|
+
<a href=\"https://altair.sirmuel.design\">Altair GraphQL Client</a> by <a
|
558
|
+
href=\"https://github.com/imolorhe\">Samuel Imolorhe</a>.</p>\n\n<p>As I''ve
|
559
|
+
<a href=\"https://iphylo.blogspot.com/2021/04/it-been-while.html\">mentioned
|
560
|
+
earlier</a> it''s taken me a while to see the point of GraphQL. But it is
|
561
|
+
clear it is gaining traction in the biodiversity world (see for example the
|
562
|
+
<a href=\"https://dev.gbif.org/hosted-portals.html\">GBIF Hosted Portals</a>)
|
563
|
+
so it''s worth exploring. My take on GraphQL is that it is a way to create
|
564
|
+
a self-describing API that someone developing a web site can use without them
|
565
|
+
having to bury themselves in the gory details of how data is internally modelled.
|
566
|
+
For example, WikiData''s query interface uses SPARQL, a powerful language
|
567
|
+
that has a steep learning curve (in part because of the administrative overhead
|
568
|
+
brought by RDF namespaces, etc.). In my previous SPARQL-based projects such
|
569
|
+
as <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a> and <a
|
570
|
+
href=\"http://alec-demo.herokuapp.com\">ALEC</a> I have either returned SPARQL
|
571
|
+
results directly (Ozymandias) or formatted SPARQL results as <a href=\"https://schema.org/DataFeed\">schema.org
|
572
|
+
DataFeeds</a> (equivalent to RSS feeds) (ALEC). Both approaches work, but
|
573
|
+
they are project-specific and if anyone else tried to build based on these
|
574
|
+
projects they might struggle for figure out what was going on. I certainly
|
575
|
+
struggle, and I wrote them!</p>\n\n<p>So it seems worthwhile to explore this
|
576
|
+
approach a little further and see if I can develop a GraphQL interface that
|
577
|
+
can be used to build the sort of rich apps that I want to see. The demo I''ve
|
578
|
+
created uses SPARQL under the hood to provide responses to the GraphQL queries.
|
579
|
+
So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully)
|
580
|
+
simpler overlay on top of SPARQL so that we can retrieve the data we want
|
581
|
+
without having to learn the intricacies of SPARQL, nor how Wikidata models
|
582
|
+
publications and people.</p>","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":[]}]}'
|
583
|
+
recorded_at: Sun, 18 Jun 2023 15:23:48 GMT
|
1791
584
|
recorded_with: VCR 6.1.0
|