Professor Yale University New Haven, Connecticut, United States
Abstract: Species occurrence records in biodiversity databases are highly heterogeneous, accumulated by a breadth of collectors and observers representing numerous organizations over long periods of time. This heterogeneity intersects with a taxonomic revision process that regularly lumps, splits, or otherwise reorganizes species, yielding species occurrence records that are often ambiguous in the taxonomic concepts they represent, having only a Latin binomial name to go by. Occurrences predating and postdating a lump or split might end up assigned to the same binomial, resulting in taxonomic mismatches when such occurrences are compared to reference data sources (such as an expert range map) that utilize a single taxonomic concept. Taxonomic uncertainty of species occurrence data in databases reduces their utility, and thus flexible methods for assessing taxonomic uncertainty, characterizing the occurrence/species-level predictors of taxonomic uncertainty, and detecting taxonomic mismatch are needed to improve data quality for workflows relying on taxonomically broad-scale species occurrence data.
We share a generalizable method for detecting taxonomic mismatches between species occurrence data and species range maps, intended for use at the scale of major species groups (tested on the set of all mammal species). We developed numerous metrics informative of the probability of taxonomic mismatch, utilizing either the spatial arrangement of species occurrence records in and outside of the species range, or metadata associated with species occurrence records. We manually verified the taxonomic mismatch status of species occurrences from 200 mammal species, and used the manually classified species to train a logistic regression model to predict species taxonomic mismatch status given key informative metrics, achieving upwards of 75% prediction accuracy. We offer some perspective on the most important drivers of taxonomic mismatch potential for mammals that arise from our model, and illustrate how the probability of taxonomic mismatch of a species’ occurrence records may vary across space and the major regions of the world.