Code here written by Erica Krimmel. Please see Use Case: Identify specimen records with suspicious coordinate data for context. Code here is modified from original given in a presentation at the 2019 ADBC Summit in Gainesville, FL.
# Load core libraries; install these packages if you have not already
library(ridigbio)
library(tidyverse)
# Load library for making nice HTML output
library(kableExtra)
# Load libraries for visualizing geographic data
library(leaflet)
In this use case for the iDigBio API we explore a situation where geographic coordinate data from the provider was modified by iDigBio during its data quality assurance process. See here for more information about iDigBio’s data quality flags.
First, let’s find all the specimen records for the data quality flag we are interested in. Do this using the idig_search_records
function from the ridigbio
package. You can learn more about this function from the iDigBio API documentation and ridigbio documentation. In this example, we want to start by searching for specimens flagged with “rev_geocode_corrected.”
# Edit the fields (e.g. `flags`) and values (e.g. "rev_geocode_corrected") in
# `list()` to adjust your query and the fields (e.g. `uuid`) in `fields` to
# adjust the columns returned in your results
df_flagCoord <- idig_search_records(rq = list(flags = "rev_geocode_corrected",
institutioncode = "lacm"),
fields = c("uuid",
"institutioncode",
"collectioncode",
"country",
"data.dwc:country",
"stateprovince",
"county",
"locality",
"geopoint",
"data.dwc:decimalLongitude",
"data.dwc:decimalLatitude",
"flags"),
limit = 100000) %>%
# Rename fields to more easily reflect their provenance (either from the
# data provider directly or modified by the data aggregator)
rename(provider_lon = `data.dwc:decimalLongitude`,
provider_lat = `data.dwc:decimalLatitude`,
provider_country = `data.dwc:country`,
aggregator_lon = `geopoint.lon`,
aggregator_lat = `geopoint.lat`,
aggregator_country = country,
aggregator_stateprovince = stateprovince,
aggregator_county = county,
aggregator_locality = locality) %>%
# Reorder columns for easier viewing
select(uuid, institutioncode, collectioncode, provider_lat, aggregator_lat,
provider_lon, aggregator_lon, provider_country, aggregator_country,
aggregator_stateprovince, aggregator_county, aggregator_locality,
flags)
Here is what our query result data looks like:
uuid | institutioncode | collectioncode | provider_lat | aggregator_lat | provider_lon | aggregator_lon | provider_country | aggregator_country | aggregator_stateprovince | aggregator_county | aggregator_locality |
---|---|---|---|---|---|---|---|---|---|---|---|
004f15e3-92e8-4c8d-8a9b-3074a37a27dc | lacm | fish | -39.2333333 | -75.71667 | -75.7166667 | 39.23333 | Antarctica | antarctica | NA | NA | NA |
004fa3d0-7d99-4af4-98b8-dd6c64e68906 | lacm | fish | -62.5 | -62.50000 | -108.5833333 | 108.58333 | Antarctica | antarctica | NA | NA | NA |
00a0ea39-6602-4096-ab4e-a8e5dce43760 | lacm | fish | -59.2333333 | -69.21667 | -69.2166667 | 59.23333 | Antarctica | antarctica | NA | NA | NA |
00ae30fc-e9fb-4786-9f3c-ce6270371d7a | lacm | fish | -36.0166667 | -83.00000 | -83 | -36.01667 | Antarctica | antarctica | NA | NA | NA |
00b60cb3-2390-4125-b679-711e6cc06298 | lacm | fish | -55.3666667 | -78.13333 | -78.1333333 | 55.36667 | Antarctica | antarctica | NA | NA | NA |
00b92804-7166-48f2-81c5-583741d90400 | lacm | fish | -56.1 | -79.06667 | -79.0666667 | 56.10000 | Antarctica | antarctica | NA | NA | NA |
00bd0dab-5c51-475e-b93c-f0c52f703b5f | lacm | fish | -57.0666667 | -59.55000 | -59.55 | -57.06667 | Antarctica | antarctica | NA | NA | NA |
00dd683d-e6ac-4031-a388-f0b8de0506e0 | lacm | fish | -68.85 | -68.85000 | -114.1333333 | 114.13333 | Antarctica | antarctica | NA | NA | NA |
00e7660e-6f68-4966-a2d8-ad076261adc7 | lacm | fish | -39.2333333 | -75.71667 | -75.7166667 | 39.23333 | Antarctica | antarctica | NA | NA | NA |
00ece399-7933-4ffa-b6f4-319d41ea25dc | lacm | fish | -56.0666667 | -82.83333 | -82.8333333 | -56.06667 | Antarctica | antarctica | NA | NA | NA |
00fa11b3-9ba1-45ad-8cc9-b2b0586f7492 | lacm | fish | -51.5 | -77.58333 | -77.5833333 | 51.50000 | Antarctica | antarctica | NA | NA | NA |
0105c782-ad92-49fa-b88e-254c14767072 | lacm | fish | -62.2166667 | -62.21667 | -95.65 | 95.65000 | Antarctica | antarctica | NA | NA | NA |
010b1128-4d78-4545-b73d-9815ce39e215 | lacm | fish | -59.6333333 | -82.45000 | -82.45 | -59.63333 | Antarctica | antarctica | NA | NA | NA |
0120df13-f6a4-4854-8d9f-a59998ecc537 | lacm | fish | -60.1166667 | -70.13333 | -70.1333333 | 60.11667 | Antarctica | antarctica | NA | NA | NA |
01381f5e-6b15-44a8-a2f3-5761100b630c | lacm | herps | 26.753 | 26.75300 | 80.949 | -80.94900 | United States | united states | florida | hendry | best western hotel, clewiston |
0138fc7e-1a82-4526-8881-6bc74ba6749f | lacm | fish | -63.45 | -86.83333 | -86.8333333 | -63.45000 | Antarctica | antarctica | NA | NA | NA |
0143216a-d45d-4f44-8957-c8d9985e92e6 | lacm | fish | -56.15 | -60.90000 | -60.9 | -56.15000 | Antarctica | antarctica | NA | NA | NA |
01626bec-6af3-4b46-a184-971d002e5822 | lacm | fish | -57.2333333 | -62.76667 | -62.7666667 | -57.23333 | Antarctica | antarctica | NA | NA | NA |
01b3127e-5815-43c9-a5fc-d064bf21eb89 | lacm | fish | -59.6166667 | -88.90000 | -88.9 | -59.61667 | Antarctica | antarctica | NA | NA | NA |
01ce0ad8-a0b4-4dfd-93f6-3eb597070f90 | lacm | fish | -53.3 | -75.53333 | -75.5333333 | 53.30000 | Antarctica | antarctica | NA | NA | NA |
01e47af0-1c6d-49b3-b6df-86e7aeb5e8d6 | lacm | fish | -40.0166667 | -82.81667 | -82.8166667 | -40.01667 | Antarctica | antarctica | NA | NA | NA |
01fc820f-2b74-4e28-8470-db1bfcd37c2e | lacm | fish | -57.3333333 | -74.70000 | -74.7 | 57.33333 | Antarctica | antarctica | NA | NA | NA |
02249913-685c-422f-b73a-ac1d439ad607 | lacm | fish | -44.6833333 | -76.06667 | -76.0666667 | 44.68333 | Antarctica | antarctica | NA | NA | NA |
023cac7a-2468-42ab-a586-bcb0f856d31f | lacm | fish | -65.0833333 | -65.08333 | -41.3 | 41.30000 | Antarctica | antarctica | NA | NA | NA |
0283d543-06c5-4674-a42c-ca375db31892 | lacm | fish | -42 | -86.00000 | -86 | -42.00000 | Antarctica | antarctica | NA | NA | NA |
02984cf6-e6e8-408b-889d-e073ab15accf | lacm | fish | -59.9333333 | -69.00000 | -69 | 59.93333 | Antarctica | antarctica | NA | NA | NA |
029ac4ca-8f8c-4f1e-a33a-da7e517ef402 | lacm | fish | -39.3916667 | -73.55000 | -73.55 | 39.39167 | Antarctica | antarctica | NA | NA | NA |
02a0aa6f-fcdd-48c5-a45c-0a255b141d82 | lacm | fish | -47.4333333 | -76.66667 | -76.6666667 | 47.43333 | Antarctica | antarctica | NA | NA | NA |
02a5f38f-cd94-4b9c-af7d-260922884712 | lacm | herps | -26.8 | -20.88333 | 20.8833333 | 26.80000 | Botswana | botswana | NA | NA | 7 mi n, 12 mi e jct., molopo-nossob rivers |
02b1a6b5-0b26-4f26-b2a2-3f0e4d7e3cf6 | lacm | fish | -45.0166667 | -76.55000 | -76.55 | 45.01667 | Antarctica | antarctica | NA | NA | NA |
02b64cdc-54f3-442b-90d2-a2cf98bdc13a | lacm | fish | -53.55 | -70.75000 | -70.75 | 53.55000 | Antarctica | antarctica | NA | NA | NA |
02b80fb9-09aa-4d95-8e7d-6894327b7dec | lacm | fish | -41.9833333 | -86.11667 | -86.1166667 | -41.98333 | Antarctica | antarctica | NA | NA | NA |
02c9ee2d-561e-44de-85af-5a46a67f8cad | lacm | fish | -57.8666667 | -74.71667 | -74.7166667 | 57.86667 | Antarctica | antarctica | NA | NA | NA |
0309ac59-7460-4d3f-85e3-a3331f2b7443 | lacm | fish | -67.9166667 | -67.91667 | -103.2166667 | 103.21667 | Antarctica | antarctica | NA | NA | NA |
03349313-c8f0-4d21-b27a-fc566f86a557 | lacm | fish | -44.6833333 | -76.06667 | -76.0666667 | 44.68333 | Antarctica | antarctica | NA | NA | NA |
035b5037-fd54-493d-bc8d-cfffb5e69e80 | lacm | fish | -38.15 | -74.51667 | -74.5166667 | 38.15000 | Antarctica | antarctica | NA | NA | NA |
03965834-16f2-45f1-bf63-8d5975e9c218 | lacm | herps | -26.8 | -20.88333 | 20.8833333 | 26.80000 | Botswana | botswana | NA | NA | 7 mi n, 12 mi e jct., molopo-nossob rivers |
03b8814f-208b-411a-a72f-6edbc745e781 | lacm | fish | -57.2333333 | -70.95000 | -70.95 | 57.23333 | Antarctica | antarctica | NA | NA | NA |
03bb1ab1-5226-4a07-a8df-d826c911be45 | lacm | fish | -62.7 | -78.56667 | -78.5666667 | 62.70000 | Antarctica | antarctica | NA | NA | NA |
03bbad49-568d-4625-836b-7ca587be33af | lacm | fish | -33.55 | -72.75000 | -72.75 | 33.55000 | Antarctica | antarctica | NA | NA | NA |
03bc3bbe-af95-4adb-b9fa-9ed28dd70a94 | lacm | fish | -54.895 | -65.45667 | -65.4566667 | -54.89500 | Antarctica | antarctica | NA | NA | NA |
03dd5da9-ceb3-4481-b722-3913479b8b54 | lacm | fish | -41.0833333 | -74.90000 | -74.9 | 41.08333 | Antarctica | antarctica | NA | NA | NA |
03e5220a-9afe-4a8a-8858-4b47505fb6d3 | lacm | fish | -36.0166667 | -83.00000 | -83 | -36.01667 | Antarctica | antarctica | NA | NA | NA |
03e6c3f1-6343-4330-93c2-00511202dad5 | lacm | birds | -65.18 | -65.18000 | -43.72 | 43.72000 | Antarctica | antarctica | NA | NA | NA |
03ec203f-38aa-41a8-8c16-fc8be1d56ffb | lacm | fish | -59.8666667 | -82.83333 | -82.8333333 | -59.86667 | Antarctica | antarctica | NA | NA | NA |
04186251-8b4f-4973-a397-5f0c45e837f3 | lacm | fish | -36.3166667 | -83.00000 | -83 | -36.31667 | Antarctica | antarctica | NA | NA | NA |
04221f29-d45d-487e-8b24-b55975aa093e | lacm | fish | -51.5 | -77.58333 | -77.5833333 | 51.50000 | Antarctica | antarctica | NA | NA | NA |
04379435-5305-49a8-a308-fa4d252a8d94 | lacm | fish | -59.85 | -78.88333 | -78.8833333 | 59.85000 | Antarctica | antarctica | NA | NA | NA |
04730a4e-792f-452d-8f5a-e724f2876562 | lacm | fish | -59.1166667 | -66.96667 | -66.9666667 | -59.11667 | Antarctica | antarctica | NA | NA | NA |
04920726-4880-4495-9163-b7ec5475fb4b | lacm | fish | -53.1833333 | -70.83333 | -70.8333333 | 53.18333 | Antarctica | antarctica | NA | NA | NA |
One example of a geographic coordinate data quality issue would be that the latitude/longitude has a reversed sign, e.g. the data provider gave the value latitude = “7.1789” but meant latitude = “-7.1789.” In the map below we can see a few examples of specimen records published to iDigBio where this is the case. These data have been adjusted by iDigBio and this action is recorded with the data quality flag “rev_geocode_flip_lat_sign.”
# Create function to allow subsetting the `df_flagCoord` dataset by other flags
# found on these same records
df_flagSubset <- function(subsetFlag) {
df_flagCoord %>%
filter(grepl(subsetFlag, flags)) %>%
select(uuid, matches("_lat|_lon")) %>%
unite(provider_coords, c("provider_lat", "provider_lon"), sep = ",") %>%
unite(aggregator_coords, c("aggregator_lat", "aggregator_lon"), sep = ",") %>%
gather(key = type, value = coordinates, -uuid) %>%
separate(coordinates, c("lat","lon"), sep = ",") %>%
mutate(lat = as.numeric(lat)) %>%
mutate(lon = as.numeric(lon)) %>%
arrange(uuid, type)}
# Subset `df_flagCoord` by records flagged for having had their latitude negated
# to place point in stated country by reverse geocoding process
df_rev_geocode_lat_sign <- df_flagSubset("rev_geocode_lat_sign")
# Create map displaying a few examples of records with the
# rev_geocode_flip_lat_sign flag
pal <- colorFactor(palette = c("#d7191c", "#fdae61", "#ffffbf", "#abdda4", "#2b83ba"),
domain = df_rev_geocode_lat_sign$uuid[1:10])
map <- df_rev_geocode_lat_sign[1:10,] %>%
mutate(popup = str_c(type, " = ", lat, ", ", lon, sep = "")) %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(
lng = ~lon,
lat = ~lat,
radius = 10,
weight = 1,
color = ~pal(uuid),
stroke = FALSE,
fillOpacity = 100,
popup = ~popup) %>%
addLegend("bottomright", pal = pal, values = ~uuid,
title = "Specimen Records",
opacity = 1)
We can visualize this data on a map to better understand what the data quality flag is telling us. For example, in the map below you can see the effect of accidentally reversing the latitude on three example georeferenced specimen records.
iDigBio uses the value provided for “country” to identify issues and apply the flag “rev_geocode_flip_lat_sign.” This is frequently helpful but not actually always correct. For example, here is a specimen record where the country has been recorded as “Antarctica” and georeferenced accordingly, then corrected incorrectly by iDigBio (probably because the data provider coordinates are farther offshore than the “country” of Antarctica extends to). It is important to recognize what kinds of data quality adjustments (good and bad) aggregators are making to your data because researchers may not know which set of coordinates to use.
# Create map displaying example of record possibly assigned the
# rev_geocode_flip_lat_sign flag incorrectly
df_flagCoord %>%
filter(uuid == "004fa3d0-7d99-4af4-98b8-dd6c64e68906") %>%
select(uuid, matches("_lat|_lon")) %>%
unite(provider_coords, c("provider_lat", "provider_lon"), sep = ",") %>%
unite(aggregator_coords, c("aggregator_lat", "aggregator_lon"), sep = ",") %>%
gather(key = type, value = coordinates, -uuid) %>%
separate(coordinates, c("lat","lon"), sep = ",") %>%
mutate(lat = as.numeric(lat)) %>%
mutate(lon = as.numeric(lon)) %>%
arrange(uuid, type) %>%
mutate(popup = str_c(type, " = ", lat, ", ", lon, sep = "")) %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(
lng = ~lon,
lat = ~lat,
radius = 10,
weight = 1,
color = ~pal(uuid),
stroke = FALSE,
fillOpacity = 100,
popup = ~popup) %>%
addLegend("bottomright", pal = pal, values = ~uuid,
title = "Specimen Record",
opacity = 1)
The iDigBio API provides a means for an institution to examine data quality issues across collections, which sometimes is not possible internally when data in different collections are managed in different databases.
# Summarize flagged records by collection type
spmByColl <- df_flagCoord %>%
group_by(collectioncode) %>%
tally()
# Generate graph to display counts of flagged records by collection within the
# institution
graph_spmByColl <- ggplot(spmByColl,
aes(x = reorder(collectioncode, -n),
y = n,
fill = collectioncode)) +
geom_col() +
theme(panel.background = element_blank(),
legend.title = element_blank(),
axis.title.x = element_text(face = "bold"),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_text(face = "bold"),
plot.title = element_text(size = 12, face = "bold")) +
labs(x = "collection",
y = "# of specimen records",
title = "LACM records flagged with geo-coordinate data quality issues by iDigBio") +
geom_text(aes(label = n, vjust = -0.5))
# Get count of total records published by the institution using function
# `idig_count_records`
totalInstSpm <- idig_count_records(rq = list(institutioncode = "lacm"))
# Calculate flagged records as percent of total records
percentFlagged <- sum(spmByColl$n)/totalInstSpm*100
For example, we can ask how many specimen records from which collections at the Natural History Museum of Los Angeles (LACM) have been flagged as “rev_geocode_corrected” by iDigBio. As an aside, although this graph highlights the number of specimen records with data quality issues, these represent only 0.36% of the total specimen records published by LACM.
We can also explore what other data quality flags these specimen records have been flagged with.
# Collate `df_flagAssoc` to describe other data quality flags that are associated
# with rev_geocode_corrected in `df_flagCoord`
df_flagAssoc <- df_flagCoord %>%
select(uuid, flags) %>%
unnest(flags) %>%
group_by(flags) %>%
tally() %>%
mutate("category" = case_when(str_detect(flags, "geo|country|state")
~ "geography",
str_detect(flags, "dwc_datasetid_added|dwc_multimedia_added|datecollected_bounds")
~ "other",
str_detect(flags, "gbif|dwc|tax")
~ "taxonomy")) %>%
mutate("percent" = n/(nrow(df_flagCoord))*100) %>%
arrange(category, desc(n))
# Visualize associated data quality flags
ggplot(df_flagAssoc, aes(x = reorder(flags, -percent), y = percent, fill = category)) +
geom_col() +
theme(axis.title.x = element_text(face = "bold"),
axis.text.x = element_text(angle = 75, hjust = 1),
axis.ticks.y = element_blank(),
axis.title.y = element_text(face = "bold"),
plot.title = element_text(size = 12, face = "bold")
) +
labs(x = "additional iDigBio data quality flag",
y = "% specimen records",
title = "LACM records flagged for geo-coordinate issues are also flagged for...",
fill = "flag category")