Code here written by Erica Krimmel. Please see Use Case: Identify specimen records with suspicious coordinate data for context. Code here is modified from original given in a presentation at the 2019 ADBC Summit in Gainesville, FL.

# Load core libraries; install these packages if you have not already
library(ridigbio)
library(tidyverse)

# Load library for making nice HTML output
library(kableExtra)

# Load libraries for visualizing geographic data
library(leaflet)

In this use case for the iDigBio API we explore a situation where geographic coordinate data from the provider was modified by iDigBio during its data quality assurance process. See here for more information about iDigBio’s data quality flags.

Write a query to search for specimen records

First, let’s find all the specimen records for the data quality flag we are interested in. Do this using the idig_search_records function from the ridigbio package. You can learn more about this function from the iDigBio API documentation and ridigbio documentation. In this example, we want to start by searching for specimens flagged with “rev_geocode_corrected.”

# Edit the fields (e.g. `flags`) and values (e.g. "rev_geocode_corrected") in
# `list()` to adjust your query and the fields (e.g. `uuid`) in `fields` to
# adjust the columns returned in your results
df_flagCoord <- idig_search_records(rq = list(flags = "rev_geocode_corrected",
                                              institutioncode = "lacm"),
                    fields = c("uuid",
                               "institutioncode",
                               "collectioncode",
                               "country",
                               "data.dwc:country",
                               "stateprovince",
                               "county",
                               "locality",
                               "geopoint",
                               "data.dwc:decimalLongitude",
                               "data.dwc:decimalLatitude",
                               "flags"),
                    limit = 100000) %>% 
  # Rename fields to more easily reflect their provenance (either from the
  # data provider directly or modified by the data aggregator)
  rename(provider_lon = `data.dwc:decimalLongitude`,
         provider_lat = `data.dwc:decimalLatitude`,
         provider_country = `data.dwc:country`,
         aggregator_lon = `geopoint.lon`,
         aggregator_lat = `geopoint.lat`,
         aggregator_country = country,
         aggregator_stateprovince = stateprovince,
         aggregator_county = county,
         aggregator_locality = locality) %>% 
  # Reorder columns for easier viewing
  select(uuid, institutioncode, collectioncode, provider_lat, aggregator_lat,
         provider_lon, aggregator_lon, provider_country, aggregator_country,
         aggregator_stateprovince, aggregator_county, aggregator_locality,
         flags)

Here is what our query result data looks like:

uuid institutioncode collectioncode provider_lat aggregator_lat provider_lon aggregator_lon provider_country aggregator_country aggregator_stateprovince aggregator_county aggregator_locality
004f15e3-92e8-4c8d-8a9b-3074a37a27dc lacm fish -39.2333333 -75.71667 -75.7166667 39.23333 Antarctica antarctica NA NA NA
004fa3d0-7d99-4af4-98b8-dd6c64e68906 lacm fish -62.5 -62.50000 -108.5833333 108.58333 Antarctica antarctica NA NA NA
00a0ea39-6602-4096-ab4e-a8e5dce43760 lacm fish -59.2333333 -69.21667 -69.2166667 59.23333 Antarctica antarctica NA NA NA
00ae30fc-e9fb-4786-9f3c-ce6270371d7a lacm fish -36.0166667 -83.00000 -83 -36.01667 Antarctica antarctica NA NA NA
00b60cb3-2390-4125-b679-711e6cc06298 lacm fish -55.3666667 -78.13333 -78.1333333 55.36667 Antarctica antarctica NA NA NA
00b92804-7166-48f2-81c5-583741d90400 lacm fish -56.1 -79.06667 -79.0666667 56.10000 Antarctica antarctica NA NA NA
00bd0dab-5c51-475e-b93c-f0c52f703b5f lacm fish -57.0666667 -59.55000 -59.55 -57.06667 Antarctica antarctica NA NA NA
00dd683d-e6ac-4031-a388-f0b8de0506e0 lacm fish -68.85 -68.85000 -114.1333333 114.13333 Antarctica antarctica NA NA NA
00e7660e-6f68-4966-a2d8-ad076261adc7 lacm fish -39.2333333 -75.71667 -75.7166667 39.23333 Antarctica antarctica NA NA NA
00ece399-7933-4ffa-b6f4-319d41ea25dc lacm fish -56.0666667 -82.83333 -82.8333333 -56.06667 Antarctica antarctica NA NA NA
00fa11b3-9ba1-45ad-8cc9-b2b0586f7492 lacm fish -51.5 -77.58333 -77.5833333 51.50000 Antarctica antarctica NA NA NA
0105c782-ad92-49fa-b88e-254c14767072 lacm fish -62.2166667 -62.21667 -95.65 95.65000 Antarctica antarctica NA NA NA
010b1128-4d78-4545-b73d-9815ce39e215 lacm fish -59.6333333 -82.45000 -82.45 -59.63333 Antarctica antarctica NA NA NA
0120df13-f6a4-4854-8d9f-a59998ecc537 lacm fish -60.1166667 -70.13333 -70.1333333 60.11667 Antarctica antarctica NA NA NA
01381f5e-6b15-44a8-a2f3-5761100b630c lacm herps 26.753 26.75300 80.949 -80.94900 United States united states florida hendry best western hotel, clewiston
0138fc7e-1a82-4526-8881-6bc74ba6749f lacm fish -63.45 -86.83333 -86.8333333 -63.45000 Antarctica antarctica NA NA NA
0143216a-d45d-4f44-8957-c8d9985e92e6 lacm fish -56.15 -60.90000 -60.9 -56.15000 Antarctica antarctica NA NA NA
01626bec-6af3-4b46-a184-971d002e5822 lacm fish -57.2333333 -62.76667 -62.7666667 -57.23333 Antarctica antarctica NA NA NA
01b3127e-5815-43c9-a5fc-d064bf21eb89 lacm fish -59.6166667 -88.90000 -88.9 -59.61667 Antarctica antarctica NA NA NA
01ce0ad8-a0b4-4dfd-93f6-3eb597070f90 lacm fish -53.3 -75.53333 -75.5333333 53.30000 Antarctica antarctica NA NA NA
01e47af0-1c6d-49b3-b6df-86e7aeb5e8d6 lacm fish -40.0166667 -82.81667 -82.8166667 -40.01667 Antarctica antarctica NA NA NA
01fc820f-2b74-4e28-8470-db1bfcd37c2e lacm fish -57.3333333 -74.70000 -74.7 57.33333 Antarctica antarctica NA NA NA
02249913-685c-422f-b73a-ac1d439ad607 lacm fish -44.6833333 -76.06667 -76.0666667 44.68333 Antarctica antarctica NA NA NA
023cac7a-2468-42ab-a586-bcb0f856d31f lacm fish -65.0833333 -65.08333 -41.3 41.30000 Antarctica antarctica NA NA NA
0283d543-06c5-4674-a42c-ca375db31892 lacm fish -42 -86.00000 -86 -42.00000 Antarctica antarctica NA NA NA
02984cf6-e6e8-408b-889d-e073ab15accf lacm fish -59.9333333 -69.00000 -69 59.93333 Antarctica antarctica NA NA NA
029ac4ca-8f8c-4f1e-a33a-da7e517ef402 lacm fish -39.3916667 -73.55000 -73.55 39.39167 Antarctica antarctica NA NA NA
02a0aa6f-fcdd-48c5-a45c-0a255b141d82 lacm fish -47.4333333 -76.66667 -76.6666667 47.43333 Antarctica antarctica NA NA NA
02a5f38f-cd94-4b9c-af7d-260922884712 lacm herps -26.8 -20.88333 20.8833333 26.80000 Botswana botswana NA NA 7 mi n, 12 mi e jct., molopo-nossob rivers
02b1a6b5-0b26-4f26-b2a2-3f0e4d7e3cf6 lacm fish -45.0166667 -76.55000 -76.55 45.01667 Antarctica antarctica NA NA NA
02b64cdc-54f3-442b-90d2-a2cf98bdc13a lacm fish -53.55 -70.75000 -70.75 53.55000 Antarctica antarctica NA NA NA
02b80fb9-09aa-4d95-8e7d-6894327b7dec lacm fish -41.9833333 -86.11667 -86.1166667 -41.98333 Antarctica antarctica NA NA NA
02c9ee2d-561e-44de-85af-5a46a67f8cad lacm fish -57.8666667 -74.71667 -74.7166667 57.86667 Antarctica antarctica NA NA NA
0309ac59-7460-4d3f-85e3-a3331f2b7443 lacm fish -67.9166667 -67.91667 -103.2166667 103.21667 Antarctica antarctica NA NA NA
03349313-c8f0-4d21-b27a-fc566f86a557 lacm fish -44.6833333 -76.06667 -76.0666667 44.68333 Antarctica antarctica NA NA NA
035b5037-fd54-493d-bc8d-cfffb5e69e80 lacm fish -38.15 -74.51667 -74.5166667 38.15000 Antarctica antarctica NA NA NA
03965834-16f2-45f1-bf63-8d5975e9c218 lacm herps -26.8 -20.88333 20.8833333 26.80000 Botswana botswana NA NA 7 mi n, 12 mi e jct., molopo-nossob rivers
03b8814f-208b-411a-a72f-6edbc745e781 lacm fish -57.2333333 -70.95000 -70.95 57.23333 Antarctica antarctica NA NA NA
03bb1ab1-5226-4a07-a8df-d826c911be45 lacm fish -62.7 -78.56667 -78.5666667 62.70000 Antarctica antarctica NA NA NA
03bbad49-568d-4625-836b-7ca587be33af lacm fish -33.55 -72.75000 -72.75 33.55000 Antarctica antarctica NA NA NA
03bc3bbe-af95-4adb-b9fa-9ed28dd70a94 lacm fish -54.895 -65.45667 -65.4566667 -54.89500 Antarctica antarctica NA NA NA
03dd5da9-ceb3-4481-b722-3913479b8b54 lacm fish -41.0833333 -74.90000 -74.9 41.08333 Antarctica antarctica NA NA NA
03e5220a-9afe-4a8a-8858-4b47505fb6d3 lacm fish -36.0166667 -83.00000 -83 -36.01667 Antarctica antarctica NA NA NA
03e6c3f1-6343-4330-93c2-00511202dad5 lacm birds -65.18 -65.18000 -43.72 43.72000 Antarctica antarctica NA NA NA
03ec203f-38aa-41a8-8c16-fc8be1d56ffb lacm fish -59.8666667 -82.83333 -82.8333333 -59.86667 Antarctica antarctica NA NA NA
04186251-8b4f-4973-a397-5f0c45e837f3 lacm fish -36.3166667 -83.00000 -83 -36.31667 Antarctica antarctica NA NA NA
04221f29-d45d-487e-8b24-b55975aa093e lacm fish -51.5 -77.58333 -77.5833333 51.50000 Antarctica antarctica NA NA NA
04379435-5305-49a8-a308-fa4d252a8d94 lacm fish -59.85 -78.88333 -78.8833333 59.85000 Antarctica antarctica NA NA NA
04730a4e-792f-452d-8f5a-e724f2876562 lacm fish -59.1166667 -66.96667 -66.9666667 -59.11667 Antarctica antarctica NA NA NA
04920726-4880-4495-9163-b7ec5475fb4b lacm fish -53.1833333 -70.83333 -70.8333333 53.18333 Antarctica antarctica NA NA NA

Visualize suspicious coordinates

One example of a geographic coordinate data quality issue would be that the latitude/longitude has a reversed sign, e.g. the data provider gave the value latitude = “7.1789” but meant latitude = “-7.1789.” In the map below we can see a few examples of specimen records published to iDigBio where this is the case. These data have been adjusted by iDigBio and this action is recorded with the data quality flag “rev_geocode_flip_lat_sign.”

# Create function to allow subsetting the `df_flagCoord` dataset by other flags
# found on these same records
df_flagSubset <- function(subsetFlag) {
  df_flagCoord %>% 
  filter(grepl(subsetFlag, flags)) %>% 
  select(uuid, matches("_lat|_lon")) %>% 
  unite(provider_coords, c("provider_lat", "provider_lon"), sep = ",") %>% 
  unite(aggregator_coords, c("aggregator_lat", "aggregator_lon"), sep = ",") %>% 
  gather(key = type, value = coordinates, -uuid) %>% 
  separate(coordinates, c("lat","lon"), sep = ",") %>% 
  mutate(lat = as.numeric(lat)) %>% 
  mutate(lon = as.numeric(lon)) %>% 
  arrange(uuid, type)}

# Subset `df_flagCoord` by records flagged for having had their latitude negated
# to place point in stated country by reverse geocoding process
df_rev_geocode_lat_sign <- df_flagSubset("rev_geocode_lat_sign")

# Create map displaying a few examples of records with the
# rev_geocode_flip_lat_sign flag
pal <- colorFactor(palette = c("#d7191c", "#fdae61", "#ffffbf", "#abdda4", "#2b83ba"),
                   domain = df_rev_geocode_lat_sign$uuid[1:10])

map <- df_rev_geocode_lat_sign[1:10,] %>% 
  mutate(popup = str_c(type, " = ", lat, ", ", lon, sep = "")) %>% 
  leaflet() %>%
  addTiles() %>% 
  addCircleMarkers(
    lng = ~lon,
    lat = ~lat,
    radius = 10,
    weight = 1,
    color = ~pal(uuid),
    stroke = FALSE,
    fillOpacity = 100,
    popup = ~popup) %>% 
  addLegend("bottomright", pal = pal, values = ~uuid,
    title = "Specimen Records",
    opacity = 1)

We can visualize this data on a map to better understand what the data quality flag is telling us. For example, in the map below you can see the effect of accidentally reversing the latitude on three example georeferenced specimen records.

iDigBio uses the value provided for “country” to identify issues and apply the flag “rev_geocode_flip_lat_sign.” This is frequently helpful but not actually always correct. For example, here is a specimen record where the country has been recorded as “Antarctica” and georeferenced accordingly, then corrected incorrectly by iDigBio (probably because the data provider coordinates are farther offshore than the “country” of Antarctica extends to). It is important to recognize what kinds of data quality adjustments (good and bad) aggregators are making to your data because researchers may not know which set of coordinates to use.

# Create map displaying example of record possibly assigned the 
# rev_geocode_flip_lat_sign flag incorrectly
df_flagCoord %>% 
  filter(uuid == "004fa3d0-7d99-4af4-98b8-dd6c64e68906") %>% 
  select(uuid, matches("_lat|_lon")) %>% 
  unite(provider_coords, c("provider_lat", "provider_lon"), sep = ",") %>% 
  unite(aggregator_coords, c("aggregator_lat", "aggregator_lon"), sep = ",") %>% 
  gather(key = type, value = coordinates, -uuid) %>% 
  separate(coordinates, c("lat","lon"), sep = ",") %>% 
  mutate(lat = as.numeric(lat)) %>% 
  mutate(lon = as.numeric(lon)) %>% 
  arrange(uuid, type) %>% 
  mutate(popup = str_c(type, " = ", lat, ", ", lon, sep = "")) %>% 
  leaflet() %>%
  addTiles() %>% 
  addCircleMarkers(
    lng = ~lon,
    lat = ~lat,
    radius = 10,
    weight = 1,
    color = ~pal(uuid),
    stroke = FALSE,
    fillOpacity = 100,
    popup = ~popup) %>% 
  addLegend("bottomright", pal = pal, values = ~uuid,
    title = "Specimen Record",
    opacity = 1)

Summarize and explore data

The iDigBio API provides a means for an institution to examine data quality issues across collections, which sometimes is not possible internally when data in different collections are managed in different databases.

# Summarize flagged records by collection type
spmByColl <- df_flagCoord %>% 
  group_by(collectioncode) %>% 
  tally()

# Generate graph to display counts of flagged records by collection within the
# institution
graph_spmByColl <- ggplot(spmByColl, 
                          aes(x = reorder(collectioncode, -n), 
                              y = n,
                              fill = collectioncode)) +
  geom_col() +
  theme(panel.background = element_blank(),
        legend.title = element_blank(),
        axis.title.x = element_text(face = "bold"),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_text(face = "bold"),
        plot.title = element_text(size = 12, face = "bold")) +
  labs(x = "collection", 
       y = "# of specimen records",
       title = "LACM records flagged with geo-coordinate data quality issues by iDigBio") +
  geom_text(aes(label = n, vjust = -0.5))

# Get count of total records published by the institution using function
# `idig_count_records`
totalInstSpm <- idig_count_records(rq = list(institutioncode = "lacm"))

# Calculate flagged records as percent of total records
percentFlagged <- sum(spmByColl$n)/totalInstSpm*100

For example, we can ask how many specimen records from which collections at the Natural History Museum of Los Angeles (LACM) have been flagged as “rev_geocode_corrected” by iDigBio. As an aside, although this graph highlights the number of specimen records with data quality issues, these represent only 0.36% of the total specimen records published by LACM.

We can also explore what other data quality flags these specimen records have been flagged with.

# Collate `df_flagAssoc` to describe other data quality flags that are associated
# with rev_geocode_corrected in `df_flagCoord`
df_flagAssoc <- df_flagCoord %>% 
  select(uuid, flags) %>% 
  unnest(flags) %>% 
  group_by(flags) %>% 
  tally() %>% 
  mutate("category" = case_when(str_detect(flags, "geo|country|state")
                              ~ "geography",
                      str_detect(flags, "dwc_datasetid_added|dwc_multimedia_added|datecollected_bounds")
                              ~ "other",
                      str_detect(flags, "gbif|dwc|tax")
                              ~ "taxonomy")) %>% 
  mutate("percent" = n/(nrow(df_flagCoord))*100) %>% 
  arrange(category, desc(n))

# Visualize associated data quality flags
ggplot(df_flagAssoc, aes(x = reorder(flags, -percent), y = percent, fill = category)) +
  geom_col() +
  theme(axis.title.x = element_text(face = "bold"),
        axis.text.x = element_text(angle = 75, hjust = 1),
        axis.ticks.y = element_blank(),
        axis.title.y = element_text(face = "bold"),
        plot.title = element_text(size = 12, face = "bold")
        ) +
  labs(x = "additional iDigBio data quality flag", 
       y = "% specimen records",
       title = "LACM records flagged for geo-coordinate issues are also flagged for...",
       fill = "flag category")