Function to check all values of a vector or data frame/tibble column are unique
checkAllUnique.Rd
Function to check all values of a vector or data frame/tibble column are unique
Arguments
- dat
vector, single dimensional list or columns of data frame or tibble to test to see if values/rows are all unique
- errIfNA
logical: defaults to TRUE to disallow any NA values
- allowJustOneNA
logical: defaults to TRUE so if errIfNA is FALSE but you have more than one NA function returns FALSE
- allowMultipleColumns
logical: defaults to FALSE but if you really do want to allow all of multi-column dat, set to TRUE
Background
This is a utility function to check that all the values of a vector, or a single column of a data frame or tibble, or a single dimensional list are unique. It can also check that all the rows of multiple columns of a data frame or tibble, are all unique.
This is important where you expect a single row of data for each participant say, and so you expect the values of participantID in a data set to be unique. It will cripple things later if you don't identify duplicated ID values. Sometimes it's not just a single variable that should be unique, a typical example of this is when you have data from multiple services and ID values for participants from each service. Then there may be multiple values for ID but they should each come from a different service. (Mind you, once I have established that rows are unique across serviceID and participantID I usally paste the two variables together with something like
or a tidyverse way of doing it:
Making sure things are unique also prevents all sorts of sometimes very confusing error messages if you want to pivot data depending on an ID variable whose values ought to be unique but aren't. Checking for this is in my Wisdom! Rblog page .
See also
Other data checking functions:
getNNA()
,
getNOK()
,
isOneToOne()
Examples
if (FALSE) {
### letters gets us the 26 lower case letters of the English alphabet so should test OK
checkAllUnique(letters)
# [1] TRUE
### good!
### the checking is case sensitive:
checkAllUnique(c("A", letters), errIfNA = FALSE, allowJustOneNA = FALSE)
# [1] TRUE
### but ...
checkAllUnique(c("a", letters), errIfNA = FALSE, allowJustOneNA = FALSE)
# [1] FALSE
### both good!
### by default checkAllUnique doesn't allow any NA values,
### generally sensible for my data: ID codes or table indices
checkAllUnique(c(letters, NA))
# [1] FALSE
### good!
### but we can override that:
checkAllUnique(c(letters, NA), errIfNA = FALSE)
# [1] TRUE
### good!
### but generally I wouldn't want multiple NA values
### in the typical situations I'd be using
### checkAllUnique() so I have forced you to allow
### that explicitly if you really want it ...
checkAllUnique(c(NA, letters, NA), errIfNA = FALSE)
# [1] FALSE
### good!
### but you _can_ override that if you need to ...
checkAllUnique(c(NA, letters, NA), errIfNA = FALSE, allowJustOneNA = FALSE)
# [1] TRUE
### good!
### but you _can_ override that if you need to ...
checkAllUnique(c(NA, letters, NA), errIfNA = FALSE, allowJustOneNA = FALSE)
# [1] TRUE
### good!
### by default checkAllUnique expects vector input but it can handle data
### in multiple columns in a data frame or tibble
tmpDat <- as.data.frame(matrix(1:10, ncol = 2))
tmpDat
# V1 V2
# 1 1 6
# 2 2 7
# 3 3 8
# 4 4 9
# 5 5 10
checkAllUnique(tmpDat[, 1])
# [1] TRUE
checkAllUnique(tmpDat[, 2])
# [1] TRUE
### but remember column indexing a tibble returns a tibble not a vector
tmpTib <- as_tibble(tmpDat)
tmpTib
# # A tibble: 5 x 2
# V1 V2
# <int> <int>
# 1 1 6
# 2 2 7
# 3 3 8
# 4 4 9
# 5 5 10
checkAllUnique(tmpTib[, 1])
# Error in checkAllUnique(tmpTib[, 1]) :
# You have input an object of length one to checkAllUnique(): makes no sense!
### whoops, I remember, must pull() to extract as vector
checkAllUnique(tmpTib %>% pull(1))
# [1] TRUE
checkAllUnique(tmpTib %>% pull(2))
# [1] TRUE
### you _can_ allow multiple columns
checkAllUnique(tmpDat, allowMultipleColumns = TRUE)
# [1] TRUE
# In checkAllUnique(tmpDat, allowMultipleColumns = TRUE) :
# Input of dat to checkAllUnique was not a vector: be careful please!
checkAllUnique(tmpTib, allowMultipleColumns = TRUE)
# [1] TRUE
# Warning message:
# In checkAllUnique(tmpTib, allowMultipleColumns = TRUE) :
# Input of dat to checkAllUnique was not a vector: be careful please!
### but it is checking all content in all columns so if two rows are the
### same it will return FALSE:
tmpDat2 <- tmpDat
tmpDat2[2, ] <- tmpDat[1, ] # make row 2 same as row 1
tmpDat2
# V1 V2
# 1 1 6
# 2 1 6
# 3 3 8
# 4 4 9
# 5 5 10
checkAllUnique(tmpDat2, allowMultipleColumns = TRUE)
# [1] FALSE
# Warning message:
# In checkAllUnique(tmpDat2, allowMultipleColumns = TRUE) :
# Input of dat to checkAllUnique was not a vector: be careful please!
### what about columns of different classes, e.g. numeric and character
cbind(tmpDat2, letters[1:5]) -> tmpDatMixed
tmpDatMixed
# V1 V2 letters[1:5]
# 1 1 6 a
# 2 1 6 b
# 3 3 8 c
# 4 4 9 d
# 5 5 10 e
checkAllUnique(tmpDatMixed, allowMultipleColumns = TRUE)
# [1] TRUE
# Warning message:
# In checkAllUnique(tmpDatMixed, allowMultipleColumns = TRUE) :
# Input of dat to checkAllUnique was not a vector: be careful please!
### that came out as TRUE because R "promoted" the numerics to character
### and because the two different letters in rows 1 and 2 of column 3
### broke the non-unique tie of rows 1 and 2 in columns 1 and 2
### what about having NA in a multicolumn input?
tmpDatNA <- tmpDat
tmpDatNA[2, 1] <- NA
tmpDatNA
# 1 1 6
# 2 NA 7
# 3 3 8
# 4 4 9
# 5 5 10
checkAllUnique(tmpDatNA, allowMultipleColumns = TRUE)
# [1] FALSE
# Warning message:
# In checkAllUnique(tmpDatNA, allowMultipleColumns = TRUE) :
# Input of dat to checkAllUnique was not a vector: be careful please!
### but
checkAllUnique(tmpDatNA, allowMultipleColumns = TRUE, errIfNA = FALSE)
}