This question was posed on R-Help; for the following data, "I would like to perform a simple matching but only row 1 compared to row 1, row 2 compared to row 2 (paired).......giving back a number as dissimilarity for each comparison."
MalVar29_37 <- read.table(textConnection("V1 V2 V3 V4 V5 V6 V7 V8 V9 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 NA NA NA NA NA NA NA NA NA 0 1 0 0 0 1 0 0 0"), header=TRUE)
FemVar29_37 <- read.table(textConnection("V1 V2 V3 V4 V5 V6 V7 V8 V9 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0"), header=TRUE)
I have suggested:
comparison <- MalVar29_37 == FemVar29_37
dissimilar <- function(tRow){ length(tRow[tRow==FALSE]) }
dissimilarity <- apply(comparison, c(1), dissimilar)
dissimilarity
Variable comparison is an entry-by-entry comparison of the two data frames, resulting in values of TRUE or FALSE. I've defined a function dissimilar as the number of FALSEs in a given object (tRow). Variable dissimilarity is then the application of this function for each row of comparison. In this example, 0 means all of the entries in a row matche, 9 means none of them matched.
Phil Spector kindly emailed me pointing out that using sum would more efficient in the dissimilar function. However, I want to still be able to consider NAs as being dissimilar (my working assumption throughout this recipe). So a better function might be:
dissimilar <- function(tRow){ sum(tRow==FALSE, na.rm=TRUE) + sum(is.na(tRow)) }
It is actually about 40% faster to use sums instead of sub-setting the lists and using lengths (but the speed increase will only be noticeable on very long lists).
The inquirer also wondered how to make dissimilar a number between 0 and 1, by dividing by the number of variables in the comparison. The nice thing about using dissimilar as a function is that it's easy to change:
dissimilar <- function(tRow){ (sum(tRow==FALSE, na.rm=TRUE) + sum(is.na(tRow)))/length(tRow) }