Row-by-row dissimilarity measure

No votes yet

This question was posed on R-Help; for the following data, "I would like to perform a simple matching but only row 1 compared to row 1, row 2 compared to row 2 (paired).......giving back a number as dissimilarity for each comparison."

MalVar29_37 <- read.table(textConnection("V1 V2 V3 V4 V5 V6 V7 V8 V9
0  0  0  0  0  1  0  0  0
0  0  0  0  0  1  0  0  0
0  0  0  0  0  1  0  0  0
NA NA NA NA NA NA NA NA NA
0  1  0  0  0  1  0  0  0"), header=TRUE)

FemVar29_37 <- read.table(textConnection("V1 V2 V3 V4 V5 V6 V7 V8 V9 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0"), header=TRUE)

I have suggested:

comparison <- MalVar29_37 == FemVar29_37

dissimilar <- function(tRow){ length(tRow[tRow==FALSE]) }

dissimilarity <- apply(comparison, c(1), dissimilar)

dissimilarity

Variable comparison is an entry-by-entry comparison of the two data frames, resulting in values of TRUE or FALSE. I've defined a function dissimilar as the number of FALSEs in a given object (tRow). Variable dissimilarity is then the application of this function for each row of comparison. In this example, 0 means all of the entries in a row matche, 9 means none of them matched.

Phil Spector kindly emailed me pointing out that using sum would more efficient in the dissimilar function. However, I want to still be able to consider NAs as being dissimilar (my working assumption throughout this recipe). So a better function might be:

dissimilar <- function(tRow){
	sum(tRow==FALSE, na.rm=TRUE) + sum(is.na(tRow))
}

It is actually about 40% faster to use sums instead of sub-setting the lists and using lengths (but the speed increase will only be noticeable on very long lists).

The inquirer also wondered how to make dissimilar a number between 0 and 1, by dividing by the number of variables in the comparison. The nice thing about using dissimilar as a function is that it's easy to change:

dissimilar <- function(tRow){
	(sum(tRow==FALSE, na.rm=TRUE) + sum(is.na(tRow)))/length(tRow)
}