EDMAinR is an R package for Euclidean Distance Matrix Analysis (EDMA). EDMA is a coordinate‐free approach for comparing biological shapes using landmark data.
You can install the package from GitHub as with the help of the remotes R package:
if (!require(EDMAinR)) {
if (!require(remotes))
install.packages("remotes")
remotes::install_github("psolymos/EDMAinR")
}
You can now load the EDMAinR package:
Use the read_xyz
function to read 2D or 3D landmark data
from *.xyz
files. First we specify the paths to the xyz
file:
Note: we use the
system.file()
function to access example files from the package. When defining your own files, you will either set the working directory usingsetwd()
or use a path likefile <- "c:/Users/<user>/<etc>"
.
The xyz landmark data has the following structure:
XYZ
: indicates dimensions, XYZ means 3D landmark
data.42L 3 9
: dimensions, e.g. 42 landmarks (K), 3 dimensions (D), 9 specimens (n).Now we can read in these text files:
Note: find more information below about reading in other morphometrics file formats.
We stored the data from the file in the object called x
.
By typing the name x
(or by typing print(x)
)
we get to see what is inside:
The printout of x
told us the most important facts about
the data: the header information and the dimensions.
The EDMA data object (class ‘edma_data’) is a list with two elements:
$name
is the data set name (header information from the
.xyz
file),$data
is a list of n matrices (the list can be named if
specimen information is present), each matrix is of dimension K × D, dimension names for
the matrices describing landmark names (rows) and coordinate names
(columns).The following methods are used to retrieve useful information from EDMA data objects.
Use dim
to access dimensions (landmarks, K; dimensions, D; replicates, n). landmarks
,
dimensions
, and specimens
return the landmark,
dimension, and specimen names, respectively, dimnames
returns all three in a list:
dim(x) # dimensions
#> [1] 47 3 28
dimnames(x) # dimension names
#> [[1]]
#> [1] "amsph" "bas" "cpsh" "ethma" "ethmp" "laalf" "lasph" "lflac" "lnsla"
#> [10] "lnslp" "locc" "loci" "lpalf" "lpfl" "lpmx" "lpns" "lpsh" "lpsq"
#> [19] "lpto" "lptyp" "lsqu" "lsyn" "lzya" "lzygo" "lzyt" "opi" "raalf"
#> [28] "rasph" "rflac" "rmaxi" "rnsla" "rnslp" "rocc" "roci" "rpalf" "rpfl"
#> [37] "rpmx" "rpns" "rpsh" "rpsq" "rpto" "rptyp" "rsqu" "rsyn" "rzya"
#> [46] "rzygo" "rzyt"
#>
#> [[2]]
#> [1] "X" "Y" "Z"
#>
#> [[3]]
#> [1] "CZCD1_1" "CZCD1_2" "CZCD1_7" "CZCD1_10" "CZCD1_11" "CZCD1_15"
#> [7] "CZCD1_16" "CZCD1_18" "CZCD1_20" "CZCD1_21" "CZCD1_23" "CZCD1_24"
#> [13] "CZCD1_25" "CZCD1_28" "CZCD1_30" "CZCD1_32" "CZCD1_36" "CZCD1_37"
#> [19] "CZCD1_39" "CZCD1_40" "CZCD1_42" "CZCD1_53" "CZCD1_56" "CZCD1_59"
#> [25] "CZCD1_65" "CZCD1_66" "CZCD1_72" "CZCD1_73"
landmarks(x) # landmark names
#> [1] "amsph" "bas" "cpsh" "ethma" "ethmp" "laalf" "lasph" "lflac" "lnsla"
#> [10] "lnslp" "locc" "loci" "lpalf" "lpfl" "lpmx" "lpns" "lpsh" "lpsq"
#> [19] "lpto" "lptyp" "lsqu" "lsyn" "lzya" "lzygo" "lzyt" "opi" "raalf"
#> [28] "rasph" "rflac" "rmaxi" "rnsla" "rnslp" "rocc" "roci" "rpalf" "rpfl"
#> [37] "rpmx" "rpns" "rpsh" "rpsq" "rpto" "rptyp" "rsqu" "rsyn" "rzya"
#> [46] "rzygo" "rzyt"
dimensions(x) # the names of the coordinate dimensions
#> [1] "X" "Y" "Z"
specimens(x) # specimen IDs
#> [1] "CZCD1_1" "CZCD1_2" "CZCD1_7" "CZCD1_10" "CZCD1_11" "CZCD1_15"
#> [7] "CZCD1_16" "CZCD1_18" "CZCD1_20" "CZCD1_21" "CZCD1_23" "CZCD1_24"
#> [13] "CZCD1_25" "CZCD1_28" "CZCD1_30" "CZCD1_32" "CZCD1_36" "CZCD1_37"
#> [19] "CZCD1_39" "CZCD1_40" "CZCD1_42" "CZCD1_53" "CZCD1_56" "CZCD1_59"
#> [25] "CZCD1_65" "CZCD1_66" "CZCD1_72" "CZCD1_73"
When the dimension names are missing from the input data file, we use
L
to denote landmarks and S
to denote
specimens. These are prepended to numeric landmark and specimen IDs that
follows the original ordering of the data. The coordinate dimensions are
"X"
, "Y"
, and optionally "Z"
for
3D landmark data.
It is also possible to manipulate dimension names. This is useful when different files have the same landmarks with slight differences (lower/upper case differences, etc.):
lm <- landmarks(x) # copy original names
landmarks(x) <- toupper(lm)
landmarks(x) # new landmark names
#> [1] "AMSPH" "BAS" "CPSH" "ETHMA" "ETHMP" "LAALF" "LASPH" "LFLAC" "LNSLA"
#> [10] "LNSLP" "LOCC" "LOCI" "LPALF" "LPFL" "LPMX" "LPNS" "LPSH" "LPSQ"
#> [19] "LPTO" "LPTYP" "LSQU" "LSYN" "LZYA" "LZYGO" "LZYT" "OPI" "RAALF"
#> [28] "RASPH" "RFLAC" "RMAXI" "RNSLA" "RNSLP" "ROCC" "ROCI" "RPALF" "RPFL"
#> [37] "RPMX" "RPNS" "RPSH" "RPSQ" "RPTO" "RPTYP" "RSQU" "RSYN" "RZYA"
#> [46] "RZYGO" "RZYT"
landmarks(x) <- lm # reset names to original
You can use the edma_scale()
function on the EDMA data
objects that implements various landmark scaling procedures.
Subsetting the data comes handy sometimes, i.e. we might want to select only a few of the landmarks to get a simplified picture of the relationships, or because different input files list different landmarks and we want to select the ones that are present in both data set. Or we need to filter some specimens, because we want to select only specimens of certain ages, gender, etc. Often we need to draw a random sample of the specimens (for randomization and bootstrap).
The most general way to subset the data sets is via the
[
function (i.e. x[i, j, k]
), the 3 indices
inside the brackets refer to the landmarks (i
), dimensions
(j
), and specimens (k
). Imagine the data as a
3D data cube and the square brackets allow us to slice the cube from 3
different directions. The colon notation in the next part mean what
parts should be taken using the from:to
way of defining a
sequence of integers:
x[1:10, , ] # select the 1st 10 landmarks
#> EDMA data: Crouzon P0 MUT
#> 10 landmarks, 3 dimensions, 28 specimens
x[ , 1:2, ] # select 2 of the 2 dimensions
#> EDMA data: Crouzon P0 MUT
#> 47 landmarks, 2 dimensions, 28 specimens
x[ , , 11:20] # select individuals from 11 to 20
#> EDMA data: Crouzon P0 MUT
#> 47 landmarks, 3 dimensions, 10 specimens
x[1:10, , 1:20] # combine multiple indices
#> EDMA data: Crouzon P0 MUT
#> 10 landmarks, 3 dimensions, 20 specimens
Read more about how to use square brackets to extract parts of the
data object: ?Extract
(see section ‘Matrices and arrays’).
The most important thing to keep in mind is that the empty/missing index
(e.g. x[,,]
) means ‘select everything’.
We can take subsets using:
Logical indices might be handy if, e.g., we have ancillary data about
the specimens. If we have age
of each specimens (values in
ages matching the order of specimens in the EDMA data object), we can
use it as x[,,age < 1]
.
Character based subsets are useful if we want to select homologous
landmarks for multiple objects, or we want to make sure that the
landmarks’ order is identical among multiple objects. The same
[
function can be used not only to take a subset but also
to reorder landmarks or specimens. Character indices are most suitable
for this purpose, e.g.:
x[c("amsph", "bas", "cpsh"), , ]
The data ($data
) format inside the object x
is list of the K × D
matrices, one for each individual. Sometimes it is handy to stack these
matrices and create a rectangular data (either as a matrix, or data
frame, with n × K
rows and D columns):
str(as.matrix(x))
#> num [1:1316, 1:3] 5.85 2.79 6.86 9.11 8.25 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:1316] "CZCD1_1_amsph" "CZCD1_1_bas" "CZCD1_1_cpsh" "CZCD1_1_ethma" ...
#> ..$ : chr [1:3] "X" "Y" "Z"
Note: we are using
str
to show the structure of these objects, this is not necessary when exploring the data.str
here tells us the dimensions (1316 rows and 3 columns, the first few values, and the dimension names).
Alternatively, we can store the data as an array (K × D × n):
The as.edma_data
function turns a 3D array to an EDMA
data object. This is useful when handling 3D array objects returned by
many functions of the geomorph
package (i.e. after reading Morphologika, NTS, TPS files).
The header information gets lost in this way, but we can set it quite easily:
xnew <- as.edma_data(as.array(x))
xnew
#> EDMA data: Landmark data
#> 47 landmarks, 3 dimensions, 28 specimens
xnew$name <- "This is the same data as before"
xnew
#> EDMA data: This is the same data as before
#> 47 landmarks, 3 dimensions, 28 specimens
Note: the geomorph package needs to be installed separately using
install.packages("geomorph")
.
The plot_2d
function shows the convex hulls or
confidence ellipses for the landmarks based on all the specimens. The
plot is showing the first two axes of multidimensional scaling based on
the form matrices (FM; pairwise Euclidean
distances):
The images by default show the convex hull for each landmark enclosing all specimens. It is also possible to use 95% confidence ellipses:
Data sets can be checked using the plot
function. By
default, this function steps through all the specimens to check for any
surprises. The blue areas show the convex hull for the landmark leaving
a single specimen out, while red dots show the actual specimen. It is a
sign of problem when a dot for a specimen is located far outside of the
convex hull.
Using the which
argument we can specify which specimen
we want to display (it is not possible to select multiple specimen, use
the [
method to take a subset of the data that and plot it
with plot_2d
):
We will insert an error to show how the plot
function
can help identify problems. We change the coordinates for the 1st
landmark of the 1st specimen. The error is showing up as an outlier
(long red line):
The ordination plot shows the specimens based on multidimensional
scaling using pairwise distances between specimens. These distances are
based on the T-statistic and can be calculated as
as.dist(x)
. The calculation involves the following
steps:
The multidimensional scaling uses the square root transformed log-T distances.
The error shows up nicely in the ordination plot (colored red):
plot_ord(x, col=c(2, rep(1, dim(x)[3])))
plot_ord(xwrong, col=c("red", rep("black", dim(xwrong)[3])))
The cluster dendrogram is based on the same log-T distance matrix
described above. The default agglomeration algorithm is set to the
Ward’s minimum variance method ("ward.D2"
) which is using
the square root of the log-T distances too.
Other agglomeration methods can be set by passing a different
method
argument, e.g. single linkage (see
?hclust
for a list of options):
Both the ordination and clustering functions return the ordination/cluster results, so that plots can be further customized. Here is an example to create a plot
mds <- plot_ord(x, plot=FALSE)
str(mds)
#> List of 5
#> $ points: num [1:28, 1:2] -0.181 -0.757 -0.287 -0.116 -0.564 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:28] "CZCD1_1" "CZCD1_2" "CZCD1_7" "CZCD1_10" ...
#> .. ..$ : chr [1:2] "Axis 1" "Axis 2"
#> $ eig : NULL
#> $ x : NULL
#> $ ac : num 0.143
#> $ GOF : num [1:2] 0.253 0.253
plot(mds$points, pch=4, col="darkgreen",
main="Ordination diagram")
abline(h=0, v=0, lty=2, col="grey")
text(mds$points, labels=specimens(x), pos=1, cex=0.6, col="tan")
ggplot2 is a popular alternative for R’s base graphics. Here is how to create plots with it:
library(ggplot2)
df <- as.data.frame(mds$points)
p <- ggplot(data=df, aes(x=`Axis 1`, y=`Axis 2`, label=rownames(df))) +
geom_label() +
geom_vline(xintercept=0, lty=2) +
geom_hline(yintercept=0, lty=2)
p
See ?ape::plot.phylo
for how to customize
dendrograms:
h <- plot_clust(x, plot=FALSE)
h
#>
#> Call:
#> hclust(d = d, method = method)
#>
#> Cluster method : ward.D2
#> Number of objects: 28
plot(h)
ggdendro is a package that can plot tree-like data structures with ggplot2. Here is an example:
Different kinds of morphometrics data formats can be turned into EDMA
data format as long as it can be organized into a K × D × n array
and then using the as.edma_data
method.
Once the data is an EDMA data object, the write_xyz
function will write that into a text file with extension xyz:
f <- tempfile(fileext = ".xyz") # create a temporary file
write_xyz(x, file=f) # write data to temp file
tmp <- read_xyz(file=f) # read back the data
x # original data
#> EDMA data: Crouzon P0 MUT
#> 47 landmarks, 3 dimensions, 28 specimens
tmp # this should be the same too
#> EDMA data: Crouzon P0 MUT
#> 47 landmarks, 3 dimensions, 28 specimens
## test if the all the dimnames are the same
stopifnot(identical(dimnames(x), dimnames(tmp)))
unlink(f) # delete temp file
Note: tuse a real file name and path, the temp file is used here for demonstration purposes only.
Read on to see how the xyz data are used to fit EDMA models to it.