EDMA data

Introduction

EDMAinR is an R package for Euclidean Distance Matrix Analysis (EDMA). EDMA is a coordinate‐free approach for comparing biological shapes using landmark data.

You can install the package from GitHub as with the help of the remotes R package:

if (!require(EDMAinR)) {
  if (!require(remotes))
      install.packages("remotes")
  remotes::install_github("psolymos/EDMAinR")
}

You can now load the EDMAinR package:

library(EDMAinR)
#> EDMAinR 0.3-0     2023-08-21

Landmark data

Reading data from xyz files

Use the read_xyz function to read 2D or 3D landmark data from *.xyz files. First we specify the paths to the xyz file:

file <- system.file("extdata/crouzon/Crouzon_P0_Global_MUT.xyz", 
    package="EDMAinR")

Note: we use the system.file() function to access example files from the package. When defining your own files, you will either set the working directory using setwd() or use a path like file <- "c:/Users/<user>/<etc>".

The xyz landmark data has the following structure:

  • Header: this is the description of the data.
  • XYZ: indicates dimensions, XYZ means 3D landmark data.
  • 42L 3 9: dimensions, e.g. 42 landmarks (K), 3 dimensions (D), 9 specimens (n).
  • Landmark names, separated by space.
  • The stacked data of landmark coordinates, e.g. 3 columns, space separated numeric values with K × n rows, the K landmarks per individuals stacked n times.
  • Blank line.
  • Date on of scans for each specimen (n rows), this part is also used to get specimen IDs.

Now we can read in these text files:

x <- read_xyz(file)

Note: find more information below about reading in other morphometrics file formats.

We stored the data from the file in the object called x. By typing the name x (or by typing print(x)) we get to see what is inside:

x
#> EDMA data: Crouzon P0 MUT
#> 47 landmarks, 3 dimensions, 28 specimens

The EDMA data format

The printout of x told us the most important facts about the data: the header information and the dimensions.

The EDMA data object (class ‘edma_data’) is a list with two elements:

  • $name is the data set name (header information from the .xyz file),
  • $data is a list of n matrices (the list can be named if specimen information is present), each matrix is of dimension K × D, dimension names for the matrices describing landmark names (rows) and coordinate names (columns).

EDMA data dimensions and names

The following methods are used to retrieve useful information from EDMA data objects.

Use dim to access dimensions (landmarks, K; dimensions, D; replicates, n). landmarks, dimensions, and specimens return the landmark, dimension, and specimen names, respectively, dimnames returns all three in a list:

dim(x) # dimensions
#> [1] 47  3 28
dimnames(x) # dimension names
#> [[1]]
#>  [1] "amsph" "bas"   "cpsh"  "ethma" "ethmp" "laalf" "lasph" "lflac" "lnsla"
#> [10] "lnslp" "locc"  "loci"  "lpalf" "lpfl"  "lpmx"  "lpns"  "lpsh"  "lpsq" 
#> [19] "lpto"  "lptyp" "lsqu"  "lsyn"  "lzya"  "lzygo" "lzyt"  "opi"   "raalf"
#> [28] "rasph" "rflac" "rmaxi" "rnsla" "rnslp" "rocc"  "roci"  "rpalf" "rpfl" 
#> [37] "rpmx"  "rpns"  "rpsh"  "rpsq"  "rpto"  "rptyp" "rsqu"  "rsyn"  "rzya" 
#> [46] "rzygo" "rzyt" 
#> 
#> [[2]]
#> [1] "X" "Y" "Z"
#> 
#> [[3]]
#>  [1] "CZCD1_1"  "CZCD1_2"  "CZCD1_7"  "CZCD1_10" "CZCD1_11" "CZCD1_15"
#>  [7] "CZCD1_16" "CZCD1_18" "CZCD1_20" "CZCD1_21" "CZCD1_23" "CZCD1_24"
#> [13] "CZCD1_25" "CZCD1_28" "CZCD1_30" "CZCD1_32" "CZCD1_36" "CZCD1_37"
#> [19] "CZCD1_39" "CZCD1_40" "CZCD1_42" "CZCD1_53" "CZCD1_56" "CZCD1_59"
#> [25] "CZCD1_65" "CZCD1_66" "CZCD1_72" "CZCD1_73"
landmarks(x) # landmark names
#>  [1] "amsph" "bas"   "cpsh"  "ethma" "ethmp" "laalf" "lasph" "lflac" "lnsla"
#> [10] "lnslp" "locc"  "loci"  "lpalf" "lpfl"  "lpmx"  "lpns"  "lpsh"  "lpsq" 
#> [19] "lpto"  "lptyp" "lsqu"  "lsyn"  "lzya"  "lzygo" "lzyt"  "opi"   "raalf"
#> [28] "rasph" "rflac" "rmaxi" "rnsla" "rnslp" "rocc"  "roci"  "rpalf" "rpfl" 
#> [37] "rpmx"  "rpns"  "rpsh"  "rpsq"  "rpto"  "rptyp" "rsqu"  "rsyn"  "rzya" 
#> [46] "rzygo" "rzyt"
dimensions(x) # the names of the coordinate dimensions
#> [1] "X" "Y" "Z"
specimens(x) # specimen IDs
#>  [1] "CZCD1_1"  "CZCD1_2"  "CZCD1_7"  "CZCD1_10" "CZCD1_11" "CZCD1_15"
#>  [7] "CZCD1_16" "CZCD1_18" "CZCD1_20" "CZCD1_21" "CZCD1_23" "CZCD1_24"
#> [13] "CZCD1_25" "CZCD1_28" "CZCD1_30" "CZCD1_32" "CZCD1_36" "CZCD1_37"
#> [19] "CZCD1_39" "CZCD1_40" "CZCD1_42" "CZCD1_53" "CZCD1_56" "CZCD1_59"
#> [25] "CZCD1_65" "CZCD1_66" "CZCD1_72" "CZCD1_73"

When the dimension names are missing from the input data file, we use L to denote landmarks and S to denote specimens. These are prepended to numeric landmark and specimen IDs that follows the original ordering of the data. The coordinate dimensions are "X", "Y", and optionally "Z" for 3D landmark data.

It is also possible to manipulate dimension names. This is useful when different files have the same landmarks with slight differences (lower/upper case differences, etc.):

lm <- landmarks(x) # copy original names
landmarks(x) <- toupper(lm)
landmarks(x) # new landmark names
#>  [1] "AMSPH" "BAS"   "CPSH"  "ETHMA" "ETHMP" "LAALF" "LASPH" "LFLAC" "LNSLA"
#> [10] "LNSLP" "LOCC"  "LOCI"  "LPALF" "LPFL"  "LPMX"  "LPNS"  "LPSH"  "LPSQ" 
#> [19] "LPTO"  "LPTYP" "LSQU"  "LSYN"  "LZYA"  "LZYGO" "LZYT"  "OPI"   "RAALF"
#> [28] "RASPH" "RFLAC" "RMAXI" "RNSLA" "RNSLP" "ROCC"  "ROCI"  "RPALF" "RPFL" 
#> [37] "RPMX"  "RPNS"  "RPSH"  "RPSQ"  "RPTO"  "RPTYP" "RSQU"  "RSYN"  "RZYA" 
#> [46] "RZYGO" "RZYT"
landmarks(x) <- lm # reset names to original

You can use the edma_scale() function on the EDMA data objects that implements various landmark scaling procedures.

Selecting subsets of EDMA data objects

Subsetting the data comes handy sometimes, i.e. we might want to select only a few of the landmarks to get a simplified picture of the relationships, or because different input files list different landmarks and we want to select the ones that are present in both data set. Or we need to filter some specimens, because we want to select only specimens of certain ages, gender, etc. Often we need to draw a random sample of the specimens (for randomization and bootstrap).

The most general way to subset the data sets is via the [ function (i.e. x[i, j, k]), the 3 indices inside the brackets refer to the landmarks (i), dimensions (j), and specimens (k). Imagine the data as a 3D data cube and the square brackets allow us to slice the cube from 3 different directions. The colon notation in the next part mean what parts should be taken using the from:to way of defining a sequence of integers:

x[1:10, , ]  # select the 1st 10 landmarks
#> EDMA data: Crouzon P0 MUT
#> 10 landmarks, 3 dimensions, 28 specimens
x[ , 1:2, ]  # select 2 of the 2 dimensions
#> EDMA data: Crouzon P0 MUT
#> 47 landmarks, 2 dimensions, 28 specimens
x[ , , 11:20] # select individuals from 11 to 20
#> EDMA data: Crouzon P0 MUT
#> 47 landmarks, 3 dimensions, 10 specimens
x[1:10, , 1:20] # combine multiple indices
#> EDMA data: Crouzon P0 MUT
#> 10 landmarks, 3 dimensions, 20 specimens

Read more about how to use square brackets to extract parts of the data object: ?Extract (see section ‘Matrices and arrays’). The most important thing to keep in mind is that the empty/missing index (e.g. x[,,]) means ‘select everything’.

We can take subsets using:

  • numeric indices (like as we saw above),
  • logical indices,
  • and by character names.

Logical indices might be handy if, e.g., we have ancillary data about the specimens. If we have age of each specimens (values in ages matching the order of specimens in the EDMA data object), we can use it as x[,,age < 1].

Character based subsets are useful if we want to select homologous landmarks for multiple objects, or we want to make sure that the landmarks’ order is identical among multiple objects. The same [ function can be used not only to take a subset but also to reorder landmarks or specimens. Character indices are most suitable for this purpose, e.g.:

x[c("amsph", "bas", "cpsh"), , ]

Coercing EDMA data into other object types

The data ($data) format inside the object x is list of the K × D matrices, one for each individual. Sometimes it is handy to stack these matrices and create a rectangular data (either as a matrix, or data frame, with n × K rows and D columns):

str(as.matrix(x))
#>  num [1:1316, 1:3] 5.85 2.79 6.86 9.11 8.25 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : chr [1:1316] "CZCD1_1_amsph" "CZCD1_1_bas" "CZCD1_1_cpsh" "CZCD1_1_ethma" ...
#>   ..$ : chr [1:3] "X" "Y" "Z"

Note: we are using str to show the structure of these objects, this is not necessary when exploring the data. str here tells us the dimensions (1316 rows and 3 columns, the first few values, and the dimension names).

Alternatively, we can store the data as an array (K × D × n):

str(as.array(x))
#>  num [1:47, 1:3, 1:28] 5.85 2.79 6.86 9.11 8.25 ...
#>  - attr(*, "dimnames")=List of 3
#>   ..$ : chr [1:47] "amsph" "bas" "cpsh" "ethma" ...
#>   ..$ : chr [1:3] "X" "Y" "Z"
#>   ..$ : chr [1:28] "CZCD1_1" "CZCD1_2" "CZCD1_7" "CZCD1_10" ...

Turning 3D arrays into EDMA data objects

The as.edma_data function turns a 3D array to an EDMA data object. This is useful when handling 3D array objects returned by many functions of the geomorph package (i.e. after reading Morphologika, NTS, TPS files).

The header information gets lost in this way, but we can set it quite easily:

xnew <- as.edma_data(as.array(x))
xnew
#> EDMA data: Landmark data
#> 47 landmarks, 3 dimensions, 28 specimens
xnew$name <- "This is the same data as before"
xnew
#> EDMA data: This is the same data as before
#> 47 landmarks, 3 dimensions, 28 specimens

Note: the geomorph package needs to be installed separately using install.packages("geomorph").

Visual inspection of the data

The plot_2d function shows the convex hulls or confidence ellipses for the landmarks based on all the specimens. The plot is showing the first two axes of multidimensional scaling based on the form matrices (FM; pairwise Euclidean distances):

plot_2d(x)

The images by default show the convex hull for each landmark enclosing all specimens. It is also possible to use 95% confidence ellipses:

plot_2d(x, hull=FALSE)

Spotting data errors

Data sets can be checked using the plot function. By default, this function steps through all the specimens to check for any surprises. The blue areas show the convex hull for the landmark leaving a single specimen out, while red dots show the actual specimen. It is a sign of problem when a dot for a specimen is located far outside of the convex hull.

Using the which argument we can specify which specimen we want to display (it is not possible to select multiple specimen, use the [ method to take a subset of the data that and plot it with plot_2d):

plot(x, which=1)

We will insert an error to show how the plot function can help identify problems. We change the coordinates for the 1st landmark of the 1st specimen. The error is showing up as an outlier (long red line):

## original values for specimen 1
plot(x, which=1)
## we change the 1st landmark
xwrong <- x
xwrong$data[[1]][1,1:2] <- c(2, 2)
plot(xwrong, which=1)

Ordination and clustering

The ordination plot shows the specimens based on multidimensional scaling using pairwise distances between specimens. These distances are based on the T-statistic and can be calculated as as.dist(x). The calculation involves the following steps:

  • calculate Euclidean distances (EDi) between all pairs of landmarks for specimen i (i = 1, …, n) based on the coordinates,
  • calculate Euclidean distances (EDj) between all pairs of landmarks for specimen j (j = 1, …, n) based on the coordinates,
  • calculate the ratio EDi/EDj,
  • calculate the T-statistic on the log scale (so that a max/min ratio of 1, that is 0 distance, becomes log(1)=0): dij = log(max[EDi/EDj]/min[EDi/EDj]).

The multidimensional scaling uses the square root transformed log-T distances.

The error shows up nicely in the ordination plot (colored red):

plot_ord(x, col=c(2, rep(1, dim(x)[3])))
plot_ord(xwrong, col=c("red", rep("black", dim(xwrong)[3])))

The cluster dendrogram is based on the same log-T distance matrix described above. The default agglomeration algorithm is set to the Ward’s minimum variance method ("ward.D2") which is using the square root of the log-T distances too.

plot_clust(x)
plot_clust(xwrong)

Other agglomeration methods can be set by passing a different method argument, e.g. single linkage (see ?hclust for a list of options):

plot_clust(x, method = "single")
plot_clust(xwrong, method = "single")

Both the ordination and clustering functions return the ordination/cluster results, so that plots can be further customized. Here is an example to create a plot

mds <- plot_ord(x, plot=FALSE)
str(mds)
#> List of 5
#>  $ points: num [1:28, 1:2] -0.181 -0.757 -0.287 -0.116 -0.564 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:28] "CZCD1_1" "CZCD1_2" "CZCD1_7" "CZCD1_10" ...
#>   .. ..$ : chr [1:2] "Axis 1" "Axis 2"
#>  $ eig   : NULL
#>  $ x     : NULL
#>  $ ac    : num 0.143
#>  $ GOF   : num [1:2] 0.253 0.253
plot(mds$points, pch=4, col="darkgreen",
  main="Ordination diagram")
abline(h=0, v=0, lty=2, col="grey")
text(mds$points, labels=specimens(x), pos=1, cex=0.6, col="tan")

ggplot2 is a popular alternative for R’s base graphics. Here is how to create plots with it:

library(ggplot2)
df <- as.data.frame(mds$points)
p <- ggplot(data=df, aes(x=`Axis 1`, y=`Axis 2`, label=rownames(df))) +
  geom_label() +
  geom_vline(xintercept=0, lty=2) +
  geom_hline(yintercept=0, lty=2)
p

See ?ape::plot.phylo for how to customize dendrograms:

h <- plot_clust(x, plot=FALSE)
h
#> 
#> Call:
#> hclust(d = d, method = method)
#> 
#> Cluster method   : ward.D2 
#> Number of objects: 28
plot(h)

plot(ape::as.phylo(h), type="cladogram", font=1, cex=0.6)

ggdendro is a package that can plot tree-like data structures with ggplot2. Here is an example:

library(ggdendro)
ggdendrogram(h, rotate = TRUE, size = 2)

Writing data to xyz format

Different kinds of morphometrics data formats can be turned into EDMA data format as long as it can be organized into a K × D × n array and then using the as.edma_data method.

Once the data is an EDMA data object, the write_xyz function will write that into a text file with extension xyz:

f <- tempfile(fileext = ".xyz") # create a temporary file
write_xyz(x, file=f)            # write data to temp file
tmp <- read_xyz(file=f)         # read back the data
x                               # original data
#> EDMA data: Crouzon P0 MUT
#> 47 landmarks, 3 dimensions, 28 specimens
tmp                             # this should be the same too
#> EDMA data: Crouzon P0 MUT
#> 47 landmarks, 3 dimensions, 28 specimens
## test if the all the dimnames are the same
stopifnot(identical(dimnames(x), dimnames(tmp)))
unlink(f)                       # delete temp file

Note: tuse a real file name and path, the temp file is used here for demonstration purposes only.

Read on to see how the xyz data are used to fit EDMA models to it.