--- title: "EDMA data" author: "Peter Solymos" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{EDMA data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup,include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) set.seed(429) ``` # Introduction **EDMAinR** is an R package for Euclidean Distance Matrix Analysis (EDMA). EDMA is a coordinateā€free approach for comparing biological shapes using landmark data. You can install the package from GitHub as with the help of the **remotes** R package: ```{r install,eval=FALSE} if (!require(EDMAinR)) { if (!require(remotes)) install.packages("remotes") remotes::install_github("psolymos/EDMAinR") } ``` You can now load the **EDMAinR** package: ```{r load} library(EDMAinR) ``` # Landmark data ## Reading data from xyz files Use the `read_xyz` function to read 2D or 3D landmark data from `*.xyz` files. First we specify the paths to the xyz file: ```{r filexyz} file <- system.file("extdata/crouzon/Crouzon_P0_Global_MUT.xyz", package="EDMAinR") ``` > *Note*: we use the `system.file()` function to access example files > from the package. When defining your own files, > you will either set the working directory using `setwd()` or use > a path like `file <- "c:/Users//"`. The xyz landmark data has the following structure: - Header: this is the description of the data. - `XYZ`: indicates dimensions, XYZ means 3D landmark data. - `42L 3 9`: dimensions, e.g. 42 landmarks ($K$), 3 dimensions ($D$), 9 specimens ($n$). - Landmark names, separated by space. - The stacked data of landmark coordinates, e.g. 3 columns, space separated numeric values with $K \times n$ rows, the $K$ landmarks per individuals stacked $n$ times. - Blank line. - Date on of scans for each specimen ($n$ rows), this part is also used to get specimen IDs. Now we can read in these text files: ```{r readxyz} x <- read_xyz(file) ``` > *Note*: find more information [below](#turning-3d-arrays-into-edma-data-objects) > about reading in other morphometrics file formats. We stored the data from the file in the object called `x`. By typing the name `x` (or by typing `print(x)`) we get to see what is inside: ```{r printxyz} x ``` ## The EDMA data format The printout of `x` told us the most important facts about the data: the header information and the dimensions. The EDMA data object (class 'edma_data') is a list with two elements: - `$name` is the data set name (header information from the `.xyz` file), - `$data` is a list of $n$ matrices (the list can be named if specimen information is present), each matrix is of dimension $K \times D$, dimension names for the matrices describing landmark names (rows) and coordinate names (columns). ## EDMA data dimensions and names The following methods are used to retrieve useful information from EDMA data objects. Use `dim` to access dimensions (landmarks, $K$; dimensions, $D$; replicates, $n$). `landmarks`, `dimensions`, and `specimens` return the landmark, dimension, and specimen names, respectively, `dimnames` returns all three in a list: ```{r datamethods} dim(x) # dimensions dimnames(x) # dimension names landmarks(x) # landmark names dimensions(x) # the names of the coordinate dimensions specimens(x) # specimen IDs ``` When the dimension names are missing from the input data file, we use `L` to denote landmarks and `S` to denote specimens. These are prepended to numeric landmark and specimen IDs that follows the original ordering of the data. The coordinate dimensions are `"X"`, `"Y"`, and optionally `"Z"` for 3D landmark data. It is also possible to manipulate dimension names. This is useful when different files have the same landmarks with slight differences (lower/upper case differences, etc.): ```{r datamethods2} lm <- landmarks(x) # copy original names landmarks(x) <- toupper(lm) landmarks(x) # new landmark names landmarks(x) <- lm # reset names to original ``` You can use the `edma_scale()` function on the EDMA data objects that implements various landmark scaling procedures. ## Selecting subsets of EDMA data objects Subsetting the data comes handy sometimes, i.e. we might want to select only a few of the landmarks to get a simplified picture of the relationships, or because different input files list different landmarks and we want to select the ones that are present in both data set. Or we need to filter some specimens, because we want to select only specimens of certain ages, gender, etc. Often we need to draw a random sample of the specimens (for randomization and bootstrap). The most general way to subset the data sets is *via* the `[` function (i.e. `x[i, j, k]`), the 3 indices inside the brackets refer to the landmarks (`i`), dimensions (`j`), and specimens (`k`). Imagine the data as a 3D data cube and the square brackets allow us to slice the cube from 3 different directions. The colon notation in the next part mean what parts should be taken using the `from:to` way of defining a sequence of integers: ```{r datasubset} x[1:10, , ] # select the 1st 10 landmarks x[ , 1:2, ] # select 2 of the 2 dimensions x[ , , 11:20] # select individuals from 11 to 20 x[1:10, , 1:20] # combine multiple indices ``` Read more about how to use square brackets to extract parts of the data object: `?Extract` (see section 'Matrices and arrays'). The most important thing to keep in mind is that the empty/missing index (e.g. `x[,,]`) means 'select everything'. We can take subsets using: - numeric indices (like as we saw above), - logical indices, - and by character names. Logical indices might be handy if, e.g., we have ancillary data about the specimens. If we have `age` of each specimens (values in ages matching the order of specimens in the EDMA data object), we can use it as `x[,,age < 1]`. Character based subsets are useful if we want to select homologous landmarks for multiple objects, or we want to make sure that the landmarks' order is identical among multiple objects. The same `[` function can be used not only to take a subset but also to reorder landmarks or specimens. Character indices are most suitable for this purpose, e.g.: ```{lms} x[c("amsph", "bas", "cpsh"), , ] ``` ## Coercing EDMA data into other object types The data (`$data`) format inside the object `x` is list of the $K \times D$ matrices, one for each individual. Sometimes it is handy to stack these matrices and create a rectangular data (either as a matrix, or data frame, with $n \times K$ rows and $D$ columns): ```{r dataflat} str(as.matrix(x)) ``` > *Note*: we are using `str` to show the structure of these objects, > this is not necessary when exploring the data. > `str` here tells us the dimensions (`r nrow(as.matrix(x))` rows and > `r ncol(as.matrix(x))` columns, the first few values, and the dimension names). Alternatively, we can store the data as an array ($K \times D \times n$): ```{r dataarray} str(as.array(x)) ``` ## Turning 3D arrays into EDMA data objects The `as.edma_data` function turns a 3D array to an EDMA data object. This is useful when handling 3D array objects returned by many functions of the [**geomorph**](https://cran.r-project.org/package=geomorph) package (i.e. after reading Morphologika, NTS, TPS files). The header information gets lost in this way, but we can set it quite easily: ```{r arraytoedma} xnew <- as.edma_data(as.array(x)) xnew xnew$name <- "This is the same data as before" xnew ``` > *Note*: the **geomorph** package needs to be installed separately > using `install.packages("geomorph")`. # Visual inspection of the data The `plot_2d` function shows the convex hulls or confidence ellipses for the landmarks based on all the specimens. The plot is showing the first two axes of multidimensional scaling based on the form matrices ($FM$; pairwise Euclidean distances): ```{r allspec} plot_2d(x) ``` The images by default show the convex hull for each landmark enclosing all specimens. It is also possible to use 95% confidence ellipses: ```{r allspec2} plot_2d(x, hull=FALSE) ``` ## Spotting data errors Data sets can be checked using the `plot` function. By default, this function steps through all the specimens to check for any surprises. The blue areas show the convex hull for the landmark leaving a single specimen out, while red dots show the actual specimen. It is a sign of problem when a dot for a specimen is located far outside of the convex hull. Using the `which` argument we can specify which specimen we want to display (it is not possible to select multiple specimen, use the `[` method to take a subset of the data that and plot it with `plot_2d`): ```{r onespec} plot(x, which=1) ``` We will insert an error to show how the `plot` function can help identify problems. We change the coordinates for the 1st landmark of the 1st specimen. The error is showing up as an outlier (long red line): ```{r fig.width=7,fig.height=5,out.width='45%',fig.show='hold'} ## original values for specimen 1 plot(x, which=1) ## we change the 1st landmark xwrong <- x xwrong$data[[1]][1,1:2] <- c(2, 2) plot(xwrong, which=1) ``` ## Ordination and clustering The ordination plot shows the specimens based on multidimensional scaling using pairwise distances between specimens. These distances are based on the T-statistic and can be calculated as `as.dist(x)`. The calculation involves the following steps: - calculate Euclidean distances ($ED_{i}$) between all pairs of landmarks for specimen $i$ ($i = 1, \dots, n$) based on the coordinates, - calculate Euclidean distances ($ED_{j}$) between all pairs of landmarks for specimen $j$ ($j = 1, \dots, n$) based on the coordinates, - calculate the ratio $ED_{i} / ED_{j}$, - calculate the T-statistic on the log scale (so that a max/min ratio of 1, that is 0 distance, becomes log(1)=0): $d_{ij}=log(max[ED_{i} / ED_{j}] / min[ED_{i} / ED_{j}])$. The multidimensional scaling uses the square root transformed log-T distances. The error shows up nicely in the ordination plot (colored red): ```{r fig.width=7,fig.height=5,out.width='45%',fig.show='hold'} plot_ord(x, col=c(2, rep(1, dim(x)[3]))) plot_ord(xwrong, col=c("red", rep("black", dim(xwrong)[3]))) ``` The cluster dendrogram is based on the same log-T distance matrix described above. The default agglomeration algorithm is set to the Ward's minimum variance method (`"ward.D2"`) which is using the square root of the log-T distances too. ```{r fig.width=7,fig.height=7,out.width='45%',fig.show='hold'} plot_clust(x) plot_clust(xwrong) ``` Other agglomeration methods can be set by passing a different `method` argument, e.g. single linkage (see `?hclust` for a list of options): ```{r fig.width=7,fig.height=7,out.width='45%',fig.show='hold'} plot_clust(x, method = "single") plot_clust(xwrong, method = "single") ``` Both the ordination and clustering functions return the ordination/cluster results, so that plots can be further customized. Here is an example to create a plot ```{r} mds <- plot_ord(x, plot=FALSE) str(mds) plot(mds$points, pch=4, col="darkgreen", main="Ordination diagram") abline(h=0, v=0, lty=2, col="grey") text(mds$points, labels=specimens(x), pos=1, cex=0.6, col="tan") ``` [**ggplot2**](https://ggplot2-book.org/) is a popular alternative for R's base graphics. Here is how to create plots with it: ```{r} library(ggplot2) df <- as.data.frame(mds$points) p <- ggplot(data=df, aes(x=`Axis 1`, y=`Axis 2`, label=rownames(df))) + geom_label() + geom_vline(xintercept=0, lty=2) + geom_hline(yintercept=0, lty=2) p ``` See `?ape::plot.phylo` for how to customize dendrograms: ```{r} h <- plot_clust(x, plot=FALSE) h plot(h) plot(ape::as.phylo(h), type="cladogram", font=1, cex=0.6) ``` [**ggdendro**](https://CRAN.R-project.org/package=ggdendro) is a package that can plot tree-like data structures with **ggplot2**. Here is an example: ```{r} library(ggdendro) ggdendrogram(h, rotate = TRUE, size = 2) ``` ## Writing data to xyz format Different kinds of morphometrics data formats can be turned into EDMA data format as long as it can be organized into a $K \times D \times n$ array and then using the `as.edma_data` method. Once the data is an EDMA data object, the `write_xyz` function will write that into a text file with extension xyz: ```{r} f <- tempfile(fileext = ".xyz") # create a temporary file write_xyz(x, file=f) # write data to temp file tmp <- read_xyz(file=f) # read back the data x # original data tmp # this should be the same too ## test if the all the dimnames are the same stopifnot(identical(dimnames(x), dimnames(tmp))) unlink(f) # delete temp file ``` > *Note*: tuse a real file name and path, the temp file is used here > for demonstration purposes only. Read on to see how the xyz data are used to fit EDMA models to it.