Compositional Data Analysis on Nova Scotia vegetation, and the driving influences for Barium compositions

Results

Tidying the data

Before I begin my analysis, I need to tidy the data so that it is usable and informative.

Firstly, I loaded in packages and data. This data is from the Canadian Geochemical survey website(https://geochem.nrcan.gc.ca/cdogs/content/svy_dat/svy210254_e.dat.htm). Download the version under the red arrow,where results below the detection limit is denoted as -(detection limit). This spreadsheet (c54) details the types of vegetation present in central Nova Scotia, and its respective chemical composition.

NOTE: If GeochemUtils cannot be installed, check if the below packages are installed:“compositions”,“qdapRegex”,“ggrepel”

As the species and organ is currently combined into the same column, I would like to make two new columns, each specifying only plant species and only plant organ.

c54[, Sample_Type_Name_en := as.factor(Sample_Type_Name_en)]
c54[, plant_species := as.factor(str_remove(Sample_Type_Name_en, "[Tt]wigs"))]
c54[, plant_organ := as.factor(str_extract(Sample_Type_Name_en, "[Tt]wig"))]

The columns of elements all end in ‘INA’. These should be renamed without INA. Also, I would like to remove empty “Wt” column

setnames(c54, names(c54)[str_detect_elements(c54)], str_extract_elements(c54, nomatch = NULL))

Here I would like to unify all the units (ppm, ppb and %) to ppm. The corresponding units for each element are available in the link above.

NOTE: Sodium units are in ppm rather than %.

#Make the unit an object listing the associated elements
ppb = c("Au", "Ir")
percent = c("Ca", "Fe", "K")

#find out which column name correspond to which index 
cbind(names(c54)) 

#Include the these units (ppb,%) in their column names
setnames(c54, colnames(c54)[c(13,25)], paste0(ppb, "_ppb"))
setnames(c54, colnames(c54)[c(16, 22, 26)], paste0(percent, "_%"))

#Include ppm unit for all other column names
setnames(c54, select.VarsElements(c54), paste0(select.VarsElements(c54), "_ppm"))

#Remove the element names, leaving only the unit associated to the index
my_conc_info = names(c54)[str_detect_elements(names(c54))] %>% str_remove_elements() %>% str_remove("_")

#There is an extra "Wt" that is wrongfully selected as an element. We need to remove this from the object and from the dataset.
my_conc_info=my_conc_info[-35]
c54=c54[,-"Wt"]

#Specify which units correspond to which element
names(my_conc_info) = names(c54)[str_detect_elements(names(c54))]

#Unify all units to ppm
unify_conc(c54, vars = names(my_conc_info), target_conc = c("ppm"), conc_info=my_conc_info)

Now that I have unified all the values in ppm, I would like to discard the unit in the column name, and all NA values. c54_cl contains a cleaned data file.

names(c54) = stringr::str_remove(names(c54), "_pp[bm]$")
names(c54) = stringr::str_remove(names(c54), "_%$")
c54_cl = remove_NAs(c54, vars = names(c54)[str_detect_elements(c54)]) %>% data.frame()
write.csv(c54_cl, "/Users/Jasthecoolbean/Desktop/PCA_barium/c54_cl.csv", row.names=FALSE)

Geology

Before we delve into analysis, it is nice to visualize what kind of geology we have beneath are sample points. Before running the code below, please carry out the following steps:

Download the shapefile of Nova Scotia: https://novascotia.ca/natr/meb/download/dp043md.asp
Download the shapefile of the vegetation sample area: https://novascotia.ca/natr/meb/download/dp129md.asp
Load both shapefiles (p00001gb.shp and c129gynl.shp) into qgis
Vector > Geoprocessing tools > Clip > Input layer: Nova Scota, Overlay layer: Sample area
Make the new layer permanent, and export as a shapefile.

NOTE: On QGIS, you must click the three dots next to the file name to specify where you will store the layer/exporting file. It will show up as an error if not.

The new shapefile now contains the geological information for our vegetation sampling area. We now want to draw our vegetation sampling points on top of this geological map. See the rmd file for more guidance on how to clean and plot the data.

The plot shows that it is mostly sandstone, turbites and slate (blue), with slate and siltstone running across east-west (purple).

PCA biplot

I will try both method 1 and method 2 for generating biplots.

Method 1

Plot 1A

The first method uses ggplots to make a biplots. Plot 1A will show all elements (except for REE) on the biplot. The spread is mostly affected by the negative BDL values (Plot 1B will improve on this).

## [1] explained variance:
## [1] NA

Plot 1B

All elements have been ‘imputed BDL’ here. This means, the negative number showing BDL are transformed, so they are now very small values that will vary randomly (This is done so we do not falsely assign meaning to these uniform negative numbers). Eu is excluded.

NOTE: when using impute_BDL, download a datafile, that IF there is a number below detection limit, that it marks as -(detection limit).

## Selected columns are: As,Au,Ba,Br,Ca,Ce,Co,Cr,Cs,Eu,Fe,Hf,Hg,Ir,La,Lu,Mo,Na,Nd,Ni,Rb,Sb,Sc,Se,Sm,Sr,Ta,Tb,Th,U,W,Yb,Zn

## [1] explained variance:
## [1] 0.5294349

Plot 1C

Some elemental concentrations are so low, that it is lower than the limit of detection. This is tested out through hist + ecdf + QQ-plot loop. As I would only like to look at the elements without a detection limit problem, I will select the following elements: “Ba”, “Br”, “Ca”, “Cr”, “Fe”, “La”, “Na”, “Rb”, “Sc”, “Zn”.

## [1] explained variance:
## [1] 0.7326497

Method 2

The second method is from http://file.statistik.tuwien.ac.at/StatDA/R-scripts/page195.html. It produces the same biplot as Plot 1C.

We’ve seen now that Ba and Na are strongly influenced by PC2.

Mapping PCA

Mapping the PCA scores will give us an idea how elemental compositions may vary with the local geography. We see that generally, there is a higher PC2 score inland, and a lower PC2 score towards the coast.

Mapping Barium concentrations

Whilst it makes sense for Na uptake into plants to be higher near the coast, it is odd that there is pattern in Barium uptake by plants. This may be due to a number of conditions, such as redox processes, geology, pH and more. Below we will plot Barium’s concentration on a map.

PERSONAL NOTE: This is just a visual representation that maybe there is a correlation of barium with the local geography. We cannot simply infer there is more barium uptake inland from plotting barium concentrations alone. This is because all elements are dependent on each other. Say, if each tree can only take up x amount of nutrients every day, uptaking more sodium due to seaspray effects will lead to less barium being taken up comparatively. Hence, we need to look at ratios between elements to reduce this spurious effect, as ratios do not change as much as concentration values.

So in the next tab, lets explore using ratios of Barium, to see whether this pattern is significantly influenced by a) Distance b) Elevation or c) Slope.

Covariables of Barium change

Distance from coast

Firstly, we need to find the distance of each sampling point is from the coastline. Then we can plot a linear model showing barium ratios against distance.

You will need to import both the sampling area shapefile (sema_litho), and sampling points (c54_impute) into qgis under the same CRS.

NOTE: If under properties it says that it is in the same crs, but it does not appear on the same space, try the following:For csv/excel files, Vector> Data management tool> Reproject layer> Change to desired csv. For DEM and other rasters, Raster> Projections> Warp(reproject)> Change to desired csv. If this still doesn’t work, consider using st_transform in R before importing the file into qgis.

When both layers are in QGIS, follow the steps below to get the distance of each point to the coast:

New Shapefile Layer (third icon on the left) > Geometry Type: Polygon > Toggle Editting > Add Line feature
Draw a line as accurately as you can around the coastline
Processing toolbox > Vector Analysis > Distance to nearest hub (points) > distance.
Put in Destination layer as the line, and Source point layer as your data points
Exported the data table as an excel file and read that in R (I have read it as cd)

Here is a linear model for Barium concentration and distance:

As mentioned before, we cannot just look at barium concentrations. Rather, we need to look at ratios. Let’s take into 2 considerations: One, we should remove Na from the data set to reduce the influence of seaspray on the composition. Two, We need Ba to be in ratio with an element that is present in rock, but not present in sea spray. Elements that would be good include Zn, Ca, Fe etc…we will use all the elements in c54_sel.

If we log this Barium ratio, say log(Zn/Ba), log(Ca/Ba) etc…, then this can break down into log(Zn) - log(Ba), log(Ca) - log(Ba) etc… This can help help us reduce dimensionality, and calculate the distances between elements in this isometric space, which can then tell us: do certain elements cluster together? If so, does distance play a role driving this cluster?

Coefficients and p-values for each elemental ratio:

## Response Br :
## 
## Call:
## lm(formula = Br ~ distance, data = bd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92440 -0.46975 -0.04321  0.45465  2.23845 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.769e+00  7.222e-02  -52.19  < 2e-16 ***
## distance    -5.507e-06  1.587e-06   -3.47 0.000557 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7346 on 626 degrees of freedom
## Multiple R-squared:  0.01887,    Adjusted R-squared:  0.0173 
## F-statistic: 12.04 on 1 and 626 DF,  p-value: 0.0005567
## 
## 
## Response Ca :
## 
## Call:
## lm(formula = Ca ~ distance, data = bd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55318 -0.35586 -0.01074  0.35677  1.67218 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.000e+00  5.462e-02  91.552   <2e-16 ***
## distance    -2.870e-06  1.200e-06  -2.391   0.0171 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5555 on 626 degrees of freedom
## Multiple R-squared:  0.00905,    Adjusted R-squared:  0.007467 
## F-statistic: 5.717 on 1 and 626 DF,  p-value: 0.01709
## 
## 
## Response Cr :
## 
## Call:
## lm(formula = Cr ~ distance, data = bd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.90414 -0.39778  0.05954  0.44433  1.72609 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.786e+00  6.252e-02 -60.552   <2e-16 ***
## distance    -1.845e-06  1.374e-06  -1.343     0.18    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6359 on 626 degrees of freedom
## Multiple R-squared:  0.002872,   Adjusted R-squared:  0.001279 
## F-statistic: 1.803 on 1 and 626 DF,  p-value: 0.1798
## 
## 
## Response Fe :
## 
## Call:
## lm(formula = Fe ~ distance, data = bd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.22133 -0.61300 -0.04979  0.57283  2.00632 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.291e+00  8.157e-02  15.830   <2e-16 ***
## distance    -7.447e-07  1.792e-06  -0.415    0.678    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8296 on 626 degrees of freedom
## Multiple R-squared:  0.0002756,  Adjusted R-squared:  -0.001321 
## F-statistic: 0.1726 on 1 and 626 DF,  p-value: 0.678
## 
## 
## Response La :
## 
## Call:
## lm(formula = La ~ distance, data = bd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40578 -0.70167 -0.05244  0.64043  2.73180 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.419e+00  9.138e-02 -59.305   <2e-16 ***
## distance     3.452e-07  2.008e-06   0.172    0.864    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9294 on 626 degrees of freedom
## Multiple R-squared:  4.721e-05,  Adjusted R-squared:  -0.00155 
## F-statistic: 0.02955 on 1 and 626 DF,  p-value: 0.8636
## 
## 
## Response Rb :
## 
## Call:
## lm(formula = Rb ~ distance, data = bd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.67335 -0.45883  0.05227  0.51330  2.20101 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.532e+00  7.676e-02 -19.959   <2e-16 ***
## distance    -1.994e-07  1.687e-06  -0.118    0.906    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7808 on 626 degrees of freedom
## Multiple R-squared:  2.231e-05,  Adjusted R-squared:  -0.001575 
## F-statistic: 0.01397 on 1 and 626 DF,  p-value: 0.906
## 
## 
## Response Sc :
## 
## Call:
## lm(formula = Sc ~ distance, data = bd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30518 -0.67680 -0.04742  0.63813  2.23299 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.952e+00  8.944e-02 -77.735   <2e-16 ***
## distance     5.685e-07  1.965e-06   0.289    0.772    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9097 on 626 degrees of freedom
## Multiple R-squared:  0.0001336,  Adjusted R-squared:  -0.001464 
## F-statistic: 0.08366 on 1 and 626 DF,  p-value: 0.7725
## 
## 
## Response Zn :
## 
## Call:
## lm(formula = Zn ~ distance, data = bd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.59789 -0.37687  0.01039  0.40000  1.63598 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.462e-01  5.741e-02  11.256   <2e-16 ***
## distance    -2.132e-06  1.262e-06  -1.689   0.0916 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.584 on 626 degrees of freedom
## Multiple R-squared:  0.004539,   Adjusted R-squared:  0.002949 
## F-statistic: 2.854 on 1 and 626 DF,  p-value: 0.09163

Anova:

## Analysis of Variance Table
## 
##              Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)   1 0.99984   480987      8    619 < 2.2e-16 ***
## distance      1 0.06323        5      8    619 2.447e-06 ***
## Residuals   626                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Only log(Br/Ba) changes significantly with distance (p<0.001). It decreases with increasing distance.

Elevation

I have included some code hear needed to change the crs of the shapefiles. These are the steps of what I have done:

Download three map quadrants relevant to nova scotia from Canadian data (use the index file to find out which quadrants are useful) https://open.canada.ca/data/en/dataset/7f245e4d-76c2-4caa-951a-45d1d2051333, https://ftp.maps.canada.ca/pub/nrcan_rncan/elevation/cdem_mnec/doc/CDEM_product_specs.pdf
Raster>merge in qgis
import sample area (in the correct crs, crs = 4617)
Raster>Clip Raster to mask (so only the topography of the sample area is shown)
Layer properties of newly clipped+merged file>symbols>change it to a continuous color, click classify, and specify the number of classes
Print layout>Add map>Add legend

Keep in mind, that these elevation values are relative. The real values will be extracted in the next step when plotting linear models.

To extract the real values, install a plugin in qgis called point sampling tool to obtain attributes from the raster (DEM) for the sample points specified. You will need to add in the sampling points layer (c54_cl) and map DEM. I have saved the extracted elevation points as an excel sheet, which I named ele.

NOTE: It will make life much simpler if you select all columns of elements, coordinates, and map data in the input. This will minimize data cleaning later.

Coefficients and p-values for each elemental ratio:

## Response Br :
## 
## Call:
## lm(formula = Br ~ elevation, data = ele)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9178 -0.4569 -0.0721  0.5048  2.2176 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.7661928  0.0542478 -69.426  < 2e-16 ***
## elevation   -0.0026768  0.0005289  -5.061 5.48e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7269 on 626 degrees of freedom
## Multiple R-squared:  0.03931,    Adjusted R-squared:  0.03778 
## F-statistic: 25.62 on 1 and 626 DF,  p-value: 5.484e-07
## 
## 
## Response Ca :
## 
## Call:
## lm(formula = Ca ~ elevation, data = ele)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.58088 -0.35922 -0.02081  0.35262  1.56935 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.0134745  0.0411686 121.779  < 2e-16 ***
## elevation   -0.0015313  0.0004014  -3.815  0.00015 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5517 on 626 degrees of freedom
## Multiple R-squared:  0.02272,    Adjusted R-squared:  0.02116 
## F-statistic: 14.56 on 1 and 626 DF,  p-value: 0.0001496
## 
## 
## Response Cr :
## 
## Call:
## lm(formula = Cr ~ elevation, data = ele)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91593 -0.37339  0.06034  0.44531  1.85493 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.7052407  0.0469390 -78.937  < 2e-16 ***
## elevation   -0.0018140  0.0004576  -3.964 8.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.629 on 626 degrees of freedom
## Multiple R-squared:  0.02449,    Adjusted R-squared:  0.02293 
## F-statistic: 15.71 on 1 and 626 DF,  p-value: 8.222e-05
## 
## 
## Response Fe :
## 
## Call:
## lm(formula = Fe ~ elevation, data = ele)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.23341 -0.60993 -0.05017  0.58670  1.98972 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.3416127  0.0618030  21.708   <2e-16 ***
## elevation   -0.0009391  0.0006025  -1.559     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8282 on 626 degrees of freedom
## Multiple R-squared:  0.003865,   Adjusted R-squared:  0.002274 
## F-statistic: 2.429 on 1 and 626 DF,  p-value: 0.1196
## 
## 
## Response La :
## 
## Call:
## lm(formula = La ~ elevation, data = ele)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.41824 -0.71854 -0.05175  0.65438  2.74099 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.3814162  0.0693511 -77.597   <2e-16 ***
## elevation   -0.0002681  0.0006761  -0.397    0.692    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9293 on 626 degrees of freedom
## Multiple R-squared:  0.0002511,  Adjusted R-squared:  -0.001346 
## F-statistic: 0.1572 on 1 and 626 DF,  p-value: 0.6919
## 
## 
## Response Rb :
## 
## Call:
## lm(formula = Rb ~ elevation, data = ele)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.67383 -0.46136  0.05309  0.51030  2.19640 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.5278840  0.0582652 -26.223   <2e-16 ***
## elevation   -0.0001443  0.0005681  -0.254      0.8    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7808 on 626 degrees of freedom
## Multiple R-squared:  0.0001031,  Adjusted R-squared:  -0.001494 
## F-statistic: 0.06454 on 1 and 626 DF,  p-value: 0.7995
## 
## 
## Response Sc :
## 
## Call:
## lm(formula = Sc ~ elevation, data = ele)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.32433 -0.66939 -0.04764  0.64603  2.22231 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -6.8951099  0.0678728 -101.589   <2e-16 ***
## elevation   -0.0003883  0.0006617   -0.587    0.558    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9095 on 626 degrees of freedom
## Multiple R-squared:  0.0005498,  Adjusted R-squared:  -0.001047 
## F-statistic: 0.3444 on 1 and 626 DF,  p-value: 0.5575
## 
## 
## Response Zn :
## 
## Call:
## lm(formula = Zn ~ elevation, data = ele)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61127 -0.37955  0.01332  0.40180  1.54661 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.6754245  0.0433227  15.591  < 2e-16 ***
## elevation   -0.0013595  0.0004224  -3.219  0.00135 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5805 on 626 degrees of freedom
## Multiple R-squared:  0.01628,    Adjusted R-squared:  0.01471 
## F-statistic: 10.36 on 1 and 626 DF,  p-value: 0.001354

Anova:

## Analysis of Variance Table
## 
##              Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)   1 0.99984   489889      8    619 < 2.2e-16 ***
## elevation     1 0.13990       13      8    619 < 2.2e-16 ***
## Residuals   626                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Br, Ca and Cr ratio decreases significantly with elevation (p<0.001)

Slope

Import the merged map and data points.

NOTE: CRS must be in UTM when calculating slope. This is because our elevation data is in meters, so crs has to use the same metric system.

Calculate the slope of the DEM using Raster terrain analysis > Slope. Do this for the merged map layer. Keep the terrain as Z-factor as 1.
Extract slopes at the sampling points using Point sampling tool. Save it as a csv file and load it into R.

Coefficients and p-values for each elemental ratio:

## Response Br :
## 
## Call:
## lm(formula = Br ~ slope, data = slo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81142 -0.48537 -0.04539  0.47854  2.27513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.87662    0.04780 -81.107  < 2e-16 ***
## slope       -0.03398    0.01054  -3.223  0.00133 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7356 on 626 degrees of freedom
## Multiple R-squared:  0.01632,    Adjusted R-squared:  0.01475 
## F-statistic: 10.39 on 1 and 626 DF,  p-value: 0.001334
## 
## 
## Response Ca :
## 
## Call:
## lm(formula = Ca ~ slope, data = slo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.60763 -0.36448 -0.01206  0.37530  1.63641 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.994467   0.035800 139.509  < 2e-16 ***
## slope       -0.031780   0.007896  -4.025  6.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.551 on 626 degrees of freedom
## Multiple R-squared:  0.02522,    Adjusted R-squared:  0.02367 
## F-statistic:  16.2 on 1 and 626 DF,  p-value: 6.401e-05
## 
## 
## Response Cr :
## 
## Call:
## lm(formula = Cr ~ slope, data = slo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.89477 -0.39709  0.06235  0.43116  1.74161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.814493   0.041309 -92.340   <2e-16 ***
## slope       -0.013408   0.009111  -1.472    0.142    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6357 on 626 degrees of freedom
## Multiple R-squared:  0.003447,   Adjusted R-squared:  0.001856 
## F-statistic: 2.166 on 1 and 626 DF,  p-value: 0.1416
## 
## 
## Response Fe :
## 
## Call:
## lm(formula = Fe ~ slope, data = slo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.12483 -0.60997 -0.03101  0.57914  2.05559 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.35430    0.05371   25.22   <2e-16 ***
## slope       -0.02629    0.01185   -2.22   0.0268 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8265 on 626 degrees of freedom
## Multiple R-squared:  0.007809,   Adjusted R-squared:  0.006224 
## F-statistic: 4.927 on 1 and 626 DF,  p-value: 0.0268
## 
## 
## Response La :
## 
## Call:
## lm(formula = La ~ slope, data = slo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.29711 -0.69008 -0.05233  0.66777  2.66780 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.30658    0.06019 -88.166   <2e-16 ***
## slope       -0.02741    0.01327  -2.065   0.0394 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9263 on 626 degrees of freedom
## Multiple R-squared:  0.006763,   Adjusted R-squared:  0.005176 
## F-statistic: 4.262 on 1 and 626 DF,  p-value: 0.03938
## 
## 
## Response Rb :
## 
## Call:
## lm(formula = Rb ~ slope, data = slo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5585 -0.4525  0.0384  0.5179  2.1718 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.44461    0.05050 -28.604   <2e-16 ***
## slope       -0.02677    0.01114  -2.403   0.0165 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7772 on 626 degrees of freedom
## Multiple R-squared:  0.009141,   Adjusted R-squared:  0.007558 
## F-statistic: 5.775 on 1 and 626 DF,  p-value: 0.01655
## 
## 
## Response Sc :
## 
## Call:
## lm(formula = Sc ~ slope, data = slo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.24863 -0.69908 -0.03468  0.63770  2.23141 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -6.83985    0.05894 -116.042   <2e-16 ***
## slope       -0.02485    0.01300   -1.912   0.0564 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9071 on 626 degrees of freedom
## Multiple R-squared:  0.005803,   Adjusted R-squared:  0.004215 
## F-statistic: 3.654 on 1 and 626 DF,  p-value: 0.05639
## 
## 
## Response Zn :
## 
## Call:
## lm(formula = Zn ~ slope, data = slo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68958 -0.37576  0.01379  0.40157  1.63272 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.651650   0.037733  17.270  < 2e-16 ***
## slope       -0.026286   0.008322  -3.159  0.00166 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5807 on 626 degrees of freedom
## Multiple R-squared:  0.01569,    Adjusted R-squared:  0.01411 
## F-statistic: 9.976 on 1 and 626 DF,  p-value: 0.001662

Anova:

## Analysis of Variance Table
## 
##              Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)   1 0.99984   471112      8    619 < 2.2e-16 ***
## slope         1 0.03851        3      8    619  0.001946 ** 
## Residuals   626                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Only Ca changes with slope (p<0.001)

Interactions between co-variables

After seeing how these different co-variables affect the subcompositions, we can study an interaction model of this system. We can look at things like: do the covariables slope and elevation interact? Is the effect of slope on the composition, just a proxy for elevation?

Firstly, we test the co-variables individually in the model.

Distance and barium ratios show a significant correlation (p<0.001).

## Analysis of Variance Table
## 
##              Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)   1 0.99984   480987      8    619 < 2.2e-16 ***
## bd$distance   1 0.06323        5      8    619 2.447e-06 ***
## Residuals   626                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Slope and barium ratios show a significant correlation (p<0.001).

## Analysis of Variance Table
## 
##              Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)   1 0.99984   471112      8    619 < 2.2e-16 ***
## slo$slope     1 0.03851        3      8    619  0.001946 ** 
## Residuals   626                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Elevation and barium ratios show a significant correlation (p<0.001).

## Analysis of Variance Table
## 
##                Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)     1 0.99984   489889      8    619 < 2.2e-16 ***
## ele$elevation   1 0.13990       13      8    619 < 2.2e-16 ***
## Residuals     626                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Then, we test the individual co-variables together in the model.

If I pass the data onto distance first, then elevation, ANOVA says that both variables can explain the variance of the Barium ratios.

## Analysis of Variance Table
## 
##                Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)     1 0.99984   493269      8    618 < 2.2e-16 ***
## bd$distance     1 0.06867        6      8    618 5.265e-07 ***
## ele$elevation   1 0.10190        9      8    618 2.185e-11 ***
## Residuals     625                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, if I switch up the order of variables, we see that elevation and barium ratios show a significant correlation (p<0.001). Elevation and barium ratios do not show a significant correlation (p<0.001).

This implies, that most of the variance is explained by elevation. When the remaining variance is passed down onto distance, this co-variable is unable to strongly explain the rest of the data set. Thus distance could be a proxy for elevation.

## Analysis of Variance Table
## 
##                Df  Pillai approx F num Df den Df  Pr(>F)    
## (Intercept)     1 0.99984   493269      8    618 < 2e-16 ***
## ele$elevation   1 0.14153       13      8    618 < 2e-16 ***
## bd$distance     1 0.02185        2      8    618 0.08945 .  
## Residuals     625                                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Elevation and barium ratios show a significant correlation (p<0.001). Slope and barium ratios show a significant correlation (p<0.001).

## Analysis of Variance Table
## 
##                Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)     1 0.99984   493399      8    618 < 2.2e-16 ***
## ele$elevation   1 0.14213       13      8    618 < 2.2e-16 ***
## slo$slope       1 0.05388        4      8    618 3.443e-05 ***
## Residuals     625                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I obtained the same results even after switching the order around, as seen below. This means that slope can explain some of the variance in our data, but elevation remains the main variable.

## Analysis of Variance Table
## 
##                Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)     1 0.99984   493399      8    618 < 2.2e-16 ***
## slo$slope       1 0.03946        3      8    618  0.001555 ** 
## ele$elevation   1 0.15366       14      8    618 < 2.2e-16 ***
## Residuals     625                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now that I know distance may be a proxy variable, I will only include elevation and slope in my final model. The model should look a little like this at the moment:

Let’s test the interaction between elevation and slope on the barium ratios.

There seems to be a significant interaction between elevation and slope affecting barium ratios (p<0.001).

## Analysis of Variance Table
## 
##                          Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)               1 0.99984   479174      8    619 < 2.2e-16 ***
## ele$elevation:slo$slope   1 0.07262        6      8    619 1.607e-07 ***
## Residuals               626                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Although I cannot directly infer causality from this interaction, it makes much more sense (in a geochemical way) that elevation influences slope, which influences the barium ratios. So (maybe) the model could look like this:

However, when I combine this interaction with elevation and slope as individual variables, it becomes insignificant.

## Analysis of Variance Table
## 
##                          Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)               1 0.99984   494345      8    617 < 2.2e-16 ***
## ele$elevation             1 0.14400       13      8    617 < 2.2e-16 ***
## slo$slope                 1 0.05411        4      8    617 3.311e-05 ***
## ele$elevation:slo$slope   1 0.02130        2      8    617    0.1005    
## Residuals               624                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We may say here, that the individual components of elevation and slope is much more important in explaining the variation. So, the first diagram more accurately paints the picture. As we are dealing with a lot of elemental ratios, it could be that most variation of the elemental ratios are explained by elevation. Whilst few are explained by slope.

We will run Solveig’s Property test to explore this hypothesis:

PERSONAL NOTE: Say if Br is pointing to Zn in the distance plot, this means, that with increasing distance, there is a smaller Br/Zn ratio.

Distance: Looking at the bigger picture, we see that Br and Na (relative to most elements) are decreasing with increasing distance. This can be mostly explained by seaspray.

Elevation: Many arrows are pointing from sodium, so we can say that Na/x is highly influenced (decreasing) by elevation. Same goes to Bromine. The pattern seen for elevation is similar to what was seen for distance. This allows us to visualise how distance may be a proxy for elevation.

Other elemental ratios like Ba/x and La/x is increasing with elevation. Elevation may play such role, because stronger precipitation over elevated regions may alter the chemical composition drastically.

Slope: With increasing slope, there is a smaller Ca/Ba ratio only. This supports what we found previously, such that slope only explains a limited number of elemental ratios, whilst elevation is the main co-variable.

Steeper slopes are often associated to better drainage. This can promote oxic conditions in the soil. Hence, perhaps more oxic conditions can lead to more barium mobility and uptake into the plant.

Next steps: It would be interesting to study the subcompositions that are not greatly influenced within the composition. For example in elevation, Ca, Cr and Zn mostly are not strongly influenced by other elements, as they don’t have a strong arrow preference. These are the elements won’t change as much respective to other elemental uptake. Hence, they will give us a better picture of how the variables affect the elements.

Compositional Data Analysis on Nova Scotia vegetation, and the driving influences for Barium compositions

Jasmine Ng, Lucjia Dujmovic, Solveig Pospeich, Raimon Tolosana Delgado, K. Gerald van den boogart

28-07-2023

Results

Tidying the data

Geology

PCA biplot

Method 1

Plot 1A

Plot 1B

Plot 1C

Method 2

Mapping PCA

Mapping Barium concentrations

Covariables of Barium change

Distance from coast

Elevation

Slope

Interactions between co-variables