Sunday, July 31, 2016

Hey, what's up with those environmental overlaps in the ENMTools R package?

I'm so glad I asked!  The env.overlap metrics produced by the ENMTools R package are based on methods developed by John Baumgartner and myself.  The purpose of these metrics is to address one of the key issues with the niche overlap metrics currently implemented in ENMTools and elsewhere; the difference between the geographic distribution of suitability and the distribution of suitability in environment space.

Existing methods in ENMTools and most other packages measure similarity between models via some metric that quantifies the similarity in predicted suitability of habitat in geographic space.  While this may be exactly the sort of thing you'd like to measure if you're wondering about the potential for species to occupy the same habitat in an existing landscape, it can be somewhat misleading if the availability of habitat types on the landscape is strongly biased.  For instance, what if two species have very little niche overlap in environment space, but that overlap happens to occur in a combination of environments that turns out to be very common in the current landscape?

To illustrate, let's take two species (red and blue) and look at their niches in environment space:




So they're pretty different, right? But what if the only available environments in the study region occur in that area of overlap?  E.g., what if the current environment space is represented by the green area here?



Well then the only environments we have within which we can measure similarity between species happen to be those environments that are suitable for both!  This means that our measure of overlap between models in geographic space could be arbitrarily disconnected from the actual similarity between those models in environment space.  Depending on the sort of question we're trying to ask, that could be quite misleading.

The method of measuring overlap developed by Broennimann et al. (2012) deals with this issue to some extent.  However, those methods only work in two dimensions, and only work by using a kernel density approach based on occurrence points in environment space.  My guess is that that's still way better than what the original ENMTools approach did for most purposes, but it's not very useful if (for instance) you want to ask how similar the environmental predictions of a GLM are to, say, an RF model.  You simply can't do it.  Or if you want to ask questions in a higher dimensional space, you're basically out of luck.

So what can we do?  Can we figure out a way to measure overlap between two arbitrary models in a n-dimensional space?  It turns out that this is not easy to do exactly, but you can get approximate measures to an arbitrary level of precision fairly easily!

Our approach leverages the fact that R already has great packages for doing Latin hypercube sampling.  This allows us to draw random, but largely independent, points from that n-dimensional environment space.  We can then use dismo's predict function to project our models to those points in environment space, and measure suitability differences between species.  Obviously throwing just a couple of points into a 19-dimensional space (for instance, if you're using all Bioclim variables) isn't going to get you very close to the truth, but of course if you keep throwing more points in there you will get closer and closer to the true average similarity between models across the space.

SO that's what the method does: it starts by making a random Latin hypercube sample of 10,000 points from the space of all combinations of environments, with each variable bound by it's maximum and minimum in the current environment space.  Then it chucks another 10,000 points in there, and it asks how different the answer with 20,000 points is compared to the answer with 10,000.  Then repeat for 30,000 vs. 20,000, and so on, until subsequent measures fall below some threshold tolerance level.  This allows us to get arbitrarily close to the true overlap by specifying our tolerance level.  Lower tolerances take longer and longer to process, since it requires more samples for the value to converge.  However, we've found that with tolerances set at around .001 we get very consistent results for a 4-dimensional comparison with an execution time of around two seconds.

Pretty cool, huh?  Now you can compare the environmental predictions of any two models that can be projected using ENMTools' predict() function in environment space, instead of just looking at their projections into a given geographic space!

Friday, July 29, 2016

All rangebreak tests now available in ENMTools R package

As of yesterday, all rangebreak tests are now available in the R package.  There are still some rough edges to be smoothed off and all that, but if you are doing a rangebreak-y study and need that sort of thing, it is usable.  Demo code and example outputs are now available on the readme at:

https://github.com/danlwarren/ENMTools


ENMTools R package model plots are now all done in ggplot2

Newest update (I did say they'd be coming very rapidly now, didn't I?): I've switched the enmtools model plots from using base raster plotting functions to using ggplot2.  They don't actually look very different, as you can see by comparing base (top) to ggplot2 (bottom) plots:


However the switch to ggplot2 allows for a lot more flexibility down the line, and also makes it easier to store plots as objects for future manipulation.

Thursday, July 28, 2016

New features for ENMTools model objects and functions: response plots, model evaluation, and new color ramps

One of the advantages of the enmtools.species object structure is that I can now provide a much more accessible interface to much of dismo's modeling functionality, and can add new functionality that automates a lot of the outputs you might typically want from an SDM/ENM.  You've already seen some of this here, but in the past week I've added a lot more.  For one thing I've switched from the default color ramps, such as this:



To viridis color ramps, e.g.,


This has a number of advantages.  First, viridis color ramps are very pretty, and this particular one has a very familiar Maxent-y sort of look about it which makes it easy for an experienced SDM modeler to interpret.  More importantly, the viridis color ramps are designed with accessibility in mind.  The authors put a ton of work into figuring out a set of color ramps that are interpretable when printed in greyscale AND accessible to people with varying types of color blindness.  That's pretty awesome.

You'll also notice that there are two sets of points plotted there.  Those are training and test points.  You can now call all of the enmtools modeling functions with an argument "test.prop", e.g.,

ahli.glm = enmtools.glm(pres ~ layer.1 + layer.2 + layer.3 + layer.4, ahli, env, test.prop = 0.2)

And they will automatically withhold that proportion of your data for model testing.  Now when you call your model object, you get training and test evaluation metrics!

>ahli.glm


Formula:  presence ~ layer.1 + layer.2 + layer.3 + layer.4



Data table (top ten lines): 

|   | Longitude| Latitude| layer.1| layer.2| layer.3| layer.4| presence|
|:--|---------:|--------:|-------:|-------:|-------:|-------:|--------:|
|1  |  -80.0106|  21.8744|    2765|    1235|    1174|     252|        1|
|2  |  -79.9086|  21.8095|    2289|    1732|     957|     231|        1|
|3  |  -79.8065|  21.7631|    2158|    1870|     983|     253|        1|
|4  |  -79.8251|  21.8095|    2207|    1877|     967|     259|        1|
|5  |  -79.8807|  21.8374|    2244|    1828|     945|     249|        1|
|6  |  -79.9550|  21.8374|    2250|    1766|     919|     235|        1|
|7  |  -80.3446|  22.0136|    2201|    1822|     978|     277|        1|
|8  |  -80.2983|  21.9951|    2214|    1786|     986|     284|        1|
|10 |  -80.1591|  21.9673|    2984|     965|    1311|     237|        1|
|11 |  -80.1498|  21.9858|    3042|     841|    1371|     221|        1|


Model:  
Call:
glm(formula = f, family = "binomial", data = analysis.df[, -c(1, 
    2)])

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.65556  -0.18280  -0.12121  -0.08065   3.10812  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) 40.033648  27.208173   1.471   0.1412  
layer.1     -0.012770   0.007165  -1.782   0.0747 .
layer.2     -0.009662   0.007346  -1.315   0.1884  
layer.3      0.006954   0.006638   1.047   0.2949  
layer.4     -0.020317   0.025631  -0.793   0.4280  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 130.29  on 1011  degrees of freedom
Residual deviance: 119.18  on 1007  degrees of freedom
AIC: 129.18

Number of Fisher Scoring iterations: 8



Model fit (training data):  class          : ModelEvaluation 
n presences    : 12 
n absences     : 1000 
AUC            : 0.7485833 
cor            : 0.09753628 
max TPR+TNR at : -4.82228 


Proportion of data wittheld for model fitting:  0.2

Model fit (test data):  class          : ModelEvaluation 
n presences    : 4 
n absences     : 1000 
AUC            : 0.74075 
cor            : 0.05048264 
max TPR+TNR at : -4.570937 


So that's cool, obviously.  Even cooler is that your model object now contains marginal response functions.  These are calculated by varying each predictor from the minimum value to the maximum value found in the environmental layers, while holding the other predictors constant at the mean value across all presence points.  At present these plots aren't printed by default when you call your object, but I may change that.  For now, you can type:

>ahli.glm$response.plots

And you get:






I may do some more tweaking in the future, but these are ggplot2 plots so you can easily modify them however you want.

Friday, July 22, 2016

Background/similarity tests in the ENMTools R package

Okay, let's use our two species to run a background/similarity test.  This works a lot like the identity test (see the post preceding this one), but there's a new option called "test.type" that can be set to "asymmetric" or "symmetric".  Here's an asymmetric background test using Bioclim:




bg.bc.asym = background.test(species.1 = ahli, species.2 = allogus, env = env, type = "bc", nreps = 99, test.type = "asymmetric")
bg.bc.asym
## 
## 
##  
## 
## Asymmetric background test ahli vs. allogus background
## 
## background test p-values:
##        D        I rank.cor 
##      0.32      0.76      0.43 
## 
## 
## Replicates:
## 
## 
## |          |         D|         I|  rank.cor|
## |:---------|---------:|---------:|---------:|
## |empirical | 0.1328502| 0.3177390| 0.0706201|
## |rep 1     | 0.1430965| 0.3114858| 0.0824412|
## |rep 2     | 0.1284871| 0.2801639| 0.0156034|
## |rep 3     | 0.1599120| 0.3384525| 0.1136082|
## |rep 4     | 0.1431022| 0.3101197| 0.0766638|
plot of chunk unnamed-chunk-17


What is "symmetric" vs. "asymmetric"?  Well, an asymmetric test means that we are comparing the empirical overlap to a null distribution generated by comparing one species' real occurrences to the background of another (species.1 vs. background of species.2).  In the Warren et al. 2008 paper we used this sort of asymmetric test, repeating it in each direction (species.1 vs. background of species.2 and species.2 vs. background of species.1).  While we had the idea that that might generate some interesting biological insight, I think it's generated just as much (if not more) confusion.  For this reason, the new R package also provides the option to do symmetric tests.  These tests compare the empirical overlap to the overlap expected when points are drawn randomly from the background of both species (species.1 background vs. species.2 background), keeping sample sizes for each species constant, of course.

And now a symmetric background test using Domain:


bg.dm.sym = background.test(species.1 = ahli, species.2 = allogus, env = env, type = "dm", nreps = 99, test.type = "symmetric")

bg.dm.sym
## 
## 
##  
## 
## Symmetric background test ahli background vs. allogus background
## 
## background test p-values:
##        D        I rank.cor 
##      0.38      0.36      0.21 
## 
## 
## Replicates:
## 
## 
## |          |         D|         I|  rank.cor|
## |:---------|---------:|---------:|---------:|
## |empirical | 0.1328502| 0.3177390| 0.0706201|
## |rep 1     | 0.2382775| 0.4428653| 0.1774936|
## |rep 2     | 0.1518903| 0.3555431| 0.1002003|
## |rep 3     | 0.1250674| 0.3029139| 0.0717565|
## |rep 4     | 0.1165355| 0.2946842| 0.0841041|
plot of chunk unnamed-chunk-20


Wednesday, July 20, 2016

Running an identity/equivalency test in the ENMTools R package

Okay, let's say we've got two enmtools.species objects: ahli and allogus.  How can we run an identity test?

Here's what we need:


  1. Our two species
  2. Our RasterStack of environmental layers
  3. The type of model we'd like ("glm", "bc", "dm", or "mx", for GLM, Bioclim, Domain, or Maxent)
  4. A formula (GLM only)
  5. The number of reps to perform

So here's how we'd run an identity test using GLM for our two species. 




id.glm = identity.test(species.1 = ahli, species.2 = allogus, env = env, type = "glm", f = presence ~ layer.1 + layer.2 + layer.3 + layer.4, nreps = 99)

Doing 99 reps takes a while, but when you're done, you get an "identity.test" object.  That contains all sorts of useful information.  A quick summary will show you some of it:

id.glm
## 
## 
##  
## 
## Identity test ahli vs. allogus
## 
## Identity test p-values:
##        D        I rank.cor 
##      0.01      0.01      0.01 
## 
## 
## Replicates:
## 
## 
## |          |         D|         I|   rank.cor|
## |:---------|---------:|---------:|----------:|
## |empirical | 0.2221752| 0.4661581| -0.4761597|
## |rep 1     | 0.8883545| 0.9899271|  0.8942366|
## |rep 2     | 0.8486324| 0.9828760|  0.9315827|
## |rep 3     | 0.8227838| 0.9742077|  0.8881490|
## |rep 4     | 0.7255044| 0.9469161|  0.5551645|
plot of chunk unnamed-chunk-14
If you want to access the empirical or replicate models, those are stored in that object as well:

names(id.glm)
[1] "description"               "reps.overlap"              "p.values"                  "empirical.species.1.model" "empirical.species.2.model"
[6] "replicate.models"          "d.plot"                    "i.plot"                    "cor.plot"     

As with building species models, identity.test works pretty much the same for Domain, Bioclim, and Maxent models with the exception that you don't need to supply a formula.

Monday, July 18, 2016

Using an enmtools.species object to build an ENM

Now that we've got our enmtools.species object and have assigned it presence and background data and a species name, we can use it to build models very simply!  For this functionality, ENMTools is basically just acting as a wrapper for dismo, using those functions to actually build models.  At present ENMTools only has interfaces for GLM, Maxent, Bioclim, and Domain, but that will change with time.

So let's use our enmtools.species object "ahli" to build a quick Bioclim model.  I've got a RasterStack object made up of four environmental layers.  It's named "env", and the layers are just "layer.1", etc.

Let's build a model!

ahli.bc = enmtools.bc(species = ahli, env = env)

For Bioclim, Domain, and Maxent, it's that easy!  ENMTools extracts the presence and background data from env using the data stored in the species object, builds a model, and returns some lovely formatted output.

For GLM we need to supply a formula as well, but other than that it's idential.


ahli.glm = enmtools.glm(f = pres ~ layer.1 + layer.2 + layer.3 + layer.4, species = ahli, env = env)

Let's look at the output:

ahli.glm
## 
## 
## Formula:  presence ~ layer.1 + layer.2 + layer.3 + layer.4
## 
## 
## 
## Data table (top ten lines): 
## 
## | layer.1| layer.2| layer.3| layer.4| presence|
## |-------:|-------:|-------:|-------:|--------:|
## |    2765|    1235|    1174|     252|        1|
## |    2289|    1732|     957|     231|        1|
## |    2158|    1870|     983|     253|        1|
## |    2207|    1877|     967|     259|        1|
## |    2244|    1828|     945|     249|        1|
## |    2250|    1766|     919|     235|        1|
## |    2201|    1822|     978|     277|        1|
## |    2214|    1786|     986|     284|        1|
## |    2287|    1722|     992|     266|        1|
## |    2984|     965|    1311|     237|        1|
## 
## 
## Model:  
## Call:
## glm(formula = f, family = "binomial", data = analysis.df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.67171  -0.20485  -0.14150  -0.09528   3.08762  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 46.178922  24.777923   1.864   0.0624 .
## layer.1     -0.013347   0.006276  -2.127   0.0334 *
## layer.2     -0.011985   0.006612  -1.813   0.0699 .
## layer.3      0.003485   0.006586   0.529   0.5967  
## layer.4     -0.009092   0.021248  -0.428   0.6687  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 164.58  on 1015  degrees of freedom
## Residual deviance: 150.15  on 1011  degrees of freedom
## AIC: 160.15
## 
## Number of Fisher Scoring iterations: 8
## 
## 
## 
## Suitability:  
## class       : RasterLayer 
## dimensions  : 418, 1535, 641630  (nrow, ncol, ncell)
## resolution  : 0.008333333, 0.008333333  (x, y)
## extent      : -86.90809, -74.11642, 19.80837, 23.2917  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## data source : in memory
## names       : layer 
## values      : 6.419793e-08, 0.999983  (min, max)
plot of chunk unnamed-chunk-8
You get pretty, formatted output and a nice plot as well.  Next up: identity/equivalency tests!



Saturday, July 16, 2016

The new R version of ENMTools is in the works! Here's how to build an enmtools.species object.

But for real this time.  I've started over entirely from scratch, and I'm using the new R package as a foundation for some novel analyses that I'm developing as part of my current research.  You can download it and view a fairly lengthy manual of what's currently implemented here:

https://github.com/danlwarren/ENMTools


For reasons that will become clear with time (when some of the downstream stuff gets finished), the way you interface with ENMTools is going to be a bit different from how you work with dismo or Biomod.  First off, you start by defining enmtools.species objects for each species (or population) that you want to compare.


Here I'll create one called ahli (based on data from Anolis ahli).



ahli = enmtools.species()

Now that doesn't have any data associated with it, so if we get a summary of it, we basically just hear back from R that we don't have any data.



ahli
## 
## 
## Range raster not defined.
## 
## Presence points not defined.
## 
## Background points not defined.
## 
## Species name not defined.

So let's add some data:





ahli$species.name = "ahli"
ahli$presence.points = read.csv("test/testdata/ahli.csv")[,3:4]
ahli$background.points = background.points.buffer(ahli$presence.points, 20000, 1000, env[[1]])
ahli

And then look at it again:



## 
## 
## Range raster: 
## class       : RasterLayer 
## dimensions  : 418, 1535, 641630  (nrow, ncol, ncell)
## resolution  : 0.008333333, 0.008333333  (x, y)
## extent      : -86.90809, -74.11642, 19.80837, 23.2917  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## data source : in memory
## names       : layer.1 
## values      : 1, 1  (min, max)
## 
## 
## 
## Presence points (first ten only): 
## 
## | Longitude| Latitude|
## |---------:|--------:|
## |  -80.0106|  21.8744|
## |  -79.9086|  21.8095|
## |  -79.8065|  21.7631|
## |  -79.8251|  21.8095|
## |  -79.8807|  21.8374|
## |  -79.9550|  21.8374|
## |  -80.3446|  22.0136|
## |  -80.2983|  21.9951|
## |  -80.1776|  21.9023|
## |  -80.1591|  21.9673|
## 
## 
## Background points (first ten only): 
## 
## | Longitude| Latitude|
## |---------:|--------:|
## | -79.78726| 21.72920|
## | -79.82892| 21.73754|
## | -79.83726| 21.69587|
## | -80.01226| 22.01254|
## | -79.63726| 21.76254|
## | -79.92892| 21.78754|
## | -79.99559| 22.12920|
## | -79.81226| 21.87087|
## | -80.30392| 22.07920|
## | -79.97892| 21.85420|
## 
## 
## Species name:  ahli

Neat, huh?  Next up I'll show you how to build an ENM.