Wednesday, October 28, 2015

Handy little snippet of R code for thinning occurrence data

I came up with this a few months back.  I was using the R package spThin, by Aiello-Lammens et al, but found that it didn't quite do what I wanted it to do.  The purpose of that package is to return the maximum number of records for a given thinning distance, which is obviously very valuable in situations where you (1) don't have a ton of data and (2) are concerned about spatial autocorrelation.

However, it wasn't quite what I needed for two reasons.  First, I didn't need to maximize the number of points given a specified thinning distance; I needed to grab a fixed number of points that did the best possible job of spanning the variation (spatial or otherwise) in my initial data set.  Second, the spThin algorithm, because it's trying to optimize sample size, can take a very long time to run for larger data sets.

Here's the algorithm I fudged together:

1. Pick a single random point from your input data set X and move it to your output set Y.

Then, while the number of points in Y is less than the number you want:
2. Calculate the distance between the points in X and the points in Y.
3. Pick the point from X that has the highest minimum distance to points in Y (i.e., is the furthest away from any of the points you've already decided to keep).

Lather, rinse, repeat.  Stop when you've got the number of points you want.

Just so you can get an idea of what's going on, here's some data on Leucadendron discolor from a project I'm working on with Haris Saslis-Lagoudakis:



That's a lot of points!  4,874, to be exact.  Now let's run it through thin.max and ask for 100 output points:



Lovely!  We've got a pretty good approximation of the spatial coverage of the initial data set, but with a lot fewer points.  What's extra-nice about this approach is that you can use it for pretty much any set of variables that you can use to calculate Euclidean distances.  That means you can thin data based on geographic distance, environmental distance, or a combination of both!  Just pass it the right column names, and it should work.  Of course since it's Euclidean distances you might run into some scaling issues if you try to use a combination of geographic and environmental variables, but you could probably fix that by rescaling axes before selecting points in some useful way.

Also, since it starts from a randomly chosen point, you will get somewhat different solutions each time you run it.  Could be useful for some sort of spatially-overdispersed jackknife if you want to do something like that for some reason.

There's no R package as such for this, but it will probably be folded into the R version of ENMTools when I get around to working on that again.  For now, you can get the code here:

https://gist.github.com/danlwarren/271288d5bab45d2da549