Monday 23 November 2020

Spatially partition your randomisations

In previous posts I described how the rand_structured algorithm works, and then how one can use spatially structured randomisations to compare observed patterns with stricter null models.   

This post is concerned with how one can spatially partition a randomisation.  Some of these details are in a previous blog post, but this post is a much more detailed description. 

In the standard implementation of a randomisation, in Biodiverse and I suspect more generally also, labels are randomly allocated to any group across the entire data set.  This works well for many studies and is very effective.  However, when one begins to scale analyses to larger extents, for example North America, Asia or globally, then the total pool of labels begins to span many different environments.  The randomisations could allocate a polar taxon to the tropics, and a desert taxon to a rainforest.  This does not make the randomisation invalid, but it is perhaps not as strict as it could be.  

The effect is perhaps best considered with a scenario using phylogenetic diversity where polar and tropical taxa are distinct clades on the tree.   For a randomisation that allocates labels anywhere across the data set, one will commonly end up with a random realisation containing many groups with a mixture of polar and tropical taxa.  The PD on the random data set will be higher than on the observed data set in almost all cases because both clades are sampled and thus more of the tree is represented.  Note that this does not make the result wrong - it means that, when compared with a random sample from the full pool of taxa, the observed PD is less than expected.  That is useful to know, but the next question to ask is "are the patterns in the tropics higher or lower than expected for the tropics?", and the same for the polar taxa.  

One could quite easily remove any groups outside a region of interest and then rerun the analysis.  However, then the ranges of the labels will be reduced and any endemism analyses will not be comparable with the larger analyses.  

This can be readily fixed in Biodiverse by specifying a spatial condition to define subsets. Or, if you only want to randomise a subset of your data while holding the rest as-observed, you can use a definition query.    The spatial condition approach was used in Mishler et al. (2020) to (very approximately) partition Canada, the US and Mexico to assess sensitivity of the results for the full data set.  The definition query was used in Allen et al. (2019) to only randomise values within Florida, while holding the adjacent regions constant.

How does it work?  Both approaches slice the data into subsets, apply the randomisation algorithm to each subset independently, and then reassemble the randomised subsets into a full randomised basedata that is then used for the comparisons.  The difference for the definition query is that the data is divided into two sets, and only the subset that passes the query is randomised.

The definition query is run first.  This means that, if you specify both a definition query and a spatial condition, then only the groups that pass the definition query are considered for randomisations, and those randomisations will be spatially partitioned.

This approach is general, and works for any of the randomisation algorithms in Biodiverse.  Note, however, it will not apply different randomisation algorithms in different subsets.  If there is a good reason to do so then we can look at it, but it will make the interface much more complex to implement.  

One other point to note is that, for the structured randomisations, the swapping algorithm is applied within the subsets.  Each subset is run to completion before they are all aggregated.


Some images will work better than text, so here are some clipped screenshots from Biodiverse.

The observed distribution of Acacia tenuissima (red cells), with outlines of Australian states and territories in blue.  Acacia data are from Mishler et al. 2014, and the cell size is 50 km.  

A tenuissima incidences in a randomisation constrained such that each incidence is allocated within the polygon that contains it. Note the absence of incidences in NSW, Victoria and Tasmania, and that there are only three in South Australia.  This is consistent with the observed data.

A tenuissima records randomised using a definition query so that only incidences within Western Australia are randomised.  All others are kept unchanged. Compare with the observed distribution above.   


So how does one use it?  It is done as part of the standard interface (this functionality has actually been in Biodiverse since version 1, but the interface was slightly reconfigured in version 2).  The user specifies spatial conditions using the same syntax as for the spatial analyses.   


Subsets to randomise independently are defined using a spatial condition (red arrow), while the definition query (blue arrow) is used to randomise a subset of the data.

The choice of condition is up to the user, and they can specify whatever condition they like (it is their analysis, after all...).  Generally speaking, though, it is better to use a condition that generates non-overlapping regions.  One that uses a shapefile would fit with many geographic cases.  Many conditions will work but might not be very sensible, for example overlapping circles.  Each group is allocated to only one subset, so in such overlapping cases it is the first one that contains the group "wins" the group.

One thing to watch for, and which is fixed for version 4, is that if the spatial condition does not capture all groups then the system will throw an error.  The workaround for version 3 and earlier is to use a definition query that matches (shadows) the spatial condition, although keep in mind that any groups outside the definition query will not be randomised.

To sum up, Biodiverse allows a high degree of complexity and fine grained control when using randomisations to assess the significance of observed patterns.  This post has described the spatial conditions and definition queries.  More features and approaches are described in other posts, and are grouped under the randomisation tag.  



Shawn Laffan

23-Nov-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


No comments:

Post a Comment

Note: only a member of this blog may post a comment.