Monday, 23 November 2020

Spatially partition your randomisations

In previous posts I described how the rand_structured algorithm works, and then how one can use spatially structured randomisations to compare observed patterns with stricter null models.   

This post is concerned with how one can spatially partition a randomisation.  Some of these details are in a previous blog post, but this post is a much more detailed description. 

In the standard implementation of a randomisation, in Biodiverse and I suspect more generally also, labels are randomly allocated to any group across the entire data set.  This works well for many studies and is very effective.  However, when one begins to scale analyses to larger extents, for example North America, Asia or globally, then the total pool of labels begins to span many different environments.  The randomisations could allocate a polar taxon to the tropics, and a desert taxon to a rainforest.  This does not make the randomisation invalid, but it is perhaps not as strict as it could be.  

The effect is perhaps best considered with a scenario using phylogenetic diversity where polar and tropical taxa are distinct clades on the tree.   For a randomisation that allocates labels anywhere across the data set, one will commonly end up with a random realisation containing many groups with a mixture of polar and tropical taxa.  The PD on the random data set will be higher than on the observed data set in almost all cases because both clades are sampled and thus more of the tree is represented.  Note that this does not make the result wrong - it means that, when compared with a random sample from the full pool of taxa, the observed PD is less than expected.  That is useful to know, but the next question to ask is "are the patterns in the tropics higher or lower than expected for the tropics?", and the same for the polar taxa.  

One could quite easily remove any groups outside a region of interest and then rerun the analysis.  However, then the ranges of the labels will be reduced and any endemism analyses will not be comparable with the larger analyses.  

This can be readily fixed in Biodiverse by specifying a spatial condition to define subsets. Or, if you only want to randomise a subset of your data while holding the rest as-observed, you can use a definition query.    The spatial condition approach was used in Mishler et al. (2020) to (very approximately) partition Canada, the US and Mexico to assess sensitivity of the results for the full data set.  The definition query was used in Allen et al. (2019) to only randomise values within Florida, while holding the adjacent regions constant.

How does it work?  Both approaches slice the data into subsets, apply the randomisation algorithm to each subset independently, and then reassemble the randomised subsets into a full randomised basedata that is then used for the comparisons.  The difference for the definition query is that the data is divided into two sets, and only the subset that passes the query is randomised.

The definition query is run first.  This means that, if you specify both a definition query and a spatial condition, then only the groups that pass the definition query are considered for randomisations, and those randomisations will be spatially partitioned.

This approach is general, and works for any of the randomisation algorithms in Biodiverse.  Note, however, it will not apply different randomisation algorithms in different subsets.  If there is a good reason to do so then we can look at it, but it will make the interface much more complex to implement.  

One other point to note is that, for the structured randomisations, the swapping algorithm is applied within the subsets.  Each subset is run to completion before they are all aggregated.


Some images will work better than text, so here are some clipped screenshots from Biodiverse.

The observed distribution of Acacia tenuissima (red cells), with outlines of Australian states and territories in blue.  Acacia data are from Mishler et al. 2014, and the cell size is 50 km.  

A tenuissima incidences in a randomisation constrained such that each incidence is allocated within the polygon that contains it. Note the absence of incidences in NSW, Victoria and Tasmania, and that there are only three in South Australia.  This is consistent with the observed data.

A tenuissima records randomised using a definition query so that only incidences within Western Australia are randomised.  All others are kept unchanged. Compare with the observed distribution above.   


So how does one use it?  It is done as part of the standard interface (this functionality has actually been in Biodiverse since version 1, but the interface was slightly reconfigured in version 2).  The user specifies spatial conditions using the same syntax as for the spatial analyses.   


Subsets to randomise independently are defined using a spatial condition (red arrow), while the definition query (blue arrow) is used to randomise a subset of the data.

The choice of condition is up to the user, and they can specify whatever condition they like (it is their analysis, after all...).  Generally speaking, though, it is better to use a condition that generates non-overlapping regions.  One that uses a shapefile would fit with many geographic cases.  Many conditions will work but might not be very sensible, for example overlapping circles.  Each group is allocated to only one subset, so in such overlapping cases it is the first one that contains the group "wins" the group.

One thing to watch for, and which is fixed for version 4, is that if the spatial condition does not capture all groups then the system will throw an error.  The workaround for version 3 and earlier is to use a definition query that matches (shadows) the spatial condition, although keep in mind that any groups outside the definition query will not be randomised.

To sum up, Biodiverse allows a high degree of complexity and fine grained control when using randomisations to assess the significance of observed patterns.  This post has described the spatial conditions and definition queries.  More features and approaches are described in other posts, and are grouped under the randomisation tag.  



Shawn Laffan

23-Nov-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


Thursday, 19 November 2020

Randomisations - modelling spatial structure

In the previous post, I described how the rand_structured algorithm works.  If you don't feel like reading that post right now, then a short description is that it uses a filling algorithm to randomly allocate the labels in a basedata onto the groups of a new basedata, with the end result being random assemblages in each group, and where each group has the same number of labels as the original basedata by default. A swapping algorithm is then used to allocate any labels that could not be allocated using the filling process, usually because there were no unfilled groups that did not already contain these labels.

The filling algorithm has some nice advantages over other methods that also match the richness patterns, such as the independent swaps algorithm.  The main one is that filling algorithms are easily generalised to model spatial processes in their allocation, for example random walks and diffusion processes.

The videos below show some of these spatially structured randomisations at work, using Acacia tetragonophylla incidences from Mishler et al. (2014).  Note that the shade of red is used only to indicate when in the sequence a cell was allocated to, with lighter shades being earlier and darker later.

The first video shows a random walk model, in which a seed location is chosen for the first allocation.  The next location is a randomly selected neighbour of the current location, and the process repeats.  If a location has no available neighbours (all are full or already contain this label) then it backtracks until it finds one it can allocate to (backtracking can be from the last allocated, the first, or randomly selected).   If there are no possible allocations then it will try a new seed location.  One can also set a probability to randomly reseed even if the window has not be fully allocated to.   



The second uses a diffusion process.  This also uses a seed location, but considers the neighbours of all of its current locations for the next allocation.  As with the random walk, this process is also constrained by the filled and previously allocated groups, and will reseed if necessary or at random.   There is no backtracking as it is not needed since there is no "path" being followed. 



The final model is a proximity allocation process.  A seed location is chosen, and a spatial condition is then used to select a window of neighbours.   The example here uses a circle of a three cell radius, but any supported condition could be used such as predefined regions represented using a shapefile.  Once all groups in the window have been allocated to, excluding fully allocated groups, a new seed location is chosen. 



As with the main rand_structured algorithm, any unallocated incidences are handled using a swapping algorithm (see details in the previous post).  This ensures richness targets are met, at the cost of some loss of spatial contiguity in the randomly generated distributions.  

The outliers for this proximity allocation randomisation are due to the swapping algorithm used to reach the per-group richness targets.



The above randomisations are all implemented under the rand_spatially_structured randomisation, for which parameters can be set to model each approach.  However, setting parameters can be complicated and error prone so there are variants called rand_diffusion and rand_random_walk that restrict the set of arguments needed and set some useful defaults, but otherwise are just wrappers around rand_spatially_structured (which itself is implemented in Biodiverse as a particular case of the rand_structured algorithm). There is not currently a wrapper for the proximity allocation, but one could be added if needed.  The next few screenshots show the parameter settings.   


The options to control the spatially structured randomisations are marked with the red and blue arrows.  The spatial condition (marked by the red arrow) determines which neighbours of the considered group will be allocated to next.  This example uses a square, but it could be any of the supported conditions.  The other parameters (marked with the blue arrow) control reseeding, backtracking, and allocation order.  

The rand_diffusion parameters are almost identical to those for rand_spatially_structured, as the former is just a special case of the latter and this simplifies setting it up.   

As with the rand_diffusion parameters, and for the same reason, the parameters for rand_random_walk are also almost identical to those for rand_spatially_structured.   

As an aside, if you want a good introduction to the general process of dynamic spatial modelling, then have a look at O'Sullivan and Perry (2013).  There is also a good introduction to the limits of the complete spatial randomness model in O'Sullivan and Unwin (2010).  

To sum up, there is more to randomisations than just randomly shuffling the data around.  Biodiversity patterns are normally spatially structured, so finding patterns that are significantly different from those that are completely random is often not much of a challenge.  Adding constraints such as matching species richness patterns is very useful, but one can go further and introduce even more spatial structure, making the tests "harder to pass" (an example is in Laffan and Crisp, 2003).  As noted by Gotelli (2001), trying multiple hammers is a useful thing, and to quote that paper: "The advantage of null models is that they provide flexibility and specificity that cannot often be obtained with conventional statistical analyses".  Using multiple approaches might be regarded as a fishing expedition, but it does allow one to generate a distribution of models and thus deeper understanding of the patterns one is analysing.  And fish are nutritious.


Shawn Laffan

19-Nov-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


Wednesday, 18 November 2020

Randomisations - how the rand_structured algorithm works

The default randomisation algorithm in Biodiverse is called rand_structured.  This has been available in Biodiverse since the beginning and has been used in several papers.  It forms the basis of the current examples of the CANAPE protocol, and an early variant was used in Laffan and Crisp (2003) which used an ancestral variant of Biodiverse.  

The rand_structured algorithm is so called because it maintain the structure of the original data, specifically the spatial distribution of the richness patterns, while randomly allocating taxa (labels) to groups (cells).  By default, the richness patterns are matched exactly, but users also have the option to allow tolerances such as an additional number of labels per group (e.g. five extra), or some multiple of the observed (e.g. 1.5 times as many).  Irrespective of these tolerances, only as many label/group combinations are allocated as are observed in the original data.   If one thinks of the data as a site by species matrix then, if the richness patterns are matched exactly, then the row and column sums are identical between the original and randomised matrix.  If tolerances are used then the row and column sums will not match.  In both cases, however, the number of non-zero entries (incidences) across the matrix is constant.

The rand_structured randomisation uses a filling algorithm, followed by a swapping algorithm to reach its richness targets.  

The filling process selects a label at random and allocates all of its occurrences to groups (cells), also at random.  Thinking spatially, it is scattering the incidences randomly across the map (but keep in mind that your data do not need to be spatial for them to be analysed).  The filling algorithm will not consider any groups that have already reached their richness targets. 

At the end of the allocation process there will normally be a set of labels that have not yet been fully allocated, because they are already allocated to the unfilled groups, and all the other groups are full (have reached their richness targets).  This is where the swapping algorithm comes in. 

In the swapping process, an unallocated label is selected at random (call it label1).  The set of filled groups that do not contain this label is then searched to find a label that can be allocated to one of the unfilled groups (call it label2).  Once that is done, the label2 incidence is moved to the unfilled group, and label1 is added to the previously filled group.  This is repeated until all incidences have been allocated.  

All the above might be easier to understand if presented visually.  The red cells in the image below represent the distribution of Acacia tetragonophylla across Australia at a 50 km resolution.  Grey cells contain one or more other Acacia species.  

The video below it shows the allocation of the A tetragonophylla incidences, with the intensity of the red representing when in the sequence they were allocated (lighter is earlier, darker is later).  Note how there is no apparent spatial pattern in the allocation, but remember that the richness patterns for each group (cell) are being preserved. 



The distribution of Acacia tetragonaphylla (red cells).  Grey cells have at least one other Acacia species.  Data are from Mishler et al. (2014), and cell size is 50 km.  






Some readers might be thinking that the search for labels and groups to swap can take some time.  However, the process is much faster than simply thrashing around, randomly swapping labels until they have all been allocated. In fact, a huge amount of work has been spent over the years to optimise the filling and swapping code to ensure that it runs fast and scales linearly (or sub-linearly) as the number of labels and groups increases (and sometimes is unaffected). Much of this is done using indexing and tracking lists to reduce searching, a memory-for-speed trade-off that is in some ways key to the Biodiverse development philosophy.  As a result of this work, the current implementation is several orders of magnitude faster than what it was eight years ago.


Note that, as small caveat to the above, it is possible there are small deviations of the description from the implementation.  In such cases it is the code that is canonical.  (This is somewhat like descriptive comments in a computer program being what the programmer wanted some code to do, not necessarily what it actually does).


Shawn Laffan

18-Nov-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


Thursday, 12 November 2020

Randomisations now also generate z-scores

Randomisations in Biodiverse have always generated rank-relative scores from which significance can be derived. See for example this previous post: https://biodiverse-analysis-software.blogspot.com/2016/08/easier-to-use-randomisation-results.html 

However, sometimes users want to use z-scores to assess a randomisation. Several other tools such as phylocom also calculate z-scores, and often these are the default.

From version 4 of Biodiverse, z-scores are now also be calculated for randomisations. More details about what happens are in the updated help page, but a short summary is given here. 

For each iteration of the randomisation, each analysis is regenerated using the same set of spatial conditions, but using the randomised version of the basedata. The values of each index for each group (cell) are then checked, and running tallies are kept for the sum of randomised index values as well as the sum of squared index values. These sums, along with the count of iterations, allows the calculation of the mean (sum of x / count) and standard deviation using the running variance method. 

Once the randomisations have completed, the z-score for each index value for each group can be calculated in the usual way as (observed - expected / sqrt (variance)). Interpretation is the same as any other z-score, with values outside the interval [-1.96,1.96] being significant for two tailed test with an alpha of 0.05, providing the number of samples is large. Note that the strictly correct value depends on the degrees of freedom, which depends on the number of samples. But why would you run a small number of iterations? 

The z-scores are calculated automatically whenever a randomisation is run.  The results are collated for each group and are accessible through the ">>z_scores>>" lists.  In this case Rand1 is the name of the randomisation, hence the list name is Rand1>>z_scores>>SPATIAL_RESULTS. 



If you want to add z-scores to an existing randomisation then you will need to run at least one extra randomisation iteration to trigger their calculation. If there is demand to do so without the extra iteration then we can add an option for it.  (Consider the relative numeric difference between 999 and 1000 iterations, without worrying about aesthetics such as the number of observed plus random realisations being a multiple of 10).  


Shawn Laffan, 12-Nov-2020



For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/ 

For the full list of changes in the 3.99 development series, leading to version 4, see https://github.com/shawnlaffan/biodiverse/wiki/ReleaseNotes#version-399-dev-series 

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users

Friday, 6 November 2020

Updated handling of Cluster and Region Grower analyses in randomisations

Randomisations in Biodiverse are used to assess the statistical significance of a set of analysis results given some randomisation scheme such as shuffling the species around the map (labels across groups in Biodiverse terms), subject to constraints such as each cell (group) must maintain the same number of species. Randomisations are key to interpreting where results differ from what would be expected, and are integral to protocols such as the Categorical Analysis of Paleo and Neo Endemism (CANAPE)


The basic process of the randomisation is this.  For each iteration of the randomisation analysis, Biodiverse will:
  1. Create a new basedata object with a random allocation of labels to groups
  2. For each analysis in the basedata 
    1. Regenerate a version using the randomised basedata
    2. Compare the values of the analyses from the original and randomised basedata and track if they are higher or lower, on a cell by cell basis
    3. Track basic statistics of the distribution to allow the calculation of the mean and standard deviation, and thus z-scores (this is new in Version 4 - there will be more on this in another post).

The tracking is used to reduce memory usage, as the randomised basedatas and outputs can be discarded as soon as they have been collated.  This is more efficient than keeping all the results in memory for later ranking, especially with large data sets such as those used for Mishler et al. (2020).

A huge amount of work has been spent over the last several years to make the randomisation process go faster and scale better with larger data sets, with recent versions being orders of magnitude faster than some of the earlier versions. In some cases analyses now take minutes where they used to take days. However, one slow-down that still affects things in version 3.1 is the effect of tree structures in a basedata. These are the Cluster and Region Grower analyses. 

The reason the Cluster and Region Grower analyses slow things down is simply that they take longer to calculate.  A spatial analysis only need to pass over each processing group (cell) once, so one can think of it as N calculations.  In comparison, a cluster analysis will compare each group with each other group to create its matrix.  If one is clustering a full data set then this will be N(N-1)/2 calculations, but the same scaling effect applies for subsets defined using definition queries (e.g. for CANAPE).  Even when one considers that the spatial analysis might include many neighbours for each group, the Cluster and Region Grower analyses take much longer.  And then the Cluster analysis needs to process the matrix.

Even with the slow-down, a comparison of the randomised Cluster or Region Grower analysis with the original is not always informative.  

In Biodiverse Version 4, this process has been changed.  Cluster and Region grower analyses are now skipped by default.  This will substantially speed up randomisations containing such analyses.  

If you still want to run them then there is an option to do so.  See the red arrow in the screenshot below.  


Randomisations of Cluster and Regions Grower analyses is off by default in Version 4, but can be re-enabled if needed (see red arrow).



One point to note is that any calculations per node (branch) are still done.  This was modified some time ago to use the set of randomised groups under each branch from the original, non-randomised tree.  This works because the group IDs are the same across both the original and the randomised basedatas.  It is just that the randomised version has a randomly allocated set of labels. 



Shawn Laffan
06-Nov-2020



For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/ 

For the full list of changes in the 3.99 development series, leading to version 4, see https://github.com/shawnlaffan/biodiverse/wiki/ReleaseNotes#version-399-dev-series 


To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 


You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users