Friday 6 November 2020

Updated handling of Cluster and Region Grower analyses in randomisations

Randomisations in Biodiverse are used to assess the statistical significance of a set of analysis results given some randomisation scheme such as shuffling the species around the map (labels across groups in Biodiverse terms), subject to constraints such as each cell (group) must maintain the same number of species. Randomisations are key to interpreting where results differ from what would be expected, and are integral to protocols such as the Categorical Analysis of Paleo and Neo Endemism (CANAPE)

The basic process of the randomisation is this.  For each iteration of the randomisation analysis, Biodiverse will:
  1. Create a new basedata object with a random allocation of labels to groups
  2. For each analysis in the basedata 
    1. Regenerate a version using the randomised basedata
    2. Compare the values of the analyses from the original and randomised basedata and track if they are higher or lower, on a cell by cell basis
    3. Track basic statistics of the distribution to allow the calculation of the mean and standard deviation, and thus z-scores (this is new in Version 4 - there will be more on this in another post).

The tracking is used to reduce memory usage, as the randomised basedatas and outputs can be discarded as soon as they have been collated.  This is more efficient than keeping all the results in memory for later ranking, especially with large data sets such as those used for Mishler et al. (2020).

A huge amount of work has been spent over the last several years to make the randomisation process go faster and scale better with larger data sets, with recent versions being orders of magnitude faster than some of the earlier versions. In some cases analyses now take minutes where they used to take days. However, one slow-down that still affects things in version 3.1 is the effect of tree structures in a basedata. These are the Cluster and Region Grower analyses. 

The reason the Cluster and Region Grower analyses slow things down is simply that they take longer to calculate.  A spatial analysis only need to pass over each processing group (cell) once, so one can think of it as N calculations.  In comparison, a cluster analysis will compare each group with each other group to create its matrix.  If one is clustering a full data set then this will be N(N-1)/2 calculations, but the same scaling effect applies for subsets defined using definition queries (e.g. for CANAPE).  Even when one considers that the spatial analysis might include many neighbours for each group, the Cluster and Region Grower analyses take much longer.  And then the Cluster analysis needs to process the matrix.

Even with the slow-down, a comparison of the randomised Cluster or Region Grower analysis with the original is not always informative.  

In Biodiverse Version 4, this process has been changed.  Cluster and Region grower analyses are now skipped by default.  This will substantially speed up randomisations containing such analyses.  

If you still want to run them then there is an option to do so.  See the red arrow in the screenshot below.  

Randomisations of Cluster and Regions Grower analyses is off by default in Version 4, but can be re-enabled if needed (see red arrow).

One point to note is that any calculations per node (branch) are still done.  This was modified some time ago to use the set of randomised groups under each branch from the original, non-randomised tree.  This works because the group IDs are the same across both the original and the randomised basedatas.  It is just that the randomised version has a randomly allocated set of labels. 

Shawn Laffan

For more details about Biodiverse, see 

For the full list of changes in the 3.99 development series, leading to version 4, see 

To see what else Biodiverse has been used for, see 

You can also join the Biodiverse-users mailing list at 

No comments:

Post a Comment

Note: only a member of this blog may post a comment.