Monday, 7 December 2020

Biodiverse now includes the independent swaps randomisation algorithm

Randomisations are one of the key analyses in Biodiverse.  They give an assessment of whether an observed result is significantly different from a distribution of randomly generated versions.   Several recent posts have described how the spatially structured randomisations work, including the (default) rand_structured algorithm and its spatially structured variants. These can also be spatially constrained.  

These approaches have been available in Biodiverse for some time: rand_structured since the very beginning, spatial constraints since version 1, and the spatially structured variants since version 2.  

One of the general principles when developing Biodiverse is to provide options for users[1].  In that light, version 4 of Biodiverse will include the independent swaps algorithm.  

How does independent swaps work?  

A short summary is below, and more details and assessments are in Gotelli (2000) and Miklos & Podani (2004).  Code for the R implementation is available through the Picante package.  

  1. Pick two different groups at random (call them group1 and group2)
  2. Pick two different labels at random (call them label1 and label2)
  3. If label1 is not in group1, or label2 is not in group2, or if label1 is already in group2, or label2 is already in group1  
    1. Then they cannot be swapped, so switch label1 and label2.  
    2. If the switched pairs cannot be swapped using the condition in #3 then go back to step #1
  4. Swap the labels between groups
  5. Increment the number of successful swaps by 1
  6. Start back at step #1


This is a very simple algorithm, and simple has many advantages when it comes to implementation.  However, it is also a brute-force approach.  The fact that the pairs are selected at random means there is no consideration of "swappability".  Consider the case of a site by species matrix where most of the matrix entries are zero, such as where most taxa are highly range restricted.  In that case, there is a very high chance that the algorithm will select a pair that cannot be swapped and must retry with a different selection, e.g. because neither label is in either group.  This can lead to many wasted CPU cycles.  As an unflattering description one might call this a "throw enough mud at a wall and some of it will stick" approach.  Keep in mind, though, that for many cases this will work well enough, and the results are most definitely random - good enough is good enough.  

Another issue with the independent swaps approach is that one can never be quite certain how many iterations are needed to fully randomise a data set.  Miklos & Podani (2004) suggest twice the number of non-zero entries in the matrix, but I am not aware of whether that has been rigorously assessed.  Too few iterations and some of the original structure will be retained.  Too many and the analysis might take too long (even for very patient people)[2].

There is also no stopping criterion except the total number of iterations.  It is therefore possible for the algorithm to be trapped in an infinite loop when none of the group/label pairs can be swapped, as the iteration count is only incremented on a successful swap.   

In light of the above points, the implementation of independent swaps in Biodiverse has three parameters, two of which lead to early stopping.  

  1. The default number of swaps is set to be twice the number of non-zero cells (if one thinks of the data as a site by species matrix).  This value follows Miklos & Podani.  Users can specify any positive integer here.  The algorithm will stop when this value is reached.  
  2. The maximum number of iterations defaults to 100 times the number of swaps, but can also be specified as a positive integer.  If this number is exceeded then the algorithm stops, regardless of the number of swaps completed.  (Note that this value will likely be increased in the final Biodiverse 4 release, see below for why).  
  3. The number of times each label/group pair in the original matrix is swapped out is tracked.  The algorithm ends once each label/group pair has been swapped at least once.  This is off by default.

The last case uses the assumption that, once a pair has been randomly swapped, there is no need to swap it any further as it will not be any more random.  If your random number generator is not good then more swaps can actually make things worse (but his is unlikely to be a problem as most systems use good random number generators now). 

A modified independent swaps algorithm

As noted in a previous post, a huge amount of effort has gone into optimising the randomisations in Biodiverse to make them run faster.  Armed with this knowledge, the independent swaps algorithm has also been optimised.  The primary aim here has been to reduce the search for swappable label/group pairs, at the cost of extra memory to store extra index lists.  

An overview of the algorithm is this:

  1. Select a group using a multinomial probability distribution, with the probabilities being the richness scores of each group.  Call this group1.
  2. Select label1 from the labels in group1 using a uniform random distribution (each label in group1 is equiprobable).
  3. Select a group that does not already contain label1, again using a multinomial probability distribution.  Call this group2.
  4. Select label2 from the labels in group2 using a uniform random distribution (each label in group2 is equiprobable).
    1. If needed, repeat #4 until label2 is not one that is already in group1.
  5. Swap label1 to group2, and label2 to group1.
  6. Update the multinomial probability distributions.
  7. Increment the swap count.
  8. Go to step #1.

The multinomial distributions are used to replicate the chances of selecting a non-zero matrix entry when selecting entirely at random.  The use of the "absence" lists in step #3 also focuses the search to swappable pairs.  For example, a very wide ranging label (species) will occur in most groups so there is a high probability that the normal algorithm will select a group1 and group2 that both contain that label.

There is clearly a memory burden when one considers the "absence" tracking lists.  However, memory is cheap these days and the data sets must be relatively enormous for it to make too much difference.  Perl, in which Biodiverse is programmed, also has a copy-on-write mechanism to save memory when dealing with large strings (e.g. a copy of a 100MB string can be stored using a link to the original, and only needs to be really copied when it is changed in some way or is written out).  

The modified implementation has the same default stopping parameters as the unmodified implementation.

How long do they take?

So how fast are they?  The table below summarises a benchmark run with the Acacia data set from Mishler et al. (2014).  This has 506 species (labels) distributed across 3037 cells (groups).   

Both the original and modified Biodiverse implementations are used, as is that from the R Picante library and a variant implemented in pure R (and run using R version 4.03).  All were run on the same computer.  The Biodiverse runs used the GUI, the R runs were under RStudio.  Code for the R runs is available at the Biodiverse GitHub repo.  

Only one to five runs of each combination was used, and values are rounded off.  Ideally more would be run, but the results are consistent when re-run and the differences are of sufficient magnitude to use for a blog post.

There are 29,175 non-zero entries in the site by species matrix, so approximately 81% of the matrix is empty.  The target number of swaps for the independent swaps runs was set to twice the number of non-empty matrix cells, 58,350, which as noted above is the default in Biodiverse and follows Miklos and Podani (2004).

Values in the table are sorted by run time in seconds.  The "early stop" column indicates if swapping is stopped once all label/group pairs (matrix cells) have been swapped at least once.  "Num swaps" is the number of swaps actually run, "cells swapped" is the count of matrix cells swapped at least once, and is only tracked when early stopping is in use.  "Swap attempts" is how many iterations were used overall, including those where a swap was not possible.  "max attempts" is the maximum number of swap attempts before giving up, with the default being 100 times the target number of swaps.  The larger number is 2^31-1, which is the maximum value for a signed 32 bit integer.    




The main takeaway from the table is that the rand_structured algorithm is fastest of all those run, and is much faster than any of the unmodified independent swaps runs.  This difference will be more pronounced as the data set size increases.   That said, the rand_structured algorithm has been through many rounds of optimisation, so the difference might be reduced with further optimisations to the other algorithms. 

The times for the modified independent swaps are aggregates of iterations 2-5, with the value in brackets being the first iteration when many of the lists are set up and cached for later reuse.  

The modified independent swaps algorithm reaches its target with much less "wastage" of iterations, hence it is fastest of the independent swaps.  This is further aided by the early stopping condition.  

For the unmodified independent swaps algorithm, more than 24,600,000  swap attempts are needed to ensure all matrix cells have been swapped.  This is ~843 times the number of non-zero matrix cells.  The two fastest unmodified runs reach the maximum number of swap attempts when the number of actual swaps is still very low, at ~6% of the target number of swaps.

The unbounded independent swaps run in Biodiverse takes nearly seven minutes.  If used with 999 randomisation iterations in sequence then the run time will be nearly 4.8 days for the randomisations alone (not including running the various analyses for the randomised basedatas).  A faster computer could probably reduce it to 3 days sequential time...

It should also be noted that the Biodiverse implementations are all in Perl.  The Picante implementation is written in C++, which is much faster than languages like Perl, R and Python (they themselves are written in C or C++).  

Despite not being written in C, the Biodiverse modified independent swaps algorithm is two to five times faster than the Picante implementation for this data set.  The modified algorithms do more work per iteration, but run fewer iterations overall.  The Perl implementation could also use be reimplemented in C, but the flexibility of the rand_structured approach (see below) means any such efforts will likely be directed there. 

As a final note regarding speeds, the pure R implementation could be made faster.  Profiling shows that most of the processing time is spent in the sample.int() function.  Even so, it would still not be faster than the Picante implementation so the only advantage would be if early stopping were implemented.


Spatially structured independent swaps

As a final note, an issue with the independent swaps algorithm is that it is very difficult to add spatial constraints to model diffusion processes or random walks.  One can apply spatial constraints to the swapping, i.e. an incidence must be swapped with another within some distance, but that has not been implemented.  If there is sufficient demand, and perhaps funding or a code contribution, then it could be added.

Keep in mind, though, that you can apply spatial constraints so any swapping is done within sub-datasets, for example within regions.  This process is described in the spatially constrained randomisation post.  


Footnotes

[1]  One of the general principles when developing Biodiverse is to provide options for users.  This can sometimes lead to a huge array of options, for example the indices.  However, focusing on that example, many of the indices are provided to allow insights into how others are calculated, such as showing the relative contribution of each label to an endemism score

[2] Speaking of patient, Biodiverse is written for and by the impatient - impatience is one of the three virtues of programming, after all.


Shawn Laffan

07-Dec-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users



Monday, 23 November 2020

Spatially partition your randomisations

In previous posts I described how the rand_structured algorithm works, and then how one can use spatially structured randomisations to compare observed patterns with stricter null models.   

This post is concerned with how one can spatially partition a randomisation.  Some of these details are in a previous blog post, but this post is a much more detailed description. 

In the standard implementation of a randomisation, in Biodiverse and I suspect more generally also, labels are randomly allocated to any group across the entire data set.  This works well for many studies and is very effective.  However, when one begins to scale analyses to larger extents, for example North America, Asia or globally, then the total pool of labels begins to span many different environments.  The randomisations could allocate a polar taxon to the tropics, and a desert taxon to a rainforest.  This does not make the randomisation invalid, but it is perhaps not as strict as it could be.  

The effect is perhaps best considered with a scenario using phylogenetic diversity where polar and tropical taxa are distinct clades on the tree.   For a randomisation that allocates labels anywhere across the data set, one will commonly end up with a random realisation containing many groups with a mixture of polar and tropical taxa.  The PD on the random data set will be higher than on the observed data set in almost all cases because both clades are sampled and thus more of the tree is represented.  Note that this does not make the result wrong - it means that, when compared with a random sample from the full pool of taxa, the observed PD is less than expected.  That is useful to know, but the next question to ask is "are the patterns in the tropics higher or lower than expected for the tropics?", and the same for the polar taxa.  

One could quite easily remove any groups outside a region of interest and then rerun the analysis.  However, then the ranges of the labels will be reduced and any endemism analyses will not be comparable with the larger analyses.  

This can be readily fixed in Biodiverse by specifying a spatial condition to define subsets. Or, if you only want to randomise a subset of your data while holding the rest as-observed, you can use a definition query.    The spatial condition approach was used in Mishler et al. (2020) to (very approximately) partition Canada, the US and Mexico to assess sensitivity of the results for the full data set.  The definition query was used in Allen et al. (2019) to only randomise values within Florida, while holding the adjacent regions constant.

How does it work?  Both approaches slice the data into subsets, apply the randomisation algorithm to each subset independently, and then reassemble the randomised subsets into a full randomised basedata that is then used for the comparisons.  The difference for the definition query is that the data is divided into two sets, and only the subset that passes the query is randomised.

The definition query is run first.  This means that, if you specify both a definition query and a spatial condition, then only the groups that pass the definition query are considered for randomisations, and those randomisations will be spatially partitioned.

This approach is general, and works for any of the randomisation algorithms in Biodiverse.  Note, however, it will not apply different randomisation algorithms in different subsets.  If there is a good reason to do so then we can look at it, but it will make the interface much more complex to implement.  

One other point to note is that, for the structured randomisations, the swapping algorithm is applied within the subsets.  Each subset is run to completion before they are all aggregated.


Some images will work better than text, so here are some clipped screenshots from Biodiverse.

The observed distribution of Acacia tenuissima (red cells), with outlines of Australian states and territories in blue.  Acacia data are from Mishler et al. 2014, and the cell size is 50 km.  

A tenuissima incidences in a randomisation constrained such that each incidence is allocated within the polygon that contains it. Note the absence of incidences in NSW, Victoria and Tasmania, and that there are only three in South Australia.  This is consistent with the observed data.

A tenuissima records randomised using a definition query so that only incidences within Western Australia are randomised.  All others are kept unchanged. Compare with the observed distribution above.   


So how does one use it?  It is done as part of the standard interface (this functionality has actually been in Biodiverse since version 1, but the interface was slightly reconfigured in version 2).  The user specifies spatial conditions using the same syntax as for the spatial analyses.   


Subsets to randomise independently are defined using a spatial condition (red arrow), while the definition query (blue arrow) is used to randomise a subset of the data.

The choice of condition is up to the user, and they can specify whatever condition they like (it is their analysis, after all...).  Generally speaking, though, it is better to use a condition that generates non-overlapping regions.  One that uses a shapefile would fit with many geographic cases.  Many conditions will work but might not be very sensible, for example overlapping circles.  Each group is allocated to only one subset, so in such overlapping cases it is the first one that contains the group "wins" the group.

One thing to watch for, and which is fixed for version 4, is that if the spatial condition does not capture all groups then the system will throw an error.  The workaround for version 3 and earlier is to use a definition query that matches (shadows) the spatial condition, although keep in mind that any groups outside the definition query will not be randomised.

To sum up, Biodiverse allows a high degree of complexity and fine grained control when using randomisations to assess the significance of observed patterns.  This post has described the spatial conditions and definition queries.  More features and approaches are described in other posts, and are grouped under the randomisation tag.  



Shawn Laffan

23-Nov-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


Thursday, 19 November 2020

Randomisations - modelling spatial structure

In the previous post, I described how the rand_structured algorithm works.  If you don't feel like reading that post right now, then a short description is that it uses a filling algorithm to randomly allocate the labels in a basedata onto the groups of a new basedata, with the end result being random assemblages in each group, and where each group has the same number of labels as the original basedata by default. A swapping algorithm is then used to allocate any labels that could not be allocated using the filling process, usually because there were no unfilled groups that did not already contain these labels.

The filling algorithm has some nice advantages over other methods that also match the richness patterns, such as the independent swaps algorithm.  The main one is that filling algorithms are easily generalised to model spatial processes in their allocation, for example random walks and diffusion processes.

The videos below show some of these spatially structured randomisations at work, using Acacia tetragonophylla incidences from Mishler et al. (2014).  Note that the shade of red is used only to indicate when in the sequence a cell was allocated to, with lighter shades being earlier and darker later.

The first video shows a random walk model, in which a seed location is chosen for the first allocation.  The next location is a randomly selected neighbour of the current location, and the process repeats.  If a location has no available neighbours (all are full or already contain this label) then it backtracks until it finds one it can allocate to (backtracking can be from the last allocated, the first, or randomly selected).   If there are no possible allocations then it will try a new seed location.  One can also set a probability to randomly reseed even if the window has not be fully allocated to.   



The second uses a diffusion process.  This also uses a seed location, but considers the neighbours of all of its current locations for the next allocation.  As with the random walk, this process is also constrained by the filled and previously allocated groups, and will reseed if necessary or at random.   There is no backtracking as it is not needed since there is no "path" being followed. 



The final model is a proximity allocation process.  A seed location is chosen, and a spatial condition is then used to select a window of neighbours.   The example here uses a circle of a three cell radius, but any supported condition could be used such as predefined regions represented using a shapefile.  Once all groups in the window have been allocated to, excluding fully allocated groups, a new seed location is chosen. 



As with the main rand_structured algorithm, any unallocated incidences are handled using a swapping algorithm (see details in the previous post).  This ensures richness targets are met, at the cost of some loss of spatial contiguity in the randomly generated distributions.  

The outliers for this proximity allocation randomisation are due to the swapping algorithm used to reach the per-group richness targets.



The above randomisations are all implemented under the rand_spatially_structured randomisation, for which parameters can be set to model each approach.  However, setting parameters can be complicated and error prone so there are variants called rand_diffusion and rand_random_walk that restrict the set of arguments needed and set some useful defaults, but otherwise are just wrappers around rand_spatially_structured (which itself is implemented in Biodiverse as a particular case of the rand_structured algorithm). There is not currently a wrapper for the proximity allocation, but one could be added if needed.  The next few screenshots show the parameter settings.   


The options to control the spatially structured randomisations are marked with the red and blue arrows.  The spatial condition (marked by the red arrow) determines which neighbours of the considered group will be allocated to next.  This example uses a square, but it could be any of the supported conditions.  The other parameters (marked with the blue arrow) control reseeding, backtracking, and allocation order.  

The rand_diffusion parameters are almost identical to those for rand_spatially_structured, as the former is just a special case of the latter and this simplifies setting it up.   

As with the rand_diffusion parameters, and for the same reason, the parameters for rand_random_walk are also almost identical to those for rand_spatially_structured.   

As an aside, if you want a good introduction to the general process of dynamic spatial modelling, then have a look at O'Sullivan and Perry (2013).  There is also a good introduction to the limits of the complete spatial randomness model in O'Sullivan and Unwin (2010).  

To sum up, there is more to randomisations than just randomly shuffling the data around.  Biodiversity patterns are normally spatially structured, so finding patterns that are significantly different from those that are completely random is often not much of a challenge.  Adding constraints such as matching species richness patterns is very useful, but one can go further and introduce even more spatial structure, making the tests "harder to pass" (an example is in Laffan and Crisp, 2003).  As noted by Gotelli (2001), trying multiple hammers is a useful thing, and to quote that paper: "The advantage of null models is that they provide flexibility and specificity that cannot often be obtained with conventional statistical analyses".  Using multiple approaches might be regarded as a fishing expedition, but it does allow one to generate a distribution of models and thus deeper understanding of the patterns one is analysing.  And fish are nutritious.


Shawn Laffan

19-Nov-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


Wednesday, 18 November 2020

Randomisations - how the rand_structured algorithm works

The default randomisation algorithm in Biodiverse is called rand_structured.  This has been available in Biodiverse since the beginning and has been used in several papers.  It forms the basis of the current examples of the CANAPE protocol, and an early variant was used in Laffan and Crisp (2003) which used an ancestral variant of Biodiverse.  

The rand_structured algorithm is so called because it maintain the structure of the original data, specifically the spatial distribution of the richness patterns, while randomly allocating taxa (labels) to groups (cells).  By default, the richness patterns are matched exactly, but users also have the option to allow tolerances such as an additional number of labels per group (e.g. five extra), or some multiple of the observed (e.g. 1.5 times as many).  Irrespective of these tolerances, only as many label/group combinations are allocated as are observed in the original data.   If one thinks of the data as a site by species matrix then, if the richness patterns are matched exactly, then the row and column sums are identical between the original and randomised matrix.  If tolerances are used then the row and column sums will not match.  In both cases, however, the number of non-zero entries (incidences) across the matrix is constant.

The rand_structured randomisation uses a filling algorithm, followed by a swapping algorithm to reach its richness targets.  

The filling process selects a label at random and allocates all of its occurrences to groups (cells), also at random.  Thinking spatially, it is scattering the incidences randomly across the map (but keep in mind that your data do not need to be spatial for them to be analysed).  The filling algorithm will not consider any groups that have already reached their richness targets. 

At the end of the allocation process there will normally be a set of labels that have not yet been fully allocated, because they are already allocated to the unfilled groups, and all the other groups are full (have reached their richness targets).  This is where the swapping algorithm comes in. 

In the swapping process, an unallocated label is selected at random (call it label1).  The set of filled groups that do not contain this label is then searched to find a label that can be allocated to one of the unfilled groups (call it label2).  Once that is done, the label2 incidence is moved to the unfilled group, and label1 is added to the previously filled group.  This is repeated until all incidences have been allocated.  

All the above might be easier to understand if presented visually.  The red cells in the image below represent the distribution of Acacia tetragonophylla across Australia at a 50 km resolution.  Grey cells contain one or more other Acacia species.  

The video below it shows the allocation of the A tetragonophylla incidences, with the intensity of the red representing when in the sequence they were allocated (lighter is earlier, darker is later).  Note how there is no apparent spatial pattern in the allocation, but remember that the richness patterns for each group (cell) are being preserved. 



The distribution of Acacia tetragonaphylla (red cells).  Grey cells have at least one other Acacia species.  Data are from Mishler et al. (2014), and cell size is 50 km.  






Some readers might be thinking that the search for labels and groups to swap can take some time.  However, the process is much faster than simply thrashing around, randomly swapping labels until they have all been allocated. In fact, a huge amount of work has been spent over the years to optimise the filling and swapping code to ensure that it runs fast and scales linearly (or sub-linearly) as the number of labels and groups increases (and sometimes is unaffected). Much of this is done using indexing and tracking lists to reduce searching, a memory-for-speed trade-off that is in some ways key to the Biodiverse development philosophy.  As a result of this work, the current implementation is several orders of magnitude faster than what is was eight years ago.


Note that, as small caveat to the above, it is possible there are small deviations of the description from the implementation.  In such cases it is the code that is canonical.  (This is somewhat like descriptive comments in a computer program being what the programmer wanted some code to do, not necessarily what it actually does).


Shawn Laffan

18-Nov-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


Thursday, 12 November 2020

Randomisations now also generate z-scores

Randomisations in Biodiverse have always generated rank-relative scores from which significance can be derived. See for example this previous post: https://biodiverse-analysis-software.blogspot.com/2016/08/easier-to-use-randomisation-results.html 

However, sometimes users want to use z-scores to assess a randomisation. Several other tools such as phylocom also calculate z-scores, and often these are the default.

From version 4 of Biodiverse, z-scores are now also be calculated for randomisations. More details about what happens are in the updated help page, but a short summary is given here. 

For each iteration of the randomisation, each analysis is regenerated using the same set of spatial conditions, but using the randomised version of the basedata. The values of each index for each group (cell) are then checked, and running tallies are kept for the sum of randomised index values as well as the sum of squared index values. These sums, along with the count of iterations, allows the calculation of the mean (sum of x / count) and standard deviation using the running variance method. 

Once the randomisations have completed, the z-score for each index value for each group can be calculated in the usual way as (observed - expected / sqrt (variance)). Interpretation is the same as any other z-score, with values outside the interval [-1.96,1.96] being significant for two tailed test with an alpha of 0.05, providing the number of samples is large. Note that the strictly correct value depends on the degrees of freedom, which depends on the number of samples. But why would you run a small number of iterations? 

The z-scores are calculated automatically whenever a randomisation is run.  The results are collated for each group and are accessible through the ">>z_scores>>" lists.  In this case Rand1 is the name of the randomisation, hence the list name is Rand1>>z_scores>>SPATIAL_RESULTS. 



If you want to add z-scores to an existing randomisation then you will need to run at least one extra randomisation iteration to trigger their calculation. If there is demand to do so without the extra iteration then we can add an option for it.  (Consider the relative numeric difference between 999 and 1000 iterations, without worrying about aesthetics such as the number of observed plus random realisations being a multiple of 10).  


Shawn Laffan, 12-Nov-2020



For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/ 

For the full list of changes in the 3.99 development series, leading to version 4, see https://github.com/shawnlaffan/biodiverse/wiki/ReleaseNotes#version-399-dev-series 

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users

Friday, 6 November 2020

Updated handling of Cluster and Region Grower analyses in randomisations

Randomisations in Biodiverse are used to assess the statistical significance of a set of analysis results given some randomisation scheme such as shuffling the species around the map (labels across groups in Biodiverse terms), subject to constraints such as each cell (group) must maintain the same number of species. Randomisations are key to interpreting where results differ from what would be expected, and are integral to protocols such as the Categorical Analysis of Paleo and Neo Endemism (CANAPE)


The basic process of the randomisation is this.  For each iteration of the randomisation analysis, Biodiverse will:
  1. Create a new basedata object with a random allocation of labels to groups
  2. For each analysis in the basedata 
    1. Regenerate a version using the randomised basedata
    2. Compare the values of the analyses from the original and randomised basedata and track if they are higher or lower, on a cell by cell basis
    3. Track basic statistics of the distribution to allow the calculation of the mean and standard deviation, and thus z-scores (this is new in Version 4 - there will be more on this in another post).

The tracking is used to reduce memory usage, as the randomised basedatas and outputs can be discarded as soon as they have been collated.  This is more efficient than keeping all the results in memory for later ranking, especially with large data sets such as those used for Mishler et al. (2020).

A huge amount of work has been spent over the last several years to make the randomisation process go faster and scale better with larger data sets, with recent versions being orders of magnitude faster than some of the earlier versions. In some cases analyses now take minutes where they used to take days. However, one slow-down that still affects things in version 3.1 is the effect of tree structures in a basedata. These are the Cluster and Region Grower analyses. 

The reason the Cluster and Region Grower analyses slow things down is simply that they take longer to calculate.  A spatial analysis only need to pass over each processing group (cell) once, so one can think of it as N calculations.  In comparison, a cluster analysis will compare each group with each other group to create its matrix.  If one is clustering a full data set then this will be N(N-1)/2 calculations, but the same scaling effect applies for subsets defined using definition queries (e.g. for CANAPE).  Even when one considers that the spatial analysis might include many neighbours for each group, the Cluster and Region Grower analyses take much longer.  And then the Cluster analysis needs to process the matrix.

Even with the slow-down, a comparison of the randomised Cluster or Region Grower analysis with the original is not always informative.  

In Biodiverse Version 4, this process has been changed.  Cluster and Region grower analyses are now skipped by default.  This will substantially speed up randomisations containing such analyses.  

If you still want to run them then there is an option to do so.  See the red arrow in the screenshot below.  


Randomisations of Cluster and Regions Grower analyses is off by default in Version 4, but can be re-enabled if needed (see red arrow).



One point to note is that any calculations per node (branch) are still done.  This was modified some time ago to use the set of randomised groups under each branch from the original, non-randomised tree.  This works because the group IDs are the same across both the original and the randomised basedatas.  It is just that the randomised version has a randomly allocated set of labels. 



Shawn Laffan
06-Nov-2020



For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/ 

For the full list of changes in the 3.99 development series, leading to version 4, see https://github.com/shawnlaffan/biodiverse/wiki/ReleaseNotes#version-399-dev-series 


To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 


You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users 

Tuesday, 24 March 2020

Publications using Biodiverse in 2019

2020 is now in full swing, so here is a list of publications from 2019 that used Biodiverse. 

If you want to see the full list (111 at the time of writing), then go to https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/ 

Shawn Laffan
24-Mar-2020



Allen, J.M., Germain-Aubrey, C.C., Barve, N., Neubig, K., Majure, L.C., Laffan, S.W., Mishler, B.D., Owens, H.L., Smith, S.A., Whitten, W.M., Abbott, J.R., Soltis, D.E., Guralnick, R. and Soltis, P.S. (2019) Spatial phylogenetics of Florida vascular plants: The effects of calibration and uncertainty on diversity estimates. iScience, 11, 57-70.

Crouch, N.M.A., Capurucho, J.M.G., Hackett, S.J. and Bates, J.M. (2019) Evaluating the contribution of dispersal to community structure in Neotropical passerine birds. Ecography, 42, 390-399.
Dagallier, L.-P.M.J., Janssens, S.B., Dauby, G., Blach-Overgaard, A., Mackinder, B.A., Droissart, V., Svenning, J.-C., Sosef, M.S.M., Stévart, T., Harris, D.J., Sonké, B., Wieringa, J.J., Hardy, O.J. and Couvreur, T.L.P. (2019) Cradles and museums of generic plant diversity across tropical Africa. New Phytologist.

Daru, B.H., le Roux, P.C., Gopalraj, J., Park, D.S., Holt, B.G. and Greve, M. (2019) Spatial overlaps between the global protected areas network and terrestrial hotspots of evolutionary diversity. Global Ecology and Biogeography, 28, 757-766.

Fuentes-Castillo, T., Scherson, R.A., Marquet, P.A., Fajardo, P., Corcoran, D., Román, M.J. and Pliscoff, P. (2019) Modelling the current and future biodiversity distribution in the Chilean Mediterranean hotspot. The role of protected areas network in a warmer future. Diversity and Distributions, 25, 1897-1909.

Gallien, L., Thornhill, A.H., Zurell, D., Miller, J.T., and Richardson, D.M. (2019) Global predictors of alien plant establishment success: combining niche and trait proxies. Proceedings of the Royal Society B, 286, 20182477.

Garcia–R, J.C., Gonzalez-Orozco, C.E. and Trewick, S.A. (2019) Contrasting patterns of diversification in a bird family (Aves: Gruiformes: Rallidae) are revealed by analysis of geospatial distribution of species and phylogenetic diversity. Ecography, 42, 500-510.

Greenlon, A., Chang, P.L., Damtew, Z.M., et al. (2019) Global-level population genomics reveals differential effects of geography and phylogeny on horizontal gene transfer in soil bacteria. Proceedings of the National Academy of Sciences, 201900056.

Lam-Gordillo, O. and Ardisson, P-L. (2019) The global distribution and richness of frog crabs (Raninoidea). Crustaceana, 92, 1-31.

López-Aguirre, C., Hand, S.J., Laffan, S.W., & Archer, M. (2019) Zoogeographical regions and geospatial patterns of phylogenetic diversity and endemism of New World bats. Ecography, 42, 1188-1199.

McDonald-Spicer, C., Knerr, N.J., Encinas-Viso, F. and Schmidt-Lebuhn, A.N. (2019) Big data for a large clade: Bioregionalization and ancestral range estimation in the daisy family (Asteraceae). Journal of Biogeography, 46, 255-267.

Mekala, S., Donoghue, M.J., Farjon, A., Filer, D., Mathews, S., Jetz, W. and Leslie, A.B. (2019) Accumulation over evolutionary time as a major cause of biodiversity hotspots in conifers. Proceedings of the Royal Society B, 286, 20191887.

Shelley, J.J., Dempster, T., Le Feuvre, M.C., Unmack, P.J., Laffan, S.W. & Swearer, S.E. (2019) A revision of the bioregionalisation of freshwater fish communities in the Australian Monsoonal Tropics. Ecology and Evolution, 9, 4568-4588.

Tolley, K.A., da Silva, J.M. & Jansen van Vuuren, B. (2019) South African National Biodiversity Assessment 2018 Technical Report Volume 7: Genetic Diversity. South African National Biodiversity Institute, Pretoria. http://hdl.handle.net/20.500.12143/6376