Friday, 24 September 2021

Faster calculation of PhyloCom indices: NRI, NTI, MPD and MNTD

This post covers a bit more of the internals than we usually do.  Hopefully it is useful.  

From version 4, Biodiverse will use faster methods to calculate the indices in the PhyloCom set, with ultrametric trees seeing the greatest speedup.  These are the NRI, NTI, MPD and MNTD indices, which when not "acronymed" are the Net Related Index, the Nearest Taxon Index, Mean Phylogenetic Distance and Mean Nearest Taxon Distance.  These were originally implemented in the phylocom software (Webb et al. 2008) and many readers will be familiar with the R package picante.  

The MPD index is a measure of the mean of the paths along the tree between each pair of tips in a sample.  The contribution of a branch is proportional to its length and the number of paths it is part of.  

The MNTD index is the mean of the shortest path between each tip in a sample and the nearest tip that is also in the sample.  For a sample of ten tips there are only ten paths, but in the naïve case one needs to evaluate all paths to determine which is the shortest.  

One point to keep in mind is that branches in MPD and MNTD are counted one or more times, more specifically as many paths that they form part of.  This is in contrast with PD where each branch counts only once.  The number of paths also increases quadratically with the number of tips in the sample.  For example, if there are ten tips then there will be 10*(10-1)/2=45 paths to connect all pairs, and if there are 10,000 tips then there are 49,995,000 paths.  To my mind these make PD a better index, but that discussion is for another day.  Certainly it is simpler to calculate.

The NRI and NTI are z-scores of the MPD and MNTD scores, respectively, and indicate if the paths are longer or shorter than expected given a random resampling of tips across the tree.  The resampling algorithm can vary, but the simplest is to use the same number of tips as is found in the sample. In other words it matches the richness of the observed sample, so if one has ten tips in an observed sample then each random iteration draws ten tips at random and calculates the MPD and MNTD score.  There are other random sampling algorithms, such as abundance weighted, but Biodiverse only implements the richness approach for NRI and NTI.

The final NRI z-score is calculated as (observed_MPD - mean_random_MPD) / standard_deviation_random_MPD.   Interpretation follows the usual z-score distribution, with values more extreme than +/-1.96 being in the outer 5% of the distribution and thus significant at alpha=0.05.  The same process applies to NTI, but using MNTD instead of MPD.  (One point of difference between Biodiverse and phylocom is that in positive NRI and NTI values in Biodiverse correspond with values larger than expected, whereas in phylocom these have negative values).

It is also worth noting that the random resamplings used for NRI and NTI in Biodiverse do not use its more general randomisation framework.  One can use the MPD and MNTD scores with such randomisations to try more complex or spatially constrained schemes.  For more on such randomisations see posts with the randomisation tag.

A key problem to date with the Biodiverse implementation is that these calculations are very slow, and become substantially slower as the size of the data sets increases (trees become deeper and have more tips).  The rest of this post describes some of the ways these have been substantially sped up in Biodiverse version 4.  Much of this optimisation work was done using code profiling using the excellent Devel::NYTProf, and also by implementing the algorithms described in Tsirogriannis et al. (201220142016) and implemented in the PhyloMeasures package (followed by more code profiling with Devel::NYTProf).

Find the Last Common Ancestor

The search for the last common ancestor (or LCA, also referred to as the Most Recent Common Ancestor and Last Shared Ancestor) between a pair of terminal branches is what takes the most time in the MPD and MNTD calculations.  This is a key step in calculating the path connecting two tips.  Biodiverse has always cached the result of the path distance between a pair of branches so it only needs to be calculated once.  However the process of finding the path took a reasonable amount of time, something that was exacerbated when run under the random resampling process.  This has been optimised in several ways.

For ultrametric trees Biodiverse caches the same path distance between each pair of tips that share the LCA.  This pre-warming of the cache obviates the need to repeatedly find the same LCA in later checks.  This works because, as noted in Tsirogiannis et al. (2014), the distance from an internal branch to any of its tips is always the same for an ultrametric tree.

For  non-ultrametric trees Biodiverse caches the last common ancestor for each pair of tips to save looking for it next time.  The distance is not calculated until it is needed, but Biodiverse also caches the cumulative path lengths from each tip to the root so there is no need to repeatedly traverse the tree to get the distance from a tip to the LCA.

NRI and NTI

Faster calculation of MNTD and MPD are always good, but the real time sink is running the NTI and NRI calculations.  Even with faster MPD and MNTD calculations, a calculation that takes 4 seconds for a sample expands to more than an hour when repeated over 999 randomisation iterations.  And keep in mind that Biodiverse uses a convergence approach instead of a fixed number of iterations, so more than 2000 iterations is not unusual.

Re-use expected values for a given sample size

The expected values for a given sample size will not change in any meaningful way across randomisations that have converged on a stable result, so Biodiverse caches these and re-uses them.  For example, if the expected values for a sample of 10 has been calculated then it is re-used for each other sample of 10 in the data set. 

This has actually been in Biodiverse since NRI and NTI were first added, but is worth noting.  It is an easy thing to implement for other systems.

In randomisation analyses

The calculation time for NRI and NTI was exacerbated when users ran a randomisation on an analysis that included NRI and NTI.  There is really no need to run the NTI and NRI calculations through a randomisation process, as they are based on a random resampling process to begin with (one would be randomising a randomisation).  However, this is not always obvious to users.  If a user follows a philosophy of "push buttons and watch what happens" (as I do) then a long period of time can be spent waiting for the randomisations to finish as the expected values are recomputed.

The re-use of expected values described above helps here, but these were only cached within the analysis being run.  This means they were not available between randomisations, or to other calculations using the same tree.

Now Biodiverse caches the calculated expected values on the tree and reuses them whenever the tree is used in a subsequent analysis (unless the cache is cleared).  This means they are calculated once only regardless of how many analyses need them.

But the random resampling process was still too slow...

Exact estimates of expected values without needing to randomise

The next improvement was to implement the exact algorithms described by Tsirogiannis et al. (2012 and 2014).  The PhyloMeasures package implements additional steps not described in these papers, but which could be extracted from the package C code.  This is another great example of the benefits of open source code as one can see and understand how an algorithm is implemented in practice.  Where code is complex, or otherwise opaque, one can insert debugging statements in the local version to see what values are being passed into and returned from functions, and how they change within a function.  Importantly, one can also build tests to check the new implementation matches the old.

The exact estimates take advantage of the phylogenetic structure of the trees and repetitions within the calculations.  They are comparatively complex but lead to processing times that are many orders of magnitude faster than the random sampling approaches, to the point that analysis times previously measured in days now take seconds.  Given they are exact, they also lead to the exact same answer each time they are run so there is no margin of error, even if this is normally very small for values that have converged under a random resampling process.

The exact NRI algorithms apply to both ultrametric and non-ultrametric trees so are applied in all cases.  However, the NTI algorithms only apply to ultrametric trees, so analyses with non-ultrametric trees still require the random resampling approach.  It might be possible to develop better approaches for non-ultrametric trees given the main aim is to calculate probabilities from possible combinations, but that needs more experience with combinatorics than I have.

MNTD and NTI for the full tree

One final optimisation is to implement an algorithm to calculate the MNTD for a sample comprising the full set of tips on a tree.  This is perhaps something of an edge case as analyses will usually work with subsets, but it did not take long to implement.

This optimisation also applies to the NTI because there is only one possible realisation of the MNTD if all the tree tips are in the sample, so the expected mean is same as the MNTD and the standard deviation is zero (we will conveniently ignore the resultant divide by zero error in the z-score calculation in this case).  I have not checked if the phylocom, picante and PhyloMeasures implementations check for this condition, but it would not be hard to implement if they do not.

The algorithm to find the shortest path for each target tip is:

  1. Get the shortest distance from the target tip to any of its siblings' tips
    1. Set this as min_dist
    2. Set the processing node as the target tip
  2. Set the processing node as the processing node's parent
    1. Set the ancestral path distance to be the length from the parent to the target tip
    2. Stop if the ancestral path distance exceeds min_dist
  3. Get the shortest distance from the processing node to its siblings' tips
    1. Add this distance to the ancestral path distance.
    2. If the sum of these distances is shorter than min_dist then assign that value to min_dist.
  4. Stop if the processing node is the root node 
  5. Otherwise go to step 2

Each check should complete in approximately O(log(d)) time, where d is the depth of the tree, so it will take O(n log(d)) for a tree with n tips.  

The calculation of distances to tips for each branch is the cumulative path length from the tip, as noted above for the LCA calculations.   This is cached on the first calculation and then re-used, leading to further speed gains over a naïve implementation.

More?

There are potentially more optimisations (there always are) but these will do for now.  

If you want to see all the changes to the internals then they are tracked under issues 786, 788, 789793794 and 797.  Suggestions for other approaches are always welcome and can be raised on the mailing list (see link below) or on the issue tracker.  


Shawn Laffan

24-Sep-2021


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/  


To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 


You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users 


Thursday, 23 September 2021

Label and group property median and percentile statistics are changing

Biodiverse supports the analysis of additional data values attached to each label and group.  For labels, these could be things like species or population traits such as specific leaf area.  For groups these are things like the average phosphorus content in the soil across a group or set of groups.  More examples of analysing label traits are in this post, and group traits are described in Bein et al. (2020).  

The simplest means of analysing the label and group properties is to calculate summary statistics of the relevant values across the neighbour sets in use.  The relevant indices for these are under the Element Properties category because labels and groups in Biodiverse are referred to by the more generic term "elements".   

However, the implementation of these summary statistics to date has been relatively inefficient, especially for the range weighted statistics.  Where a property was assigned a weight of more than 1, the value was repeated that many times in the vector of values used to calculate the statistics.  e.g. [1,1,1,2,2,2,2,2].  This is not an issue for small data sets, but imagine that repeated for 10,000 unique label values, each of which has weights between 1 and 200, and then across 10,000 groups.  That can lead to quite some inefficiency with the calculations.  This repetition is actually needless given a weighted statistics approach can be used.

From version 4, Biodiverse uses a weighted implementation for its statistics.  This is slightly less efficient for cases where all weights are equal but there are always trade-offs when writing code for the more general case.

This new approach will have no impact on the results for statistics like the mean, standard deviation, skewness or kurtosis.

However, there will be a change to the way the percentiles and the median are calculated.  Previously the library used would snap to the lowest value when a percentile was calculated that did not exactly align with the data values.  The new approach uses interpolation, with the results being consistent with how percentiles are calculated in R (for an unweighted vector).  

This means that any calculations of the median or percentiles in Biodiverse 4 will likely return higher values for some percentiles.  The effect will be greater for smaller samples where the numeric gaps between sequential values is larger, but such sample size effects are hardly unusual in statistics.

One point that is yet to be dealt with is when the weights are not integers, i.e. where the sample counts used for a Basedata are from Species Distribution Model likelihoods (see the BCCVL if you need an online tool to calculate these).  In such cases the percentiles cannot use interpolation and will use the centre of mass.  Bias correction is also not possible for statistics like standard deviation, skewness and kurtosis in such cases, as the sum of weights is not the same as the number of samples.  This issue is for the future, though, as we do not yet support abundance weighted label stats.

For those interested in the implementation details, the approach uses the Statistics::Descriptive::PDL package which in turn uses tools provided by the Perl Data Language (PDL).  For those more familiar with R or Python, PDL provides Perl support for fast calculations using matrices and vectors.


Shawn Laffan

23-Sep-2021


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/  


To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 


You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users 


Monday, 7 December 2020

Biodiverse now includes the independent swaps randomisation algorithm

Randomisations are one of the key analyses in Biodiverse.  They give an assessment of whether an observed result is significantly different from a distribution of randomly generated versions.   Several recent posts have described how the spatially structured randomisations work, including the (default) rand_structured algorithm and its spatially structured variants. These can also be spatially constrained.  

These approaches have been available in Biodiverse for some time: rand_structured since the very beginning, spatial constraints since version 1, and the spatially structured variants since version 2.  

One of the general principles when developing Biodiverse is to provide options for users[1].  In that light, version 4 of Biodiverse will include the independent swaps algorithm.  

How does independent swaps work?  

A short summary is below, and more details and assessments are in Gotelli (2000) and Miklos & Podani (2004).  Code for the R implementation is available through the Picante package.  

  1. Pick two different groups at random (call them group1 and group2)
  2. Pick two different labels at random (call them label1 and label2)
  3. If label1 is not in group1, or label2 is not in group2, or if label1 is already in group2, or label2 is already in group1  
    1. Then they cannot be swapped, so switch label1 and label2.  
    2. If the switched pairs cannot be swapped using the condition in #3 then go back to step #1
  4. Swap the labels between groups
  5. Increment the number of successful swaps by 1
  6. Start back at step #1


This is a very simple algorithm, and simple has many advantages when it comes to implementation.  However, it is also a brute-force approach.  The fact that the pairs are selected at random means there is no consideration of "swappability".  Consider the case of a site by species matrix where most of the matrix entries are zero, such as where most taxa are highly range restricted.  In that case, there is a very high chance that the algorithm will select a pair that cannot be swapped and must retry with a different selection, e.g. because neither label is in either group.  This can lead to many wasted CPU cycles.  As an unflattering description one might call this a "throw enough mud at a wall and some of it will stick" approach.  Keep in mind, though, that for many cases this will work well enough, and the results are most definitely random - good enough is good enough.  

Another issue with the independent swaps approach is that one can never be quite certain how many iterations are needed to fully randomise a data set.  Miklos & Podani (2004) suggest twice the number of non-zero entries in the matrix, but I am not aware of whether that has been rigorously assessed.  Too few iterations and some of the original structure will be retained.  Too many and the analysis might take too long (even for very patient people)[2].

There is also no stopping criterion except the total number of iterations.  It is therefore possible for the algorithm to be trapped in an infinite loop when none of the group/label pairs can be swapped, as the iteration count is only incremented on a successful swap.   

In light of the above points, the implementation of independent swaps in Biodiverse has three parameters, two of which lead to early stopping.  

  1. The default number of swaps is set to be twice the number of non-zero cells (if one thinks of the data as a site by species matrix).  This value follows Miklos & Podani.  Users can specify any positive integer here.  The algorithm will stop when this value is reached.  
  2. The maximum number of iterations defaults to 100 times the number of swaps, but can also be specified as a positive integer.  If this number is exceeded then the algorithm stops, regardless of the number of swaps completed.  (Note that this value will likely be increased in the final Biodiverse 4 release, see below for why).  
  3. The number of times each label/group pair in the original matrix is swapped out is tracked.  The algorithm ends once each label/group pair has been swapped at least once.  This is off by default.

The last case uses the assumption that, once a pair has been randomly swapped, there is no need to swap it any further as it will not be any more random.  If your random number generator is not good then more swaps can actually make things worse (but his is unlikely to be a problem as most systems use good random number generators now). 

A modified independent swaps algorithm

As noted in a previous post, a huge amount of effort has gone into optimising the randomisations in Biodiverse to make them run faster.  Armed with this knowledge, the independent swaps algorithm has also been optimised.  The primary aim here has been to reduce the search for swappable label/group pairs, at the cost of extra memory to store extra index lists.  

An overview of the algorithm is this:

  1. Select a group using a multinomial probability distribution, with the probabilities being the richness scores of each group.  Call this group1.
  2. Select label1 from the labels in group1 using a uniform random distribution (each label in group1 is equiprobable).
  3. Select a group that does not already contain label1, again using a multinomial probability distribution.  Call this group2.
  4. Select label2 from the labels in group2 using a uniform random distribution (each label in group2 is equiprobable).
    1. If needed, repeat #4 until label2 is not one that is already in group1.
  5. Swap label1 to group2, and label2 to group1.
  6. Update the multinomial probability distributions.
  7. Increment the swap count.
  8. Go to step #1.

The multinomial distributions are used to replicate the chances of selecting a non-zero matrix entry when selecting entirely at random.  The use of the "absence" lists in step #3 also focuses the search to swappable pairs.  For example, a very wide ranging label (species) will occur in most groups so there is a high probability that the normal algorithm will select a group1 and group2 that both contain that label.

There is clearly a memory burden when one considers the "absence" tracking lists.  However, memory is cheap these days and the data sets must be relatively enormous for it to make too much difference.  Perl, in which Biodiverse is programmed, also has a copy-on-write mechanism to save memory when dealing with large strings (e.g. a copy of a 100MB string can be stored using a link to the original, and only needs to be really copied when it is changed in some way or is written out).  

The modified implementation has the same default stopping parameters as the unmodified implementation.

How long do they take?

So how fast are they?  The table below summarises a benchmark run with the Acacia data set from Mishler et al. (2014).  This has 506 species (labels) distributed across 3037 cells (groups).   

Both the original and modified Biodiverse implementations are used, as is that from the R Picante library and a variant implemented in pure R (and run using R version 4.03).  All were run on the same computer.  The Biodiverse runs used the GUI, the R runs were under RStudio.  Code for the R runs is available at the Biodiverse GitHub repo.  

Only one to five runs of each combination was used, and values are rounded off.  Ideally more would be run, but the results are consistent when re-run and the differences are of sufficient magnitude to use for a blog post.

There are 29,175 non-zero entries in the site by species matrix, so approximately 81% of the matrix is empty.  The target number of swaps for the independent swaps runs was set to twice the number of non-empty matrix cells, 58,350, which as noted above is the default in Biodiverse and follows Miklos and Podani (2004).

Values in the table are sorted by run time in seconds.  The "early stop" column indicates if swapping is stopped once all label/group pairs (matrix cells) have been swapped at least once.  "Num swaps" is the number of swaps actually run, "cells swapped" is the count of matrix cells swapped at least once, and is only tracked when early stopping is in use.  "Swap attempts" is how many iterations were used overall, including those where a swap was not possible.  "max attempts" is the maximum number of swap attempts before giving up, with the default being 100 times the target number of swaps.  The larger number is 2^31-1, which is the maximum value for a signed 32 bit integer.    




The main takeaway from the table is that the rand_structured algorithm is fastest of all those run, and is much faster than any of the unmodified independent swaps runs.  This difference will be more pronounced as the data set size increases.   That said, the rand_structured algorithm has been through many rounds of optimisation, so the difference might be reduced with further optimisations to the other algorithms. 

The times for the modified independent swaps are aggregates of iterations 2-5, with the value in brackets being the first iteration when many of the lists are set up and cached for later reuse.  

The modified independent swaps algorithm reaches its target with much less "wastage" of iterations, hence it is fastest of the independent swaps.  This is further aided by the early stopping condition.  

For the unmodified independent swaps algorithm, more than 24,600,000  swap attempts are needed to ensure all matrix cells have been swapped.  This is ~843 times the number of non-zero matrix cells.  The two fastest unmodified runs reach the maximum number of swap attempts when the number of actual swaps is still very low, at ~6% of the target number of swaps.

The unbounded independent swaps run in Biodiverse takes nearly seven minutes.  If used with 999 randomisation iterations in sequence then the run time will be nearly 4.8 days for the randomisations alone (not including running the various analyses for the randomised basedatas).  A faster computer could probably reduce it to 3 days sequential time...

It should also be noted that the Biodiverse implementations are all in Perl.  The Picante implementation is written in C++, which is much faster than languages like Perl, R and Python (they themselves are written in C or C++).  

Despite not being written in C, the Biodiverse modified independent swaps algorithm is two to five times faster than the Picante implementation for this data set.  The modified algorithms do more work per iteration, but run fewer iterations overall.  The Perl implementation could also use be reimplemented in C, but the flexibility of the rand_structured approach (see below) means any such efforts will likely be directed there. 

As a final note regarding speeds, the pure R implementation could be made faster.  Profiling shows that most of the processing time is spent in the sample.int() function.  Even so, it would still not be faster than the Picante implementation so the only advantage would be if early stopping were implemented.


Spatially structured independent swaps

As a final note, an issue with the independent swaps algorithm is that it is very difficult to add spatial constraints to model diffusion processes or random walks.  One can apply spatial constraints to the swapping, i.e. an incidence must be swapped with another within some distance, but that has not been implemented.  If there is sufficient demand, and perhaps funding or a code contribution, then it could be added.

Keep in mind, though, that you can apply spatial constraints so any swapping is done within sub-datasets, for example within regions.  This process is described in the spatially constrained randomisation post.  


Footnotes

[1]  One of the general principles when developing Biodiverse is to provide options for users.  This can sometimes lead to a huge array of options, for example the indices.  However, focusing on that example, many of the indices are provided to allow insights into how others are calculated, such as showing the relative contribution of each label to an endemism score

[2] Speaking of patient, Biodiverse is written for and by the impatient - impatience is one of the three virtues of programming, after all.


Shawn Laffan

07-Dec-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users



Monday, 23 November 2020

Spatially partition your randomisations

In previous posts I described how the rand_structured algorithm works, and then how one can use spatially structured randomisations to compare observed patterns with stricter null models.   

This post is concerned with how one can spatially partition a randomisation.  Some of these details are in a previous blog post, but this post is a much more detailed description. 

In the standard implementation of a randomisation, in Biodiverse and I suspect more generally also, labels are randomly allocated to any group across the entire data set.  This works well for many studies and is very effective.  However, when one begins to scale analyses to larger extents, for example North America, Asia or globally, then the total pool of labels begins to span many different environments.  The randomisations could allocate a polar taxon to the tropics, and a desert taxon to a rainforest.  This does not make the randomisation invalid, but it is perhaps not as strict as it could be.  

The effect is perhaps best considered with a scenario using phylogenetic diversity where polar and tropical taxa are distinct clades on the tree.   For a randomisation that allocates labels anywhere across the data set, one will commonly end up with a random realisation containing many groups with a mixture of polar and tropical taxa.  The PD on the random data set will be higher than on the observed data set in almost all cases because both clades are sampled and thus more of the tree is represented.  Note that this does not make the result wrong - it means that, when compared with a random sample from the full pool of taxa, the observed PD is less than expected.  That is useful to know, but the next question to ask is "are the patterns in the tropics higher or lower than expected for the tropics?", and the same for the polar taxa.  

One could quite easily remove any groups outside a region of interest and then rerun the analysis.  However, then the ranges of the labels will be reduced and any endemism analyses will not be comparable with the larger analyses.  

This can be readily fixed in Biodiverse by specifying a spatial condition to define subsets. Or, if you only want to randomise a subset of your data while holding the rest as-observed, you can use a definition query.    The spatial condition approach was used in Mishler et al. (2020) to (very approximately) partition Canada, the US and Mexico to assess sensitivity of the results for the full data set.  The definition query was used in Allen et al. (2019) to only randomise values within Florida, while holding the adjacent regions constant.

How does it work?  Both approaches slice the data into subsets, apply the randomisation algorithm to each subset independently, and then reassemble the randomised subsets into a full randomised basedata that is then used for the comparisons.  The difference for the definition query is that the data is divided into two sets, and only the subset that passes the query is randomised.

The definition query is run first.  This means that, if you specify both a definition query and a spatial condition, then only the groups that pass the definition query are considered for randomisations, and those randomisations will be spatially partitioned.

This approach is general, and works for any of the randomisation algorithms in Biodiverse.  Note, however, it will not apply different randomisation algorithms in different subsets.  If there is a good reason to do so then we can look at it, but it will make the interface much more complex to implement.  

One other point to note is that, for the structured randomisations, the swapping algorithm is applied within the subsets.  Each subset is run to completion before they are all aggregated.


Some images will work better than text, so here are some clipped screenshots from Biodiverse.

The observed distribution of Acacia tenuissima (red cells), with outlines of Australian states and territories in blue.  Acacia data are from Mishler et al. 2014, and the cell size is 50 km.  

A tenuissima incidences in a randomisation constrained such that each incidence is allocated within the polygon that contains it. Note the absence of incidences in NSW, Victoria and Tasmania, and that there are only three in South Australia.  This is consistent with the observed data.

A tenuissima records randomised using a definition query so that only incidences within Western Australia are randomised.  All others are kept unchanged. Compare with the observed distribution above.   


So how does one use it?  It is done as part of the standard interface (this functionality has actually been in Biodiverse since version 1, but the interface was slightly reconfigured in version 2).  The user specifies spatial conditions using the same syntax as for the spatial analyses.   


Subsets to randomise independently are defined using a spatial condition (red arrow), while the definition query (blue arrow) is used to randomise a subset of the data.

The choice of condition is up to the user, and they can specify whatever condition they like (it is their analysis, after all...).  Generally speaking, though, it is better to use a condition that generates non-overlapping regions.  One that uses a shapefile would fit with many geographic cases.  Many conditions will work but might not be very sensible, for example overlapping circles.  Each group is allocated to only one subset, so in such overlapping cases it is the first one that contains the group "wins" the group.

One thing to watch for, and which is fixed for version 4, is that if the spatial condition does not capture all groups then the system will throw an error.  The workaround for version 3 and earlier is to use a definition query that matches (shadows) the spatial condition, although keep in mind that any groups outside the definition query will not be randomised.

To sum up, Biodiverse allows a high degree of complexity and fine grained control when using randomisations to assess the significance of observed patterns.  This post has described the spatial conditions and definition queries.  More features and approaches are described in other posts, and are grouped under the randomisation tag.  



Shawn Laffan

23-Nov-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


Thursday, 19 November 2020

Randomisations - modelling spatial structure

In the previous post, I described how the rand_structured algorithm works.  If you don't feel like reading that post right now, then a short description is that it uses a filling algorithm to randomly allocate the labels in a basedata onto the groups of a new basedata, with the end result being random assemblages in each group, and where each group has the same number of labels as the original basedata by default. A swapping algorithm is then used to allocate any labels that could not be allocated using the filling process, usually because there were no unfilled groups that did not already contain these labels.

The filling algorithm has some nice advantages over other methods that also match the richness patterns, such as the independent swaps algorithm.  The main one is that filling algorithms are easily generalised to model spatial processes in their allocation, for example random walks and diffusion processes.

The videos below show some of these spatially structured randomisations at work, using Acacia tetragonophylla incidences from Mishler et al. (2014).  Note that the shade of red is used only to indicate when in the sequence a cell was allocated to, with lighter shades being earlier and darker later.

The first video shows a random walk model, in which a seed location is chosen for the first allocation.  The next location is a randomly selected neighbour of the current location, and the process repeats.  If a location has no available neighbours (all are full or already contain this label) then it backtracks until it finds one it can allocate to (backtracking can be from the last allocated, the first, or randomly selected).   If there are no possible allocations then it will try a new seed location.  One can also set a probability to randomly reseed even if the window has not be fully allocated to.   



The second uses a diffusion process.  This also uses a seed location, but considers the neighbours of all of its current locations for the next allocation.  As with the random walk, this process is also constrained by the filled and previously allocated groups, and will reseed if necessary or at random.   There is no backtracking as it is not needed since there is no "path" being followed. 



The final model is a proximity allocation process.  A seed location is chosen, and a spatial condition is then used to select a window of neighbours.   The example here uses a circle of a three cell radius, but any supported condition could be used such as predefined regions represented using a shapefile.  Once all groups in the window have been allocated to, excluding fully allocated groups, a new seed location is chosen. 



As with the main rand_structured algorithm, any unallocated incidences are handled using a swapping algorithm (see details in the previous post).  This ensures richness targets are met, at the cost of some loss of spatial contiguity in the randomly generated distributions.  

The outliers for this proximity allocation randomisation are due to the swapping algorithm used to reach the per-group richness targets.



The above randomisations are all implemented under the rand_spatially_structured randomisation, for which parameters can be set to model each approach.  However, setting parameters can be complicated and error prone so there are variants called rand_diffusion and rand_random_walk that restrict the set of arguments needed and set some useful defaults, but otherwise are just wrappers around rand_spatially_structured (which itself is implemented in Biodiverse as a particular case of the rand_structured algorithm). There is not currently a wrapper for the proximity allocation, but one could be added if needed.  The next few screenshots show the parameter settings.   


The options to control the spatially structured randomisations are marked with the red and blue arrows.  The spatial condition (marked by the red arrow) determines which neighbours of the considered group will be allocated to next.  This example uses a square, but it could be any of the supported conditions.  The other parameters (marked with the blue arrow) control reseeding, backtracking, and allocation order.  

The rand_diffusion parameters are almost identical to those for rand_spatially_structured, as the former is just a special case of the latter and this simplifies setting it up.   

As with the rand_diffusion parameters, and for the same reason, the parameters for rand_random_walk are also almost identical to those for rand_spatially_structured.   

As an aside, if you want a good introduction to the general process of dynamic spatial modelling, then have a look at O'Sullivan and Perry (2013).  There is also a good introduction to the limits of the complete spatial randomness model in O'Sullivan and Unwin (2010).  

To sum up, there is more to randomisations than just randomly shuffling the data around.  Biodiversity patterns are normally spatially structured, so finding patterns that are significantly different from those that are completely random is often not much of a challenge.  Adding constraints such as matching species richness patterns is very useful, but one can go further and introduce even more spatial structure, making the tests "harder to pass" (an example is in Laffan and Crisp, 2003).  As noted by Gotelli (2001), trying multiple hammers is a useful thing, and to quote that paper: "The advantage of null models is that they provide flexibility and specificity that cannot often be obtained with conventional statistical analyses".  Using multiple approaches might be regarded as a fishing expedition, but it does allow one to generate a distribution of models and thus deeper understanding of the patterns one is analysing.  And fish are nutritious.


Shawn Laffan

19-Nov-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


Wednesday, 18 November 2020

Randomisations - how the rand_structured algorithm works

The default randomisation algorithm in Biodiverse is called rand_structured.  This has been available in Biodiverse since the beginning and has been used in several papers.  It forms the basis of the current examples of the CANAPE protocol, and an early variant was used in Laffan and Crisp (2003) which used an ancestral variant of Biodiverse.  

The rand_structured algorithm is so called because it maintain the structure of the original data, specifically the spatial distribution of the richness patterns, while randomly allocating taxa (labels) to groups (cells).  By default, the richness patterns are matched exactly, but users also have the option to allow tolerances such as an additional number of labels per group (e.g. five extra), or some multiple of the observed (e.g. 1.5 times as many).  Irrespective of these tolerances, only as many label/group combinations are allocated as are observed in the original data.   If one thinks of the data as a site by species matrix then, if the richness patterns are matched exactly, then the row and column sums are identical between the original and randomised matrix.  If tolerances are used then the row and column sums will not match.  In both cases, however, the number of non-zero entries (incidences) across the matrix is constant.

The rand_structured randomisation uses a filling algorithm, followed by a swapping algorithm to reach its richness targets.  

The filling process selects a label at random and allocates all of its occurrences to groups (cells), also at random.  Thinking spatially, it is scattering the incidences randomly across the map (but keep in mind that your data do not need to be spatial for them to be analysed).  The filling algorithm will not consider any groups that have already reached their richness targets. 

At the end of the allocation process there will normally be a set of labels that have not yet been fully allocated, because they are already allocated to the unfilled groups, and all the other groups are full (have reached their richness targets).  This is where the swapping algorithm comes in. 

In the swapping process, an unallocated label is selected at random (call it label1).  The set of filled groups that do not contain this label is then searched to find a label that can be allocated to one of the unfilled groups (call it label2).  Once that is done, the label2 incidence is moved to the unfilled group, and label1 is added to the previously filled group.  This is repeated until all incidences have been allocated.  

All the above might be easier to understand if presented visually.  The red cells in the image below represent the distribution of Acacia tetragonophylla across Australia at a 50 km resolution.  Grey cells contain one or more other Acacia species.  

The video below it shows the allocation of the A tetragonophylla incidences, with the intensity of the red representing when in the sequence they were allocated (lighter is earlier, darker is later).  Note how there is no apparent spatial pattern in the allocation, but remember that the richness patterns for each group (cell) are being preserved. 



The distribution of Acacia tetragonaphylla (red cells).  Grey cells have at least one other Acacia species.  Data are from Mishler et al. (2014), and cell size is 50 km.  






Some readers might be thinking that the search for labels and groups to swap can take some time.  However, the process is much faster than simply thrashing around, randomly swapping labels until they have all been allocated. In fact, a huge amount of work has been spent over the years to optimise the filling and swapping code to ensure that it runs fast and scales linearly (or sub-linearly) as the number of labels and groups increases (and sometimes is unaffected). Much of this is done using indexing and tracking lists to reduce searching, a memory-for-speed trade-off that is in some ways key to the Biodiverse development philosophy.  As a result of this work, the current implementation is several orders of magnitude faster than what is was eight years ago.


Note that, as small caveat to the above, it is possible there are small deviations of the description from the implementation.  In such cases it is the code that is canonical.  (This is somewhat like descriptive comments in a computer program being what the programmer wanted some code to do, not necessarily what it actually does).


Shawn Laffan

18-Nov-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


Thursday, 12 November 2020

Randomisations now also generate z-scores

Randomisations in Biodiverse have always generated rank-relative scores from which significance can be derived. See for example this previous post: https://biodiverse-analysis-software.blogspot.com/2016/08/easier-to-use-randomisation-results.html 

However, sometimes users want to use z-scores to assess a randomisation. Several other tools such as phylocom also calculate z-scores, and often these are the default.

From version 4 of Biodiverse, z-scores are now also be calculated for randomisations. More details about what happens are in the updated help page, but a short summary is given here. 

For each iteration of the randomisation, each analysis is regenerated using the same set of spatial conditions, but using the randomised version of the basedata. The values of each index for each group (cell) are then checked, and running tallies are kept for the sum of randomised index values as well as the sum of squared index values. These sums, along with the count of iterations, allows the calculation of the mean (sum of x / count) and standard deviation using the running variance method. 

Once the randomisations have completed, the z-score for each index value for each group can be calculated in the usual way as (observed - expected / sqrt (variance)). Interpretation is the same as any other z-score, with values outside the interval [-1.96,1.96] being significant for two tailed test with an alpha of 0.05, providing the number of samples is large. Note that the strictly correct value depends on the degrees of freedom, which depends on the number of samples. But why would you run a small number of iterations? 

The z-scores are calculated automatically whenever a randomisation is run.  The results are collated for each group and are accessible through the ">>z_scores>>" lists.  In this case Rand1 is the name of the randomisation, hence the list name is Rand1>>z_scores>>SPATIAL_RESULTS. 



If you want to add z-scores to an existing randomisation then you will need to run at least one extra randomisation iteration to trigger their calculation. If there is demand to do so without the extra iteration then we can add an option for it.  (Consider the relative numeric difference between 999 and 1000 iterations, without worrying about aesthetics such as the number of observed plus random realisations being a multiple of 10).  


Shawn Laffan, 12-Nov-2020



For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/ 

For the full list of changes in the 3.99 development series, leading to version 4, see https://github.com/shawnlaffan/biodiverse/wiki/ReleaseNotes#version-399-dev-series 

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users