These approaches have been available in Biodiverse for some time: rand_structured since the very beginning, spatial constraints since version 1, and the spatially structured variants since version 2.
One of the general principles when developing Biodiverse is to provide options for users[1]. In that light, version 4 of Biodiverse will include the independent swaps algorithm.
How does independent swaps work?
A short summary is below, and more details and assessments are in Gotelli (2000) and Miklos & Podani (2004). Code for the R implementation is available through the Picante package.
- Pick two different groups at random (call them group1 and group2)
- Pick two different labels at random (call them label1 and label2)
- If label1 is not in group1, or label2 is not in group2, or if label1 is already in group2, or label2 is already in group1
- Then they cannot be swapped, so switch label1 and label2.
- If the switched pairs cannot be swapped using the condition in #3 then go back to step #1
- Swap the labels between groups
- Increment the number of successful swaps by 1
- Start back at step #1
This is a very simple algorithm, and simple has many advantages when it comes to implementation. However, it is also a brute-force approach. The fact that the pairs are selected at random means there is no consideration of "swappability". Consider the case of a site by species matrix where most of the matrix entries are zero, such as where most taxa are highly range restricted. In that case, there is a very high chance that the algorithm will select a pair that cannot be swapped and must retry with a different selection, e.g. because neither label is in either group. This can lead to many wasted CPU cycles. As an unflattering description one might call this a "throw enough mud at a wall and some of it will stick" approach. Keep in mind, though, that for many cases this will work well enough, and the results are most definitely random - good enough is good enough.
Another issue with the independent swaps approach is that one can never be quite certain how many iterations are needed to fully randomise a data set. Miklos & Podani (2004) suggest twice the number of non-zero entries in the matrix, but I am not aware of whether that has been rigorously assessed. Too few iterations and some of the original structure will be retained. Too many and the analysis might take too long (even for very patient people)[2].
There is also no stopping criterion except the total number of iterations. It is therefore possible for the algorithm to be trapped in an infinite loop when none of the group/label pairs can be swapped, as the iteration count is only incremented on a successful swap.
In light of the above points, the implementation of independent swaps in Biodiverse has three parameters, two of which lead to early stopping.
- The default number of swaps is set to be twice the number of non-zero cells (if one thinks of the data as a site by species matrix). This value follows Miklos & Podani. Users can specify any positive integer here. The algorithm will stop when this value is reached.
- The maximum number of iterations defaults to 100 times the number of swaps, but can also be specified as a positive integer. If this number is exceeded then the algorithm stops, regardless of the number of swaps completed. (Note that this value will likely be increased in the final Biodiverse 4 release, see below for why).
- The number of times each label/group pair in the original matrix is swapped out is tracked. The algorithm ends once each label/group pair has been swapped at least once. This is off by default.
The last case uses the assumption that, once a pair has been randomly swapped, there is no need to swap it any further as it will not be any more random. If your random number generator is not good then more swaps can actually make things worse (but his is unlikely to be a problem as most systems use good random number generators now).
A modified independent swaps algorithm
As noted in a previous post, a huge amount of effort has gone into optimising the randomisations in Biodiverse to make them run faster. Armed with this knowledge, the independent swaps algorithm has also been optimised. The primary aim here has been to reduce the search for swappable label/group pairs, at the cost of extra memory to store extra index lists.
An overview of the algorithm is this:
- Select a group using a multinomial probability distribution, with the probabilities being the richness scores of each group. Call this group1.
- Select label1 from the labels in group1 using a uniform random distribution (each label in group1 is equiprobable).
- Select a group that does not already contain label1, again using a multinomial probability distribution. Call this group2.
- Select label2 from the labels in group2 using a uniform random distribution (each label in group2 is equiprobable).
- If needed, repeat #4 until label2 is not one that is already in group1.
- Swap label1 to group2, and label2 to group1.
- Update the multinomial probability distributions.
- Increment the swap count.
- Go to step #1.
The multinomial distributions are used to replicate the chances of selecting a non-zero matrix entry when selecting entirely at random. The use of the "absence" lists in step #3 also focuses the search to swappable pairs. For example, a very wide ranging label (species) will occur in most groups so there is a high probability that the normal algorithm will select a group1 and group2 that both contain that label.
There is clearly a memory burden when one considers the "absence" tracking lists. However, memory is cheap these days and the data sets must be relatively enormous for it to make too much difference. Perl, in which Biodiverse is programmed, also has a copy-on-write mechanism to save memory when dealing with large strings (e.g. a copy of a 100MB string can be stored using a link to the original, and only needs to be really copied when it is changed in some way or is written out).
The modified implementation has the same default stopping parameters as the unmodified implementation.
How long do they take?
So how fast are they? The table below summarises a benchmark run with the Acacia data set from Mishler et al. (2014). This has 506 species (labels) distributed across 3037 cells (groups).
Both the original and modified Biodiverse implementations are used, as is that from the R Picante library and a variant implemented in pure R (and run using R version 4.03). All were run on the same computer. The Biodiverse runs used the GUI, the R runs were under RStudio. Code for the R runs is available at the Biodiverse GitHub repo.
Only one to five runs of each combination was used, and values are rounded off. Ideally more would be run, but the results are consistent when re-run and the differences are of sufficient magnitude to use for a blog post.
There are 29,175 non-zero entries in the site by species matrix, so approximately 81% of the matrix is empty. The target number of swaps for the independent swaps runs was set to twice the number of non-empty matrix cells, 58,350, which as noted above is the default in Biodiverse and follows Miklos and Podani (2004).
Values in the table are sorted by run time in seconds. The "early stop" column indicates if swapping is stopped once all label/group pairs (matrix cells) have been swapped at least once. "Num swaps" is the number of swaps actually run, "cells swapped" is the count of matrix cells swapped at least once, and is only tracked when early stopping is in use. "Swap attempts" is how many iterations were used overall, including those where a swap was not possible. "max attempts" is the maximum number of swap attempts before giving up, with the default being 100 times the target number of swaps. The larger number is 2^31-1, which is the maximum value for a signed 32 bit integer.
The main takeaway from the table is that the rand_structured algorithm is fastest of all those run, and is much faster than any of the unmodified independent swaps runs. This difference will be more pronounced as the data set size increases. That said, the rand_structured algorithm has been through many rounds of optimisation, so the difference might be reduced with further optimisations to the other algorithms.
The times for the modified independent swaps are aggregates of iterations 2-5, with the value in brackets being the first iteration when many of the lists are set up and cached for later reuse.
The modified independent swaps algorithm reaches its target with much less "wastage" of iterations, hence it is fastest of the independent swaps. This is further aided by the early stopping condition.
For the unmodified independent swaps algorithm, more than 24,600,000 swap attempts are needed to ensure all matrix cells have been swapped. This is ~843 times the number of non-zero matrix cells. The two fastest unmodified runs reach the maximum number of swap attempts when the number of actual swaps is still very low, at ~6% of the target number of swaps.
The unbounded independent swaps run in Biodiverse takes nearly seven minutes. If used with 999 randomisation iterations in sequence then the run time will be nearly 4.8 days for the randomisations alone (not including running the various analyses for the randomised basedatas). A faster computer could probably reduce it to 3 days sequential time...
It should also be noted that the Biodiverse implementations are all in Perl. The Picante implementation is written in C++, which is much faster than languages like Perl, R and Python (they themselves are written in C or C++).
Despite not being written in C, the Biodiverse modified independent swaps algorithm is two to five times faster than the Picante implementation for this data set. The modified algorithms do more work per iteration, but run fewer iterations overall. The Perl implementation could also use be reimplemented in C, but the flexibility of the rand_structured approach (see below) means any such efforts will likely be directed there.
As a final note regarding speeds, the pure R implementation could be made faster. Profiling shows that most of the processing time is spent in the sample.int() function. Even so, it would still not be faster than the Picante implementation so the only advantage would be if early stopping were implemented.
Spatially structured independent swaps
As a final note, an issue with the independent swaps algorithm is that it is very difficult to add spatial constraints to model diffusion processes or random walks. One can apply spatial constraints to the swapping, i.e. an incidence must be swapped with another within some distance, but that has not been implemented. If there is sufficient demand, and perhaps funding or a code contribution, then it could be added.
Keep in mind, though, that you can apply spatial constraints so any swapping is done within sub-datasets, for example within regions. This process is described in the spatially constrained randomisation post.
Footnotes
[1] One of the general principles when developing Biodiverse is to provide options for users. This can sometimes lead to a huge array of options, for example the indices. However, focusing on that example, many of the indices are provided to allow insights into how others are calculated, such as showing the relative contribution of each label to an endemism score.
[2] Speaking of patient, Biodiverse is written for and by the impatient - impatience is one of the three virtues of programming, after all.
Shawn Laffan
07-Dec-2020
For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/
To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList
You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users
No comments:
Post a Comment
Note: only a member of this blog may post a comment.