Wednesday 18 November 2020

Randomisations - how the rand_structured algorithm works

The default randomisation algorithm in Biodiverse is called rand_structured.  This has been available in Biodiverse since the beginning and has been used in several papers.  It forms the basis of the current examples of the CANAPE protocol, and an early variant was used in Laffan and Crisp (2003) which used an ancestral variant of Biodiverse.  

The rand_structured algorithm is so called because it maintain the structure of the original data, specifically the spatial distribution of the richness patterns, while randomly allocating taxa (labels) to groups (cells).  By default, the richness patterns are matched exactly, but users also have the option to allow tolerances such as an additional number of labels per group (e.g. five extra), or some multiple of the observed (e.g. 1.5 times as many).  Irrespective of these tolerances, only as many label/group combinations are allocated as are observed in the original data.   If one thinks of the data as a site by species matrix then, if the richness patterns are matched exactly, then the row and column sums are identical between the original and randomised matrix.  If tolerances are used then the row and column sums will not match.  In both cases, however, the number of non-zero entries (incidences) across the matrix is constant.

The rand_structured randomisation uses a filling algorithm, followed by a swapping algorithm to reach its richness targets.  

The filling process selects a label at random and allocates all of its occurrences to groups (cells), also at random.  Thinking spatially, it is scattering the incidences randomly across the map (but keep in mind that your data do not need to be spatial for them to be analysed).  The filling algorithm will not consider any groups that have already reached their richness targets. 

At the end of the allocation process there will normally be a set of labels that have not yet been fully allocated, because they are already allocated to the unfilled groups, and all the other groups are full (have reached their richness targets).  This is where the swapping algorithm comes in. 

In the swapping process, an unallocated label is selected at random (call it label1).  The set of filled groups that do not contain this label is then searched to find a label that can be allocated to one of the unfilled groups (call it label2).  Once that is done, the label2 incidence is moved to the unfilled group, and label1 is added to the previously filled group.  This is repeated until all incidences have been allocated.  

All the above might be easier to understand if presented visually.  The red cells in the image below represent the distribution of Acacia tetragonophylla across Australia at a 50 km resolution.  Grey cells contain one or more other Acacia species.  

The video below it shows the allocation of the A tetragonophylla incidences, with the intensity of the red representing when in the sequence they were allocated (lighter is earlier, darker is later).  Note how there is no apparent spatial pattern in the allocation, but remember that the richness patterns for each group (cell) are being preserved. 



The distribution of Acacia tetragonaphylla (red cells).  Grey cells have at least one other Acacia species.  Data are from Mishler et al. (2014), and cell size is 50 km.  






Some readers might be thinking that the search for labels and groups to swap can take some time.  However, the process is much faster than simply thrashing around, randomly swapping labels until they have all been allocated. In fact, a huge amount of work has been spent over the years to optimise the filling and swapping code to ensure that it runs fast and scales linearly (or sub-linearly) as the number of labels and groups increases (and sometimes is unaffected). Much of this is done using indexing and tracking lists to reduce searching, a memory-for-speed trade-off that is in some ways key to the Biodiverse development philosophy.  As a result of this work, the current implementation is several orders of magnitude faster than what is was eight years ago.


Note that, as small caveat to the above, it is possible there are small deviations of the description from the implementation.  In such cases it is the code that is canonical.  (This is somewhat like descriptive comments in a computer program being what the programmer wanted some code to do, not necessarily what it actually does).


Shawn Laffan

18-Nov-2020


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


No comments:

Post a Comment

Note: only a member of this blog may post a comment.