Tuesday, 5 November 2024

Randomisations: Curveball algorithm now in Biodiverse

Biodiverse supports a range of randomisations to assess significance of analysis results.  Most use cases in the published literature use the rand_structured algorithm, which is explained in this post, but several common algorithms are supported.  

One of the design principles of Biodiverse is to give the user choice.  To that end, the curveball algorithm is available from version 5.  

The publication describing Curveball is Strona et al. (2014).  The name is derived from a baseball card trading card pastime popular in North America.  

The curveball algorithm is applied to a data set of items (species, genera, words, or some other set of identifiers).  In the common biodiversity case this is a sites by species matrix, transformed to a list of lists, e.g. a list of site lists, where each site list comprises its species (or vice versa).  These lists can be considered as sets.  At each iteration, two lists (sets of items) are randomly selected.  Any items found in both sets are ignored.  The rest can be swapped between the two sets, with the number swapped limited by the smaller number of unique items in the two sets to ensure after swapping that each set retains the same number of items it started with.  As an example, consider the case where set 1 has ten items, set 2 has eight, and there are six common items found in both lists.  This means two items can be swapped between the two lists.

The general formula for the number of possible swaps at an iteration is (min (|A|,|B|) - |A ∩ B|), where A and B are the two sets being considered, and the pipes || denote the lengths of the sets (the numbers of items they contain).   If one prefers to think in terms of dissimilarity measures where a is the number of shared items, b the number unique to set 1 and c the number unique to set 2, then the formula is (min (b,c)).  Purely as an aside, this is also part of the denominator in Simpson's dissimilarity index.  

The curveball algorithm is related to the independent swaps algorithm.  The chief advantage of curveball over independent swaps is that, because it swaps as many items as it can at each iteration, it converges on a randomised result much faster.  Curveball also avoids the main pitfall of the independent swaps algorithm where a pair can be selected that cannot be swapped, thus "wasting" an iteration (swap attempt).  

Curveball does, however, have the same issue that independent swaps has in that the user needs to specify the number of iterations over which swaps will be attempted.  Too few and the resulting matrix will not be sufficiently random.  Too many and time will be "wasted".  This is addressed in Biodiverse by optionally tracking which of the original matrix entries have been swapped, and stopping when all have been done (the stop_on_all_swapped parameter).  This has some overhead in the tracking but generally this should be balanced by the time saved by running fewer iterations overall.  For those interested, the default number of swaps is the same as for the independent swaps algorithm, which is twice the number of non-zero matrix entries (twice the sum of the lengths of all lists).

Accessing the curveball algorithm in Biodiverse is the same as for any of the randomisations.  Open the Randomisation tab, select rand_curveball as the randomise function, select the number of randomisation iterations and any other algorithm specific parameters, then press Go (see image below).  The results are in the same format as always (e.g. see here, here and here).

Since it is just another algorithm, all the common options are available (another new change in version 5 is that more options are available across all algorithms in the GUI - see issue 946).  Users can define regions that are randomised separately before reassembly for analysis, including some that are not to be randomised.  One can also add some of the randomised results to the project to inspect them.

In terms of speed, curveball is faster than rand_structured.  This is largely due to there being less book-keeping required.  However, as with independent swaps, curveball can only be applied on a per-cell basis.  It does not extend to spatially structured randomisations like rand_structured does (one could ensure swap candidates come from within some local neighbourhood, but this is a different model to something like a diffusion process or a random walk.  Update 20241109: This has been implemented and will be available in V5).

All that is needed to run the curveball algorithm is to choose rand_curveball as the "Randomise function".  Other parameters are set as usual.


And that's pretty much it for the description.  If you want to read more randomisation related blog posts then check out the posts tagged with the randomisation label.  


----

Shawn Laffan

05-Nov-2024


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/  


For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 


You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions 


Monday, 4 November 2024

Plotting indices with divergent colour schemes

Many diversity indices have numerical distributions that are divergent, i.e. they are centred on some value and the interesting bit is the magnitude of the differences away from that value.  A simple example is z-scores, where the data are centre on a value of zero and the values indicate how many standard deviations above or below the expected value the input data are.   These have been plotted using a divergent scheme since version 4.1, as described here.

However, one can also have indices that are simple differences, and also ratios where 1 is the centre of the distribution, and values of 1/2 and 2 are the same magnitude difference from the centre.  The relative phylogenetic diversity and endemism indices are examples of the latter.  

From version 5, Biodiverse plots difference and ratio indices using a divergent colour scheme.    These use the same colour range as the z-scores but plotted along a continuous scale instead of as ordinal classes.  

The colouring happens automatically based on metadata stored with the indices (incidentally, the much of GUI is built using this metadata).  

Colours are also scaled so the most extreme "high" colour is equivalent to the most extreme "low" colour, i.e. if the range of difference values is -5 to 1 then the colours are assigned to the range -5 to 5, and the same for -1 to 5.  This is also accounted for when the data are log scaled or percentile trimmed to de-emphasise extreme values.  

A useful point to note is that the colour schemes can be flipped, so if one prefers blue as extreme positive values then this can be done under the Map menu at the left of the display.  

An example is below to compare the old behaviour with the new.  


Prior to version 5, ratio data were plotted using the same colour scheme as any other data, making it difficult to interpret the relative magnitude of the index values across cells.  These are the Relative Phylogenetic Diversity results for the Acacia data set of Mishler et al. (2014), scaled to emphasise the inner 90% of the distribution (i.e. the upper 5% are assigned the same colour, so too the lower 5%).  This is the interval [0.406, 0.896], which means red cells include ratios <1 which is not ideal.  Compare with the next figure.    




The same data as in the previous figure, but now using a divergent colour scheme.  Biodiverse knows this is a ratio index, so assigns colours accordingly.  Red cells have ratios exceeding 1, blue cells less than 1.  Ratios close to 1 are in yellow.  The colours are assigned to the interval [0.406,2.463], where 2.463=1/0.406.  This means one can be sure red cells have ratios exceeding 1, and there is less chance of misinterpreting the results.  





It is not shown here, but the metadata is also stored for tree-based indices so divergent colours are assigned to the tree branches where appropriate.  More details about that process are in this post.  


----

Shawn Laffan

04-Nov-2024


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/  


For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 


You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions 


GUI: Polygon overlays (and underlays)

Since its first release, Biodiverse has supported plotting of polygon and polyline feature class data (from shapefiles).  The support is very basic given users can only plot the outlines of polygons, even though the colours could be changed.  

This has worked well overall, but there are times when the linework from the feature data gets in the way of the cells being plotted.  There are also times when it is useful to plot polygons as solid fills instead of just as the outline.  From version 5 of Biodiverse it is possible to do just this.  

The process is relatively simple.  If a polygon overlay is loaded then it is listed twice in the selection window, once for lines and once for solid fill (with no outline).  The default choice is polylines, which is the current behaviour.  Users then have the option of plotting one overlay above or below the cells.



Colours can be assigned in the usual way.  In this next selection window, the polygon data will be displayed below the cells using a grey colour (grey is quite useful as it does not visually dominate when coloured cells are used).  




Polygon data are displayed as a solid grey fill, under the cells.  In this case it makes it more obvious where there are unsampled regions.  (Cell outlines have also been turned off using the map menu).


Other uses for polygon overlays are in plotting ocean polygons over terrestrial cells to cover over parts of cells that are in the sea (and vice versa for marine data).  


There is no doubt more work to be done, for example plotting more than one layer at a time, but it is a useful improvement.  If more complex plotting is needed then this is when it is best to leverage the power of GIS software.  


----

Shawn Laffan

04-Nov-2024


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/  


For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 


You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions