Biodiverse analysis software: what's new

Showing posts with label what's new. Show all posts

Tuesday, 5 November 2024

Randomisations: Curveball algorithm now in Biodiverse

Biodiverse supports a range of randomisations to assess significance of analysis results. Most use cases in the published literature use the rand_structured algorithm, which is explained in this post, but several common algorithms are supported.

One of the design principles of Biodiverse is to give the user choice. To that end, the curveball algorithm is available from version 5.

The publication describing Curveball is Strona et al. (2014). The name is derived from a baseball card trading card pastime popular in North America.

The curveball algorithm is applied to a data set of items (species, genera, words, or some other set of identifiers). In the common biodiversity case this is a sites by species matrix, transformed to a list of lists, e.g. a list of site lists, where each site list comprises its species (or vice versa). These lists can be considered as sets. At each iteration, two lists (sets of items) are randomly selected. Any items found in both sets are ignored. The rest can be swapped between the two sets, with the number swapped limited by the smaller number of unique items in the two sets to ensure after swapping that each set retains the same number of items it started with. As an example, consider the case where set 1 has ten items, set 2 has eight, and there are six common items found in both lists. This means two items can be swapped between the two lists.

The general formula for the number of possible swaps at an iteration is (min (|A|,|B|) - |A ∩ B|), where A and B are the two sets being considered, and the pipes || denote the lengths of the sets (the numbers of items they contain). If one prefers to think in terms of dissimilarity measures where a is the number of shared items, b the number unique to set 1 and c the number unique to set 2, then the formula is (min (b,c)). Purely as an aside, this is also part of the denominator in Simpson's dissimilarity index.

The curveball algorithm is related to the independent swaps algorithm. The chief advantage of curveball over independent swaps is that, because it swaps as many items as it can at each iteration, it converges on a randomised result much faster. Curveball also avoids the main pitfall of the independent swaps algorithm where a pair can be selected that cannot be swapped, thus "wasting" an iteration (swap attempt).

Curveball does, however, have the same issue that independent swaps has in that the user needs to specify the number of iterations over which swaps will be attempted. Too few and the resulting matrix will not be sufficiently random. Too many and time will be "wasted". This is addressed in Biodiverse by optionally tracking which of the original matrix entries have been swapped, and stopping when all have been done (the stop_on_all_swapped parameter). This has some overhead in the tracking but generally this should be balanced by the time saved by running fewer iterations overall. For those interested, the default number of swaps is the same as for the independent swaps algorithm, which is twice the number of non-zero matrix entries (twice the sum of the lengths of all lists).

Accessing the curveball algorithm in Biodiverse is the same as for any of the randomisations. Open the Randomisation tab, select rand_curveball as the randomise function, select the number of randomisation iterations and any other algorithm specific parameters, then press Go (see image below). The results are in the same format as always (e.g. see here, here and here).

Since it is just another algorithm, all the common options are available (another new change in version 5 is that more options are available across all algorithms in the GUI - see issue 946). Users can define regions that are randomised separately before reassembly for analysis, including some that are not to be randomised. One can also add some of the randomised results to the project to inspect them.

In terms of speed, curveball is faster than rand_structured. This is largely due to there being less book-keeping required. However, as with independent swaps, curveball can only be applied on a per-cell basis. It does not extend to spatially structured randomisations like rand_structured does (one could ensure swap candidates come from within some local neighbourhood, but this is a different model to something like a diffusion process or a random walk. Update 20241109: This has been implemented and will be available in V5).

All that is needed to run the curveball algorithm is to choose rand_curveball as the "Randomise function". Other parameters are set as usual.

And that's pretty much it for the description. If you want to read more randomisation related blog posts then check out the posts tagged with the randomisation label.

----

Shawn Laffan

05-Nov-2024

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

Monday, 4 November 2024

Plotting indices with divergent colour schemes

Many diversity indices have numerical distributions that are divergent, i.e. they are centred on some value and the interesting bit is the magnitude of the differences away from that value. A simple example is z-scores, where the data are centre on a value of zero and the values indicate how many standard deviations above or below the expected value the input data are. These have been plotted using a divergent scheme since version 4.1, as described here.

However, one can also have indices that are simple differences, and also ratios where 1 is the centre of the distribution, and values of 1/2 and 2 are the same magnitude difference from the centre. The relative phylogenetic diversity and endemism indices are examples of the latter.

From version 5, Biodiverse plots difference and ratio indices using a divergent colour scheme. These use the same colour range as the z-scores but plotted along a continuous scale instead of as ordinal classes.

The colouring happens automatically based on metadata stored with the indices (incidentally, the much of GUI is built using this metadata).

Colours are also scaled so the most extreme "high" colour is equivalent to the most extreme "low" colour, i.e. if the range of difference values is -5 to 1 then the colours are assigned to the range -5 to 5, and the same for -1 to 5. This is also accounted for when the data are log scaled or percentile trimmed to de-emphasise extreme values.

A useful point to note is that the colour schemes can be flipped, so if one prefers blue as extreme positive values then this can be done under the Map menu at the left of the display.

An example is below to compare the old behaviour with the new.

Prior to version 5, ratio data were plotted using the same colour scheme as any other data, making it difficult to interpret the relative magnitude of the index values across cells. These are the Relative Phylogenetic Diversity results for the Acacia data set of Mishler et al. (2014), scaled to emphasise the inner 90% of the distribution (i.e. the upper 5% are assigned the same colour, so too the lower 5%). This is the interval [0.406, 0.896], which means red cells include ratios <1 which is not ideal. Compare with the next figure.

The same data as in the previous figure, but now using a divergent colour scheme. Biodiverse knows this is a ratio index, so assigns colours accordingly. Red cells have ratios exceeding 1, blue cells less than 1. Ratios close to 1 are in yellow. The colours are assigned to the interval [0.406,2.463], where 2.463=1/0.406. This means one can be sure red cells have ratios exceeding 1, and there is less chance of misinterpreting the results.

It is not shown here, but the metadata is also stored for tree-based indices so divergent colours are assigned to the tree branches where appropriate. More details about that process are in this post.

----

Shawn Laffan

04-Nov-2024

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

GUI: Polygon overlays (and underlays)

Since its first release, Biodiverse has supported plotting of polygon and polyline feature class data (from shapefiles). The support is very basic given users can only plot the outlines of polygons, even though the colours could be changed.

This has worked well overall, but there are times when the linework from the feature data gets in the way of the cells being plotted. There are also times when it is useful to plot polygons as solid fills instead of just as the outline. From version 5 of Biodiverse it is possible to do just this.

The process is relatively simple. If a polygon overlay is loaded then it is listed twice in the selection window, once for lines and once for solid fill (with no outline). The default choice is polylines, which is the current behaviour. Users then have the option of plotting one overlay above or below the cells.

Colours can be assigned in the usual way. In this next selection window, the polygon data will be displayed below the cells using a grey colour (grey is quite useful as it does not visually dominate when coloured cells are used).

Polygon data are displayed as a solid grey fill, under the cells. In this case it makes it more obvious where there are unsampled regions. (Cell outlines have also been turned off using the map menu).

Other uses for polygon overlays are in plotting ocean polygons over terrestrial cells to cover over parts of cells that are in the sea (and vice versa for marine data).

There is no doubt more work to be done, for example plotting more than one layer at a time, but it is a useful improvement. If more complex plotting is needed then this is when it is best to leverage the power of GIS software.

----

Shawn Laffan

04-Nov-2024

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

Saturday, 3 February 2024

Trimming basedatas has been generalised

It has long been possible to trim the basedata labels to keep only those that match either the selected tree or selected matrix.

From Version 5 (actually 4.99_002 if you like development versions) it is possible to trim using a different basedata. The interface has also been generalised in the process.

There's not much to it, so here are some screenshots to demonstrate the process.

Generalised trimming is accessed from the basedata menu

It has the usual interface where one can specify a new name. "Trimming a clone" ensures it operates on a copy. "Delete matching" allows one to invert the trim, i.e. if one wants to keep only the labels that do not match,

Any of the basedatas, trees or matrices in the project can be selected to use as the label source.

----

Shawn Laffan

03-Feb-2024

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

Monday, 1 May 2023

Biodiverse 4.3 has been released

Biodiverse version 4.3 has now been released.

Versions for Windows, Mac and Linux (Ubuntu) are available and can be accessed via https://github.com/shawnlaffan/biodiverse/wiki/Downloads

Installation instructions are at https://github.com/shawnlaffan/biodiverse/wiki/Installation

This release contains a small number of bug fixes and improved functionality.

For the full list of issues and changes leading to the 4.3 release, see https://github.com/shawnlaffan/biodiverse/milestone/21

Main changes:

GUI:

z-score plotting has been fixed (colours were reversed). Issue 857.

Randomisations

The p-rank calculations now generate ranks for all defined values. The GUI also now colours the values, similar to the z-scores. Issue 856. More details in the blog post.

Spatial conditions

The sp_points_in_same_poly_shape condition is now faster when any points do not intersect any polygons. See commit 3ca2703.

----

Shawn Laffan

01-May-2023

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

Thursday, 27 April 2023

Changes to randomisation results - the p-rank data

Randomisations in Biodiverse produce a range of outputs. These are kept in a range of lists, differing by name (see the help system).

One of the lists that is generated in the p-ranks. This is essentially the same as the P_ values in the main randomisation lists but where the low values account for ties so one can be sure the values represent the relative ranking of the observed value against those generated from the randomised data. For example, the significance of a low value should account for any ties.

The p-ranks were implemented a few years versions ago and are detailed in this blog post. Due to how the plotting was set up at the time, only values in the outer 10% of the distribution were retained. This helped understand which groups contained significant results without a major update to the display system but in the end was probably confusing. Now that the z-score plotting has been implemented the system has the infrastructure to handle the full range of values.

So what has changed?

Two things: the calculation of values and how they are plotted.

Note that the set of cells that can be regarded as significant using the standard alpha threshold of 0.05 for high or low values is unchanged. All that has changed is the number of cells with defined values and how they are displayed in the GUI.

The calculation

Put simply, all values are now retained. Any "P_" value less than 0.5 accounts to the number of ties. Expressed as pseudocode it is:

if P_index > 0.5

p_rank = P_index

else

p_rank = ((C_index + T_index) / Q_index)

where "index" is whichever index is being compared at the time.

This makes post-hoc calculation of compound indices like CANAPE easier (although remember that Biodiverse now does that for you).

The display

The addition of the z-score plotting means that the infrastructure for the plotting is in place so it was not too difficult to re-use it to instead display percentile classes. This is applied to the p-score lists by default.

Compare the two plots below and consider which is easier to work with.

The p-rank plotting in Biodiverse version 4.2 and earlier works, but it is difficult to see which cells are in specific percentile bands. For example which of these cells is in the outer 5%?

Indices in the p-rank lists are now plotted as percentile classes. Compare with the plot above.

As with other plots, the coloured cells can be exported as RGB geotiffs to display in a GIS or other plotting system.

----

Shawn Laffan

27-Apr-2023

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

Wednesday, 29 March 2023

Biodiverse version 4.2 has been released

Biodiverse version 4.2 has now been released.

Versions for Windows, Mac and Linux (Ubuntu) are available and can be accessed via https://github.com/shawnlaffan/biodiverse/wiki/Downloads

Installation instructions are at https://github.com/shawnlaffan/biodiverse/wiki/Installation

This release contains a small number of bug fixes and improved functionality. For the full list of issues and changes leading to the 4.2 release, see https://github.com/shawnlaffan/biodiverse/milestone/20

Main changes:

GUI
- Branch highlighting in the View Labels tab works again. This was broken in version 4.1. Issue #850.
Data imports
- Raster imports now include the band labels if defined in multiband files. Issue #852.
- Importing a raster now works when the nodata value is NaN. Issue #851.

----

Shawn Laffan

29-Mar-2023

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

Tuesday, 7 February 2023

Biodiverse version 4.1 has been released

We are pleased to announce the release of Biodiverse version 4.1.

Versions for Windows, Mac and Linux (Ubuntu) are available and can be accessed via https://github.com/shawnlaffan/biodiverse/wiki/Downloads

Installation instructions are at https://github.com/shawnlaffan/biodiverse/wiki/Installation

Version 4.1 represents five issues closed across 96 source code commits.

Highlights of the changes since version 4.0 are at https://github.com/shawnlaffan/biodiverse/wiki/ReleaseNotes#version-41, and the related blog posts can be accessed via https://biodiverse-analysis-software.blogspot.com/search/label/Version41

A more detailed listing of the closed issues is at https://github.com/shawnlaffan/biodiverse/milestone/19?closed=1

The main user visible change is that z-score indices are now plotted using a divergent colour scale using z-score significance thresholds. More details are in this blog post.

----

Shawn Laffan

07-Feb-2023

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

Plotting z-score indices and randomisation results

From version 4.1, Biodiverse will plot indices it knows are z-scores using a divergent colour scheme, with values classified into intervals (adapted from the ArcGIS implementation). This makes it much easier to see which locations are potentially significant given the expected values.

This process applies to indices like the Net Relatedness Index and Net Taxon Index, all of the Gi* indices such as for group properties and label properties (more on such analyses here), as well as the z-scores generated by randomisation analyses. It also applies to branches of a cluster dendrogram when indices have been calculated for each node/branch.

You can export the coloured images to geotiff in the same way as for any data set.

There is not much more to it than that, so here are some images of what it looks like for a spatial analysis using the Acacia data set of Mishler et al. (2014).

The Net Relatedness Index

Z-scores for Phylogenetic Diversity after a spatial randomisation process

Net Relatedness Index calculated for the groups (cells) under each branch of a cluster analysis. Coloured cells are associated with the dendrogram branches that intersect the blue slider bar.

The spatial distribution of PD significance (left) with branches occurring in a cell in south-west Western Australia (black dot) coloured by clade score significance against the same randomisation process.

----

Shawn Laffan

07-Feb-2023

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

For a list of some of the analyses Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

Friday, 25 November 2022

Export cluster groups to shapefile

Biodiverse Version 4 allows users to export their cluster analyses using the same grouping process as is used to colour the branches.

This can be convenient to reconstruct the clusters in a GIS or other graphics system.

One issue is that only the cluster polygons (or points) are exported. If you want to attached data from the clusters then you can export them to delimited text using the Table Grouped method (with the same grouping parameters) and use a database join to attach them to the shapefile. The main reason for this is that shapefiles have a limit of 11 characters for field names, and many indices in Biodiverse exceed this (as well as sometimes containing characters other than letters, numbers and the underscore).

Another point to be aware of is that each group (cell) is a separate polygon so use a dissolve to merge them if you want to remove the internal boundaries.

Pictures are better than words so here are some screenshots.

An example cluster analysis, in this case with six clusters coloured.

The export option is in the usual place. It can also be accessed through the outputs tab.

In this case the export is set to use six clusters to match the display, but you can choose whatever you like. Other options include selecting by depth or by distance from the root (by length or depth).

And here we have a plot of the clusters. The colours differ but the clusters themselves are the same (and one can always update the colours).

If you want to use the grouped clusters in a spatial condition then it is easier to do so directly - see more details here.

If you just want to replicate the display then it is better to export the spatial data to an RGB geotiff and the tree to nexus with the colours embedded - see geotiff details here and the tree details here.

--------

Shawn Laffan

25-Nov-2022

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

Trees: Merge single-child branches with their children

When Biodiverse is used to trim a tree to a subset of branches, for example to match the selected BaseData object, any branch with no remaining descendants is removed from the tree. All other branches are retained.

What this means is that some internal branches (nodes) can be left with only one child branch (node),. These can be referred to as single-child nodes and also knuckles. Retaining such nodes can be useful if some of the structure of the original tree needs to be kept, for example to indicate that there is phylogenetic data but that it has been removed from the tree. The counter to this is that most phylogenetic trees are samples and so are likely to be missing many branches anyway.

In the spirit of letting the user decide, Biodiverse version 4 supports the merger of internal branches with their children if they have only one child.

Names are important, and like many systems any node can be named in Biodiverse. In fact, all nodes have names but internal nodes default to a number with three trailing underscores (so "1___", "35___" etc). This allows many of the branch and clade level indices such as the phylogenetic endemism clade contributions and PD clade loss.

The general rule when merging is that the name of the merged node is whichever node had a non-default name to begin with. If both have non-default names then a child that is a terminal wins. Otherwise the parent name is used.

The process is best demonstrated using images.

An example tree plotted using depth instead of length to show the individual branches. The black branches are not in the basedata.

The tree trimming interface includes the option to merge single child nodes. In this case it is not selected.

The black branches from the previous screenshot have been deleted but one can see several branches that appear twice as long as the others. These are actually pairs of branches.

Repeating the process above but this time merging the single child (knuckle) nodes.

In this case all the branches are the same length because all single child branches have been merged with their children.

The examples above all use the tree trimming process, but if you have a tree that already has knuckles or forget to merge them then you can also merge the nodes directly from the tree menu.

Direct access to the merging process.

--------

Shawn Laffan

25-Nov-2022

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

Tuesday, 25 October 2022

Biodiverse now calculates CANAPE for you

The CANAPE protocol is one of the analyses Biodiverse is most commonly used for (see examples amongst the list of publications using Biodiverse).

The method, or protocol, was originally described in Mishler et al. (2014) and is conceptually simple. Run an analysis that includes phylogenetic endemism and relative phylogenetic endemism, run those through a randomisation, and then categorise the results based on the significance score of the indices. This process is described in more detail in previous posts here and here.

The main issue with the approach to date is that the CANAPE classes are determined outside of Biodiverse using systems like a GIS, R code or a spreadsheet. So while the process is conceptually simple, the actual implementation can all get a bit complex. Many users are not entirely sure which indices to pass through their functions, or even which lists to extract them from.

As of Version 4 Biodiverse now calculates it for you. This occurs automatically whenever an analysis has included the Phylogenetic Endemism and Relative Phylogenetic Endemism type 2 calculations. (If you want it sooner than version 4 then it is in the development release 3.99_005, which was current at the time of writing. See the downloads page for links).

Biodiverse now calculates the CANAPE scores when the requisite indices have been calculated, and a randomisation has been run. Like many of the posts on this blog, this example uses the Acacia data set from Mishler et al. (2014).

How does Biodiverse store the results?

The results are stored in a new list where the name is the randomisation output used followed by ">>CANAPE>>". So for a randomisation called "rand" you would see "rand>>CANAPE>>". The use of angle brackets might look a bit strange at first but makes the naming consistent with the other randomisation lists and simplifies the underlying code.

The CANAPE classes are stored in an index called CANAPE_CODE, with a numeric code indicating which of the categories a cell falls in. Currently this code is 0 for not significant, 1 for neo-endemism, 2 for palaeo-endemism and 3 for mixed endemism.

Biodiverse also provides individual indices for neo, palaeo and mixed in the event a user only wants to see which cells are are in a specific class. For example one might want to run a cluster analysis using only neo-endemism cells following the process described here.

The same data as above but highlighting Palaeo-endemism cells in red. All other cells containing data are in blue.

Visualisation

A big advantage of generating CANAPE results within Biodiverse is that users can now explore the results using the functionality Biodiverse provides. As an example, the next screen shot shows an exploration of the contribution of each clade on the tree in relation to the analysis groups (cells) (see more details about that process here and here).

Each tree branch is coloured by the relative contribution of the clade subtending it to the PE score in the cell being hovered over (black dot in south-western WA). This allows an understanding of which clade is driving the PE scores, and thus CANAPE, in a cell. The visualisation process is explained in more detail here.

Displaying the results in other systems

If you then want to use the plots as part of a map then they can be exported to an RGB Geotiff. Details of how to do this are in another post but the next two screenshots show the start and end.

What about a different colour scheme?

The colour scheme used is from Mishler et al. (2014) where neo is red (new is hot), palaeo is blue (old is cold) and purple is between blue and red on a colour wheel.

If you prefer a different colour scheme then you can export the data as you normally would, for example as CSV files or as non-RGB geotiffs, and recreate the plot to your own tastes.

Changing the colours within Biodiverse would be very useful and contributions are always welcome.

What about the Super class?

The system does not currently generate the Super class. It can be added if there is demand. (Edit: It was added for Biodiverse Version 5).

Do I have to run a new randomisation analysis to see the CANAPE list?

The CANAPE lists are generated at the end of any sequence of randomisations. If you already have a randomisation analysis then they can be created by running one additional iteration.

If you are concerned that your analysis is already at 999 iterations then all you lose is a bit of numeric neatness as there are now 1001 realisations in total instead of 1000 (one original plus all the random ones). This is unlikely to make any meaningful difference once that many iterations have been run.

--------

Shawn Laffan

25-Oct-2022

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users or start a discussion at https://github.com/shawnlaffan/biodiverse/discussions

Monday, 2 May 2022

Use clusters in spatial conditions

Spatial conditions are a core part of Biodiverse

Most people seem to focus on using single cells for their analysis and trying to find the ideal cell size. This is missing much of the benefit of spatial analyses. You are not constrained to using single cells in isolation.

You can analyse regions around each focal location (processing group) using geometric shapes like circles. Varying the size of the window gives an understanding of the spatial scale of the patterns (the operational scale). However, there is no need to be geometric - you can use arbitrarily complex spatial conditions based on polygon features, proximity and/or matching text. See for example Laffan and Crisp (2003) and Laity et al. (2015).

You can also use cluster (and region grower) analyses to define your spatial windows. These allow you to let the data define the regions, with the calculations then applied giving you more understanding of the groupings that have been identified. Care needs to be taken with interpretation due to the risk of circularity, but that's not unusual. And sometimes you just want to understand something about the assemblage that falls under a node (branch). You might also be interested in the environmental properties associated with a cluster.

One issue with the cluster approach is that it can be difficult to use the branches in a spatial condition for a different analysis. Consider the case where one wants to spatially partition a randomisation so labels are kept within their associated clusters (for a given cluster cutoff). You could export the clusters to shapefile format, extract the relevant features to a new shapefile, and then use that in a new spatial condition. But that's a lot of work and not easy for people less familiar with geoprocessing and GIS.

From version 4 you can access the set of groups under a cluster analysis and use that to define spatial conditions (actually it is in the 3.99_003 development version). This can use any of the current cutting methods, so you can slice by distance from the tips, depth, or number of clusters from the root using the sp_points_in_same_cluster condition. You can also select individual branches (nodes) by name (sp_point_in_cluster).

Some snippets are below that can be copied into your spatial conditions windows. No screenshots this time, but I can add a new post of that is needed.

Note that the cluster analysis being referred to must be in the same basedata.

## sp_points_in_same_cluster examples

# Try to use the highest four clusters from the root.
# Note that the next highest number will be used
# if four is not possible, e.g. there might be five
# siblings below the root. Fewer will be returned
# if the tree has insufficient tips.
sp_points_in_same_cluster (
output       => "some_cluster_output",
num_clusters => 4,
)

# Cut the tree at a distance of 0.25 from the tips
sp_points_in_same_cluster (
output          => "some_cluster_output",
target_distance => 0.25,
)

# Cut the tree at a depth of 3 from the root.
# The root is depth 1.
sp_points_in_same_cluster (
output          => "some_cluster_output",
target_distance => 3,
group_by_depth => 1,
)

# Select four clusters below a specified node
sp_points_in_same_cluster (
output       => "some_cluster_output",
num_clusters => 4,
from_node    => '118___', # use the node's name
)

# target_distance is ignored if num_clusters is set

# so this is the same as the first example
sp_points_in_same_cluster (
output => "some_cluster_output",
num_clusters => 4,
target_distance => 0.25,
)

## sp_point_in_cluster examples

# This will select any element that is a terminal in the cluster output
# It is useful when the cluster analysis was run under
# a definition query to reduce the number of elements clustered,

# and you want the same set of elements.
sp_point_in_cluster (
output       => "some_cluster_output",
)

# Now specify a cluster within the output
sp_point_in_cluster (
output       => "some_cluster_output",
from_node    => '118___', # use the node's name
)

# Specify an element to check instead of the current
# processing element.
sp_point_in_cluster (
output       => "some_cluster_output",
from_node    => '118___', # use the node's name
element      => '123:456', # specify an element to check
)

Shawn Laffan

02-May-2022

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users

Importing group properties directly from rasters

What environmental conditions relate to my biodiversity patterns?

Often one wants to understand which environmental conditions are associated with the taxonomic, phylogenetic and/or trait data. Examples include edaphic and climatic variables, and publications doing so include Bickford and Laffan (2006), Gonzales-Orozco et al. (2013), González-Orozco et al. (2014a), González-Orozco et al. (2014a), Nagalingum et al. (2015) and Bein et al. (2020).

Such data are typically obtained as rasters, with spatial resolutions often of the order of hundreds of metres. This is in contrast to the resolution typically used for Biodiverse analyses (tens to hundreds of kilometres).

Up until now this has been something of a complex process. The raster data need to be aggregated to the same resolution as the Biodiverse data, and aligned as part of that process. Some sort of summary statistic needs to be calculated for each cell, usually the mean. Then the data need to be converted to a CSV format with coordinates that exactly match the Basedata group labels so they can be attached as group properties using the import process. The latter can be done by importing the rasters as their own basedatas, running numeric label statistics, exporting the results to CSV format and then attaching from there. Still not simple, and not easy when there are tens of rasters to process.

Now it is much easier

This process is greatly simplified in Biodiverse version 4, with early access via the 3.99_003 development release. (Access to releases is via the downloads page).

A set of rasters can be selected, imported and attached. Biodiverse takes care of all the spatial matching and runs the summary statistics. As a bonus, the imported data can also be attached to the project in the event the user wants to run other analyses on them.

Currently there is support for the mean, standard deviation, min, max etc. If there is demand for other statistics like the median or inter-quartile range then these can be added.

Any raster data supported by GDAL can be imported. Development has used geotiffs as they are the most common. The process could probably also be generalised to support other file formats like CSV and shapefile. It depends on demand and developer time.

The key criteria for the raster data are that they must be in the same coordinate system as your basedata and they must represent continuous data (i.e. not be numerical categories). The latter point is important because the group property analyses do not work with nominal/categorical values. If you need to summarise categorical data then use an indicator approach where each class is represented by its own raster, and that raster has values of 1 for where that class occurs, and zero elsewhere.

How it works

Some screenshots are probably the best means of showing the process.

In these examples I import two data sets from WorldClim at a 5 arc minute resolution, the Annual Mean Temperature and Mean Diurnal Range. These are just the first two of the Bioclim layers provided by WorldClim. The data have been projected into a Lambert Conic Conformal coordinate system to match the basedata being used (the example data that come with Biodiverse) and have been cropped to the Australian extent.

Annual rainfall from WorldClim2 for Australia, using a Lambert Conic Conformal projection. Brown is low, blue is high.

The data are going to be attached to the example data that come with Biodiverse.

The process is accessed via the Basedata menu.

Rasters are selected from a folder at the same time as the options. In this case the mean and standard deviation stats will be attached as properties to the the added to the selected basedata, and the intermediate basedatas will be added to the project so they can be visualised and/or analysed further.

The process provides some general feedback when it completes (successfully or otherwise).

The outputs tab shows the intermediate basedatas have been added. Each contains a spatial analysis that was used to calculate the statistics.

The property data cannot be visualised directly (yet). To explore them without using an analysis you need to open the View Labels window for the basedata they were attached to and control click on a cell using your mouse.

The popup window shows the properties for the cell that was clicked on (you will need to change the list being shown to be Properties).

The group properties can be analysed in a spatial or cluster analysis. Look for the calculations starting with "Group properties" under the Element Properties set. In this case the analyses will follow those linked to the the very top and calculate summary stats and Gi* hotspot stats for each branch in a cluster tree.

And here is a visualisation of the Gi* hotspot stat for branches cut at 0.4744 from the tips (you can slide the blue line to change this value). The interpretation depends on your significance threshold but Gi* scores are z-scores so, for a two-tailed test where values could be high or low, values above 1.96 are hotspots at alpha=0.05, while those below -1.96 are coldspots.

And here are the same clusters but this time coloured by the mean stat across all groups in the sample. (The naming scheme results in lots of "means").

And here is an example of the imported raster data (diurnal range) that were used to generate the group properties.

This image demonstrates what can happen when coarse resolution data are used. The 5 arc minute resolution translates to approximately 18 km when projected. The cells in the basedata containing the species observations is 50 km. The system uses raster cell centroid coordinates to allocate their values to a basedata cell and there are clearly alignment offsets here. There are many sources of finer resolution data you can use.

Shawn Laffan

02-May-2022

For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/

To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users