Tuesday 19 April 2016

More CANAPE - how to restrict your cluster analysis to groups with significant endemism

Two previous posts described more of the background to the CANAPE analysis and how to do it yourself.

[[  2017-08-10 UPDATE - the second part of the condition used an incorrect index name, so the clustering only applied to the palaeo and some of the mixed cells.  This has been corrected in the post.  ]]


What was not covered in those posts was the final step in the Mishler et al. (2014) paper in which the cells classified as palaeo- neo- or mixed-endemism under the CANAPE scheme were clustered to identify groupings with shared branches from the tree.

The aim of this post is to fill that gap.


The way it is done is conceptually simple.  We just need to constrain the cluster analysis to only use cells that are identified as having significant phylogenetic endemism.  The way it is done takes just a few steps, and a bit of typing.


Analyses in Biodiverse can be constrained to work only on a subset of groups (cells) using a definition query (the same as a "where clause" used in database queries).  The way it is currently done is a little more involved than is ideal, largely due to the amount of typing needed, but is not too complex to get working.  (One day we will develop a spatial conditions builder interface to make spatial conditions easier).

So, imagine you have run a spatial analysis called "phylo_end" in which you have told the system to calculate the indices under the  Phylogenetic Endemism and Relative Phylogenetic Endemism, type 2 calculations.  You then ran 999 iterations of a randomisation called CANAPE to assess how often the observed index values were higher than random.  The spatial results object now has an additional list called CANAPE>>SPATIAL_RESULTS which contains the counts of how often the observed value was higher than the randomised values, how many iterations were applied for each index, how many ties there were, and the proportion that were higher than random.  (See the help for more details about the naming scheme).


An example of the setup for the randomisation.
The randomisations have been completed so now the spatial analysis has an output list called CANAPE>>SPATIAL_RESULTS.  This list contains indices that summarise how often the observed values were higher than (or tied with) those calculated using the randomised data.


The next step is to start a new cluster analysis and set our definition query.  The selected basedata has a randomisation output, so you will be warned about possible issues with synchronisation.  This is only a potential problem if you run more iterations of the randomisation, which is not the case here so you can safely ignore the warning.

You will see this warning when you open a new cluster analysis, but can ignore it in this case.  


Set the Metric to be PHYLO_JACCARD (or whatever metric you like - it is your analysis after all) and enter the text below in the Definition Query box (you will need to expand it to enter text).  If your output and randomisation names differ then edit the text as appropriate.

sp_get_spatial_output_list_value (
    output => 'phylo_end', 
    list   => 'CANAPE>>SPATIAL_RESULTS', 
    index  => 'P_PE_WE'
) > 0.95
|| 
sp_get_spatial_output_list_value (
    output => 'phylo_end', 
    list   => 'CANAPE>>SPATIAL_RESULTS', 
    index  => 'P_PHYLO_RPE_NULL2'
) > 0.95



What this does is use the sp_get_spatial_output_list_value() subroutine to access the randomisation indices.  If either of the PE scores for a group using the observed or alternate trees are higher than the randomisations for 95% of iterations (these are the P_PE_WE and P_RPE_NULL2 indices) then that group will pass the test. Only those groups (cells) that pass the test will be considered for the cluster analysis.

Example of the cluster analysis setup, with a definition query specified.  Only those groups (cells) that pass the test will be used in the cluster analysis.
(The text in the image is different from the main text only in whitespace and formatting.  The code is the same).    

Once the analysis is run it will look something like the image below.
Completed cluster analysis.  Cells that failed the definition query were not considered for clustering and are plotted in grey.  
And there it is, a cluster analysis of the CANAPE regions to identify which ones have similar sets of branches from the tree.

Of course, definitions queries are extremely versatile and can be used for all sorts of applications.  You might, for example, want to cluster within a region defined by a shapefile but consider geographic ranges from across a broader study region.

It is also worth noting that spatial analyses can use definition queries, but they affect the analysis differently.  Calculations will only be completed for groups that pass the definition query, but those groups can have neighbours that fail the definition query.  This allows one to more easily analyse data with larger window sizes but with a guard area (data on the edges of the analysis region which, if not considered, could cause biased results).


Shawn Laffan, 19-Apr-2016


For more details about Biodiverse, see http://purl.org/biodiverse

To see what else Biodiverse has been used for, see https://purl.org/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users