Thursday, 20 November 2014

Do it yourself CANAPE

In the previous blog post I described some of methods behind the CANAPE (Categorical Analysis of Neo and Palaeo Endemism) method.

The purpose of this post is to give more details about how to run it with your own data.  You can also use the Biodiverse pipeline, but these instructions focus on how to do it using the Biodiverse GUI (at least to the point of generating all the necessary data).

These instructions assume you have already imported all your data, so both the species distribution data and the tree.  Instructions for how to do so are in the quick start guide.

The CANAPE analyses also need a version later than 0.19, which at the time of writing is the development version 0.99_005. [Update 2015-04-20 - version 1.0 has now been released http://purl.org/biodiverse/wiki/Downloads ]

Step 1.  View the data

This is a general step you should always do anyway, but open the View Labels tab so you can cross check the spatial data (the basedata object in Biodiverse parlance) and the tree.  This is accessed by the menu option Basedata->View Labels, the keyboard shortcut Shift-Control-V, or by double clicking on the basedata object in the outputs tab.

As you hover over cells on the map (cells are called groups in Biodiverse) you should see paths on the tree being highlighted.  As you hover over branches on the tree you should see cells being highlighted.  If you click on cells or branches then species names (labels) in the list at the top left should be highlighted.

 Hovering over a cell highlights any branches on the tree that are found in that cell.

 Hovering over a branch highlights the set of cells containing any of the named branches beneath it (usually just the terminals).  Clicking on the branch will also select the matching labels in the list at the top left, and colour all cells in the map based on how many of those labels are found in the associated group.

If there is no highlighting then there is a mismatch between your tree and the basedata.  You will need to either rename the labels, for which there is an option in the basedata menu, or re-import the tree and specify a remap table when you do so.  Both approaches can use the same table, so it is up to you which data set should be the canonical source of names.  It does not matter which one is canonical for CANAPE, but for other analyses it can do.  It depends on what you want to do with the data.

Step 2.  Run the spatial analyses

The next step is to run the calculations for the observed data.  This uses a spatial analysis, and can be run using menu option Analyses->Spatial.  You should see something like the image below.  Make sure the tree you want to use is selected.

 The initial spatial analysis window with default spatial conditions.

We now need to change the settings.  For this particular analysis we are only interested in each group (cell) in isolation, not collections of groups, so we need to delete the second spatial condition.

We also need to select the Phylogenetic Endemism calculation.  This can be accessed under the Phylogenetic Endemism category (or Phylogenetic Indices if you are using a version earlier than 0.99_005, but remember that CANAPE won't work on version 0.19 since it lacks the next set of calculations).  Then select the Relative Phylogenetic Endemism, type 2 calculation under the Phylogenetic Indices (relative) category.

 Delete the second spatial condition and select the Phylogenetic Endemism calculation.

 Also select the Relative Phylogenetic Endemism, type 2 calculation.

When all is selected, hit the Go! button.  The example data set distributed with Biodiverse will take very little time to run.  The data set used in Mishler et al. (2014) takes approximately 5 seconds on a modern laptop.  (The code includes a number of optimisations such as caching of results for later re-use and binary searches where possible).

Step 2.  View the results

Now you get to bask in the glory of a set of results.

Screenshots of the example data are given in the previous blog post so won't be repeated here.  See the section "Step 1" at this URL:   http://biodiverse-analysis-software.blogspot.com.au/2014/11/canape-categorical-analysis-of-palaeo.html

It is time to stop basking and get back to work.

Step 3.  Run the randomisations

To assess which of the cells are candidates for palaeo- or neo-endemism we need to run a randomisation.  In this case we select the rand_structured option in Biodiverse, constraining the richness of each group (cell) to be equal across all randomisations.

 CANAPE uses the rand_structured randomisation to ensure the richness of each cell is constant across all randomisation iterations/realisations.

In the screen shot above you can see in the Setup section that we have chosen the rand_structured function for the randomisation.  We will run 999 iterations with all results being prefixed with "CANAPE".  See the Biodiverse help system for more details on the naming scheme.  The checkpoint save iterations is useful when one wishes to track a long running randomisation process to see if it has converged.  The basedata will be saved at any iteration ending in the value given will be saved, so a setting of 99 means iteration 99, 199, 299 etc will be saved.  It is probably not needed here, so can be set to 999.

The Parameters section allows us to control other options.  In this case we will leave the trees alone (randomise_trees_by equals no_change), so each analysis across the random iterations will use the original tree (for other analyses one might choose to randomise the tree but not the spatial data).  We have no group properties set so can also leave randomise_group_props_by as no_change.  A richness_multiplier of 1 and richness_addition of 0 means we will replicate the exact richness scores, as the formula used is a linear function (max_random_richness = observed_richness * richness_multiplier + richness_addition).

Now we hit the Go! button again.  This might take a while, depending on how large your data set is.  For the example data it is about 3 minutes on a modern laptop, but it can scale to hours for data sets of the size used in Mishler et al. (2014).

 Unsurprisingly, the progress bar displays how progress has been made.

Step 4.  View the randomisation results

This is another case of "the images are in the previous blog post".  Look for step 2 in http://biodiverse-analysis-software.blogspot.com.au/2014/11/canape-categorical-analysis-of-palaeo.html  (but note that those images use a randomisation name of Rand1 instead of CANAPE).

Step 5.  Export the results

Exporting the results in Biodiverse version 0.99_005 and later can be done while viewing the spatial output (see this previous blog post), or from the outputs tab.  In earlier versions exporting can only be done from the outputs tab.

The screenshot below shows the export options.  If you are using the Biodiverse pipeline then you need to export to delimited text format files.  If you are going to hand roll your classification then use whichever format is appropriate.

 Choose the export option most appropriate to your needs.

 Export the SPATIAL_RESULTS list.  This contains all the PE and RPE results.  This image is for the Delimited text method.  Some options will differ for other export methods.

 Export the randomisation results.
 The delimited text exports can be viewed in a spreadsheet program or imported into a stats package such as R.
The results are in two tables (one for the spatial analyses, one for the randomisation results) so might need to be linked.  Use the Element field in each table for this, as it is the unique identifier for each group (in Biodiverse a group is jut a special type of element, as is a label).  The Axis_0 and Axis_1 fields are the coordinates of the cell centroids.  There can be any number of axis columns in a basedata, but in this case we have only two.  Remember that the randomisation naming scheme is explained here

If you export to one of the raster formats (except the ER-Mapper format) then you will have one raster per list item, so one for PE_WE_P, one for RPE_NULL2, etc.  The ER-Mapper format packs all of one list into one multiband raster, with one band per list item.

Step 6.  Classify the data

The next step is to classify the data into neo, palaeo and mixed-endemism. Blow-by-blow details of that will be left to another blog post, as this one is getting pretty long.  More of the pipeline also needs to be shaken out so it works with other data sets before I can write about that.

If you want to go ahead and do it yourself then it is not particularly complex.  Details of the classification system are given in the previous blog post.   Look for Step 3.  Use the indices PE_WE_P, PHYLO_RPE_NULL2 and PHYLO_RPE2 for PE_orig, PE_alt and RPE, respectively.

Shawn Laffan 20-Nov-2014

For more details about Biodiverse, see http://purl.org/biodiverse

For the full list of changes in the 0.99 series (leading to version 1) see https://purl.org/biodiverse/wiki/ReleaseNotes

To see what else Biodiverse has been used for, see https://purl.org/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users

Wednesday, 19 November 2014

CANAPE - Categorical Analysis of Palaeo and Neo Endemism

The purpose of this blog is to explain in a bit more detail how the CANAPE method works.  This was prompted by some very useful questions from Stu Marsden.

The paper describing the CANAPE method (Mishler et al. 2014) is at http://dx.doi.org/10.1038/ncomms5473

The CANAPE method is a three step process.  In Step 1 a set of three primary indices as calculated for each region in the data set, in Step 2 a randomisation is run to identify regions with significant endemism.  In Step 3 these regions are classified into palaeo, neo, mixed or non-endemism.

Step 1.  Calculate the observed endemism

The first step is to calculate a set of three observed endemism scores for each region in the data set:
1. Phylogenetic Endemism (PE) calculated using a user specified tree.  This will be called PE_orig below.
2. PE calculated using an alternate tree.  This will be called PE_alt below.
3. Relative Phylogenetic Endemism (RPE) which is calculated as the ratio of PE_orig to PE_alt.
In this case each region is a single cell, but it could be any collection of cells for which one is interested in running the calculations.

The formula for PE for a region "i" is:

$\bg_white&space;PE_i&space;=&space;\sum_{\lambda_i&space;\in&space;\Lambda_i}\lambda_i&space;\frac{r_{\lambda_i}}{R_{\lambda_i}}$

where $\Lambda_i$ is the set of branches found in region i, $r_{\lambda_i}$ is the local range of branch $\lambda_i$ (the number of cells in region i in which it is found), and $R_{\lambda_i}$ is the global range of branch $\lambda_i$ (calculated as the number of cells in which it is found across the whole data set).  Put in words, PE for a region is the sum of the branch lengths found in that region, but where each branch is weighted by the fraction of its geographic range that is found in that region.  It is worth noting that PE is basically is a range-weighted variant of PD (phylogenetic diversity), as the sum of PE scores across all cells will equal the PD for the set of branches found in those cells.

PE_orig and PE_alt are calculated in the same way, the difference is simply in the trees being used.  In Mishler et al. (2014) the alternate tree is one with the same topology as the original tree but where the non-zero branches are modified to be of equal length.  It should be noted that the per-branch range weighting is the same as for PE_orig, so each equalised branch receives the same range weighting as its counterpart in the original tree.

RPE for a region is simply the ratio of PE_orig and PE_alt for that region, so will be >1 when the original tree has longer range weighted branches (PE_orig is longer than on the alternate tree), and <1 when PE_alt has longer range weighted branches (PE_orig is shorter than on the alternate tree). This translates to determining if a region has a collection of longer or shorter range weighted branches.

$RPE_i&space;=&space;\frac{PE\_orig_i}{PE\_alt_i}$

The following plots illustrate the calculation of PE_orig and PE_alt for the example data set that is distributed with Biodiverse.

 PE_orig (scaled to be proportional to the total tree length).  Branches highlighted in blue are those found in the cell marked with a circle.  (The sum of these branch lengths is the PD of the cell.)  Grey branches are not found in the highlighted cell.  See this blog post for more details about the tree plots)

 Same as above, but with the tree branches scale to be proportional to their ranges.  PE for the highlighted cell is the sum of these range weighted branches.  It is clear from this plot that the PE in the highlighted cell is largely due to one narrow ranged branch as the other branches in that cell have been considerably downweighted.  (Note also that the branch weighting and thus the weighted lengths will change if collections of cells are used in the analysis, e.g. to calculate the PE of a region instead of a single cell).

 The map on the left is PE_alt, with the alternate tree on the right (where branches are scaled to be of equal length).  The map colours are not directly comparable between plots due to differing numerical ranges (something to fix in a future version of Biodiverse), but the differing highs and lows are readily differentiated.

 PE_alt, but now plotting the range weighted version of the equal branch length tree.  Note the similarity with the range weighted tree above for PE_orig, but that it has more detail in the internal branches.

Step 2.  Use a randomisation to identify regions with significant endemism

The randomisations are needed because we don't have a good basis to directly threshold the values of PE_orig, PE_alt and RPE.  The same value of PE can be obtained by different combinations of terminals (species) as it is a combination of branch lengths and their range weighting, e.g. two long but narrow range branches could be the same as 100 long but wide-ranged branches, so using some predetermined threshold is not likely to be useful.  The same applies to RPE, where the same ratio can arise from different inputs.

The randomisation is done at the level of the species rather than the cell.  In each iteration, species (tree tips) are randomly allocated to the landscape, with the constraint that the richness of each cell is held constant and the same as in the original data set, and thus so is each species' range.  For each random realisation we then calculate PE_orig, PE_alt and RPE for each cell.

We repeat the randomisation 999 times (or more) and keep a track of where the observed PE_orig, PE_alt and RPE for each ell plot against the PE_orig, PE_alt and RPE calculated using the 999 random realisations.  The significance of each cell for each index (PE_orig, PE_alt and RPE) is then the rank relative position of the original values against those of the random realisations, so anything in the top 5% is significantly high for a one tailed test, while anything in the upper 2.5% or lower 2.5% is significant for a two tailed test.

The randomisations are plotted below.  The C_ prefix in the index name means it is the count of times the observed value was greater than the randomised values, so 995 means 995 of the 999 randomisations.  The RPE test is two-tailed so we look for both high and low values (accounting for tied values in the lows, albeit there are none in this example).  These plots don’t have the thresholding applied to them because it is not yet supported in Biodiverse, but it can be done using other tools such as the Biodiverse pipeline or a GIS.

 Plot of the number of times PE_orig was greater than the same index calculated using the 999 random realisations.

 Plot of the number of times PE_alt was greater than the same index calculated using the 999 random realisations.

 Plot of the number of times RPE was greater than the same index calculated using the 999 random realisations.

Step 3.  Classify the results for each cell

The CANAPE test is then a process of checking the rank relative significance of the PE_orig, PE_alt and RPE values for each cell.  Indents are sub-branches in the decision process.

1)    If either PE_orig or PE_alt are significantly high then we look for palaeo or neo endemism
a)    If RPE is significantly high then we have palaeo-endemism (PE_orig is consistently higher than PE_alt across the random realisations)
b)    Else if RPE is significantly low then we have neo-endemism (PE_orig is consistently lower than PE_alt across the random realisations)
c)    Else we have mixed age endemism in which case
i)    If both PE_orig and PE_alt are highly significant (p<0.01) then we have super endemism (high in both palaeo and neo)
ii)    Else we have mixed (some mixture of palaeo, neo and non endemic)
2)    Else if neither PE_orig or PE_alt are significantly high then we have a non-endemic cell

The bulk of the CANAPE process can be run using the development version of Biodiverse (https://purl.org/biodiverse/wiki/Downloads [Update 2015-04-20 - Version 1 has now been released]).  It is just the final classification which needs external processing, for which there is a pipeline at https://github.com/NunzioKnerr/biodiverse_pipeline but it needs tweaking for the plots to work with other data sets.  The pipeline can actually be used to run the whole thing, it just takes a bit of work to set up at the moment.

Shawn Laffan
19-Nov-2014

For more details about Biodiverse, see http://purl.org/biodiverse

For the full list of changes in the 0.99 series (leading to version 1) see https://purl.org/biodiverse/wiki/ReleaseNotes

To see what else Biodiverse has been used for, see https://purl.org/biodiverse/wiki/PublicationsList
You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users

Equations are all