Saturday, 6 August 2016

Biodiverse now categorises your randomisation results

[[ 2016-08-30:  This post has been superseded - the categorisation was not sufficiently clear.  See this followup post for how the system works now ]]


One of the issues users face with the randomisations in Biodiverse is what to do with them once they have been run.

One key point is that the results are stored on the other analysis objects themselves, as extra indices and lists.  The index names themselves are a bit cryptic, but are consistent, and there is a description of what they mean here:  https://purl.org/biodiverse/wiki/AnalysisTypes#where-do-the-randomisation-results-go-and-what-do-they-mean

Even then, it can be difficult working out which of your groups have index scores that are significantly different from the randomisation.  This is because the data are plotted as a continuum (which in turn is because it uses the same plotting process as the original index scores).  The first image below is an example of this plotting.

One can easily  export the data and work with them in a GIS or stats package, but any tied values need to be factored in for lower tail tests, for example as used in the CANAPE process.

With the next version of Biodiverse this categorisation will become a little easier.  Biodiverse will automatically categorise your randomisation results into significance levels, putting the results into new lists on the objects that can be displayed and exported in the same way as any other data.

As an example, imagine you have run a randomisation analysis for a BaseData containing a spatial analysis in which you calculated phylogenetic endemism.  Assume that the randomisation's name is rr (not a good name, but it's convenient to type here), so the spatial analysis will now have three lists you can plot.  The first two are the same as ever:  SPATIAL_RESULTS contains the observed results for each group (cell), and rr>>SPATIAL_RESULTS contains indices to track the randomisations for each index in SPATIAL_RESULTS.  For example, for PE_WE there will be C_PE_WE, Q_PE_WE, T_PE_WE and P_PE_WE collating, respectively, the number of times observed PE_WE was higher than that generated using the randomised data, the number of time observed PE_WE was compared against the scores from the randomised data, the number of times the observed and random scores were tied, and the proportion of iterations that the observed score was higher than the random scores (P_PE_WE = C_WE_PE / Q_PE_WE).

The new list is rr>>sig>>SPATIAL_RESULTS.  This contains a set of categorisations for one and two tailed tests for each index found in SPATIAL_RESULTS.  The lower tail tests take into account any ties in the comparisons.  An example plot is in the second image below.

For the PE_WE example, one has SIG_1TAIL_PE_WE and SIG_2TAIL_PE_WE.  SIG_1TAIL_PE_WE is a one-tailed test for higher or lower than expected.  It has a value of 0.01 if it is significantly high at alpha=0.01, 0.05 if high for alpha=0.05, -0.01 if it is significantly low at alpha=0.01, and -0.05 for low at alpha=0.05.  If it is not significant then it has a null (undefined) value.

SIG_2TAIL_PE_WE has the same numbers, but for a two tailed test.  Values of -0.05 and 0.05 are low or high for a two tailed alpha=0.05, i.e. the observed scores are in the outer 5% of the distribution of random scores (<2.5% or >97.5%, respectively), while those with -0.01 or 0.01 are in the outer 1% and significant at alpha=0.01.

The upper and lower one-tailed tests have been combined into the same list to reduce the number of indices and lists generated, and thus use less memory and disk space (an index cannot be both significantly high and low so there is no overlap).  If you are interested in a one tailed test for high values then ignore the low values, and vice-versa.  The values can be easily separated after exporting the results.

Currently the results are plotted in the same manner as any data, but there are plans to allow users to overlay the randomisation significance results over the top of the observed results, for example by masking our any non-significant scores.

The current system plots all scores, regardless of whether they pass a threshold or not.  This is useful, but is difficult to interpret when looking for significance against the randomisation.

The new categorisation filters out any score that is not significant at an alpha level of either 0.05 or 0.01 (here the one-tailed results are plotted, so negative values are significantly low).  The plotting could be improved, but it will work well enough now for exploration - proper maps can also be made using a GIS or stats package.  

This is a new implementation, so any feedback about usability would be very useful.  

Shawn Laffan
06-Aug-2016

For more details about Biodiverse, see http://purl.org/biodiverse 

For the full list of changes in the 1.99 series (leading to version 2) see https://purl.org/biodiverse/wiki/ReleaseNotes 

To see what else Biodiverse has been used for, see https://purl.org/biodiverse/wiki/PublicationsList


You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users 


No comments:

Post a Comment

Note: only a member of this blog may post a comment.