Thursday 23 September 2021

Label and group property median and percentile statistics are changing

Biodiverse supports the analysis of additional data values attached to each label and group.  For labels, these could be things like species or population traits such as specific leaf area.  For groups these are things like the average phosphorus content in the soil across a group or set of groups.  More examples of analysing label traits are in this post, and group traits are described in Bein et al. (2020).  

The simplest means of analysing the label and group properties is to calculate summary statistics of the relevant values across the neighbour sets in use.  The relevant indices for these are under the Element Properties category because labels and groups in Biodiverse are referred to by the more generic term "elements".   

However, the implementation of these summary statistics to date has been relatively inefficient, especially for the range weighted statistics.  Where a property was assigned a weight of more than 1, the value was repeated that many times in the vector of values used to calculate the statistics.  e.g. [1,1,1,2,2,2,2,2].  This is not an issue for small data sets, but imagine that repeated for 10,000 unique label values, each of which has weights between 1 and 200, and then across 10,000 groups.  That can lead to quite some inefficiency with the calculations.  This repetition is actually needless given a weighted statistics approach can be used.

From version 4, Biodiverse uses a weighted implementation for its statistics.  This is slightly less efficient for cases where all weights are equal but there are always trade-offs when writing code for the more general case.

This new approach will have no impact on the results for statistics like the mean, standard deviation, skewness or kurtosis.

However, there will be a change to the way the percentiles and the median are calculated.  Previously the library used would snap to the lowest value when a percentile was calculated that did not exactly align with the data values.  The new approach uses interpolation, with the results being consistent with how percentiles are calculated in R (for an unweighted vector).  

This means that any calculations of the median or percentiles in Biodiverse 4 will likely return higher values for some percentiles.  The effect will be greater for smaller samples where the numeric gaps between sequential values is larger, but such sample size effects are hardly unusual in statistics.

One point that is yet to be dealt with is when the weights are not integers, i.e. where the sample counts used for a Basedata are from Species Distribution Model likelihoods (see the BCCVL if you need an online tool to calculate these).  In such cases the percentiles cannot use interpolation and will use the centre of mass.  Bias correction is also not possible for statistics like standard deviation, skewness and kurtosis in such cases, as the sum of weights is not the same as the number of samples.  This issue is for the future, though, as we do not yet support abundance weighted label stats.

For those interested in the implementation details, the approach uses the Statistics::Descriptive::PDL package which in turn uses tools provided by the Perl Data Language (PDL).  For those more familiar with R or Python, PDL provides Perl support for fast calculations using matrices and vectors.


Shawn Laffan

23-Sep-2021


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/  


To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 


You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users 


No comments:

Post a Comment

Note: only a member of this blog may post a comment.