Friday, 24 September 2021

Faster calculation of PhyloCom indices: NRI, NTI, MPD and MNTD

This post covers a bit more of the internals than we usually do.  Hopefully it is useful.  

From version 4, Biodiverse will use faster methods to calculate the indices in the PhyloCom set, with ultrametric trees seeing the greatest speedup.  These are the NRI, NTI, MPD and MNTD indices, which when not "acronymed" are the Net Related Index, the Nearest Taxon Index, Mean Phylogenetic Distance and Mean Nearest Taxon Distance.  These were originally implemented in the phylocom software (Webb et al. 2008) and many readers will be familiar with the R package picante.  

The MPD index is a measure of the mean of the paths along the tree between each pair of tips in a sample.  The contribution of a branch is proportional to its length and the number of paths it is part of.  

The MNTD index is the mean of the shortest path between each tip in a sample and the nearest tip that is also in the sample.  For a sample of ten tips there are only ten paths, but in the naïve case one needs to evaluate all paths to determine which is the shortest.  

One point to keep in mind is that branches in MPD and MNTD are counted one or more times, more specifically as many paths that they form part of.  This is in contrast with PD where each branch counts only once.  The number of paths also increases quadratically with the number of tips in the sample.  For example, if there are ten tips then there will be 10*(10-1)/2=45 paths to connect all pairs, and if there are 10,000 tips then there are 49,995,000 paths.  To my mind these make PD a better index, but that discussion is for another day.  Certainly it is simpler to calculate.

The NRI and NTI are z-scores of the MPD and MNTD scores, respectively, and indicate if the paths are longer or shorter than expected given a random resampling of tips across the tree.  The resampling algorithm can vary, but the simplest is to use the same number of tips as is found in the sample. In other words it matches the richness of the observed sample, so if one has ten tips in an observed sample then each random iteration draws ten tips at random and calculates the MPD and MNTD score.  There are other random sampling algorithms, such as abundance weighted, but Biodiverse only implements the richness approach for NRI and NTI.

The final NRI z-score is calculated as (observed_MPD - mean_random_MPD) / standard_deviation_random_MPD.   Interpretation follows the usual z-score distribution, with values more extreme than +/-1.96 being in the outer 5% of the distribution and thus significant at alpha=0.05.  The same process applies to NTI, but using MNTD instead of MPD.  (One point of difference between Biodiverse and phylocom is that in positive NRI and NTI values in Biodiverse correspond with values larger than expected, whereas in phylocom these have negative values).

It is also worth noting that the random resamplings used for NRI and NTI in Biodiverse do not use its more general randomisation framework.  One can use the MPD and MNTD scores with such randomisations to try more complex or spatially constrained schemes.  For more on such randomisations see posts with the randomisation tag.

A key problem to date with the Biodiverse implementation is that these calculations are very slow, and become substantially slower as the size of the data sets increases (trees become deeper and have more tips).  The rest of this post describes some of the ways these have been substantially sped up in Biodiverse version 4.  Much of this optimisation work was done using code profiling using the excellent Devel::NYTProf, and also by implementing the algorithms described in Tsirogriannis et al. (201220142016) and implemented in the PhyloMeasures package (followed by more code profiling with Devel::NYTProf).

Find the Last Common Ancestor

The search for the last common ancestor (or LCA, also referred to as the Most Recent Common Ancestor and Last Shared Ancestor) between a pair of terminal branches is what takes the most time in the MPD and MNTD calculations.  This is a key step in calculating the path connecting two tips.  Biodiverse has always cached the result of the path distance between a pair of branches so it only needs to be calculated once.  However the process of finding the path took a reasonable amount of time, something that was exacerbated when run under the random resampling process.  This has been optimised in several ways.

For ultrametric trees Biodiverse caches the same path distance between each pair of tips that share the LCA.  This pre-warming of the cache obviates the need to repeatedly find the same LCA in later checks.  This works because, as noted in Tsirogiannis et al. (2014), the distance from an internal branch to any of its tips is always the same for an ultrametric tree.

For  non-ultrametric trees Biodiverse caches the last common ancestor for each pair of tips to save looking for it next time.  The distance is not calculated until it is needed, but Biodiverse also caches the cumulative path lengths from each tip to the root so there is no need to repeatedly traverse the tree to get the distance from a tip to the LCA.

NRI and NTI

Faster calculation of MNTD and MPD are always good, but the real time sink is running the NTI and NRI calculations.  Even with faster MPD and MNTD calculations, a calculation that takes 4 seconds for a sample expands to more than an hour when repeated over 999 randomisation iterations.  And keep in mind that Biodiverse uses a convergence approach instead of a fixed number of iterations, so more than 2000 iterations is not unusual.

Re-use expected values for a given sample size

The expected values for a given sample size will not change in any meaningful way across randomisations that have converged on a stable result, so Biodiverse caches these and re-uses them.  For example, if the expected values for a sample of 10 has been calculated then it is re-used for each other sample of 10 in the data set. 

This has actually been in Biodiverse since NRI and NTI were first added, but is worth noting.  It is an easy thing to implement for other systems.

In randomisation analyses

The calculation time for NRI and NTI was exacerbated when users ran a randomisation on an analysis that included NRI and NTI.  There is really no need to run the NTI and NRI calculations through a randomisation process, as they are based on a random resampling process to begin with (one would be randomising a randomisation).  However, this is not always obvious to users.  If a user follows a philosophy of "push buttons and watch what happens" (as I do) then a long period of time can be spent waiting for the randomisations to finish as the expected values are recomputed.

The re-use of expected values described above helps here, but these were only cached within the analysis being run.  This means they were not available between randomisations, or to other calculations using the same tree.

Now Biodiverse caches the calculated expected values on the tree and reuses them whenever the tree is used in a subsequent analysis (unless the cache is cleared).  This means they are calculated once only regardless of how many analyses need them.

But the random resampling process was still too slow...

Exact estimates of expected values without needing to randomise

The next improvement was to implement the exact algorithms described by Tsirogiannis et al. (2012 and 2014).  The PhyloMeasures package implements additional steps not described in these papers, but which could be extracted from the package C code.  This is another great example of the benefits of open source code as one can see and understand how an algorithm is implemented in practice.  Where code is complex, or otherwise opaque, one can insert debugging statements in the local version to see what values are being passed into and returned from functions, and how they change within a function.  Importantly, one can also build tests to check the new implementation matches the old.

The exact estimates take advantage of the phylogenetic structure of the trees and repetitions within the calculations.  They are comparatively complex but lead to processing times that are many orders of magnitude faster than the random sampling approaches, to the point that analysis times previously measured in days now take seconds.  Given they are exact, they also lead to the exact same answer each time they are run so there is no margin of error, even if this is normally very small for values that have converged under a random resampling process.

The exact NRI algorithms apply to both ultrametric and non-ultrametric trees so are applied in all cases.  However, the NTI algorithms only apply to ultrametric trees, so analyses with non-ultrametric trees still require the random resampling approach.  It might be possible to develop better approaches for non-ultrametric trees given the main aim is to calculate probabilities from possible combinations, but that needs more experience with combinatorics than I have.

MNTD and NTI for the full tree

One final optimisation is to implement an algorithm to calculate the MNTD for a sample comprising the full set of tips on a tree.  This is perhaps something of an edge case as analyses will usually work with subsets, but it did not take long to implement.

This optimisation also applies to the NTI because there is only one possible realisation of the MNTD if all the tree tips are in the sample, so the expected mean is same as the MNTD and the standard deviation is zero (we will conveniently ignore the resultant divide by zero error in the z-score calculation in this case).  I have not checked if the phylocom, picante and PhyloMeasures implementations check for this condition, but it would not be hard to implement if they do not.

The algorithm to find the shortest path for each target tip is:

  1. Get the shortest distance from the target tip to any of its siblings' tips
    1. Set this as min_dist
    2. Set the processing node as the target tip
  2. Set the processing node as the processing node's parent
    1. Set the ancestral path distance to be the length from the parent to the target tip
    2. Stop if the ancestral path distance exceeds min_dist
  3. Get the shortest distance from the processing node to its siblings' tips
    1. Add this distance to the ancestral path distance.
    2. If the sum of these distances is shorter than min_dist then assign that value to min_dist.
  4. Stop if the processing node is the root node 
  5. Otherwise go to step 2

Each check should complete in approximately O(log(d)) time, where d is the depth of the tree, so it will take O(n log(d)) for a tree with n tips.  

The calculation of distances to tips for each branch is the cumulative path length from the tip, as noted above for the LCA calculations.   This is cached on the first calculation and then re-used, leading to further speed gains over a naïve implementation.

More?

There are potentially more optimisations (there always are) but these will do for now.  

If you want to see all the changes to the internals then they are tracked under issues 786, 788, 789793794 and 797.  Suggestions for other approaches are always welcome and can be raised on the mailing list (see link below) or on the issue tracker.  


Shawn Laffan

24-Sep-2021


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/  


To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 


You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users 


Thursday, 23 September 2021

Label and group property median and percentile statistics are changing

Biodiverse supports the analysis of additional data values attached to each label and group.  For labels, these could be things like species or population traits such as specific leaf area.  For groups these are things like the average phosphorus content in the soil across a group or set of groups.  More examples of analysing label traits are in this post, and group traits are described in Bein et al. (2020).  

The simplest means of analysing the label and group properties is to calculate summary statistics of the relevant values across the neighbour sets in use.  The relevant indices for these are under the Element Properties category because labels and groups in Biodiverse are referred to by the more generic term "elements".   

However, the implementation of these summary statistics to date has been relatively inefficient, especially for the range weighted statistics.  Where a property was assigned a weight of more than 1, the value was repeated that many times in the vector of values used to calculate the statistics.  e.g. [1,1,1,2,2,2,2,2].  This is not an issue for small data sets, but imagine that repeated for 10,000 unique label values, each of which has weights between 1 and 200, and then across 10,000 groups.  That can lead to quite some inefficiency with the calculations.  This repetition is actually needless given a weighted statistics approach can be used.

From version 4, Biodiverse uses a weighted implementation for its statistics.  This is slightly less efficient for cases where all weights are equal but there are always trade-offs when writing code for the more general case.

This new approach will have no impact on the results for statistics like the mean, standard deviation, skewness or kurtosis.

However, there will be a change to the way the percentiles and the median are calculated.  Previously the library used would snap to the lowest value when a percentile was calculated that did not exactly align with the data values.  The new approach uses interpolation, with the results being consistent with how percentiles are calculated in R (for an unweighted vector).  

This means that any calculations of the median or percentiles in Biodiverse 4 will likely return higher values for some percentiles.  The effect will be greater for smaller samples where the numeric gaps between sequential values is larger, but such sample size effects are hardly unusual in statistics.

One point that is yet to be dealt with is when the weights are not integers, i.e. where the sample counts used for a Basedata are from Species Distribution Model likelihoods (see the BCCVL if you need an online tool to calculate these).  In such cases the percentiles cannot use interpolation and will use the centre of mass.  Bias correction is also not possible for statistics like standard deviation, skewness and kurtosis in such cases, as the sum of weights is not the same as the number of samples.  This issue is for the future, though, as we do not yet support abundance weighted label stats.

For those interested in the implementation details, the approach uses the Statistics::Descriptive::PDL package which in turn uses tools provided by the Perl Data Language (PDL).  For those more familiar with R or Python, PDL provides Perl support for fast calculations using matrices and vectors.


Shawn Laffan

23-Sep-2021


For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/  


To see what else Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList 


You can also join the Biodiverse-users mailing list at https://groups.google.com/group/Biodiverse-users