Wednesday 26 April 2017

Matching spatial, tree, matrix and property data is now easier

One of the biggest bugbears when using Biodiverse has been matching the names between the spatial data and any tree, matrices, group properties and label properties.

Biodiverse uses an exact matching scheme to link tree branch names with Basedata labels (e.g. species).  If there is even a single character different then the two sets will not match, and analyses either do not run or are incomplete.

The standard approach used in Biodiverse to ensure the data match is for users to provide a remap table that "maps" labels from one data set to another (e.g. make the tree names match the basedata).  However, experience suggest that users are often unsure what columns should be used for what (and sometimes how many), and that generation of the remap file is more difficult than desired.

Biodiverse version 2 has a greatly simplified remap process that can generate possible matches automatically.  Any object can have its element names remapped at import, as before, but now there are also options under the Basedata, Tree and Matrix menus.  Better yet, there is also a centralised interface that can be accessed under the file menu.


In the centralised remap interface, one can match any object to any other object in the project, as well as to names loaded from a file.  If an object is chosen then Biodiverse will search the sets of names in each object to find possible matches.  The User defined from file is the conventional process, where the file needs to define the set of input and remapped columns.  The Auto from file allows users to load a list of names which will be searched to find possible matches, in the same way as it searches other objects.

The search process uses a fuzzy text matching scheme, as well as searching for differences purely in punctuation or in quoting characters.  The minimum acceptable distance option allows control over how different possible matches can be.  If a very large value is used then anything can match anything else, which is unlikely to be useful.

The interface for the automatic remaps is as below.  There are panes listing exact matches, non-matches (i.e. differences were too great), punctuation matches and possible typos.  Users can choose to ignore any of these sets, and also select subsets within the sets if, for example, there are false matches.

The menus for Basedata, Matrices and Trees all have remap options, but there is also a centralised system under the File menu.


Any object can be matched with any other object, as well as loading remaps from files.

The Maximum acceptable distance controls how different names can be before they are no longer considered as possible matches.
Users can control the level of remapping used within the match categories.  



Users can also choose to export the remap to a file to re-use later on (or inspect for possible problems) using the Export remap to file button.  A copy of the remap can also be sent to the clipboard for direct use in spreadsheets, text editors and the like.



Remapping of BaseData objects is not permitted if they have existing outputs, as that would potentially wreak havoc with the matches between the BaseData and the outputs.




The system will also warn if you try to map an object to itself (but it will let you try if you are determined to do so).




Kudos to Luke Fitzpatrick who got it all working.


Shawn Laffan
26-Apr-2017


For more details about Biodiverse, see http://purl.org/biodiverse 

For the full list of changes in the 1.99 series (leading to version 2) see https://purl.org/biodiverse/wiki/ReleaseNotes (for all issues addressed or being targeted to fix for version 2, see https://github.com/shawnlaffan/biodiverse/milestone/4 ).

To see what else Biodiverse has been used for, see https://purl.org/biodiverse/wiki/PublicationsList


You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users


No comments:

Post a Comment

Note: only a member of this blog may post a comment.