Thursday, 13 December 2018

Import polygon and polyline data

The short summary

As of Biodiverse version 3, you can directly import polygon and polyline data from GIS feature data sets.  If you want to try it before v3 is released then it is in the current development release.

The more detailed explanation:

Ever since development started on Biodiverse it has been able to import spatial data as point records from delimited text files (e.g. CSV format).  The ability to import raster data was later added, as well as the capacity to import point records from more sources (spreadsheets and shapefiles).

However, taxon distribution records are also frequently provided in the form of polygon range maps.  One commonly used example of these are the IUCN Red List data, but there are many sources.

The way to import such data using Biodiverse 2.1 and earlier is to process the data outside Biodiverse so they can be represented as points or tables.  This is done by intersecting them with a fishnet of polygons (also called a vector grid) that aligns with the cells that will be used in Biodiverse.  Once intersected, they can be converted to points, or their coordinates added to the attribute tables, using centroid calculations.  This is what was done in López-Aguirre et al. (2018), for example.

The fishnet approach is relatively simple if one is familiar with GIS operations, but is not something that should be done by hand when numerous taxa are to be analysed or different coordinate origins are being tried.  In such cases one can script the process, but for many this can be yet another thing to learn, and not something that is done in a hurry to meet a short deadline.  (Note that scripting is a very useful skill to have, and is portable beyond the language du jour one might first learn).

With some recent changes to the Biodiverse codebase, importation of polygon data is automated and part of the standard Biodiverse data import process.  As an added bonus, polyline data are also supported, so if you have data such as for crustacean presences along stream segments then they can also be imported.

As another bonus, if you have a mix of point, line and polygon data then they can all be imported in one pass, providing they all have the fields or attributes you select.  If not then the system will throw an error.

The set of geometry attributes that are available to select from are :shape_x, :shape_y, :shape_z, :shape_m, :shape_area and :shape_length.  Not all files have all attributes.  Point files do not have a :shape_area or :shape_length, polygon files do not have :shape_length, and polyline files do not have :shape_area.  Many files do not have :shape_z or :shape_m axes - these are for 3D shapefiles or those with time measures.

A worked example

A worked example is probably the best way to show how to use it.  Those familiar with the process of importing data will note that it is almost the same as the current process, which is quite convenient in that it is one fewer thing to learn.  

Some example data.  The polygons have no specific meaning.

As usual, select the data set (or data sets) to import.  Make sure you select Shapefile as the Format.  

This step is identical to the spreadsheet and delimited text imports.

Select the fields or attributes as appropriate.  The attributes that are visible (:shape_x, :shape_m, :shape_length, :shape_area etc) depend on properties of the first file selected. 

And here is the file imported.  There is only one taxon label in this data set, so there is not much more to show, but once imported the data are analysed like any other.

How does it work?

It is essentially an automation of the process described above.

First, a fishnet grid of polygons is generated to match the cell size of the BaseData object being imported into.

There are then two ways of handling the data.

The default approach is to treat the polygon and polyline data as presence-only, so a taxon is recorded as present in any fishnet polygon that its feature data intersect with.  This is by far the fastest approach as the system can stop checking and return true as soon as it finds an intersection.

The second approach is to calculate a new data set that is the intersection of the input data set and the fishnet data set (imagine using the fishnet polygons as a cookie cutter on the taxon polygons).  This process can be substantially slower, as the system must iterate over the polygon or polyline vertices, identify where they intersect, and then cut them as appropriate. However, if you need the additional information then so be it.  That said, this approach is only used if the area or length of the intersecting features are needed, for example they are to be added as group properties or used for the sample counts.

The underlying processing is all done using the GDAL and GEOS libraries, so some of the operations will be familiar to some users as there are interfaces for Python and R, amongst other languages.

Spatial indexes 

Both approaches use spatial indexes to speed up the calculations.  As an example of the difference this makes, one data set used in testing took 9 minutes without the index, and 70 seconds with it.  For comparison, testing for presence only takes a few seconds (with the index).  It is worth noting that spatial indexes have long been used in Biodiverse to speed up processing, albeit using a different approach.

Note that, even with the spatial index, large and complex polygons will take longer to import than simple polygons.  Multipart polygons can also sometimes take longer than single part, especially if the envelope of the features (the bounding rectangle) is very large.  This is because Biodiverse needs to check all the fishnet polygons within the envelope, so if most of the fishnet polygons do not overlap the taxon polygon then most of these checks are redundant.  If there is a need then further optimisations for the above issues can be looked for.

[Update 22-12-2018 - several optimisations have since been implemented to address the above issues and will be available in the 2.99_002 development release.]


You can also just use the attribute table

There are some occasions when you only want the data from the attribute table.  If you don't use any of the geometry fields (:shape_x, :shape_area etc.) then Biodiverse will treat the table in the same way that it imports delimited text and spreadsheet data.  This means that each record in the table is the same as a row in a spreadsheet or line in a text file.

An example of when this might be useful is if you have data summarised across biomes or other regions and are not interested in analysing the data spatially, e.g. you only want to calculate Phylogenetic Diversity for the biome level assemblages and not at every location in the biome.




Shawn Laffan
10-Dec-2018



For more details about Biodiverse, see http://shawnlaffan.github.io/biodiverse/  

For the full list of changes in the 2.99 series (leading to version 3) see https://github.com/shawnlaffan/biodiverse/wiki/ReleaseNotes#version-299 (for all issues addressed or being targeted to fix for version 2, see https://github.com/shawnlaffan/biodiverse/milestone/15 ).


To see what Biodiverse has been used for, see https://github.com/shawnlaffan/biodiverse/wiki/PublicationsList


You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users or follow the google plus page:  https://plus.google.com/+BiodiverseSoftware


No comments:

Post a Comment

Note: only a member of this blog may post a comment.