From IMarine Wiki
iMarine Partners Biodiversity Position
The management of biodiversity data covers the observations of species occurrences, their distribution mapping, and the visual and statistical analysis of both observations (point data) and distributions (areas). It requires import of structured data in various formats, in particular compliant with the Darwin Core as xml or csv datasets. The purpose of the management is to produce reliable datasets and maps on species and their potential current, historic and future distribution.
Compared to other initiatives, iMarine already offers the basic components to load, share, publish and analyze data. This makes the iMarine infrastructure an attractive option for the further development of biodiversity components.
Many data owners in the marine biodiversity domain have difficulty in gaining consistent access to biodiversity and environmental data in enough detail and with relevant metadata. There are concerns about the sheer number of datasets that have to be maintained, with multiple data streams and formats putting pressure on software developers. Concerns about the interoperability of software and the related risk of exploding support costs for software maintenance make OS development in a CoP a potentially attractive proposition.
In D4Science-II, considerable experience has been gained in the acquisition and management of species occurrence data. In addition, many services marshalling the data from “Sea to Shelf” are available; curation, metadata collection, transformation, mapping and repository services, to name a few. The iMarine Biodiversity partners aim is to provide a stronger, more resilient and flexible framework for Biodiversity Data Management. Collaboration in an Ecosystem Approach Community of Practice can help to achieve that.
This collaboration needs to be based on reliable and free resources. Access to and maintenance of these resources is the responsability of all partners of the EA-CoP, and users of the supporting eInfrastructure will have to develop and commit to an open data policy.
An effective Biodiversity data policy is an iMarine EA-CoP policy; it needs to be defined and approved by the iMarine Board. After all, making clear, fair agreements on the component development of OS software and effectively enforcing these rules will increase political and CoP support for iMarine. This proposal lists some of the components that can help achieve this.
We are aware that Biodiversity is just one facet in the iMarine decision-making processes. To facilitate this, not only components propose implementation actions, but also describe the background and the anticipated impact. We are keen to enter into discussion with other iMarine partners and iMarine supporting institutions. We invite other parties (Board partners, iMarine institutions and CoP) to have a say.
The Biodiversity proposal: objective and means
The main objective in the management of biodiversity data in iMarine is to check the data that flow between primary data providers and a varying number of aggregators such as FishBase, OBIS (possibly FishNet) and GBIF, with the general rules:
- detailed specific data flows to the more general aggregator,
- data is best managed at the most detailed level,
- data must be shared using well-known schemas,
- data is never static; it can evolve over time and space, and should be easily updatable; version control is needed to be able to repeat analysis later on exact same datasets
- reference data, such as codes for species, locations, and countries, are often resource specific (=stored with the data),
- mapping between reference data is most important for biological taxonomies, and should be contextual,
- data is never public; it has to be published by an authorizer.
Most of the above rules are already implemented in the existing OBIS system, and are expected to be made available also through the iMarine eInfrastructure.
The data exchange with GBIF is still not an easy process, and needs considerable iMarine effort to achieve a level of acceptable stability. Initial activities should discuss, define and develop these data-flows.
The OBIS and AquaMaps system are growing closer, which opens interesting perspectives on mutual data services such as import, validation checks, mapping and sharing of data and reference data. Specifically, OBIS can offer species names services and distribution validation to AM, while the GIS and R integration in AquaMaps/D4Science is of interest to OBIS. A first objective in iMarine is an analysis of the mutual services that can be provided.
The validation of data may require advanced statistical, geospatial and environmental data processing. These can now not be defined, but in the second semester of the project effort will be needed for their description.
Participation and integration are essential for successful software development and for coherent usage and sharing policies. Every component should be expected to function in a software ecosystem and to be maintained by a community. This implies a careful approach to software architecture and development, aiming at sustainability after the project.
The Biodiversity community proposals for achieving a stronger, safer and more prosperous EA CoP relate to the following iMarine components:
- Data import and selection related to Darwin Core;
- Management of taxonomic names;
- Freshwater VRE;
- Species Distribution related components.
Depending on any relevant developments that take place, other software components could also play a role in the future. These include components for statistical analysis, GIS tools, geospatial tools, and semantic technology support.
Each section looks at the following: • Background to the proposal; • What is proposed; • Anticipated impact.
As described above, the production of Darwin Core data is the main activity expected to be supported by the D4Science infrastructure. The CoP already provides software that can be evaluated (list ...), although these have been used with varying results in the past. The Community seeks in iMarine the development and release of an environment for the management of marine species observational data that offers:
- Selection of data from a source (browse repository, filter).
- More flexibility in occurrence datasets that can be used today (OBIS, ...),
- Interactive occurrence selection in a GIS-viewer (e.g. sliders to-from dates),
- Occurrences data cleaning: DarwinCore is already the standard for exchange, but there is no standard to exchange corrections (e.g., if we find an error, how do we send the feed-back to the provider in such a way that he can use what we send easily: obviously data are structured by DwC but what about the indications of what was changed and why).
- This includes also:
- the private data, they will be integrated through the DwC, but what about conveying assessment on quality, on possible traps and known issues.
- data editing incl. through interactive maps.
- Environmental data associated with biological observations, and to be used in performing the extrapolations between observed (point) data to the distribution range (polygons)
- Statistics on probabilities, e.g., a summary on a given area, bootstrap or any other robustness measures.
- Change of scale (e.g. size of grid) up to changing the grid system (for an equal area grid system: the current cells have 4 times more surface near equator than near the poles) and provide transformation standards.
- Managing shape files and related statistics, graphs, etc: can be a transect, along a coast line, a geographical area. This includes sharing maps, and analyses.
- Dynamic links with WoRMS, CoL, ITIS, NCBI and other code lists will facilitate integration and quality control of taxonomc names; see also names-based activities below
- Tools to check GBIF holdings for marine data that are not available in OBIS will allow to detect and ingest these datasets in OBIS, and make OBIS a one-stop shop for marine biogeography
- Tool for visualising the data - most important through a map interface
- Tools for quality control, such as detection of duplicates; detection of outliers in environmental space...
If the project manages to provide a solid environment for the management of biodiversity data, the potential interest in an even wider community will certainly be raised. However, to make an impact, several conditions must be met: Open source, using well known technologies, etc.
Most species information, including occurrence data, is hooked to a scientific species name that is coined to designate a taxon at specific level. Data from various sources can be integrated for a species only because they are attached to the same name. Other alternative would be to attach information either directly to specimens, or to gen sequences (often done for microbes); these alternatives will be explored later. However both taxonomy (the way to split the living organism diversity in well-defined and identifiable species in a hierarchical classification) and nomenclature (the proper way to assign unique names to the different taxa) is a work constantly in progress leading to the difficult situation where a species may be designated by several names (synonymies), or that a name may designate several species (homonymies), along the time or simultaneously according to different authors.
Several major efforts are on-going to collect and collate all taxonomic names:
- The Catalogue of Life (CoL) aims at gathering all species names and their synonymy and homonymy relationships so that interoperability between various information datasets can be automated. As of November 2011, CoL gathers the names for almost 1.4 million species over the 1,9 million estimated to be known to science from more that 110 separated Global Species Databases. It is most important to have an easy access within the iMarine ecosystem to facilitate huge dataset compilation from various sources (e.g., not using the same name for the same species) or to link them to other sources.
- The World Register of Marine Species (WoRMS) is engaged in a similar venture, but specifically for the marine domain. WoRMS and CoL are collaborating, and WoRMS is the main provider of GSDs to CoL; it is WoRMS that is used as the primary (but not exclusive) name reference by OBIS. It is important to realise that CoL and WoRMS are collaborating, but are separate activities; the collection of names they hold obviously overlaps, but significant amounts of names are held by only one of the two systems.
- Apart from WoRMS and CoL, which will be the main sources of taxonomic names, there are several other systems that should be taken into account. The Integrated Taxonomic Information System (ITIS) is an initiative of the federal government of the USA, and is the oldest player in this game. It is one of the two partners in the Catalogue of Life, the other being Species 2000. As between WoRMS and CoL, also with ITIS there are significant number of names that are only available in ITIS.
- The National Center for Biotechnology Information (NCBI) is one of the organisations hosting GenBank. In order to manage the genetic sequence information, they also keep a register of taxonomic names, which is compiled independent from CoL, WoRMS or ITIS. As such it is an important 'second opinion' on biological names. Especially for microbes, the NCBI is often the best source.
- Last but not least, the Interim Register for Marine and Non-marine Genera was created within the OBIS community, with the objective to enable to discriminate between marine and non-marine, and between fossil (extinct) and extant taxa.
This is also related to he topic 'Merging Lists of Species Names coming from different Datasources', as listed below.
- As a separated VRE or App together with WoRMS, which can be decided after the examination of the integration of freshwater data or not.
- Tools to check names from OBIS and GBIF;
- Potentially could be extended to a check name VRE for primary data providers to check their names themselves (since the primary providers are the ones who know their own data best), before they provide to aggregators.
- Significantly decrease the human resources spent to cross-check names over multiple datasets;
- Increase the quality of the primary data sources by sending them feed-back on issues (up to provide the service for themselves to check their data);
- Integrate the developed tools in the efforts developed by the Biodiversity Informatics community to establish a Global Name Architecture in order to implement in the VRE the technological solutions proposed by this project.
Freshwater VRE Data Management Tools
The freshwater VRE would require co-funding which has not been sourced yet. It is inlcuded here to provide scope for future extensions, and reference for developers where requirements are likely to emerge.
The existing infrastructure manages data at a rather low spatial and temporal resolution. However, there are no technical limits to increase the resolution. In the past, data and processing power limited the use of predictive modeling to larger habitats, and few algorithms were developed to describe species distributions over space and time. With the rapid advance in technology, both data and processing can now much better cope with high resolutions, and this opened the possibility to include fresh-water models in the infrastructure. The proposed measures here serve not only the freshwater community, but aim to bring much higher resolutions and integrated management of metadata on reliability, completeness and ownership to the data products.
The extension into freshwater is a very welcome addition, although the project is about marine data. However, the main requirements for freshwater models are also very useful to boost the marine niche modeling. In addition, the contribution of inland fisheries and estuaries to capture fisheries aquaculture is large, and cannot be neglected. A partnership with the FP7 funded project BioFresh will permit iMarine to focus on the technical integration in a VRE while the content could be managed by BioFresh. The proposed measures aim to boost the precision over temporal and spatial scales, yet build on the existing infrastructure for taxonomic, spatial and environmental data acquisition and management. The need to use higher resolution data calls for access to data providers in the following domains:
- Species occurrence and distribution; very precise locations (meters!)
- Environmental data; with temporal resolutions from days to years, averages over time and space, and seasonal products. This requires that the calculations are brought to the data, as it is unlikely that D4science can contain all relevant data.
The range of algorithms has to be extended, and the proposed contributions of the OpenModeler community and software will be analyzed for suitability and correctness before they are linked to the infrastructure. Not only the algorithms themselves, but also the management of reliability and Finally, the generated products require a stable publication environment. Advanced calculations are of little use if the loose the context or can only be repeated with difficulty. This implies that ALL steps are brought under control of a workflow, and that ALL products contain provenance and quality metadata. Also, in order to be able to analyze the results, the infrastructure must be able to generate statistics with each product, ranging from the number of data-points used, to correlations between spatial products. Only then will there be a truly scientific contribution from the project, and a sustainable interest from a scientific community.
- Many diadromous species are important economically (sturgeons, eels, salmons, some gobies, etc.) and it is important to model and manage their life cycle in the two environment, marine and freshwater if one wants to achieve the ecosystem Approach to Fisheries;
- Many primary datasets contain both marine and freshwater information. It will help to clean marine dataset from freshwater data and reverse.
Species Distribution Data Management Tools
The existing data infrastructure already contains and maintains a species distribution modeling environment; AquaMaps. This meets many of the WFC requirements, but may need extensions to better serve the existing and new CoP members. The VRE and gCubeApp are based on a PostgreSQL database with geospatial extensions, and geoserver. This architecture will be the reference model for other geospatial data management facilities. In addition, access to environmental datasets is provided by GENESI-DEC, and this can be extended to providers dissemeinating datasets more geographically restricted. The AquaMaps VRE can also be used to compare map-products. In D4science-II, a new model was introduced for the spatial re-allocation of captures. One of the information sources used were species distribution maps, as it is obvious that species captures are defined by species distributions. Bringing these resources together is no easy feat, yet D4science-II has proven the feasibility to define VREs that contain data from different domains.
The component defined here would offer a range of analytical tools to scientist to generate maps using
- different information sources,
- different formats (maps, KML),
- different distribution analysis methods (visual, statistical),
- predefined re-allocation mechanisms to change the spatial resolution of a reported occurrence in an area.
The many different reference data used, and their varying quality requires a harmonization module that converts data from one definition scheme to another. Here, the use of a semantic KB may be required for translations, mapping across coding schemas, disambiguation etc.
The definition of codelists an their use may require the services from a CodelistManager VRE.
The maintanance of mappings between elements in codelists across spatio-temporal dimensions may require the services from a CodelistMapper VRE
- Management of “shape files” of the distribution with various probabilities from number to ordinal qualification (e.g., present, doubtful; similar to the recent FAO maps);
- Standards to create and exchange distribution statements: e,g, Western central Pacific from the Philippines to Marquesas Islands, from Ryukyu Is. to New Caledonia. In other words, textual descriptions of the shape files. This may require support from a semantic KB.
- A system to qualify biological/biogeographic point data with the oceanographic conditions that prevailed at the time of the observation. Most important oceanographic parameters are temperature, salinity, pH, nutrients, bottom composition and depth (for benthic species), turbidity.
- Maps of these oceanographic parameters at various resolutions in time and space, to allow extrapolation of the biological point observations to the range of the species.
- Amplify the capacity of analyses of data;
- Moving from global perspectives and trends to more restricted areas, e.g.. at region and possibly to country level;
- Propose better prediction tools with respect to the climate change.
Merging Lists of Species Names coming from different Datasources
The main goal of this topic is to identify, in a formal way, some critical characteristics for describing potential problems in harmonising lists of species coming from different data sources, in order to understand when two entries refer to the same one. Such identification task is not trivial because of the deep differences in the nomenclature protocols which are followed in different areas of biology. Nomenclature can vary moving from Zoology to Botany and Bacteriology. This section has the daring scope of investigating the margins for building a merging algorithm which solves the above issue, as an automatic solution has never been found up to now.
A discussion page has been set, named Taxa Merging Discussion, which tries to put together the biologists' and computer scientists' points of view. Based on the considerations discussed in this document, and long experience in cleaning taxonomic names, a set of rules [] was drafted that will help spell out the requirements of the iMarine taxon name matching tool.
The ultimate goal of this set of activities is to develop 'BiOnym', a flexible workflow approach to taxon name matching running on the D4Science/iMarine infrastructure. BiOnym was presented at the conference of the 'Taxonomic Data Working Group' held in Firenze, from 28 October to 1 November 2013; more on the conference on http://www.tdwg.org/conference-2013/. The abstract of the BiOnym presentation is available from here.