From IMarine Wiki

Jump to: navigation, search

iMarine inherited its software stack (gCube) from 2 previous EU projects, D4Science and D4Science II. The software has been further extended during the iMarine project in order to enhance the foundation and build Marine Applications on top of it concentrating the effort in 3 main functionalities areas :

  • Core Facilities : dedicated to provide its users with a range of services for the operation and management of the whole infrastructure. They are detailed in this section
  • Data Management Facilities  : dedicated to provide its users with a rich array of services for the management of data
  • Data Consumption Facilities  : dedicated to provide its users with a rich array of services for the exploitation of data


Data Management Facilities

The Data Management facilities can be further categorized in 3 different areas grouping a series of components.

Data Access and Storage

The data access and storage components provide secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics. The Key features of this family of components are:

  • uniform model and access API over structured data.
    • dynamically pluggable architecture of model and API transformations to and from internal and external data sources;
    • plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs;
  • fine-grained access to structured data.
    • horizontal and vertical filtering based on pattern matching; URI-based resolution; in-place remote updates;
  • scalable access to structured data:
    • autonomic service replication with infrastructure-wide load balancing;
    • efficient and scalable storage of structured data based on graph database technology;
  • uniform modelling and access API over document data:
    • rich descriptions of document content, metadata, annotations, parts, and alternatives
    • transformations from model and API of key document sources, including OAI providers;
  • uniform modelling and access API over semantic dat: tree-views over RDF graph data
    • transformations from model and API of key document sources, including SPARQL endpoints;
  • uniform modelling and access over biodiversity data:
    • access API tailored to biodiversity data sources; dynamically pluggable architecture of transformations from external sources of biodiversity data;
    • plugins for key biodiversity data sources, including OBIS, GBIF and Catalogue of Life;
  • efficient and scalable storage of files:
    • unstructured storage back-end based on MongoDB for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding;
  • standards-based and structured storage of files:
    • POSIX-like client API;
    • support for hierarchical folder structures;
  • uniform model and access API over environmental data
    • heterogeneous external datasources investigation;
    • uniform access to OGC compliant services.

The services belonging to this area are:

Data Transfer

The Data Transfer components implements reliable data transfer mechanisms between the nodes of a gCube-based Infrastructure. The key features are :

  • Point to Point transfer : one writer-one reader as core functionality
  • Intuitive stream and iterator based interface simplified usage with reasonable default behavior for common use cases and a variety of features for increased usability and flexibility
  • Multiple protocols support: data transfer currently supports the following protocols: tcp and http
  • Reliable data transfer between Infrastructure Data Sources and Data Storages: by exploiting the uniform access interfaces provided by gCube
  • Structured and unstructured Data Transfer: both Tree based and File based transfer to cover all possible use-cases
  • Transfers to local nodes for data staging: data staging for particular use cases can be enabled on each node of the infrastructure
  • Advanced transfer scheduling and transfer optimization: a dedicated gCube service responsible for data transfer scheduling and transfer optimization
  • Transfer statistics availability: transfers are logged by the system and made available to interested consumers.

The services belonging to the area are:

Data Harmonization

The components part of the Data Harmonization are gives unified views over diverse data items and in particular they offer:

  • workflow-oriented tabular data manipulation
    • user-defined definition and execution of workflows of data manipulation steps
    • rich array of data manipulation facilities offered 'as-a-Service'
    • rich array of data mining facilities offered 'as-a-Service'
    • rich array of data visualisation facilities offered 'as-a-Service'
  • reference-data management support
    • uniform model for reference-data representation including versioning and provenance
  • data curation and enrichment support
    • species occurrence data enrichment with environmental data dynamically acquired by data providers
    • data provenance recording
  • standard-based data presentation
    • OGC standard-based Geospatial data presentation

The services belonging to this area can be further categorized according to the data type:

Data Consumption Facilities

The Data Consumption facilities can be further categorized in 5 different areas grouping a series of components.

Data Retrieval

gCube provides Information Retrieval facilities over large heterogeneous environments. The architecture and mechanisms provided by the framework ensure flexibility, scalability, high performance and availability. In particular:

  • Declarative Query Language over a heterogeneous environment. gCube Data Retrieval framework unifies Data Sources that use different data representation and semantics through the CQL standard.
  • On the fly Integration of Data Sources. A Data Source that publishes its Information Retrieval capabilities can be on-the-fly involved in the IR process.
  • Scalability in the number of Data Sources. Planning and Optimization mechanisms detect the minimum number of Sources needed to be involved during query answering, along with an optimal plan for execution.
  • Direct Integration of External Information Providers. Through the OpenSearch standard, external Information Providers can be queried dynamically. The results they provide can be aggregated with the rest of results during query answering.
  • Indexing Capabilities for Replication and High Availability. Multidimensional and Full-text indexing capabilites using an architecture that efficiently supports replication and high availability.
  • Distributed Execution Environment offering High Performance and Flexibility. Efficient execution of search plans over a large heterogeneous environment.

Services in this area:

Data Manipulation

gCube provides Data Manipulation Facilities responsible for transforming content and metadata among different formats and specifications. The architecture and mechanisms provided by the framework satisfy the requirements for arbitrary transformation or homogenization of content and metadata. Its features are useful for:

  • information retrieval
  • information presentation
  • processing and exporting

In particular it offers:

  • Automatic transformation path identification. Given the content type of a source object and the target content type, framework finds out the appropriate transformation to use. In addition, there is the ability to dynamically form a path of a number of transformation steps to produce the final format. Shortest path length is favorable.
  • Pluggable algorithms for content transformation. A generic transformation framework that is based on pluggable components ( transformation programs and Algorithms). Transformation programs and algorithms reveal the transformation capabilities of the framework. With this approach we are able to furnish domain and application specific data transformations.
  • Exploitation of Distributed Infrastructure. The integration with a Workflow Engine engine allows to have access to vast amounts of processing power and enables to handle virtually any transformation tasks thus consisting the standard Data Manipulation facility for gCube applications.
  • Advanced geospatial analytical and modeling features - e.g. R geospatial, reallocation, aggregation: The possibility to define advanced geospatial processes required in reallocation, aggregation, interpolation and design and plan implementation for geospatial processes capacity. In particular using the WPS standard interface to the Hadoop framework which exploits the power of the distributed computation.

Services in this area:

Data Mining

Data Mining facilities include a set of features, services and methods for performing data processing and mining on information sets. These features face several aspects of biological data processing ranging from ecological modeling to niche modeling experiments. Algorithms are executed in parallel and possibly distributed fashion. Furthermore, Services performing Data Mining operations are deployed according to a distributed architecture, in order to balance the load of those procedures requiring local resources.

By means of the above features, Data Mining in iMarine aims to manage problems like

  • the prediction of the impact of climate changes on biodiversity,
  • the prevention of the spread of invasive species,
  • the identification of geographical and ecological aspects of disease transmission,
  • the conservation planning,
  • the prediction of suitable habitats for marine species.

Services in the area:

Data Visualization

Data Visualisation facilities include a set of features, software and methods for performing visualisation of data. Data Visualisation is particularly meant for geo-spatial data, which is a kind of information that naturally lends to visualisation. Data are reproduced on interactive maps and can be explored by means of several inspection tools. In particular it offers:

  • uniform access over geospatial GIS layers
    • investigation over layers indexed by GeoNetwork;
    • visualization of distributed layers;
    • add of remote layers published in standard OGC formats (WMS or WFS);
  • Filtering and analysis capabilities
    • possibility to perform CQL filters on layers;
    • possibility to trace transect charts;
    • possibility to select areas for investigating on environmental features;
  • Search and indexing capabilities
    • possibility to sort over titles on a huge quantity of layers;
    • possibility to search over titles and names on a huge quantity of layers;
    • possibility to index layers by invoking GeoNetwork functionalities;

Services in the area:

Semantic Data Analysis

Semantic Data Analysis comprises a set of libraries and services to bridge the gap between communities and link distributed data across community boundaries. The introduction of the Semantic Web and the publication of expressive metadata in a shared knowledge framework enable the deployment of services that can intelligently use Web resources

In particular it offers:

  • Provision of results clustering over any search system. Returns textual snippets and for which there is an OpenSearch description
  • Provision of snippet or contents-based entity recognition. Generic as well as vertical - based on predetermined entity categories and lists which can be obtained by querying SPARQL endpoints
  • Provision of gradual faceted (session-based) search. Allows to gradually restrict the answer based on the selected entities and/or clusters
  • Ability to fetch and display semantic information of an identified entity. Achieved by querying approprate SPARQL endpoints
  • Ability to apply these services on any web page through a web browser. Using the functionality of bookmarklets

Services in the area:

Personal tools