2012.01 WP10: Data Consumption Facilities Development Monthly Activity By Task and Beneficiary

From IMarine Wiki

Jump to: navigation, search

Contents

This WP10 Activity Report described the activities performed in January 2012 by Beneficiary and Task. It is part of January 2012 Activity Report.

T10.1 Data Retrieval Facilities

NKUA Activities

The snippet mechanism was designed and implemented in the Search System and Index Layer. To this end, the activities performed were focused on the following directions:

  • Integration of the snippet mechanism (#7) in the search planner/optimizer and in the CQL language as a projected field.
  • Mechanism for identifying the fields and the corresponding terms involved in a query-tree, that drive the creation of a snippet for each result.
  • Creation of a snippet for each result, in the Index Layer.
  • Configuration for controling the maximum size of a snippet for each query result.

In addition, the following tasks have been performed in Resource Registry:

  • Integration of hosting node selection by requirements. Up until now, to achieve the same operation clients would contact the Information System directly.
  • Fixing of defects which rendered information not accessible under certain circumstances, resulting in Search System not being able to answer some queries. (#56,#57)

NKUA is also working on improving the Resource Registry by:

  • Investigating if changing the database back-end from derby to another solution would increase robustness.
  • Investigating perfomance improvements by optimizing queries at the JDOQL level and adding database indexes.

More concrete results on the latter are expected in the oncoming periods.


  • Design of the snippet mechanism in the Search and Index Layer
  • Implementation of the snippet mechanism in the Search and Index Layer
  • Testing of the snippet mechanism in the development infrastructure
  • Release of a new version of Resource Registry. All three components comprising the Resource Registry (ResourceRegistry, RRModel, RRGCubeBridge) are now in version 1.2.1 and are included in gCube maintenance release 2.7.3.

FORTH Activities

  1. We have started setting up gCube portal in our local machine for better understanding the infrastructure and in future for making development and testing easier.
  2. We have started using the HTTP API of gCube Application Support Layer for retrieving results from the Search Service.
  3. Discussions with NKUA about snippets and other requirements from the Search Service. For more see also T10.4 [[1]]


None


None

Terradue Activities

The beneficiary should report here a summary of the activities performed in the reporting period


The beneficiary should report here major issues faced in the reporting period and the identified corrective actions, if any.


The beneficiary should report here a bullet list highlighting the main achievements of the reporting period

T10.2 Data Manipulation Facilities

NKUA Activities

NKUA focused on finalizing the new design of the Data Transformation Service. In order to provide scalability, DTS has been redesigned in a way to produce execution plans for the PE2ng engine instead of performing transformations locally. A primary concern driving the new design was the minimization of DTS bottlenecks, the preservation of all present features (flow control, pipelining etc.) and the provision of easy incorporation of upcoming enhancements, such as execution model abstractions. gRS2 will be used for communication between transformation units and for the merging of results.

The primary endpoint of DTS now consists of a planner, which will be used only to create execution plans by means of consulting the transformation graph and forward the outcome of the transformation to the caller. In that way, network traffic is minimized as any kind of interfacing with data sources is delegated to worker nodes that are handling the actual transformation. Whenever a new data type is encountered by the planner, a new execution plan corresponding to a transformation sequence for this type is constructed.

In the new design, data input is decoupled from transformation planning, as all that is forwarded to transformation units are the source’s endpoint and the data type to be processed. In this way, transformation planning would be as lightened as possible while the central DTS planner is free of managing data transfer. The first transformation unit in the sequence is now the one responsible to read all element data types from source input and filter out whatever elements suitable to the transformation it has been assigned. In addition to minimizing bottlenecks, this mode of operation will not allow a computationally intensive transformation to block other transformations in progress, thereby solving a problem which was present in the previous design.

The central part in the new design is that each transformation unit corresponds to a different execution plan element. A transformation plan is therefore composed of many transformation units which may be conducted by different execution nodes. Each execution plan element instructs that a transformation is performed in a specific computational node having the required software. This software is independent of DTS, therefore the full set of computational resources present in the infrastructure can be used, as long as they support the required transforms. Also, merging functionality the outcomes of the different transformation chains will be incorporated into a separate execution plan which will be executed in a distinct execution node. As the number of transformation sequences varies dynamically at runtime, gRS2 will be used in order to support dynamic merging. In this way, the merger is notified about the gRS2 locators which correspond to the output of each sequence and the latter is able to include results of additional sources as its operation is ongoing. The outcome will then be transferred back to Data Transformation service, so the execution engine is kept transparent to the caller.

To sum up, we concluded in the new design of DTS for the distribution of different transformations processes to separate execution nodes. Our future work will also concentrate in the concurrently execution of the same transformation processes on different execution nodes.

none


  • Finalization of the new design for gCube Data Transformation Service

CNR Activities

The beneficiary should report here a summary of the activities performed in the reporting period


The beneficiary should report here major issues faced in the reporting period and the identified corrective actions, if any.


The beneficiary should report here a bullet list highlighting the main achievements of the reporting period

FAO Activities

The beneficiary should report here a summary of the activities performed in the reporting period


The beneficiary should report here major issues faced in the reporting period and the identified corrective actions, if any.


The beneficiary should report here a bullet list highlighting the main achievements of the reporting period

T10.3 Data Mining and Visualisation Facilities

CNR Activities

As stated in the past month, CNR is working on the development of a statistical service which will provide facilities for performing data mining on user's data sources. Features exposed by the service can be categorized in:

  • GENERATORS: includes probability distributions, classifications, matching or distance measurements etc.
  • MODELING: includes models to be trained, e.g. neural networks, species envelopes, support vector machines etc.. The result will be typically a binary file.
  • CLUSTERING: involves clustering procedures for grouping together phenomena or multidimensional points.
  • TRANSDUCERS: involves algorithms for transforming a dataset into another.
  • EVALUATORS: a set of procedures for measuring the quality of a model.

Currently the following has been implemented for each class:

  • GENERATORS:
    • Aquamaps Suitable
    • Aquamaps Native
    • Aquamaps Suitable 2050
    • Aquamaps Native 2050
    • Aquamaps Native+Neural Network
    • Aquamaps Suitable+NeuralNetwork

The above algorithms can run in parallel on a local machine.

  • MODELING:
    • Aquamaps HSPEN
    • FeedForward Neural Network
  • CLUSTERING: -
  • TRANSDUCERS: -
  • EVALUATORS:
    • Discrepancy Analysis between two spatial probability distributions
      • mean
      • variance
      • accuray
      • errors count
      • max error
    • Quality Analysis for a classification or probability distributions
      • Accuracy
      • Sensitivity
      • Omission Rate
      • Specificity
      • AUC
      • ROC Curve
      • ROC Chart

An experiment is here reported where we tried to simulate the Aquamaps distributions, both those manually reviewed or automatically generated by the Aquamaps algorithm. The following images refer to a manual distribution by biologists and to a map automatically generated by the Aquamaps algorithm on the Basking Shark species.

Suitable Map Generated by Aquamaps biologists.
Suitable Map Generated by Aquamaps Algorithm.
Absence data from close positions.
Absence data from random positions.
Presence data for Basking Shark.

We trained two feed forward neural networks with presence data and absence data from the figures above. In the first case absence data were close together, while in the second case the distribution of absence data was randomly chosen. For absence data we took the points of the manually reviewed distribution where the map reported probability values less than 0.01. The resulting distributions are completely different, as shown in the following figures.

Suitable Map Generated by a Neural Network. Training absence data are close together.
Suitable Map Generated by a Neural Network. Training absence data are randomly chosen.


The underlying idea was to supply the neural network with the same parameters as the Aquamaps algorithm, i.e. depth mean,depth max,depth min,sst mean,sbt mean,salinity mean,salinity b mean, primary production mean,ice concentration,land distance,ocean area. These should contain a sufficient amount of information to understand the fish's environmental preferences. The obtained results could mean that, for the Basking Shark, the information set is not rich enough or that the biologist performed the map design basing on a local information about the fish, as the neural network did.

Data used for this scenario have been the following:

  • Neural Network uses 2 inner layers with 100 and 2 neurons
  • Presence data are made up of 449 points
  • Absence data are made up of 449 points


  • The Statistical Service has been tested and enhanced with new features
  • A testing experiment has been done for assessing service performances
  • Quality analysis has been developed and improved
  • Statistical analysis has been introduced with a state-of-the-art system (Feed Forward Neural Network)

NKUA Activities

The beneficiary should report here a summary of the activities performed in the reporting period


The beneficiary should report here major issues faced in the reporting period and the identified corrective actions, if any.


The beneficiary should report here a bullet list highlighting the main achievements of the reporting period

FAO Activities

The beneficiary should report here a summary of the activities performed in the reporting period


The beneficiary should report here major issues faced in the reporting period and the identified corrective actions, if any.


The beneficiary should report here a bullet list highlighting the main achievements of the reporting period

T10.4 Semantic Data Analysis Facilities

FORTH Activities

We investigate using search standards (OpenSearch) and publishing/returning advanced search results using semantic web formats (RDF).
We are currently preparing a document describing these issues which can be found at BSCW ([[2]]).
We have started the development of a prototype.


None


None

FAO Activities

The beneficiary should report here a summary of the activities performed in the reporting period


The beneficiary should report here major issues faced in the reporting period and the identified corrective actions, if any.


The beneficiary should report here a bullet list highlighting the main achievements of the reporting period

Personal tools