The ScalableDataMining Virtual Research Environment

From IMarine Wiki

Jump to: navigation, search



Scalable Data Mining is a VRE designed to apply Data Mining techniques to biological data. The algorithms are executed in a distributed fashion on the e-Infrastructure nodes or on local multi-core machines. Scalability is thus meant as distributed data processing but even as services dynamically provided to the users. The system is scalable in the number of users and in the size of the data to process. Statistical data processing can be applied to perform Niche Modelling or Ecological Modelling experiments. Other applications can use general purpose techniques like Bayesian models. Time series of observations can be managed as well, in order to classify trends, catch anomaly patterns and perform simulations. The idea under the distributed computation for data mining techniques is to overcome common limitations that can happen when using statistical algorithms:

  • the training and projection procedure timing,
  • the linear or non-linear time increase when the number of data to process increases,
  • the multiple runs needed for reducing overfitting or local minima problems,
  • the multiple models topologies to be evaluated for assessing the optimal model's configuration.

All the above issues strongly limit the amount of time a scientist can dedicate to the evaluation of the results and to the combination and comparison of the outcomes of different experiments. On the other side the Statistical Data Mining VRE adds advantages in using a distributed e-Infrastructure endowed with many data sources. Some of these are:

  • efficiency and time saving in computations;
  • availability of a set of data sources containing environmental or species features;
  • reliability of the quality of the features;
  • certification of compliancy between e-Infrastructure data sources and algorithms inputs\outputs;
  • import of users' own files;
  • sharing of results and users' files.


The ScalableDataMining Virtual Research Environment Homepage


The main facilities this VRE offers are:

  • Statistical Manager:
  • Workspace: to enable every user to store and organise the information objects he/she is interested to work with. In addition to that, the user is allowed to collaborate with other users by sharing objects and messages;
  • VRE Management: to enable authorised users (i.e. VRE Managers) to manage other users using or willing to access the VRE. VRE Managers can (i) authorise users in accessing the VRE, (ii) assign or withdraw roles to users, (iii) remove users, and (iv) send a communication to the current users.



  • (DONE) gCube 2.11.0 (October '12);
  • (DONE) gCube 2.12.0 (February '13);
  • (DONE) gCube 2.13.0 (March '13);
  • (DONE) gCube 2.14.0 (May '13);
  • (DONE) gCube 2.15.0 (June '13);
  • (DONE) gCube 2.16.0 (July '13);
  • (DONE) gCube 2.17.0 (December '13);
  • (DONE) gCube 2.17.1 (December '13);
  • (DONE) gCube 3.0.0 (March '14);
Personal tools