16.02.2012 T8.3 Conference Call
From IMarine Wiki
Time: Thursday February 16, 2012, 11:30 - 13:00 CET
Conference call agenda:
- Understand what is the need of CNR for support in the scientific scenarios iMarine deals with and in which way NKUA can be involved.
- Reach a pre-agreement on the targeted platforms
- Understand what the needs are for evolving PE2ng and its model of resources and operation without disrupting the architecture and code of scientific tools, so that they can seamlessly become part of gCube resources.
- Pasquale Pagano (CNR)
- Gianpaolo Coro (CNR)
- George Kakaletris (NKUA)
- Gerasimos Farantatos (NKUA)
- Lefteris Stamatogiannakis (NKUA)
CNR is implementing statistical service for T10.3
The goal is to parallelize not only Aquamaps, but all other algorithms in the statistical service.
The idea is to use PE2ng for job execution and move RainyCloud approach to PE2ng.
CNR (Lino) describes RainyCloud model.
RainyCloud is not just for Aquamaps algorithms, it supports multitude of models.
RainyCloud supports different back-ends - transparent to user.
Input to RainyCloud is Input tables and their location + Info about max number of computational resources + Core function that can be parallelized. Process storing data is managed by RainyCloud procedure.
Core function must be present in the execution node that will be called. Maximum nodes is an artificial barrier imposed in order to support future fee-based execution, the way Cloud works.
CNR (Lino) inquires if PE2ng can be used and if there is need for extensions.
CNR (Gianpaolo) describes Ecological Modeling Application. EM supports many computational infrastructures for execution, PE2ng could be one of them.
NKUA (Lefteris) asks about the amount of data to be processed.
CNR (Gianpaolo) Aquamaps 1-3 M of records to be written to the table.
CNR (Lino): In Aquamaps, data are already available in tables, algorithms are ran (eg to compute prob distro). Every time the algorithm is executed new data is generated. So far the input data are stored in several dbs (hosted by different organizations, read only access) and the output data in another db.
CNR (Gianpaolo) informs about computational costs. On a local machine with 8 cores the algorithm takes about 3 hours to complete.
NKUA (Lefteris) Raises the point that in batch processing usually all data is first harvested, put on hadoop, and the the outcoume is obtained.
CNR (Gianpaolo) informs that RainyCloud is a gateway towards hadoop. Would like to use d4S infra as computational infrastructure.
NKUA (Lefteris) Points out that data should be brought in one place/ prevent data moving around.
CNR (Lino) Describes the attributes and requirements of processing tasks. Data are generated by several different organizations, in databases. E.g. GBIF, Aquamaps communities.
CNR (Gianpaolo) further notices that rhe bottleneck on storing data on db occupies 20% of the computation. 80% is calculations, meaning that if we parallelize calculations, we essentially parallelize the whole procedure.
NKUA (Lefteris) asks about possibility of harvesting data every night in order to populate processing engine like hadoop and query upon the summaries produces.
CNR(Gianpaolo) Describes the on demand requirements of tha tasks. The user has their own database. The user wants to start the calculation and after a certain amount of time receive results. We do not know where the db is, there might be real time requirements, therefore we cannot perform batch processing. The requirement is to process GBs of data on-demand.
NKUA(George) Points out that most grid worfklow engines as well as PE2ng operate based on this model, drawbacks notwithstanding. Support data registration to avoid reloads.
NKUA(George): We must understand the need for new algorithms, and which of those are parallelizable and which are not.
NKUA(George) Asks if there are requirements for Modeling and Transducer algorithms and if NKUA could assist CNR on the latter.
CNR (Lino): We are not really working based on requirements from users. Instead, knowing the user's needs, we propose and implement solutions. Not driven by user requirements, but by user knowledge.
NKUA (George) Asks if there is any need to put effort on clustering algorithms.
CNR (Gianpaolo) CNR will provide algorithms, perhaps a parallelization plan. NKUA will support with PE2ng.
CNR (Lino) Asks whether abstractions will be built on top of PE2ng. E.g. provide map and reduce functions to abstraction layer.
NKUA (George): It is UoA's intention to provide execution abstractions for PE2ng. In D4Science much effort was put on Search, now revising Indexes.
We also want to exploit grid infrastructures in addition to hadoop.
One thing we will have to tackle is code deployment. Collaborate with deployment services, eg gCube services or manager of Hadoop (complex task).
We can provide a draft model of execution abstractions, eg map/reduce (the way CNR could provide map and reduce functions).
CNR (Lino) NKUA could also be involved in Data Mining and Visualization.
NKUA (George) Anomaly pattern detection, bibliometric analysis are of interest. Lefteris has experience on subjects, among others.
- CNR (Gianpaolo) will take a look at the documentation and perform some simple executions.
- NKUA (Gerasimos) will contact Marco Mikulicic for RainyCloud integration with PE2ng (via a PE2ng adaptor).
- NKUA (Lefteris) will start examining algorithms (and data) for making proposals on the topic.
- NKUA (George) will work on offering abstractions for computational models over PE2ng.