2011.12 WP9: Data Management Facilities Development Monthly Activity By Task and Beneficiary
From IMarine Wiki
This activity report documents the activities performed in November and December 2011. It is integral to Activity Report of the period.
T9.1 Data Access and Storage Facilities
FAO has been working in two main directions:
- in alignment with the methodology presented at the project’s KOM for the task, FAO has liaised with CNR on the selection of a first group of relevant data sources, the characterisation of the models and access protocols, and the design of solutions to access them from within the infrastructure. The following data sources have been looked at and discussed:
- FLOD, in representation of data sources that expose Open Linked Data via SPARQL access protocols.
- FAO’s SDMX repository, in representation of data sources that expose timeseries data in standard SDMX formats and access protocols.
- OBIS, in representation of data sources that expose biodiversity data in in standard Darwin Core formats and standard DiGIR protocols.
- WorldBank, in representation of data sources that expose structured data under ad-hoc models and ad-hoc protocols.
In all cases, it has been decided to reuse and instantiate solutions already available in the Content Management subsystem of gCube, i.e. by developing plugins for the Content Manager service that translate the target protocols and models to the tree-based model and access API of the service. This will require defining canonical tree forms for the identified classes, though the Open Linked Data case and the ad-hoc access cases such as WorldBank will be likely accommodated with mappings to unconstrained trees. It has been decided that access solutions for at least two of the data sources above will be prototyped for the first deliverable due in M4.
- given the central role of existing gCube’s access mechanisms in the solutions planned above, it has been decided that the beginning of the project is the best and in fact only moment in which known issues with the Content Management subsystem could be addressed, so as to to give iMarine and its development activities in the area of data access the most solid basis possible.
Accordingly, FAO has carried out intensive work to address issues such as: unnecessary orientation towards content management and document management, limited modularity of subsystem components, excessive dependencies on gCore, sub-optimal failure management, non-standard plugin installation mechanisms, and many other issues of a more minor nature. To avoid retro-compatibility issues, FAO has completely forked the Content Manager and its associated components into a new gCube subsystem, the Data Access subsystem. Existing clients will migrate to the new components at their own pace.
The forking has resulted in 10 components oriented towards general tree management, including:
tree-managerservice (a fork of Content Manager)
tree-manager-stubs(a fork of the Content Manager stubs)
tree-manager-library(a fork of the Content Manager Library)
tree-manager-framework(previously embedded in Content Manager)
streamslibrary (previously embedded in Content Manager)
treeslibrary (previously embedded in Content Manager)
document-model-library(forks the gCube Document Model Library)
oai-tm-plugin(forks the OAI plugin of the Content Manager)
tree-repository(new, replaces the SMS Plugin of the Content Manager as the internal storage solution)
the components are all built with Maven, are fully documented in the code, are under version control, and already integrate in the Etics infrastructure using the latest support for Maven builds which is about to be rolled out in WP7. Service archives for each components are also available and build every night. Wiki-based documentation will next be forked from the Content Manager’s, with the appropriate changes.
tree-manager-library makes use of two new enabling components that fall in the scope of WP8 activities, i.e.
common-gcore-clients, which have been developed using the new subsystem as a source of requirements and first test-bed.
tree-manager-framework addresses directly the requirements for plugins for the new data sources listed above and is already available for early prototyping. It is foreseen that plugin development will begin eagerly in the new year.
- all the activities below concern access to structured-data. Activities more focused on unstructured data are ongoing at CNR but there has been little alignment between the two strands of work. FAO recommends that design be governed by related principles under a common plan.
- a model for task distribution and development activities across task partners is still to be clearly defined.
- Identification of clear targets for a first iteration of development activities.
- full fork of the Content Management subsystem as a better foundation for future development activities
Extensions of storage functionalities of the StorageManager library, for integration in the gCube infrastructure
The activity can be summarized by the following points:
- StorageManager: wrapper library implementation for gCube environment;
- StorageManager: functionality extensions:
- access control on remote resources: private, shared, public;
- locking system on remote resources;
- TTL manage for locked resources;
- StorageManager: Implemented a protocol Handler called "smp": StorageManagerProtocol;
- MongoDB: Supports asynchronous replication of data between servers for failover and redundancy
T9.2 Data Transfer Facilities
As presented during the KOM, CERN activities in the first period of the project with respect to this task have been focused on the study and identification of possible integration path between gCube and the FTS Service, the solution for File Transfer implemented at CERN and exploited by the WLCG Grid.
Several meetings have been set up between the iMarine CERN team and the FTS team in order to understand the benefits of a possible integration of FTS in gCube and as well if the provisioned functionalities of the Data Transfer in gCube could be also exploited by the FTS.
In Parallel iMarine CERN team has installed two version of the FTS service in order to test the functionalities.
First the EMI 1 version ( FTS 2.2.6 ) has been set up on the node d4science-dt.cern.ch (https://d4science-dt.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer) together with version 2.2.6 of the FTS Agent. Both component have been configured to join the glite VOVirtual Organization; d4scienece.research-infrastructures.eu and some test have been performed from the EMI ui installed @CNR ( ui.emi.d4science) in particular testing file transfers using url-copy between the Storage Elements installed at CNR and NKUA sites.
In a second phase a new version of FTS and FTA ( 2.2.8 ) has been installed ( on the same node). This version is still a pilot version and it has not been yet released for EMI 1. The new version includes a transfer monitoring mechanism which exploits the Messaging Infrastructure based on ActiveMQ message brokers. The solution implemented in gCube for monitoring is also exploiting ActiveMQ brokers therefore the monitoring mechanism could be easily integrated in gCube.
Taking into account the deadlines and milestones for the design and implementation the CERN iMarine team has decided to continue the investigation on the FTS integration in gCube till the end of M3 ( January 12) and at that time decide if use it as the base service of the gCube Data Transfer Facilities or start implementing a pure gCube component. The deadline has been also included in the EMi-iMarine DoW under definition.
In addition CERN edited an abstract concerning the study of the integration of the FTS service within gCube, which has been submitted to the EGI Community Forum 2012 to be held in Munich from 26 to 30 March.
- Meetings with EMI FTS team to understand a possible integration path of the FTS service in gCube.
- The study and the installation of the EMI FTS service for the gLite d4science.research-infrastructures.eu VOVirtual Organization;
- Abstract concerning the study of the FTS integration within gCube has been submitted to the EGI CF 2012 ( Munich 26/30 March)
The activity consists in the setup of an environment for the publication of geospacial data described in netCDF format. The publication is meant to produce contents in OGC compliant format. The activity had to investigate on the possibility for services to access files remotely even by opendap protocol. The result has to be used by data transfer procedures or by data access services. In the former case, the use of OGC protocol could allow for the transferring of geospacial contents from a machine to another, in the latter case, data could be taken by the data access service and be released in another format for usage by other consumers in the infrastructure.
The following figure depicts the architecture of the investigated procedure.
The THREDDS software has been used for publishing netCDF files contents and a custom library has been released for accessing the content via opendap. Tests in the gCube VTI environment with the new provided library were successful. The new library will be released in production environment with the gcube 2.7.2 release. The hypothesized architecture will take netCDF files from MyOcen and will publish them by means of the THREDDS software. In cases where data processing is required before supplying contents, a WPS service will be used, provided by a 52North Software http://52north.org/maven/project-sites/wps/52n-wps-webapp/. THREDDS is currently running on Open Nebula installation on Terradue site. When ready, the whole system will be moved in the D4Science infrastructure production environment. The 52North WPS service will then be extended with plugins for performing resampling of netCDF data. It will be integrated in the infrastructure by means of a connector in the Data Access gCube Service.
T9.3 Data Assessment, Harmonization and Certification Facilities
The activity regarded the analysis and investigation of a framework for performing data assessment, harmonization and certification. The main idea is to use a possibly open source, state-of-the-art framework in order to manage import, harmonization and certification of data. The framework will manage sets of operations each one focused on a small task (e.g. import from CSV, replacement of data). The operations will be combined in a workflow in order to produce the desired output.
The analysis is continuing and did not produce definitive results. The need has come out for a workflow output to be stored externally to the engine. Currently, the team identified an existing technology for framework implementation. At the moment is under analysis the use of ETL tools from Data warehouse systems. In Time Series development area, the team has started the transition to the Maven artifacts.