MarineTLO-based warehouse

From IMarine Wiki

Jump to: navigation, search

Contents

General Description

The MarineTLO-based warehouse contains information about marine species. The information in the warehouse is structured using the MarineTLO Ontology, and contains data from FLOD, ECOSCOPE, WoRMS, FishBase and DBPedia. Below we describe in more detail the underlying sources, the technology used for constructing the MarineTLO-based warehouse (in the sequel we will refer to it as the warehouse), and its current uses.

Building the warehouse

The Underlying sources

Fisheries Linked Open Data (FLOD) RDF dataset. FLOD, created and maintained by Food and Agriculture Organization (FAO), is dedicated to create a dense network of relationships among the entities of the Fishery domains, and to programmatically serve them to semantic and traditional application environments. The FLOD content is exposed either via a public SPARQL endpoint[1] (suitable for semantic applications) or via a JAVA API to be embedded in consumers’ application code. Currently, the FLOD network includes entities and relationships from the domains of Marine Species, Water Areas, Land Areas, Exclusive Economic Zones, and serves software applications in the domain of statistics and GIS.

ECOSCOPE Knowledge Base. IRD offers a public SPARQL endpoint[2] for its knowledge base containing geographical data, pictures and information about marine ecosystems (specifically data about fishes, sharks, related persons, countries and organizations, harbors, vessels, etc.).

WoRMS. The World Register of Marine Species[3] currently contains more than 200 thousand species, around 380 thousand species names including synonyms, and 470 thousands taxa (infraspecies to kingdoms).

FishBase[4] is a global database of fish species. It is a relational database containing information about the taxonomy, geographical distribution, biometrics, population, genetic data and many more. Currently, it contains more the 32 thousand species and more than 300 thousand common names in various languages.

DBpedia[5] is a project focusing on the task of converting content from Wikipedia to structured knowledge so that Semantic Web techniques can be employed against it. At the time of writing this article, the English version of the knowledge base of DBpedia describes more than 4.5 million things, containing persons, places, works, species, etc. In our case, we are using a subset of DBpedia’s knowledge base containing only fishes (i.e. instances classified under the class http://dbpedia.org/ontology/Fish).

Applying data transformations and entity matching rules

Some data can be stored into the warehouse as they are fetched, while others may need to be transformed. For example, a literal may need to be transformed into a URI, or to be split for using its constituents, or an intermediate node may need to be created; for example instead of (x,hasName,y) to have (x,hasNameAssignement,z),(z,name,y),(z,date,d). In order to handle such cases, we created a number of transformation rules that are applied to the fetched data, before its ingestion to the warehouse.

Assessing the connectivity

From this activity, we observed that the data fetched from the sources are in many cases problematic (consistency problems, duplicates, wrong values). We have noticed that placing them together in a warehouse makes easier the identification of such errors. Furthermore, the availability of the warehouse enables defining owl:sameAs connections by exploiting transitively induced equivalences, and can be produced by exploiting matching rules, using SILK framework[6]. In any case, the inspection of the repository for detecting the missing connections that are required for satisfying the needs of the competence queries is an important requirement. To this end, we have devised some metrics for quantifying the value of the warehouse and the value (contribution) of each source to the warehouse. These metrics are described in detail in [Tzitzikas-LWDM'14], while a vocabulary that allows the representation, exchange and querying of such measurements is described in [Mountantonakis-PROFILES'14].

Handling the Provenance

After ingesting data coming from several sources in the warehouse we can still identify their provenance. We support four levels of provenance: (a) at conceptual modeling level, (b) at URIs and values level, (c) at triple level, and (d) at query level.

As regards (a), MarineTLO models the provenance of species names, codes etc. (who and when assigned them). Therefore there is no need for adopting any other model for capturing provenance (e.g. OPM[7]). As regards (b) we adopt the namespace mechanism for reflecting the source of origin of an individual. For example the URI http://www.fishbase.org/entity#thunnus_albacares denotes that this URI has been derived from FishBase. Furthermore during the construction of the warehouse there is the option of applying a uniform notation @source to literals (where source can be FLOD, ECOSCOPE, WoRMS, FishBase, DBpedia). As regards (c) we store the triples from each source in a separate graph space. This is useful not only for provenance reasons, but also for refreshing parts of the warehouse, as well as for computing the connectivity metrics that have been described previously. Finally as regards (d) we have implemented a framework, called MatWare and it is described below, that offers a query rewriting functionality that exploits the graph spaces, and it returns the sources that contributed to the query results. An extensive discussion about provenance in MarineTLO-based warehouses can be found at the papers described in [Tzitzikas-ESWC'14].

The process

The following figure sketches the construction process. The main effort has to be dedicated for setting up the warehouse the first time, since in that time one has to select/define the schema, the schema mappings, the instance matching rules, etc. Afterwards the warehouse can be reconstructed periodically for getting refreshed content, without requiring human intervention. For this reason we have developed a framework, called MatWare,[Tzitzikas-ESCW'14] that automates the construction, maintenance and quantitative evaluation of the warehouse.

The process for constructing and evolving the warehouse

The main steps for constructing the warehouse (and are shown in the figure above) are:

  • Define the requirements of the marine community. These requirements are expressed as competence queries and allow testing and evaluating the MarineTLO ontology and the warehouse.
  • Get the required data from the underlying sources. We use different ways to fetch and transform those data, particularly:
    • FLOD: we fetch all data from its SPARQL endpoint.
    • ECOSCOPE: we download all the RDF files from the web site of ECOSCOPE
    • WoRMS: we use the Species Data Discovery Service (SDDS)[8] to retrieve all data from WoRMS. This produces a DwC-A file which is transformed to RDF format, with the use of an appropriate tool (called MarineTLO Wrapper).
    • FishBase: we extract various information about marine species from the relational tables of FishBase and transform them to RDF format with the use of an appropriate tool (called FishBaseReaper)
    • DBpedia: we fetch all data about marine species from its SPARQL endpoint.
  • We inspect the connectivity of the "draft" warehouse, by exploiting the competence queries and we formulate the SILK rules that will produce owl:sameAs links.
  • We create the triples that describe the warehouse and its contents (VoID triples)
  • We test and evaluate the warehouse using a set of connectivity metrics.

Versions

In total we released 5 different versions of the warehouse. Each version contained the corresponding MarineTLO version and the required schema mappings, in addition to the following:

  • Version 1: contents from FLOD, ECOSCOPE and WoRMS, about the scientific name and predators of species.
  • Version 2: contents from FLOD, ECOSCOPE, WoRMS and DBpedia, about the same concepts of Version 1 (i.e., scientific names and predators) plus authorship information of species.
  • Version 3: contents from FLOD, ECOSCOPE, WoRMS, FishBase and DBpedia about the same concepts of Version 2 plus common names of species, information about ecosystems, countries, water areas, vessels, gears and EEZ.
  • Version3+: the same contents with the 3rd version of the warehouse. The difference in this version is that we used multiple graphspaces for storing data coming from different sources. This allowed us to track easily the provenance of the information in the warehouse (e.g., the fact that “yellowfin tuna” is an English common name of the species thunnus albacares is derived from WoRMS and FishBase).
  • Version 4: contents from FLOD, ECOSCOPE, WoRMS, FishBase and DBpedia about the same concepts of Version 3, containing also information about bibliographic citations and statistical indicators

The following figure shows the contents of the latest version of the warehouse (last updated July 2014).

The contents of the MarineTLO-based warehouse V4 (July 2014)

Current Uses of the warehouse

The warehouse is currently in use by the X-Search[9] system. Before building the MarineTLO-based warehouse, X-Search was exploiting FLOD as the underlying knowledge base and was able to detect no more than 11,000 species. Note also that for each species, the MarineTLO-based warehouse has in average about 30 properties, while in FLOD each species has in average only 6 properties. In addition, the MarineTLO-based warehouse contains about 200 distinct predicates that connect two URIs (contrary to the about 40 predicates of FLOD), allowing richer experience while browsing on the properties of an entity. The following figure shows an example of (a part of) an entity card. An entity card is a popup window describing a resource (e.g., a species) which is displayed to the user on-demand (by clicking the small icon next to an entity name in Figure 3), offering entity exploration and browsing. In that figure, we divided the card into four groups, each one presenting information derived from different sources. Specifically, group A comes from DBpedia, B from FLOD, C from ECOSCOPE and D from WoRMS. Note that this information is derived at real-time (in less than one second).

An entity exploration card displayed by XSearch for the species Thunnus Albacares

Furthermore, the FactSheetGenerator[10] for using this warehouse is under development and will offer more elaborate information. Its current version that focuses on tuna, called TunaAtlas, is already publicly available[11] an indicative screen is shown below.

The Tuna Atlas application

References

  • MarineTLO-based Warehouse SPARQL endpoint: http://virtuoso.i-marine.d4science.org:8890/sparql
  • MarineTLO-based Warehouse faceted browser: http://virtuoso.i-marine.d4science.org:8890/fct
  • [Tzitzikas-MTSR'13] Y. Tzitzikas, C. Alloca, C. Bekiari, Y. Marketakis, P. Fafalios, M. Doerr, N. Minadakis, T. Patkos and L. Candela. Integrating Heterogeneous and Distributed Information about Marine Species through a Top Level Ontology, Proceedings of the 7th Metadata and Semantic Research Conference, MTSR'13, Thessaloniki, Greece, November 2013.
  • [Tzitzikas-ERCIM'14] Y. Tzitzikas, C. Allocca, C. Bekiari, Y. Marketakis, P. Fafalios and N. Minadakis. Ontology-based Integration of Heterogeneous and Distributed Information of the Marine Domain, ERCIM News 2014 (96), Special theme: Linked Open Data, January 2014
  • [Tzitzikas-LWDM'14] Y. Tzitzikas, N. Minadakis, Y. Marketakis, P. Fafalios, C. Alloca and N. Mountantonakis. Quantifying the Connectivity of a Semantic Warehouse, 4th International Workshop on Linked Web Data Management, LWDM'14, Athens, Greece, March 2014.
  • [Moutantonakis-PROFILES'14] M. Mountantonakis, C. Allocca, P. Fafalios, N. Minadakis, Y. Marketakis, C. Lantzaki and Y. Tzitzikas. Extending VoID for Expressing the Connectivity Metrics of a Semantic Warehouse, 1st International Workshop on Dataset Profiling & Federated Search for Linked Data (PROFILES'14), in conjunction with the 11th Extended Semantic Web Conference (ESWC'14), Anissaras Hersonissou, Crete, Greece, May 2014.
  • [Tzitzikas-ESCW'14] Y. Tzitzikas, N. Minadakis, Y. Marketakis, P. Fafalios, C. Alloca, N. Mountantonakis and I. Zidianaki. MatWare: Constructing and Exploiting Domain Specific Warehouses by Aggregating Semantic Data, 11th Extended Semantic Web Conference (ESWC'14), Anissaras Hersonissou, Crete, Greece, May 2014.
Personal tools