ONETT: Systematic Knowledge Graph Generation for National Access Points

Adolfo Antón; Jhon Toledo; David Chaves-Fraga; Oscar Corcho

Introduction

Transport data is being currently published by transport authorities and operators in many different formats, some of which are well-known de-facto standards, such as the General Transit Feed Specification or GTFS, and some others are ad-hoc data formats whose structure is decided by the data publisher (e.g., current datasets and APIs published by Empresa Municipal de Transportes de Madrid in its open data portal, tram information in Zaragoza, etc.).

All of these datasets have similarities, associated to the fact that they are describing overlapping sets of information (schedules, stops, vehicles, lines, etc.). They are also made available, commonly, using tabular data formats. For example, GTFS feeds are essentially zip-compressed files containing sets of CSV files following the GTFS specification. And other data sources such as those mentioned above as examples provide the data either in CSV or JSON.

Having all this data available in a homogeneous manner would actually reduce the total cost of reusing data sources, especially across operators/authorities and cities/regions. That is, developers may be able to develop one application that would be deployable in any city in the world with minor adaptations. This is already happening with GTFS, which is not only being used by Google Maps to provide data about transport infrastructure, but also for route planning, but also by other route planners, such as Navita.io and OpenTripPlanner.

To achieve this homogeneity, there are several options that may be followed:

Transport authorities and operators may agree on using the same data format and hence publish according to such data format. They know well the type of data that they handle, the quality properties on such data, etc., so they should be able to provide this data easily. To some extent, this is what is happening currently with GTFS, and what should happen in the near future in the European Union with NeTex, according to directive 2010/40/EU and regulation 2017/1926 (MMTIS).
3rd parties (as well as operators and authorities themselves) may be able to create transformation rules that allow transforming the original data sources into other generally-agreed formats, republishing such transformed data either in the original data portals, if allowed to do so, or in other servers. Transformations may be done programmatically (that is, with ad-hoc code) or declaratively (using mappings in existing languages like R2RML [1] or RML [2]).

In this paper, we present our work on ensuring that declarative mappings can be used for the purpose of transforming transport data published by transport authorities and operators into a homogeneous representation based on Transmodel (the reference data model for public transport at European level, which will be further described in section 2). This data can then be further transformed into NeTEx so as to comply with the EU regulations for the publication of transport-related data in National Access Points.

Transmodel Ontology and GTFS

In its drive to foster interoperability across Europe, the EU is requiring each Member State to allow access to transportation data via a National Access Point (NAP). According to the EU Regulation 2017/1926, all transportation authorities, transport operators and infrastructure managers must provide static and dynamic data in specific data formats (e.g., NeTEx, SIRI). - the EU Regulation applies to different transportation modes, including air, train, road vehicle, bus, ferry, metro, tram, shuttlebus, car-sharing, car-pooling and bike-sharing.

Transmodel is the European Reference Data Model for Public Transport. It provides a conceptual model of common public transport concepts and data structures that can be used to build many different kinds of public transport information system such as timetabling, fares, operational management, real-time data, journey planning. It is divided into eight different sections or Parts: Common Concepts (CC), Public Transport Network Topology (NT), Network Description (ND), Operations Monitoring & Control (OM), Fare Management (FM), Passenger Information (PI), Driver Management (DM), Management Information & Statistics (MI).

These parts or sections are usually developed by different standards or specific data formats. One of the most relevant implementations is NeTEx, which covers partially some features of the parts CC, NT, ND, FM and PI. NeTEx releases the 2017/1926 EU Regulation (May 2017) where the European Commission recognized NeTEx as a strategic standard for the cross-border exchange of data. The first step must be taken before December 2019 when every European country must provide data available in NeTEx format at National Access Points to allow EU-wide multi-modal travel information services.

The General Transit Feed Specification (GTFS) is a de-facto standard for representing public transport data, a collection of at least five required, two optional required and up to fifteen CSV files (with extension .txt and preferably encoded as UTF-8) contained within a .zip file to describe a transit scheduled operations system. The aim of GTFS is providing at least trip-planning functionality. It defines the headers and a set of rules that must be taken into account when the dataset is created. Each file, as well as its headers, can be mandatory or optional and they have relations among them. The specification supports the representation of several public transport features such as trips, routes, stops, times, fares or calendar.

In order to provide a better GTFS to NeTEx conversion and further full data interoperability, we have started to build up a Transmodel Ontology. The development is released in a github repository where every material generated in the different activities carried out during the development of the vocabulary, as for instance use cases, user stories, glossary of terms, etc., will be available in the Vocabulary Wiki Project maintenance. Eventually, some queries will be performed in a SPARQL endpoint to test and exemplify its operability. Furthermore, in the context of work in a CEN Transmodel working group has published a base URI that is used by ONETT to perform the transformations. Before performing the transformation from GTFS to the ontology based format of Transmodel, we analyse the relationship between the two standards. For example, in Table 1 we show the relation between the properties of the calendar.txt in the GTFS model with the corresponding property in Transmodel using the NeTEx implementation. The full relation between the two standards is available online.

GTFS-Calendar	Transmodel (NeTEx)
service_id	<DayType>@id + <ServiceCalendarFrame> @id
moday	<DayType><properties><PropertyOfDay><DaysOfWeek>monday
tuesday	<DayType><properties><PropertyOfDay><DaysOfWeek>tuesday
wednesday	<DayType><properties><PropertyOfDay><DaysOfWeek>wednesday
thursday	<DayType><properties><PropertyOfDay><DaysOfWeek>thursday
friday	<DayType><properties><PropertyOfDay><DaysOfWeek>friday
saturday	<DayType><properties><PropertyOfDay><DaysOfWeek>saturday
sunday	<DayType><properties><PropertyOfDay><DaysOfWeek>sunday
start_date	<ServiceCalendar><FromDate>
end_date	<ServiceCalendar><ToDate>

Table 1: Relation among GTFS-Calendar properties and Transmodel in NeTEx implementation

The ONETT demo

The Open NEtwork of public Transport application (ONETT) uses Semantic Web technologies to perform a knowledge graph generation in the transport domain. More in detail, ONETT applies the concept of Ontology Based Data Access (OBDA) [3], which it aims at providing a unified view and common access to a set of data sources, using ontologies and mappings.

In this specific case, we generate a general mapping between GTFS and ontology based Transmodel using the RML specification in its YARRRML [4] serialization. For transforming the raw data in CSV to RDF ONETT integrates the SDM-RDFizer engine for RML mappings. Before running the transformation, we have to perform a mapping translation [5] process to adapt the general mapping to the input data as it is not always going have the same structure and number of files. The workflow of the application is shown in Fig. 1. More in detail, the steps following by ONETT for generating the desirable RDF knowledge graph based on the Transmodel ontology from a GTFS feed are:

Analyse the input data: It decompresses and analyses the input GTFS feed to know the files and the structure of each file (headers).
Mapping translation: It takes the general GTFS YARRRML mapping that represents the full specification and generates a new mapping corresponding to the input data.
Knowledge Graph Generation: It runs the SDM-RDFizer engine to transform the raw data to RDF.

[ONETT workflow] — Fig. 1: The ONETT workflow for the systematic generation of Knowledge Graph following Transmodel from GTFS feeds.

These steps are a black box for the transport authorities that want to obtain the knowledge graph from their GTFS feeds. Using the web application the user only has to upload the compressed feed or provide a URL and automatically ONETT generates the corresponding knowledge graph. With this approach, we provide a useful tool to generate National Access Point complaint data from a de-facto standard and very popular data format in a systematic manner.

Conclusions and Future Work

The availability of homogeneous transport data from worldwide transport authorities and operators gives us the possibility of creating new types of applications related to transport (trip planners, fare calculators, ticket recommenders, etc.) that can be deployed easily in different regions or cities. In this paper, we have shown our approach to create such homogeneous transport data based on declarative mappings that can be used to generate transport knowledge graphs for any region or city in the world that is currently publishing data in GTFS. The mappings allow transforming GTFS data into RDF according to a TransModel-based ontology. Such data can be queried in a homogeneous manner so that the aforementioned applications can be created more easily.