In its current state, the application uses the following four data sources, for the associated reasons:
- The ontology of the domain of scientific publications (namespace 'sr') that is developed during this course (which forms the 'internal' SPARQL endpoint for Milestone 3): Created in order to model scientific collaboration and publication, this ontology is at the heart of the project. Currently, it is flexible and responsive to conceptual changes that may occur in early phases of research, and in future, it will serve as a platform on which data sources from various places can be integrated on, and enable their communication with each other.
- VU-Pure database (namespace 'pvu') for populating the ontology with instances (previously integrated with the ontology): Pure's holds detailed records of researchers of Vrije Universiteit Amsterdam, and because has been more accessible due to practical reasons (i.e., already being in our possession), it was seen as a good starting point before other datasets are added.
- Web Of Science categories (namespace 'wsc') for additional classes (also previously integrated with the ontology): Because a good model of scientific fields is essential for a study that aims to understand collaboration patterns between disciplines, we decided to integrate into our ontology Web of Science Category Terms \cite{reuters2017}, a popular and established way of categorizing scientific fields. As mentioned, these categories are made part of the combined ontology, but not yet mapped to scientific publications.
- A database of universities of the world (separate ontology/triple store) , which is being queried from dbpedia's SPARQL endpoint. Initially a demo database that comes from Linked Data Reactor (LD-R ) Framework, due to its relevance to the current project, this database is kept on the server (and modified in order to switch from the then non-functional 'live.dbpedia.org' to 'dbpedia.org' domain). This database will likely to be integrated with the current ontology in future, as it would be efficient and beneficial to be able to name universities in the world without having to build a new ontology.
Producing the Data through Parsing, Querying, and Inferencing
The Pure and Web of Science databases were integrated with the current ontology through using a self-made Python parser, and then joined together afterwards with the support of inferencing and Protege. The last dataset (i.e., universities data), however, is not joined this way, and is integrated to the current application with SPARQL queries (fig. \ref{494903}). In all stages, however, inferencing has been helpful in reducing the effort needed to explicitly specify every possible relationship between entities, which would be highly unfeasible without inferencing. Indeed, the significant difference the inferencing made for the current database is evident from queries like the ones in fig \ref{596289}, \ref{470350}, and \ref{659082} which, returns respectively three authors, only one publication, and nothing when inferencing is not on. And when it's on, although the number of returned results are much higher, they are still modest. This is because the development is still being carried out with a truncated version of the Pure dataset, which has an order of magnitude larger number of instances. Thus, when (or if, at least during the course, given the limited computing power) the full Pure dataset is added as instances of the ontology, the number of inferred relationships and instances can be expected to increase dramatically. And finally, another place where inference is being utilized, with the help of LD-R framework, is the application's visual query interface (fig. \ref{444678} ). This visual interface has the potential to make otherwise complex queries (see fig. \ref{347168}) intuitive, and help linked data workflows to become more efficient and user friendly.