For each article, Scopus extracts the list of references and an algorithm matches those references with existing Scopus records. If a match is found, a citation is added to that record which then allows us to evaluate the impact of this record on research from a quantitative citation standpoint.
It would be relatively easy to apply the same principle to datasets, but there are a few roadblocks that make this process much more complicated for data:
- Completeness of metadata: creating records for datasets in itself is a challenge. The very concept of authorship means something different for data, and often repositories do not provide the same level of completeness of metadata.
- References: article citations always come from the reference section. This is not necessarily true for datasets. Sometimes they are indeed cited in the references (following the FORCE Data Citation principles \cite{m2014}\cite{Cousijn_2017}), but data can also be connected to an article via a "Data Availability section" or simply as supplementary material. Sometimes data are also just linked in the body of the article to an external repository. For data, the definition of what a citation is may still need to be decided on by the community.
- Cited by what? Article citations are coming predominately from other articles. They sometimes come from books or conference proceedings, but the community more or less agrees on how to count these citations. For datasets, are we only considering citations coming from articles? Can we have citations from other datasets? Should datasets have references? Should we also consider datasets that are not connected to an article?
Because of these considerations, we will start by analysing the current landscape using Scopus as a proxy for article output, assuming that Scholix has a wide enough coverage to represent the data sharing landscape.
(A validation of this hypothesis will follow in future publications of this work)
Scholix and Scopus coverage
In this section, we analyse Scopus and Scholix coverage, in particular to highlight any bias or significant pattern in data sharing depending on years and subject areas.
Scholix is a framework that allows for peer to peer sharing of information about articles and datasets \cite{Burton_2017}. In this instance the implementation by OpenAIRE has been used, with the Data-Literature Interlinking service \cite{service}.
To perform such an analysis we are merging Scopus and Scholix data sets based on DOIs, i.e. when a Scopus Record DOI matches a Scholar Source DOI. Linking Scholix to Scopus Records allows us to leverage the power of Scopus Knowledge Graph as described in the previous section, in particular to perform more advanced analysis based on the metadata computed and curated by Scopus for other entities such as Authors, Affiliations and Sources. As an example, we can now analyze data sets based on the country of their author's affiliation.
As of June 2018, 433,345 documents currently in Scopus are connected to one or more dataset via Scholix, with a total of 1,296,925 links to 1,126,133 unique data sets. This represents a very small percentage of Scopus articles (about 1% of all Scopus articles), mostly because data sharing is a very recent practice.