This process has generated a vast body of custom algorithms and concepts that support reproducible analysis of library catalogues [REFS - bibliographica R package]. These methods complement traditional software interfaces that have been designed for browsing and automated retrieval, rather than scalable statistical research. Data harmonization and quantitative analysis are intricately related objectives. Often, the actual data analysis reveals unnoticed shortcomings in the data. Hence, bibliographic data science is an inherently iterative process, where improved understanding of the data and historical trends can lead to enhances in the data harmonization procedures, and to new, independent ways to validate the data and observed patterns.
A key innovation in our approach is that the available large-scale data allows us to rigorously estimate the varying gathering sizes in different places and time periods and thus obtain more accurate estimates for the missing entries and the overall paper consumption.
Our open science approach facilitates collaborative methods development. We are constantly taking advantage of, and contributing to, the growing body of open source algorithms in the relevant research fields.