Introduction

Library catalogues are essential tools in information science, and their utilization has been greatly advanced by digitalization. The need to manage and organize the ever increasing body of digital information has motivated the development of new concepts and technologies, such as  Linked Open Data (LOD), which was first introduced some twenty years ago and has been on the agenda of most National Libraries since then. Metadata collections of published material that different libraries hold are of course particularly suitable for interlinking and enriching with different semantic layers [FOOTNOTE : good examples of LOD services among National Libraries are \cite{finto} and  \cite{data}. About development of LOD in the library sector, see \cite{Yoose_2013} **]. It is of course important to consider the future of global information infrastructures, integration of web sources and open, reusable metadata and its enrichment, in order to take full advantage of digital resources.
This article relates closely to these efforts in National Libraries because it aims to make one important claim: for research purposes, it is extremely important that we can rely on the data quality and subsequently make robust statistical claims based on these data collections [FOOTNOTE: The relevance of this type of quantitative analysis of library catalogues has been understood in \cite{Buringh_2009}\cite{Baten_2008} **] Thus, even when it is certain that no cumulative integrated catalogue of national bibliographic data will be perfect and free from errors, it can be sufficiently representative of important trends in knowledge production. This hypothesis comes with a huge research potential but it is yet to be systematically explored and tested. 
We have started to develop novel ways of addressing these needs by creating a data analytical ecosystem, which is designed to harmonize and integrate different sources of library catalogue metadata maintained by the research libraries, so that they can be used in quantitative research. Whereas the contemporary cataloguing efforts, and data management technologies, including LOD, have focused on data storage, management, and distribution, our efforts have a different, complementary target. We focus on enhancing the overall data quality and commensurability between independently maintained library catalogues by systematic large-scale harmonization and quality control. It is widely observed that metadata collections have high amounts of inaccurate entries, data collection biases, and missing information. Whereas this has formed severe challenges for reliable research use of these data collections, many of these issues can be potentially overcome by systematic data harmonization and careful validation. Hence, we aim to fill this critical gap that has repeatedly proven to be a central bottleneck in large-scale quantitative analysis of library catalogues. At the same time, as we demonstrate in the case of the Finnish National Bibliography (FNB), the harmonized data can then be combined into LOD releases, opening new doors for the research on national bibliographies. To emphasize the need for such complementary approaches and their vast research potential, we propose a new research paradigm of bibliographic data science, which is specifically targeted at enabling the use of library catalogues as research objects.
The reason why the question of data quality has not been raised with respect to national bibliographies is that even when they are cherished within information science, bibliographic information has been, nevertheless, an undervalued research resource. Our study is motivated by the analysis of Swedish National Bibliography and the Finnish National Bibliography, focusing on publication patterns in Sweden and Finland, where we have encountered specific and largely overlooked challenges in using bibliographic catalogues for historical research. Use of national bibliographies as a research resource, rather than a mere information retrieval tool, has proven to be challenging as obtaining valid conclusions critically depends not only on the overall understanding of the historical context but also on technical issues of data quality and completeness. Scalable solutions to these challenges, and subsequent research cases that build on statistical analysis of these data collections, have been missing.
We demonstrate how such challenges can be overcome by specifically tailored open and collaborative data analytical ecosystems that provide scalable tools for data processing and analysis, from efficient and reliable harmonization of the raw entries and and augmentation of the missing data to integration and statistical analysis of national bibliographies. We show how external data sources, for instance, on authors, publishers, and places, can be used to enrich and verify bibliographic information. We present an analysis of the overall publishing landscape in the period c. 1500-1800 based on comprehensive harmonization and joint analysis of four large bibliographies, which has allowed us to assess publishing activity beyond what is accessible by the use of national catalogues alone. In particular, we have prepared the first harmonized versions of the Finnish and Swedish National Bibliographies (FNB and SNB, respectively), the English Short-Title Catalogue (ESTC), and the Heritage of the Printed Book Database (HPBD). Such systematic approach has vast potential for wider implementation in related studies and other bibliographies. Our  work clearly indicates that whereas national bibliographies are essentially about mapping the national canon of publishing, integrating data across borders should be managed in a way that takes into account specific local circumstances while also helping to overcome the national view in analyzing the past. Such integration can help scholarship to reach a more precise view of print culture beyond the confines of national bibliographies. 

Bibliographic data science

Quantitative, data-intensive research has not been the original or intended goal of analytical bibliography. Instead, a primary motivation for cataloguing has been to preserve as much information of the original document and it's physical creation as possible, including potential errors and shortcomings [FOOTNOTE: for a good discussion of W. W. Greg and Fredson Bowers who largely shaped the field, see \cite{analytical} **]. Thus, if for instance a place name is wrongly spelled, for cataloguing purposes it is relevant to also to preserve that miss-spelling. Whereas library catalogues and national bibliographies have been traditionally used as a search tool for information retrieval, recent studies have indicated the research potential of these data resources in publishing history when the information is appropriately controlled and verified [REFS]. Our work builds on traditional bibliographic research, and we are using established definitions of bibliographic concepts where possible. [FOOTNOTE: For most analytical bibliographical definitions, we rely on (gaskell1995new) **]. 
Our work prepares the ground for quantitative research use of bibliographic information. We use the term bibliographic data science to describe this emerging research area where bibliographic catalogues are viewed as quantitative research material, and systematic efforts are carried out to ensure data reliability and completeness. The harmonized data sets can be further integrated and converted into Linked Open Data [REFS] and other popular formats in order to utilize the vast pool of existing software tools. In addition to improving the overall data quality and hence the overall value of LOD and other data infrastructures that focus on data management and retrieval, the harmonization enables statistical analysis with scientific programming environments such as R [REFS] or Python [REFS], which provide advanced tools for modern data analysis and statistical inference. Hence, these two approaches serve different, complementary purposes.